TRAVEL INSURANCE PROJECT

Introduction

In this data set, a tour and travel company is offering a travel insurance package to their customers. This dataset includes variables that could be an influence on whether a customer will buy the travel insurance package. Some of these variables include age, whether they are a frequent flyer and much more. With this project, I will be using R to visually explore the data and create a logistic regression model to use as a classifier. After the building the logistic model, I will be using different metrics to evaluate the logistic regression model.

Link for Data: https://www.kaggle.com/tejashvi14/travel-insurance-prediction-data

Travel Insurance Project: Text

Data Visualization using 'ggplot2'

Link to R Display and Code: https://rpubs.com/tracylam1/801047

In this section, I showed the data visually with bar and pie charts. In these visualizations, I used the variables age, employment type, whether they are a frequent flyer and their annual income since I predict those variables as the most influential for whether a customer buys the travel insurance.

Screen Shot 2021-08-26 at 2.19.07 PM.png

Travel Insurance Project: Text

Logistic Regression Model

In this section, I build a logistic regression model as a binary classifier to find out what affects whether or not a customer would purchase the travel insurance package based on the other variables. As seen in the logistic regression model below, the most significant predictors for the binary variable, TravelInsurance, include age, annual income, family members, and if they are a frequent flyer. This is due to the fact that the p-values of these variables are less than the significance level of 0.05. This mostly matched what I predicted to be the most influential in a customer's choice for buying the travel insurance.

Screen Shot 2021-08-26 at 2.24.20 PM.png

Travel Insurance Project: Text

Evaluating the Logistic Regression Model

In this section, I evaluated the logistic regression model using cross-validation method, misclassification error rate, confusion matrix, specificity, sensitivity and the ROC curve. As we can see in the results below, this misclassification error rate and cross-validation method gives around the same error estimate. However, since the error rate is a little high with around 22%, I believe that this may not be the best model to use to predict whether someone would buy the travel insurance or not.

Screen Shot 2021-08-26 at 2.45.55 PM.png

Travel Insurance Project: Text

We can get more insight to the results from the misclassification rate and cross validation method with the confusion matrix below. As we can see in the matrix, there are any false negatives. As seen below, the sensitivity is around 49% which isn't that high as we can see with the many false negatives.

Screen Shot 2021-08-26 at 2.46.22 PM.png

Travel Insurance Project: Text