Optimize your Email Campaign Using Data

This is part of my audacity nano degree capstone project.

9 min readSep 16, 2021

Project Definition

In this project, I try to optimize the way an email campaign might be deployed by estimating which demographics might be more receptive to an offer so that we don’t have to contact everyone.

The data for this project was provided by Arvato Bertelsmann and will not be available a the project’s github repo because of their terms of use.

We are given 3 datasets, a dataset with information about Arvato’s customers, a dataset with information about the general population of Germany, and finally a dataset with information about potential clients that were part of an email campaign and their response to it.

For each dataset, we are given 366 columns with information about each person’s interests/demographics, information about their household and their neighborhoods.

We are asked to explore these datasets, find what differentiates a customer from a noncustomer, and also predict which person is likely to convert to our email campaign.

Problem statement

First, we’ll do some cleaning and exploratory analysis to prepare the data and try to understand what the key differences between customers and noncustomers are. This will entail dealing with missing values, adjusting values, and finding what are some of the key differences between customers and noncustomers, which can be done by comparing their distributions.
Second, we’ll train a clustering model that separates customers from noncustomers based on their features, this will give us a way to systematically segment potential customers from the general population.
Finally, we’re asked to train a model on the email campaign dataset that will predict whether a person would be interested in our email campaign or not.

Metrics

For the first part of the project which is just cleaning and exploratory analysis, we do not need to define an evaluation metric.

For the second part of the project, we’ll be training a clustering model that separates customers from noncustomers, the evaluation metric has to get bigger the better the model separates customers from noncustomers. We could calculate the gini coefficient on the distribution of customers in the clusters, the higher the index the more unequally distributed they are between the groups, and therefore the better the clusters separate customers from noncustomers.

For the last part which is to train a classification model to predict which customers will respond to an email marketing campaign and test that model in a Kaggle competition, the evaluation metric defined for the Kaggle competition is ROC-AUC thus will use the same for the training.

Cleaning and Data Exploration

With the data comes a dictionary that has the feature name (Attribute) and its description.

I added the last 6 columns going feature by feature giving them a new name (because they are in german), classifying them by type (categorial C, categorical ordinal C-O, numeric N, binary C-B), deciding whether to drop them (only 16), deciding whether they’re of interest and finally deciding If they’re going to be split into dummy variables. For the most part, columns were dropped because they were categorical variables that had too many possible values that if transformed to dummies they’d have a significant effect on the dataset complexity.

Additionally, there are some columns in the dataset that do not have a corresponding description in the dictionary, these are going to be dropped too since we don’t have a description for them.

There are also distinct representations for missing values that are not coherent with their numerical value, for example, the share of car owners between the ages of 31 and 45, is an ordinal feature but that “9” will be misinterpreted by any model as an even higher share which would not be correct so we’ll replace those with one single value (-1).

example of incoherent representation of missing values with -1 and 9

The cleaning steps were put and are described in a function called clean_df:

Check if after applying clean_df there are columns with a high number of null values…

# check for proportion of nan values in the azdias dataframe
azdias.isnull().mean().sort_values(ascending=False)
>>
KBA13_ANZAHL_PKW               0.118714
KBA05_MOTRAD                   0.000000
KBA13_ALTERHALTER_60           0.000000
KBA13_ALTERHALTER_45           0.000000
KBA13_ALTERHALTER_30           0.000000
KBA05_ZUL4                     0.000000
KBA05_ZUL3                     0.000000
KBA05_ZUL2                     0.000000
KBA05_ZUL1                     0.000000
KBA05_VORB2                    0.000000
KBA05_VORB1                    0.000000
KBA05_VORB0                    0.000000

Looks like we’re good to go, now we can grab some of the columns that draw our attention and compare the distribution of customers vs non-customers.

We can see a higher proportion of over 60 years old in the customers dataset vs the general population.

Customers also seem to be much more concerned about their finance than the general population, having higher proportions in the money saver, be prepared and investor categories and less proportion in low financial interest category.

In sync with the previous graph, they’re also more likely to be in the higher end of the income distribution in contrast to the general population.

It also seems like they’re more traditionally and dutifully minded.

In contrast to the general population, they’re more likely to be active consumers, their characteristical consumption types are, universal, versatile, and gourmet with a much higher proportion than the general population.

Clustering Model For Customer Segmentation

Great, we now have a general idea of some key differences between our customers and the general population, but there are literally hundreds of features and we won’t be able to study each of them for each person and decide whether they are a potential customer. For this task, a clustering model comes in handy, after being trained this model would be useful for identifying potential customers.

Data preprocessing

First we’re going to concatenate both customers and the general population datasets identifying each personas either customer or noncustomer:

azdias["iscustomer"] = 0
customers["iscustomer"] = 1
consolidated_demographics = pd.concat([
    azdias, 
    customers.drop(['CUSTOMER_GROUP', 'ONLINE_PURCHASE','PRODUCT_GROUP'], axis=1)
], axis=0)

Then we’re going to replace categorical columns for dummy variables according to the initial analysis done on the features summary excel.

After doing so we just need to drop useless or unwanted columns such as:

LNR: which is just the identifier for each person
CAMEO_MICROCELL_CLASSIFICATION_DETAILED : a categorical feature that would add too many columns to the dataset if converted to dummy variables

Then we would want to fill all missing values with the previously defined unknown value equivalent(-1)

Finally, for the clustering algorithm to work optimally, we need to scale the data to be in the same range, for this we’ll use a min-max scaler.

scaler = MinMaxScaler()
consolidated_demographics_transformed = prepare_for_model(consolidated_demographics)
consolidated_demographics_transformed = scaler.fit_transform(
    consolidated_demographics_transformed.drop(["iscustomer"], axis=1)
)

Implementation & Refinement

We’ll use Kmeans here and train it on a consolidated dataframe from both datasets, but instead of using the elbow method in order to decide how many clusters we’ll have we’ll use the gini coefficient calculated on the distribution of customers in the clusters, the higher the index the more unequally distributed they are between the groups and therefore the better the clusters separate customers from noncustomers.

With 12 clusters we get the highest Gini coefficient, let’s see how it distributes customers between the clusters

Results

kmeans = KMeans(n_clusters=12).fit(consolidated_demographics_transformed)
consolidated_demographics["group"] = kmeans.labels_
(consolidated_demographics.groupby("group").agg({"iscustomer":sum})/consolidated_demographics["iscustomer"].sum()).sort_values("iscustomer")

Awsome the model managed to separate customers from noncustomers having 70% of customers in 4 of the 10 groups. Now we would be able to use this model for future segmentation of potential customers.

Let’s see the differences between groups 8, 1, 2, and 4 (likely to be customers) vs groups 10, 11, 5, and 9 (very different from customers) to confirm that the model is working.

consolidated_demographics["group2"] = consolidated_demographics["group"].apply(lambda g: "LIKELYTOBECUSTOMER" if g in [1,4,2,8] else "NONCUSTOMER" if g in[9,11,5,10] else "MEDIUM")

The same differences as before can be seen between both groups, with some of them getting even more different, like household income, meaning that the model is doing a good job separating customers from noncustomers.

Conclusion

A Kmeans model was able to discriminate between customers and noncustomers with good separation, but the Kmeans model requires a certain shape of clusters for it to work optimally, this separation could be further improved with a dbscan model which does not depend on the shape of the cluster but, dbscan would take forever to fit such a high dimensional dataset, it might be a good idea to apply PCA prior to dbscan in order to reduce the number of features.

Email Campaign Conversion Prediction Model

Last but not less important, we are going to train a model on customer response data for a specific email campaign, this might prove useful as a way to efficiently send emails only to customers that are likely to respond, of course, we might have to send the email to a portion of our email list to train future models on their respective response.

Data cleaning and preprocessing

We have the same 366 columns as before plus a RESPONSE column that tells us whether the customer converted to the campaign or not, so we should be able to use the previously created functions clean_df and prepare_for_model for data cleaning and data preprocessing.

mailout_train = pd.read_csv('Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
nan_values_per_column = mailout_train.isnull().mean()
columns_with_to_many_missing_data = list(nan_values_per_column[nan_values_per_column > 0.5].index)
clean_df(mailout_train, extra_cols_to_drop=columns_with_to_many_missing_data, drop_columns_not_in_feature_excel=False)
mailout_train = prepare_for_model(mailout_train)

One important thing to notice is that we have a very imbalanced dataset, with only 1.2% of the almost 43,000 recipients converting to the campaign.

mailout_train["RESPONSE"].value_counts() / mailout_train.shape[0]
>>0    0.987617
1    0.012383
Name: RESPONSE, dtype: float64

Implementation

We’ll use xgboost combined with cross-validation and BayesSearchCV which is an optimized version of GridSearchCV.

For the evaluation metric, we’ll use ROC-AUC since we’re going to test our model in a Kaggle competition that uses ROC-AUC as its evaluation criteria.

In order to tackle the imbalanced data problem we might want to use higher values for the scale_pos_weight parameter. From xgboost documentation:

scale_pos_weight [default=1]
Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances)

Another important parameter that needed special attention was the gamma parameter, the higher this parameter the more conservative the model will be, helping to balance the number of false positives.

The model began giving not-so-good ROC curves and the hyperparameters space was explored using the BayesSearchCV and iterating over which ranges would give Better ROC curves…

Results

After some iterations over various ranges of the hyperparameters, we arrived at the following ones, yielding a ROC-AUC of 0.86, much better than the 0.56 that we started with.

Best parameters: OrderedDict([
('clf__colsample_bytree', 0.9538787548328516), ('clf__max_depth', 4), ('clf__min_child_weight', 5), ('clf__scale_pos_weight', 64.19778036588166)])
Mean CV score:  0.772314
Standard deviation of CV score:  0.028927

Getting a ROC curve with an AUC of 0.86 on the training data, we can now test this model against the test data set for the corresponding Kaggle competition… and we got an AUC of 0.79.

Due to the distance between both AUC we can say that the model might be overfitting the training data.

Notes on previously tried methods

PCA was tried initially getting super high AUC(0.91) but low AUC against the test set, the model was overfitting
SMOTE was tried with no improvement, the imbalanced data was taken into account by the scale_pos_weight parameter

Ideas for Improvement

Try dropping columns that are correlated to each other
Try dropping columns that have the same distribution between customers and noncustomers
Try other models such as randomforest

Optimize your Email Campaign Using Data

This is part of my audacity nano degree capstone project.

Project Definition

Problem statement

Metrics

Cleaning and Data Exploration

Clustering Model For Customer Segmentation

Data preprocessing

Implementation & Refinement

Results

Conclusion

Email Campaign Conversion Prediction Model

Data cleaning and preprocessing

Implementation

Results

Notes on previously tried methods

Ideas for Improvement

Thank you for reading!

Written by Sho Arata