Customer Segmentation & Forecasting Part I

10 min readFeb 7, 2021

In this example we are going to analyze the sales of a company in several countries: the consumer behavior using clustering and the revenue forecast using time-series with prophet library.

In this first part, we’ll use K-MEANS and hieralchical clustering.

Clustering is a non-supervised technique in ML.

We’ll use elbow method and dendograms to find the best number of clusters to clasify our customers and therefore prepare personalized marketing campaings.

Let’s start!

Dataset

I found the following dataset in kaggle. As you probably know Kaggle is a good site to learn, share knowledge, participate in competitions and download datasets.

Online Retail

Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data…

www.kaggle.com

Analizing the data

Pandas gives us a lot of functionality to load and analize the dataset.

# Data loading with pandas
dataset = pd.read_excel(io='Online Retail.xlsx')
dataset = pd.DataFrame(dataset)dataset.info()

dataset.describe()

The output is Statistical information for each column as max, min, mean etc.

With head and tail method you can see the first or last rows.

dataset.head(10)

Now, let’s review the null values:

To check if there are null values and the number of values for each feature, we use the following:

dataset.isnull().values.any()
dataset.isnull().sum()

You can delete the rows with null values or you can use the mean as in this case to avoid dataset reduction:

dataset = dataset.fillna(dataset.mean())

after that, we have to cast the string of InvoiceDate column to a date datatype:

dataset['InvoiceDate'] = pd.to_datetime(dataset['InvoiceDate']).dt.date

Let’s go to review the number of values of each feature:

pd.DataFrame([{'products': len(df['StockCode'].value_counts()),    
               'transactions': len(df['InvoiceNo'].value_counts()),
               'customers': len(df['CustomerID'].value_counts()),  
              }], columns = ['products', 'transactions', 'customers'], index = ['quantity'])

And the countries:

import plotly.graph_objs as go
from plotly.offline import iplotdata = dict(type='choropleth',
locations = countries.index,
locationmode = 'country names', z = countries,
text = countries.index, colorbar = {'title':'Order nb.'},
colorscale=[[0, 'rgb(224,255,255)'],
            [0.01, 'rgb(166,206,227)'], [0.02, 'rgb(31,120,180)'],
            [0.03, 'rgb(178,223,138)'], [0.05, 'rgb(51,160,44)'],
            [0.10, 'rgb(251,154,153)'], [0.20, 'rgb(255,255,0)'],
            [1, 'rgb(227,26,28)']],    
reversescale = False)
#_______________________
layout = dict(title='Number of orders per country',
geo = dict(showframe = True, projection={'type':'mercator'}))
#______________
choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

As you can see UK is the country with more Sales

Calculating Recency, Frequency and Monetary

Recency, Frequency and Monetary(Revenue) is a very common technique used in marketing to understand the consumer behavior.

We are going to analyze the orders in Spain due to its low sales.

Calculate the number of products:

df_sp = dataset.query('Country == "Spain"')
tx_user = pd.DataFrame(df_sp['CustomerID'].unique())tx_user.columns = ['CustomerID']tx_num_productxCustomer = dataset.groupby('CustomerID').Quantity.count().reset_index()
tx_num_productxCustomer.columns = ['CustomerID','NumProducts']
tx_user = pd.merge(tx_user, tx_num_productxCustomer[['CustomerID','NumProducts']], on='CustomerID')
tx_user[:10].sort_values('CustomerID')

Calculate the Recency

# get the max purchase date for each customer and create a dataframe with ittx_max_purchase = df_sp.groupby('CustomerID').InvoiceDate.max().reset_index()
tx_max_purchase.columns = ['CustomerID','MaxPurchaseDate']# we take our observation point as the max invoice date in our datasettx_max_purchase['Recency'] = (tx_max_purchase['MaxPurchaseDate'].max() - tx_max_purchase['MaxPurchaseDate']).dt.days# merge this dataframe to our new user dataframetx_user = pd.merge(tx_user, tx_max_purchase[['CustomerID','Recency']], on='CustomerID')tx_user.head()

And let’s analyze it:

tx_user.Recency.describe()

And now with the histogram:

# Histogramlabel = tx_user["Recency"]
fig = plt.figure(figsize = (8,8))#plot the histogram
a=fig.add_subplot(2, 1, 1)
hist_bin_width = 1
label.plot.hist(color = 'lightblue', bins = range(0, 373,hist_bin_width) )#show the mean and de median values on the plot
plt.axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(label.median(), color='green', linestyle='dashed', linewidth=2)a.set_title('Histogram Customer&Recency')

The mean is near 90 and the median near 50

Calculate the Frequency

# Frequency calculation in tx_user datasettx_frequency = df_sp.groupby('CustomerID').InvoiceDate.count().reset_index()
tx_frequency.columns = ['CustomerID','Frequency']# add this data to our main dataframe
tx_user = pd.merge(tx_user, tx_frequency, on='CustomerID')tx_user.head()

Calculate the Revenue(Monetary)

# Calculate revenue for each customerdf_sp['Revenue'] = df_sp['UnitPrice'] * df_sp['Quantity']
tx_revenue = df_sp.groupby('CustomerID').Revenue.sum().reset_index()# merge it with our main dataframe
tx_user = pd.merge(tx_user, tx_revenue, on='CustomerID')tx_user.head()

Calculate the RFM score

We’ll calculate 3 numbers concatenated to form RFM value.

These numbers will be calculated using quartiles. I mean, if the value of R is great 75% the r_quartile will be 1, between 74% and 50% it will be 2 etc…

Therefore the best value of a r_quartirle will be 4, small values in recency will get higher value on the quartile (Recently means the last time customer came)

For Frequency and Monetary, it will be the opposite.

Once we get these 3 quartiles we’ll concatenate them in the RFM score value:

quantiles = tx_user.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()segmented_rfm = tx_userdef RScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1
    
def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4
    
segmented_rfm['r_quartile'] = segmented_rfm['Recency'].apply(RScore, args=('Recency',quantiles,))
segmented_rfm['f_quartile'] = segmented_rfm['Frequency'].apply(FMScore, args=('Frequency',quantiles,))
segmented_rfm['m_quartile'] = segmented_rfm['Revenue'].apply(FMScore, args=('Revenue',quantiles,))
segmented_rfm.head()    
    
segmented_rfm['RFMScore'] = segmented_rfm.r_quartile.map(str) + segmented_rfm.f_quartile.map(str) + segmented_rfm.m_quartile.map(str)
segmented_rfm.head()

segmented_rfm.RFMScore.value_counts()

Standarization

#standarizationfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScalerlabelencoder_X1 = LabelEncoder()sc = StandardScaler()
segmented_rfm[['CustomerID']] = sc.fit_transform(segmented_rfm[['CustomerID']])
segmented_rfm[['NumProducts']] = sc.fit_transform(segmented_rfm[['NumProducts']])
segmented_rfm[['Recency']] = sc.fit_transform(segmented_rfm[['Recency']])
segmented_rfm[['Frequency']] = sc.fit_transform(segmented_rfm[['Frequency']])
segmented_rfm[['Revenue']] = sc.fit_transform(segmented_rfm[['Revenue']])
segmented_rfm[['RFMScore']] = sc.fit_transform(segmented_rfm[['RFMScore']])segmented_rfm.head(10)

segmented_rfm.RFMScore.value_counts()

Clustering using the Elbow method and K-MEANS

We are going to use the elbow method to know the best number of cluster to classify customers and their consumer behaviors.

Let’s take the revenue and the RFMScore for clustering:

#Clustering RFM - Revenueimport seaborn as snsX = segmented_rfm.iloc[:,[4, 8]]wcss = {}
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = "k-means++", max_iter = 300, n_init = 10, random_state = 0)
    #y_kmeans = kmeans.fit(X)
    y_kmeans = kmeans.fit_predict(X)wcss[i] = kmeans.inertia_# Plot SSE for each *k*
plt.title('The Elbow Method')
plt.xlabel('Clusters numbers'); plt.ylabel('WCSS')
sns.pointplot(x=list(wcss.keys()), y=list(wcss.values()))
plt.show()

This is the output: 3 clusters. The point of the elbow method

From point 3 and 4 the graphic doesn’t change too much. So we could consider 3 or 4 clusters.

Now, we’ll apply K-MEANS :

X = tx_user.iloc[:,[4, 8 ]].values
kmeans = KMeans(n_clusters = 3, init="k-means++", max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(X)# Visualizing the clusters
plt.figure(figsize = (8,10))
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = "red", label = "CLUSTER1")
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = "blue", label = "CLUSTER2")
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = "green", label = "CLUSTER3")plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 30, c = "yellow", label = "Baricentros")
plt.title("Cluster de clientes")
plt.xlabel("Revenue")
plt.ylabel("RFM Score")
plt.legend()
plt.show()

Clustering using Dendograms and Hierarchical Clustering

We could thinking the elbow method is very simple to identify the best number of clusters, but it is very effective. Now we are going to use other method to view again the most optimal number of clusters using dendograms and hierarchical Clustering:

X = segmented_rfm.iloc[:,[4, 8]].valuesimport scipy.cluster.hierarchy as sch
plt.figure(figsize = (12,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))plt.title('Dendogram')
plt.xlabel('Number of Customers')
plt.ylabel('Euclideam Distances')
plt.show()

To find the best number of cluster with this method you have to look for the most higher vertical line that doesn’t cross any horizontal.

In this case, it is the second vertical blue line to the right, given that it doesn’t cross in its longitude any horizontal line. This line cross 3 vertical lines, so the number of cluster is 3. But the red line bellow is near and we could also consider 4 clusters.

# Adjusting herarchical Clustering to the datasetfrom sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 3, 
                    affinity = 'euclidean', 
                    linkage = 'ward')y_hc = hc.fit_predict(X)# Visualising the clustersplt.figure(figsize = (8,10))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
#plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.title('Clusters of customers')
plt.xlabel('Revenue')
plt.ylabel('RFMScore')
plt.legend()
plt.show()

Visualizing Clusters in 3D

For this representation we are going to use 4 clusters instead of 3.

from mpl_toolkits import mplot3dX = segmented_rfm.iloc[:,[5, 6, 7 ]].values# Applying k-means method
kmeans = KMeans(n_clusters = 4, init="k-means++", max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(X)plt.figure(figsize = (12,12))
ax = plt.axes(projection = "3d")
# Clusters visualizationax.scatter3D(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], X[y_kmeans == 0, 2], s = 100, c = "red", label = "Ocassional Customers")
ax.scatter3D(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], X[y_kmeans == 1, 2], s = 100, c = "blue", label = "Customers to Recover")
ax.scatter3D(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], X[y_kmeans == 2, 2], s = 100, c = "green", label = "New Customers")
ax.scatter3D(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], X[y_kmeans == 3, 2], s = 100, c = "cyan", label = "Golden Customers")
ax.scatter3D(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], kmeans.cluster_centers_[:,2], s = 200, c = "yellow", label = "Baricentros")
plt.title("RFM Score vs Revenue & Num Products")
plt.xlabel("Recency")
plt.ylabel("Frequency")
plt.ylabel("Revenue")
plt.legend()plt.show()

Now in interactive mode:

from mpl_toolkits import mplot3d
import plotly
import plotly.graph_objs as godef getTrace(x, y, z, c, label, s=2):
    trace_points = go.Scatter3d(
    x=x, y=y, z=z,
    mode='markers',
    marker=dict(size=s, line=dict(color='rgb(0, 0, 0)', width=0.5), color=c, opacity=1),
    name=label
    )
    return trace_points;def showGraph(title, x_colname, x_range, y_colname, y_range, z_colname, z_range, traces):
    layout = go.Layout(
    title=title,
    scene = dict(
        xaxis=dict(title=x_colname, range = x_range),
        yaxis=dict(title=y_colname, range = y_range),
        zaxis=dict(title=z_colname, range = z_range)
        )
    )
    fig = go.Figure(data=traces, layout=layout)
    plotly.offline.plot(fig)X = segmented_rfm.iloc[:,[5, 6, 7 ]].values# Applying k-means method kmeans = KMeans(n_clusters = 4, init="k-means++", max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(X)plt.figure(figsize = (12,12))
ax = plt.axes(projection = "3d")# clusters visualizationt1 = getTrace(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], X[y_kmeans == 0, 2], s = 5, c = "red", label = "Ocassional Customers")
t2 = getTrace(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], X[y_kmeans == 1, 2], s = 5, c = "blue", label = "Customers to recover")
t3 = getTrace(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], X[y_kmeans == 2, 2], s = 5, c = "green", label = "New Customers")
t4 = getTrace(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], X[y_kmeans == 3, 2], s = 5, c = "cyan", label = "Golden Customers")
centroids = getTrace(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], kmeans.cluster_centers_[:,2], s = 8, c = "yellow", label = "Baricentros")x = X[:,0] #recency
y = X[:,1] #Frequency
z = X[:,2] #monetaryshowGraph("Customers cluster", "Recency", [min(x), max(x)], "Revenue", [min(y), max(y)], "Frequency", [min(z), max(z)], [t1,t2,t3,t4, centroids])

This will open a new tab in your browser and you can rotate and check the R, F and M quartile value for any point.

And finally, we’ll add the segmentation labels for each customer:

cluster_labels = kmeans.labels_tx_customers = segmented_rfm.assign(Cluster = cluster_labels)
tx_customers.head(10)

Let’s see how many customers there are in each cluster:

rfm_agg = tx_customers.groupby('Cluster').agg({'CustomerID':'count'})
print(rfm_agg)

# RFM visualization
import squarify
import matplotlibnorm = matplotlib.colors.Normalize(vmin=min(rfm_agg.CustomerID.astype(float)), vmax=max(rfm_agg.CustomerID.astype(float)))
colors = [matplotlib.cm.Blues(norm(value)) for value in rfm_agg.CustomerID.astype(float)]fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(13, 7)
squarify.plot(sizes=rfm_agg['CustomerID'], 
              label=['Ocassional Customers','Customers to Recover','New Customers', 'Golden Customers'],
              color = colors,
              alpha=0.5)
plt.title("RFM Segments",fontsize=20)
plt.axis('off')
plt.show()