Survey Segmentation Tutorial

By Jason Wittenauer, Huron Consulting Group.

When reviewing survey information, you’ll usually be handed Likert questions (ex: solutions with a scale of 1 to 5 with 1 being unhealthy and 5 being good). Utilizing a couple of methods, you may confirm the standard of the survey and begin grouping respondents into populations. The steps we might be following are listed beneath:

  • Analyzing our information set for scale.
  • Utilizing Precept Element Evaluation (PCA) to confirm that the survey is sound and grouping information.
  • Checking for correlated questions.
  • Establishing the Exploratory Issue Evaluation (EFA) to create the ultimate segments.


Knowledge Evaluate

The info set we might be utilizing consists of 90 respondents answering questions based mostly on how they like to buy vehicles. This information was initially sourced from PromptCloud here, however will be discovered on this repository beneath the info folder. There are 14 choices that respondents are contemplating once they purchase a automobile: value, security, exterior appears to be like, and many others. You’ll discover that there’s a Respondent ID column added to the file positioned on this repository. We’ll need to have the ability to tie the respondents to their segments for future reporting, so there ought to at all times be an ID included in your information set.


Setup Library

Step one is to arrange all of the packages that might be used on this evaluation. The 2 important packages for the evaluation are PCA and Issue Analyzer to generate all of the modeling statistics we have to create our survey groupings.

In [1]:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from factor_analyzer import FactorAnalyzer
import os



Import Knowledge

Subsequent, we are going to learn in our information set.

In [2]:

os.chdir('C:ProjectsSurvey Segmentation')
df = pd.read_csv('DataCarPurchaseSurvey.csv')


Out [2]:

As you may see, the info set has already been transformed to numbers. If this has not been carried out in your information set, you have to to transform any textual content like “Good”, “Neutral”, “Bad”, and many others. right into a numeric format. On this information set, 1 could be very low, and 5 could be very excessive.


Affirm Reply Scale

In case you do not know a lot concerning the survey information that’s being analyzed, you may at all times verify the size of all of the columns by searching for the min, max, and distinctive worth counts. This can let you already know if it’s good to rescale the info or not.

In [3]:

columnStatistics = pd.DataFrame(df.max(axis=zero)) # will return max worth of every column
columnStatistics.columns = ['MaxValues']

columnStatistics['MinValues'] = df.min(axis=zero) # will return min worth of every column

uniqueCounts = pd.DataFrame(columnStatistics.index)
uniqueCounts.set_index(zero, inplace=True)
uniqueCounts['UniqueValues'] = np.nan
for col in df:
    uniqueCounts.loc[col]['UniqueValues'] = df[col].nunique() # will return min worth of every column
columnStatistics['UniqueValues'] = uniqueCounts['UniqueValues']


Out [3]:

It seems now we have a scale of 1 to 5 for all these questions. Watch out with assuming the size, although, and you may find yourself with a query that simply did not have responses on the high or low finish. This might make it look like on the identical scale when it isn’t. The most suitable choice is at all times to evaluate the unique survey to confirm all query scales. For our functions, these questions had been all on the identical scale.


Put together Knowledge

Once we analyze the info set in PCA and EFA, we don’t wish to embrace the ID column as a part of the evaluation. Nevertheless, we do wish to hold it round for reference functions. Let’s make it the dataframe index.

In [4]:

df.set_index('Respondent_ID', inplace=True)


Verify Survey Validity Utilizing PCA

Now we will run PCA to find out if the survey was written effectively sufficient to place respondents into varied segments. First, we are going to arrange our covariance matrix.

In [5]:

covar_matrix = PCA(n_components = len(df.columns)) #elements are equal to the variety of options now we have

Out [5]:

PCA(copy=True, iterated_power='auto', n_components=14, random_state=None,
    svd_solver='auto',, whiten=False)

Subsequent, we are going to plot the eigenvalues of our options to confirm that there are at a minimal of 2-3 options which have a price higher than 1.


In [6]:

plt.xlabel('# of Options')
plt.title('PCA Eigenvalues')
plt.axhline(y=1, coloration='r', linestyle='--')

Out [6]:

After confirming the eigenvalues, we will verify to see that one thing lower than the full variety of options explains a big portion of the variance. On this case, we set the brink at 80% and it appears to be like like 6 options (lower than 14) are explaining at the least 80% of the variance.

In [7]:

variance = covar_matrix.explained_variance_ratio_ #calculate variance ratios
var=np.cumsum(np.spherical(covar_matrix.explained_variance_ratio_, decimals=3)*100)

plt.ylabel('% Variance Defined')
plt.xlabel('# of Options')
plt.title('PCA Variance Defined')
plt.axhline(y=80, coloration='r', linestyle='--')

Out [7]:

The final a part of our preliminary survey validation checks is to make it possible for the elements of the PCA are exhibiting various kinds of populations. If all of the populations present Security and Resale_Value as their high 2 options, then the survey is not segmenting the inhabitants very effectively. In our beneath code, we might be wanting on the high 3 options for every element, which appear like very completely different populations.

In [8]:

elements = pd.DataFrame(covar_matrix.components_ ,columns = df.columns)
elements.rename(index = lambda x: 'PC-' + str(x + 1), inplace=True)

# High 3 optimistic contributors
pd.DataFrame(elements.columns.values[np.argsort(-components.values, axis=1)[:, :3]], 
                  columns = ['1st Max','2nd Max','3rd Max'])

Out [8]:


Correlating Questions

Survey questions can generally not produce completely different outcomes. For instance, everybody who charges Security excessive may additionally fee Know-how excessive. When that occurs, having each questions is not going to essentially assist with doing a mathematical segmentation. This doesn’t suggest that they’re invalid inquiries to have listed, although. There might be plenty of enterprise worth to realizing that Security and Know-how correlate extremely. While you discover correlating questions, it’s a good suggestion to debate with your online business customers which of them needs to be eliminated (if any!).

Our uncooked information output of correlating questions will be seen beneath (1 = good correlation and zero = no correlation).

In [9]:

Out [9]:

This can be visually represented in a warmth map. On this visualization, darker is nice as a result of the questions don’t correlate.

In [10]:

plt.xticks(vary(len(df.columns)), df.columns, rotation='vertical')
plt.yticks(vary(len(df.columns)), df.columns)

Out [10]:


Utilizing EFA to Create Segments

Now that now we have verified there’s segmentation taking place with the survey outcomes, we will begin analyzing what number of segments we wish. That is the place it begins to combine between artwork and science. Typically you need extra segments as a result of you will need to embrace a characteristic that may not be captured with fewer segments. Different occasions, the enterprise want would possibly simply be “create 4, and only 4 segments because we have 4 flavors of this new food being marketed”. Whatever the scenario, we will use EFA to create our segments and confirm that the segments are what we wish.


Evaluate the Scree Plot

To start out the evaluation, we have to create a scree plot. To do that, we have to take a look at the eigenvalues.

In [11]:

fa = FactorAnalyzer(rotation=None, n_factors=len(df.columns))

# Verify Eigenvalues
ev, v = fa.get_eigenvalues()

Out [11]:

array([2.75506068, 2.1640701 , 1.46454689, 1.32990296, 1.04029066,
       0.99198697, 0.80634535, 0.68102944, 0.60136568, 0.5536899 ,
       0.51364695, 0.46653069, 0.35660717, 0.27492656])

Now that now we have an inventory of the eigenvalues, we will map them to our components.

In [12]:

plt.title('Scree Plot')
plt.axhline(y=1, coloration='r', linestyle='--')

Out [12]:

This plot ought to look very acquainted, as we used an identical plot above with the PCA. And identical to that earlier evaluation, we’re going to search for the variety of components which can be above one to find out what number of preliminary segments we wish to create. On this case, we might be creating 5 segments. Ideally, we wish to map the VSS Complexity and Parallel Evaluation strains to provide us a variety of segments to check out. Whereas that is constructed into packages in R, with Python, there would not appear to be a straightforward means to do that. So, we are going to simply must depend on trial and error to do our segmentation.


Create Segments and Evaluate Loadings

Now that we all know there are 5 segments for the preliminary evaluation, we will create a brand new mannequin with these segments and see how every characteristic is positively or negatively mirrored in every phase.

In [13]:

fa = FactorAnalyzer(rotation="varimax", n_factors=5)

# Verify loadings
loadings = pd.DataFrame(fa.loadings_)
loadings.rename(columns = lambda x: 'Issue-' + str(x + 1), inplace=True)
loadings.index = df.columns

Out [13]:

Within the above desk, every issue will be thought of a phase. You would possibly wish to mix these for enterprise functions into tremendous segments, however they do symbolize distinct populations. Once we analyze the segments, it helps to place a limitation on the numbers/relationship energy. For instance, if we take away all the pieces that’s lower than .4 (positively or negatively), we find yourself with the beneath desk.

In [14]:

segments = loadings[loadings >= .4].fillna(loadings[loadings <= -.4])

Out [14]:

Now we will begin naming the segments based mostly on the options which can be inside every issue. To do that, we simply title the columns.

In [15]:

segment_names = ['Overall Cost', 'Comfort and Fuel Efficiency', 'Review Confirmer', 'Service', 'Color Trumps All']
segments.columns = segment_names

Out [15]:


Verify Variance and Do Adequacy Checks

In [16]:

# Verify variance
factorVariance = pd.DataFrame(fa.get_factor_variance())
factorVariance.rename(columns = lambda x: 'Issue-' + str(x + 1), inplace=True)
factorVariance.index = ['SS Loadings', 'Proportion Variance', 'Cumulative Variance']

Out [16]:

It appears to be like like 5 components can clarify 45% of our variance. We in all probability wish to shoot for one thing over 50%, so we have to improve the issue depend. Nevertheless, what actually issues is the enterprise case concerned.


Resolution Time

We now have validated our survey and give you some preliminary segments. There’s a catch, although, as our segments did not use all of the survey questions (Security and Know-how weren’t used). If you will need to discover folks to promote Security or Know-how options too, then we would want to extend the segments from 5 to 6 and re-run the EFA portion of our evaluation. One other requirement may be we will solely have 3 segments, so we have to both scale back the components for the EFA (which would scale back the options used) or mix the 5 segments into 3 segments. Simply realizing how you can do a mathematically right segmentation doesn’t essentially translate into one thing usable by a enterprise.


Export for Reporting

As soon as the segments are accredited, it’s time to put together the info for exporting. We’ll apply our segments to the unique information, unpivot the correlation matrix, and unpivot the loadings for the report that may be discovered right here: Dashboard

In [17]:

# Knowledge mapped to components
factor_scores = pd.DataFrame(fa.rework(df))
factor_scores.columns = segment_names
factor_scores['Respondent_ID'] = df.index
df_export = pd.merge(df, factor_scores, on='Respondent_ID')
df_export['Primary Segment'] = df_export[segment_names].idxmax(axis=1)
df_export.to_csv('DataData_Scored.csv', index=False)

# Correlation matrix
correlation_export = df.corr().unstack().reset_index(title='worth')
correlation_export.columns = ['Feature 1', 'Feature 2', 'Value']
correlation_export.to_csv('DataCorrelations.csv', index=False)

# Loadings
loadings.columns = segment_names
loadings_export = loadings.unstack().reset_index(title='worth')
loadings_export.columns = ['Segment', 'Feature', 'Value']
loadings_export.to_csv('DataLoadings.csv', index=False)



On this tutorial, you discovered how you can create a segmentation evaluation based mostly on Likert survey questions. Hopefully, you’re feeling empowered to generate your personal evaluation based mostly on new information.


Bio: Jason Wittenauer is a Lead Knowledge Scientist specializing within the healthcare business, and leads growth of latest analytics instruments that incorporate information science expertise in an easy-to-understand format. Jason graduated from Brigham Younger College and is a knowledge science developer, researcher, practitioner, and educator with over 12 years of business expertise. He has created many healthcare enabling applied sciences that embrace predicting denials, automating rule sample discovery for care variation, and creating a bunch of instruments to allow healthcare professionals to work extra effectively.


About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *