Case Study: Analysing customer sentiment with R programming

“No great marketing decisions have ever been made on qualitative data.” – John Sculley

Introduction

Written in conjunction with Rachel Kirkham and Tom Daniel.

Our chosen dataset is real anonymised data taken from a women’s online clothing retailer and consists of 23,000 customer reviews and ratings. The dataset contains ten fields including consumer age, rating number from 1-5, and written reviews.

Whilst starting from a dataset and then deciding how to make use of it is an error as per Mela & Morman (2018), for the purpose of this exercise we imagined ourselves as Marketing Executives setting objectives to create a brief for a Data Scientist.

It was clear from manually reading a selection of the reviews that there was a high level of customer service and positive emotion connected to the Brand. Positive reviews were overwhelmingly evident and outnumbered the number of poor ratings (1-3).

However, the data highlighted that 17.7% would not recommend their purchase. This was an interesting opportunity to explore the impact of sentiment within the written reviews, and whether there was potential to appeal to customer emotional motivators to convert them to a customer that would recommend the product.

We defined the business problem as:
Nearly 20% of customers would not recommend their purchase and are unlikely therefore to purchase from the brand again. However, this isn’t always reflected in poor ratings or reviews of their purchase. As a Brand, we want to be able to recognise who is unsatisfied with their purchase so we can engage them and retain them as customers.

The objective is:
Finding negative sentiment within reviews that prevents the customers from recommending the product so that the brand can take targeted actions to improve.

Within this report we will utilise the AI Canvas framework, outlined by Agrawal, A., Gans, A. & Goldfarb, A. (Apr 17, 2018) to help us structure and analyse the findings from the data. Our initial analysis informs the Prediction part of the framework, enabling us to clarify what the machine learning will contribute, make judgement on the output and how we can use it to take action.

This insight will be key, as the reviews skew positive in this dataset (77.5% scoring 4 or 5), and the initial analysis of the input data will enable us to breakdown the reviews into subsets to better identify indicators of customer dissatisfaction and communicate insights.

The insights from the ‘Input Data’ highlighted key phrases that potentially held negative sentiment. By testing a range of phrase lengths in our ‘Outcomes’’ section, we were able to identify phrases that indicated specific customer concerns.

Using the combination of the framework and the sentiment analysis creates an informed insight into potential opportunities for the business, that would not have been highlighted through single word analysis alone. Therefore, we are able to highlight opportunities for increasing the company’s emotional connection with their customer, rather than just pointing out problems.

Step One – Prediction specification and refinement

We undertook some initial analysis of the reviews so that we could clearly specify our requirements to make actionable recommendations. This involved understanding the words associated with each category of clothing, and their related sentiments.

To process the text within the data set we wrote two functions; the first to produce a word cloud, and the second to analyse the reviews for NRC sentiment. NRC sentiment uses the NRC emotion lexicon to assign English words to eight basic emotions. We used this lexicon because we were interested in the emotional connection the customers had to their purchase, and wanted to highlight the context that the reviews were written in.

The word cloud function subsets the data, turning the reviews into a Simple Corpus and then strips whitespace, converts to all text to lower case, and removes numbers, punctuation and stop words. The text was not stemmed, as we wanted whole words for visualisation purposes. The corpus is converted to a Document Term Matrix and counts produced of the words.

The resulting word clouds showed that customers used emotive words such as “perfect, great, beautiful” to describe their purchases, suggesting that they are generally impressed with the quality of the clothing. But there were also other words that needed greater context, such as “fit, different, tight”, that could highlight areas of negative segment.

We therefore set out to identify which phrases customers used that the Brand could use as performance indicators.

Step Two – Judgement

In this analysis, we are seeking to identify an indicator that a human should intervene. We value identification of words or phrases that indicate negative sentiment the company could address. Error handling is not relevant for this kind of analysis.

In order to drill down further into the data, we generated a word cloud for each of the 20 categories of clothing. This next step of analysis highlighted the most frequently used terms for each product category, and enabled us to get a sense of the content of the review, and manually identify a number of themes that we would have struggled to highlight without the data analysis and could conduct further analysis on.

We found the four most frequently used keywords across the 20 category insights were “size, fabric, fit and colour”. Interestingly this insight highlighted non-emotive words which on their own held no context.

This enabled us to compare sentiment across products based on those aspects and potentially identify issues with the products based on the comparison.

Step Three – Action

We are trying to identify phrases that can be used to indicate customer dissatisfaction. This meant looking at the categories we had identified as having potential issues and identifying phrases that could be potential indicators.

The five categories of clothes we have compared are Dresses, Blouses, Jeans, Jackets and Knits. The sentiment analysis revealed that whilst the sentiments of ‘joy’ and ‘trust’ were consistently highly ranked, the sentiment of ‘anticipation’ was also ranked in the top three for all five categories. This gave a clear indicator that positive language might be being used to describe negative sentiment or feedback that was preventing the customer from recommending their purchase.

This revealed that we needed to analyse the most common strings of words used in the product reviews in order to understand which phrases could be used to detect customer issues.  By understanding the context in which these phrases are used and common themes among the reviews we can select the best indicator.

To do this we created a function to conduct n-gram analysis. We tested analysing bigrams, trigrams and 4-word grams to understand which gave the most contextual information. We chose to conduct n-gram analysis as a quick way to get a deeper contextual understanding of the reviews that indicated potential size, fit, colour and fabric issues.

Here is an example of our output for Intimates; a loop was used to apply this analysis to the categories identified for further investigation.

We found that the tri-gram analysis was most useful, followed by 4-word grams. As the number of words included in a token increases, the frequency of those tokens generally decreases; five-word grams were meaningless, for example, as they diluted the common sentiment too much.

Dresses

We identified a phrase here that warranted further investigation. “wanted to love this”, “I wanted to love” appears frequently not only for dresses, but for other categories we analysed.

Blouses

We can see here that “wanted to love this” and “in the back” are phrases that could be indicators of an issue and appear frequently for blouses.

Jeans

We can see here that “size” and “fit” occur frequently for Jeans, so we did further analysis of the category. There were no phrases identified that seemed like a good candidate for testing.

Jackets

Here we can see that size and fit occur frequently for this category. “in the back” is a phrase that could indicate fit issues, and occurs frequently for other categories, so we selected it for further analysis.

Knits

So, if we consider the analysis undertaken for Knits, we can see that “the material is” appears frequently. We considered this phrase worthy of investigation. We can also see “in the back” appears frequently for this category.

As mentioned in the Introduction, the reviews skew positive in this dataset (77.five% scoring 4 or 5), we also chose to analyse the positive reviews further. We took all reviews with the phrase “I love” and a review score of 5, and conducted Parts of Speech tagging to identify the adjectives used in those reviews using the English-EWT model (https://universaldependencies.org/format.html) to conduct the POS tagging exercise.

With this insight, we then analysed the negative reviews, to see if there were any overlaps in language that we could draw from. The fact that anticipation ranked so high in all categories suggested that the consumer was unsure about the purchase and then pleasantly surprised, or was excited about the purchase and then let down by the reality of the product.

Step Four – Outcome

Our key metric for success was identifying key phrases the Brand could use to identify customer dissatisfaction.

Our analysis revealed the phrase “I wanted to love” as a recurring theme in reviews that wouldn’t recommend the product. This phrase stood out as a clear example of where single words associated to positive sentiment, in this case “love”, were being used to imply disappointment.

This analysis pulled out three phrases we were interested in analysing; “I wanted to love”, “the material is” and “in the back”. All three phrases imply that there is context that could be associated to anticipation. By analysing how these phrases are used across the five chosen category sets and their purchase rating would enable us to refine the context in which they are being used.

“I wanted to love” is an emotive phrase, conveying disappointment despite anticipated satisfaction, and suggests a negative customer experience. The average ratings for reviews containing this phase for all product classes are significantly below the average rating for the same product class and the range for ratings containing the phrase is limited to the lower half of the ratings scale, scores from one to three. This analysis suggests the phrase has an amplified negative polarity.

“I wanted to love” appears in 0.66% of reviews and with similar frequency in all product classes. This fails to identify a particular issue with certain products. However, the emotive nature of the phrase itself presents a marketing opportunity. As described in Zorfas & Leemon (2016), brands that can create an emotional connection with customers are better able to maximise customer value. Real-time sentiment monitoring could identify the phrase “I wanted to love” and immediately connect with an upset customer, engaging in a dialogue and potentially providing further action, such as a discount on their next purchase. This fast and personal response is explained by Ruttenburg et all (2018) as the level of personal service that creates an emotional connection with consumers and ultimately drives demand and a willingness to pay more in the future for goods and services.

The phrase “the material is” initially seems to suggest neutral sentiment, with no implied customer experience. As with “I wanted to love”, however, the average ratings for reviews containing “the material is” for all product classes are significantly below the average rating for the same product class and the range for ratings containing the phrase is limited to the lower half of the ratings scale, scores from one to three. This analysis suggests the phrase has a negative connotation, suggesting there is an issue with the materials used in the products.

“The material is” appears in 1.01% of all reviews and with some variation between the product classes. However, there may be an element of confirmation bias contained in the phrase, perhaps materials are only mentioned when below expectations or otherwise taken for granted. Given the different materials used for each product, analysis of a single year’s data is not enough to gauge the expectations of customers for that product. We also don’t have enough information on this different product offerings to make specific recommendations. The important information is the trend within each product, year on year analysis should be carried out, with particular relevance given to changes in trends when new fabrics and materials are introduced. Changes in the frequency “the material is” could be a leading indicator of customer satisfaction, allowing brands to choose the optimal blend of fabrics for each product.

“In the back” was selected as it could highlight fit and design issues for some products. Upon reading the reviews associated with this phrase, there is a recurring theme of blouses having fit issues (“this top was too tight across my chest but billowed out in the back” for example) for some products. The brand should identify the specific blouses related to these reviews to adjust their product design.

Final Considerations

The final considerations section of the AI Canvas framework looks at ‘the types of data that will be required to train, operate and improve AI”. As mentioned in the ‘Judgement’ section, we’re seeking an indicator for human intervention, rather than an automated algorithm. We have analysed the input data thoroughly, and whilst we haven’t ‘trained’ the data, we have identified phrases that could be used as features for training a review classifier in the future conduct n-gram analysis; having tested bigrams, trigrams and 4 word grams to understand which gave the most contextual information. We could also use changes in sentiment as feedback to update such a model and improve its performance.

Conclusion and recommendations

The range of outcomes within the report highlight the importance of acknowledging what problem you want to solve ahead of collecting your data. Whilst we had detailed information around what each customer had bought, the most insightful information captured in the text reviews made it difficult to accurately analyse why they were not happy. 

Having the insight that 17.7% of customers would not recommend their purchase, we would recommend that the Brand add an additional data field at the point of giving the review, allowing the customer to indicate why they are unhappy. This would enable much faster understanding of the data analysis you could then apply to the results to find key themes such as the “size, fabric, fit and colour” themes we recognised broadly across the product categories, to feedback to the business design team and improve the product. 

This may create greater efficiency of how the brand could utilise their customer service resources. A review with the phrase ‘I wanted to love’ combined with the field suggested above could trigger a direct response by a customer service representative, engaging the customer and empathising with their disappointment to resolve the issue, retain them as a customer and potentially extend their customer lifespan. This could be extended to their social media presence and enriched with using personal sign offs to emphasis the human response (Ruttenberg, A., Tripp, A., Dibner, C., Mitchell. J, & Huang, W. 2018).

We also recommend the brand taking the opportunity to further the use of AI across the full range of experience insights. Where can the brand engage and delight their consumers, as recommended by Zorfas, A. & Leemon, D. 2016,  through positive interaction building on the emotional motivators, such as a sense of belonging, they have the opportunity to convert them into emotionally connected customers and double their value to the Brand.

In conclusion, producing meaningful insight, rather than a set of facts, is a goal of analysis (Mela & Moorman, 2018), and we have shown this is best achieved with a combination of machine and human learning. The utility of information is a function of its causality rather than correlation to desired outcomes, and human experience is often a better guide to this than pure analysis, which is frequently constrained by the limited volume and variance of available data. Examples of this are the different suggested potential applications of the “I wanted to love” and “the material is” word-strings. 

A business goal of seeking to identify dissatisfied customers can benefit from machine-learning based sentiment analysis, with real-time detection of issues on a scale impractical without automation. Effective engagement with those customers will most likely be best achieved with a higher degree of human input than the identification due to the context specific nature of issues, with insight from this fed back into the automation to increase the accuracy. 

References

Agrawal, A., Gans, A. & Goldfarb, A. (Apr 17, 2018) A Simple Tool to Start Making Decisions with the Help of AI. Harvard Business Review Digital Article. Available here: https://services.hbsp.harvard.edu/api/courses/706350/items/H04BYL-PDF-ENG/sclinks/c1d93f3cde2622db2e7afad555eb7199 [Accessed 6th April 2020]

Mela, C. & Moorman, C. (May 30, 2018) Why Marketing Analytics Hasn’t Lived Up to Its Promise. Harvard Business Review. Available here: https://services.hbsp.harvard.edu/api/courses/706350/items/H04BYL-PDF-ENG/sclinks/c1d93f3cde2622db2e7afad555eb7199 [Accessed 6th April 2020]

Ruttenberg, A., Tripp, A., Dibner, C., Mitchell. J, & Huang, W. (2018) How Customer Service Can Turn Angry Customers into Loyal Ones. Harvard Business Review. Available here: https://services.hbsp.harvard.edu/api/courses/706350/items/H043QN-PDF-ENG/sclinks/394f7f731c881d6711daa4933c610768 [Accessed 5th April 2020]

Zorfas, A. & Leemon, D. (2016) An Emotional Connection Matters More than Customer Satisfaction. Harvard Business Review. Available here: https://services.hbsp.harvard.edu/api/courses/706350/items/H033FC-PDF-ENG/sclinks/5f633fb0079f3b39242e1be9b354632d [Accessed 5th April 2020]

Samantha Bonnar