August 1, 2019

Ice Cream Mining

Ice Cream Mining

Last week, we hit a record temperature of 41°C in Belgium. And as we were slowly melting away, we did what any data marketing firm would do under such circumstances. We explored how our three favorite ice cream companies position themselves in the market!

More specifically, we analyzed the language that each of three ice cream companies use on their website. Which words stand out? Which words are common property? And which words make each brand unique?

The Results

We noticed that each company has their own unique usage of certain words.

  • Ben & Jerry’s seems to be more inclined to talk about “chunks”, “swirls” and “cream”.
  • Breyers often talks about “peanut butter”, “butterscotch” and “milk”
  • Haagen-Dazs attempts to put themselves forward as “extraordinary” and the “best”.

What else?

The tone of voice is for all three companies quite positive. However, Haagen-Dazs uses the most subjective language style. Haagen-Dazs talks most optimistic about its product, Breyers comes second and Ben & Jerry’s is most objective (but still positive).

Can we perform the same kind of analysis on our competitors?

Of course you can! One of the big benefits about this analysis, is that the data is already there. It gives you valuable insights, while at the same time being time- and budget-friendly. Read more details about our method below or get in touch if you wish to know more!

Bart Van Proeyen
+32 494 60 95 38


Our analyses consists of four different pillars:

  • Web scraping to collect our data
    saving you the workload to click & read through all pages manually
  • Natural Language Processing to turn text into analyzable pieces of information
    saving you the headache to separate the forest from the trees
  • Correspondence Analyses to visualize the ice cream market
    giving you the insights how close or far certain brands are from certain words or statements
  • Sentiment Analyses to understand how positive/negative and objective/subjective companies speak about their brand
    helping you to determine how positive authors are about your brand… in social media this would be even more valuable

Web scraping

First we collected all textual information from seven different ice cream websites. Some irrelevant pages (e.g. contact pages) were ignored. In total, n=821 unique pages were scraped for their text. This constitutes our ‘corpus’ – the unit of our analyses. Our corpus consists of n=3 documents. One for each company.

Natural Language Processing

Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence used to read, decipher and understand the human languages. Ultimately, the goal is to teach machines how we communicate. Think about applications such as auto-correct features, chatbots or translation engines.

To turn our corpus into analyzable information, we used several techniques from the NLP toolbox:

  • Language detection: we only selected English pages.
  • Stopword filtering: we filtered out irrelevant stopwords (e.g. ‘the’, ‘a, ‘in”).
  • Term Frequency: this is a count of how often each word appears in each document.
  • TFIDF: this is a numerical statistic that reflects how important a word is in a certain document (relative to the full corpus).

In the end, we prepared our data with NLP to analyze and visualize further.

Correspondence Analyses

Correspondence analysis (CA) is a technique for graphically displaying a two-way table by calculating coordinates representing its rows and columns. In marketing, it is widely used to depict perceptions about brands. On the one hand, brands are visualized in a two-dimensional space. On this space, several characteristics are plotted. The interpretation is made by looking at the location of the characteristics in the plot. The characteristics on the outer rim of the plot are generally the ones that weigh most.

The correspondence analyses was done by looking at the frequency of words in each document. We only selected the n=30 most frequent words (in total), to make the plot readable. (In total, there were n=6738 unique words in our corpus).

Sentiment Analyses

Sentiment analyses can be considered as a subdomain of NLP. Its goal is to understand the tone of voice of certain pieces of text. We looked at the general sentiment of each of our 3 documents. We calculated two metrics: polarity (positive / negative) and subjectivity (subjective / objective). This yielded the quadrant as depicted above.