Recently, I had the opportunity to participate at Text Analytics World in Chicago
. It was exciting to see so many industry-focused data science practitioners in the same place talking about modern methods of leveraging text for industry insights.
Here are two takeaways that really stood out at this event:
Neural Word Embeddings are the Rule, not the Exception
Neural word embeddings, such as Word2Vec, are a clever method of representing words in a natural language vocabulary (typically consisting of millions of terms) in a nice and compact vector space where it is much easier to detect signals and patterns in how the words are represented. This is accomplished by training an artificial neural network to perform some (often not directly related but well defined) tasks involving natural language, such as asking the network to guess the next word in a sentence after showing it the last few words. Word2Vec then throws away the rest of the network and only keeps the part of the neural net trained to represent terms vocabulary as vectors (representations of the words as lists of a few hundred numbers) that it found useful in solving the original task. Click for for a further introduction to these algorithms.
This technique (and others like it) has had remarkable results in the area of natural language representation for machine learning over the last few years, primarily leveraged by companies at the cutting edge of natural language machine learning. No longer. W2V and its cousins have permeated all corners of the industry as the standard in text processing and natural language representation. For the vast majority of the talks I attended, if W2V was not a cornerstone of the results, it played a key part in the discussions following, as data scientists speculated how it might further improve the work.
A Sour Sentiment on Sentiment
While rich multi-dimensional representations of text in terms of Word2Vec vectors is clearly in vogue, tasks such as Sentiment Analysis (SA) in industry are not. Sentiment Analysis boils natural language strings pulled from social media and CRM data down to a spectrum of positive and negatively characterized sentiments. During a panel discussion, the audience was polled on their experiences with this task, revealing two insights:
- First, nearly every practitioner at some point in his or her career has been asked to do SA for a customer or stakeholder.
- Second, no one in the audience has had an experience in which SA alone was able to drive business value.
The explanation for the first point is likely a sociological one: sentiment analysis is easy to understand; customers feel good about X or they don’t and their words will tell us the difference. The reason why it is so uncommon for practitioners to be able to use SA to drive value is likely that customer opinions on a product or brand are often more complicated than binary sentiments. They might be pleased or displeased for a variety of reasons, so simply knowing the overall sentiment does not easily facilitate actionable insights. Moreover, certain sample sets such as CRM logs are often overwhelmingly biased; you don’t need a data scientist to tell you that if someone is calling customer service, their outlook might not be so positive at that moment.