Text classification models are used to categorize text into organized groups. Text is analyzed by a model and then the appropriate tags are applied based on the content. Machine learning models that can automatically apply tags for classification are known as classifiers.
Classifiers can't just work automatically, they need to be trained to be able to make specific predictions for texts. Training a classifier is done by:
- defining a set of tags that the model will work with
- making associations between pieces of text and the corresponding tag or tags
Once enough texts have have been tagged, the classifier can learn from those associations and begin to make predictions with new texts.
A classifier is most effective when it is built for a specific use case using a set of tags and training texts that pertain to it. The following are some of the ways that classifiers, and their according set of tags, are used.
Sentiment analysis is one of the most common use cases for classifiers. This kind of analysis is used detect positive or negative sentiment from a user or customer in their comments, tweets, reviews, etc.
For example, with the following hotel reviews:
"Friendly service. Superior room! Loved the high ceiling. Housekeeping service was top quality. Excellent breakfast and fitness room."Text B:
"If possible I would give the hotel zero stars. The smell is terrible and gave me a headache after one night. Recommend avoiding."
A sentiment analysis classifier will classify Text A as
positive as there are elements that indicate that the reviewer was generally satisfied.
Text B would likely be classified as
negative since the reviewer is signaling to many negative aspects of the hotel room and experience.
In language detection, an incoming piece of text will be analyzed against a list of languages (e.g.: Spanish, English, French, etc.) to programmatically detect the language of the given text.
For example, if we have the following texts:
"Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data…"
"El aprendizaje automático o aprendizaje de máquinas es una rama de la inteligencia artificial cuyo objetivo es desarrollar técnicas que permitan a las computadoras aprender…"
"L’apprentissage automatique (machine learning en anglais), un des champs d’étude de l’intelligence artificielle, est la discipline scientifique concernée par le développement…"
Text A would be classified as
English , Text B as
Spanish and Text C as
Imagine you are working with a set of apparel products and you want to automatically classify them using their descriptions.
Let's look at the the following descriptions as examples:
"This women’s printed pullover sweater is a great basic to add to your wardrobe. Throw this sweater over a top for extra warmth and to add some fun pops of color to your outfit. Pair it with jeans and boots this winter. This top is an excellent essential…"
"Relieve tired, aching feet from the stress of high heels. The fully-lined padded insole provides a comfortable fit and a rubber outsole for durability. Bendable comes in a convenient carry-bag…"
The description from Product A mentions a sweater in various capacities, so in all probability it would be classified as
The description in Product B mentions aspects of a sandal, plus various references to feet and other kinds of shoes. The likely classification would be
Text classification is often used for organize text by topic. This is used for tweets, documents, news, reviews, support tickets, etc.
If we trained a classifier to recognize broad categories of general topics, we could use the model to analyze text from different inputs.
For example, if one were to analyze the text from a web page that has this kind of content:
A topic classifier may understand the written text and find that the
Cooking tag is most appropriate. Alternatively, given the following tweet:
The same topic classifier would classify this text as