TL;DR: the more data that you tag for your classifier, the better. Avoid tagging mistakes, use diverse samples, and have consistent criteria.

With machine learning, the more training data you use for training a model, the better and more accurate the model is.

So, the most straightforward way to improve a classifier is to tag more data, so the classifier has more information to learn and understand each category.

How Much Training Data?

How much training data you need to use depends on your use case, how many tags you want to use, the extensiveness of your tags, your data, the quality of the tagging, and your criteria.

For starters, not every classification has the same complexity. For example, sentiment analysis is probably the hardest NLP task because of things such as subjectivity, tone, irony, sarcasm, context, and polarity. So, sentiment analysis classifiers need more training data than other classifiers.

To start seeing decent results, we suggest the following amount of training samples:

  • Topic detection: usually needs around 50-100 samples per tag.
  • Sentiment analysis: usually needs around 500-1,000 samples per tag.
  • Intent classification: usually needs around 50-100 samples per tag.
  • Urgency detection: usually needs at least 100 samples for the Urgency tag and 200 samples for the Non-Urgent tag.

Consider that improving a classifier is an iterative process; you tag more data and then see how it works, how accurate it is, and detect where it is failing and where the model is making accurate predictions. Then, you continue tagging more data focusing on those tags or categories where the model needs more help. You can also concentrate on tagging samples that contain particular words or expressions the classifier usually makes mistakes with.

How to Add More Training Data?

To add more training data to a classifier, you need to go to Build > Train. Then start tagging the examples by selecting the expected tag and clicking the ‘Confirm’ button:

If you have tagged all of your data, you can always upload more data to your classifier. Just go to Build > Data and click on ‘Upload Data’:

Finally, choose how you want to upload new data (e.g. Excel or CSV file) to your classifier and follow the upload wizard:

Quality of the Data

Besides the volume of data, that is, how many training samples you use for training a classifier, the other important factor is the quality of data.

Diversity of the Data

It's critical to use quality examples to train the classifier that accurately represents each category and the ‘universe’ of possibilities that could be found in each tag. In other words, it’s recommended to use a diverse set of examples for each tag that contains a variety of expressions. This enables a classifier to learn from the multiple ways people express themselves when talking about a particular tag or category.

For example, imagine that you are training a topic classifier to analyze app reviews for Facebook, and you have tags such as Security and Privacy, Notifications, News Feed, etc.

Quality of the tagging

It's common that while tagging data for training a classifier, the human tagger makes mistakes because of boredom, distractions, and human error. For example, instead of tagging something that is clearly Positive when training a sentiment classifier, the human tagger makes a mistake and tags it as Neutral or even Negative.

These tagging mistakes significantly affect the accuracy of a classifier because the model will learn from these mistakes, and later on, it will create the same mistakes when analyzing new data and making predictions.

Consistency in the Tagging Criteria

Having consistent criteria when tagging data for a classifier also influences how many training samples you need. If you have consistent tagging criteria, you’ll need much less training data than if you don’t.

Another common situation when training a classifier is that when you first start tagging data, you begin tagging with specific criteria. But, as you tag more and more data, you’ll learn from the data and change your criteria over time. This is natural and a good thing: the more you understand about your data, the better.

But consider that these changes in your criteria will confuse your classifier and affect its accuracy. Imagine the following scenario. At the start of your tagging, you are telling the classifier that for a particular input (text), you expect a specific (output). Then over time, you change your criteria. For similar input, you start telling the classifier you expect other tags you initially were missing. By doing this, you are providing mixed signals to the classifier in your tagging that would even confuse a human tagger.

If you detect that you have changed your criteria over time, we recommend that you review the samples you used to train your classifier. Review the tags you used in your training data, and if you find inconsistencies, retag those samples.

An excellent way to find these inconsistencies is to explore the False Positives and False Negatives samples for a particular tag. For finding these samples, within the Build > Stats tab, go to the tag you wish to improve and then either click on False Positives or False Negatives:

Conclusion

We should look to achieve both quality and volume while training an accurate classifier: use a lot of training samples with excellent quality. Avoid tagging mistakes when training your classifier, use diverse samples, and have consistent tagging criteria.

If you need to choose, we suggest using fewer examples, but with higher quality. That is, the samples used for training the classifier should have consistent criteria, and the tagging should have almost no mistakes.

Did this answer your question?