After gathering data from your various sources, and coming up with tags for your classifier, you’ll have to tag your texts appropriately to create a training dataset. 

This can be a manual process but it’s key for training your model. With MonkeyLearn, we try to make it as easy as possible with our interface.

By tagging the text data, you will be teaching the machine learning algorithm that for a particular input (features in texts), you expect a particular output (tags).

️ An accurate machine learning classifier depends completely on how accurate this initial tagging is.

How Much Training Data is Necessary?

The number of texts needed for the training of your model strongly depends on your particular use case. The complexity for what you want to do with a classifier, and the number of tags involved, is going to vary and also have a large impact on how much text is necessary to build accuracy. 

For example, building a classifier for analyzing sentiment of tweets is a much different project compared to a model trained to identify topics in product reviews. Sentiment analysis is a much harder problem to solve and it needs much more training data. Analyzing tweets, with all their hashtags, replies and shorthands, is a far more challenging task that analyzing well-written reviews.

In short, the more training data that you have, the better. While with MonkeyLearn it's possible to build a classifier with at least 4 texts per tag, we suggest that having at least 20 texts per tag is necessary to build accuracy. Once you train with 20 in each tag, classifier statistics can be generated and it's easier to know what to to do next.

More data will likely be necessary to get to acceptable accuracy rates. Here are some baselines we have seen for certain jobs in classifiers:

  • Topic detection usually needs around 500 texts in total if working with a limited number of tags
  • Sentiment analysis usually needs at least 3,000 texts in total

Quality over quantity

It is much better to start with less text data than to add tons of data that might have a lot of errors. Working with a small data set where you are 100% sure that texts are correctly classified and representative of the tag in question is going go lead to more accuracy.

How "Clean" is your Data?

Many of our users add thousands of training texts when they create a classifier for the first time. It does seem like common sense to think that high volumes of data will be beneficial for the machine learning algorithm. However, if the data is incorrectly tagged, the model will learn from those incorrect instances as well

Most data sets have some mistakes; many come from human errors. It is a challenge to maintain a consistent criteria. People will make subjective judgements based on their context, but that context isn't going to be taught to a machine learning model.

There may also be cultural issues embedded in your data, and it may be necessary to create multiple classifiers for different cultural contexts if a high level of accuracy is needed. For example, if trying to do sentiment analysis in an international context, there will be subtleties with what people are trying to say in very similar language. In the US, this is the shit may be a positive thing. In the UK, this is shite is negative. Pardon the 🤬language, but it's a pretty common phrase on Twitter .

Other data sets can get "noisy" simply by their nature. For example, if you are classifying emails then you will likely have a problem when analyzing longer chains with multiple replies. It will be difficult for a classifier to discern the key text to analyze (the last reply) from the other text: like previous conversations, footers, privacy snippets, etc (we built a model to help find the last reply for just this problem).

If you data is dirty, it may be worth spending some time trying to clean it. With cleaner, high quality data (and even at less volume) you'll be off to a better start building your classifier.

Tagging Your Data Manually

While you can use they MonkeyLearn interface for tag your text data, you can also do it manually and then upload your file with both the text and tags applied.  

Data Tagging Tools & Outsourcing

The following tools are suggested to perform manual data tagging with smaller data sets:

  • Excel
  • Libre Office
  • Google Spreadsheets

If you have larger data sets, we would recommend using something like
Open Refine, or any other tool or technique that you might be familiar with.

In some cases, you can outsource the work of tagging your training dataset with crowdsourcing platforms like Mechanical Turk or with freelancers from sites like Upwork or Freelancer

We wrote a blog post on the subject: How to use Mechanical Turk to tag training data.

Saving Manually Tagged Data

In order to use your text data with MonkeyLearn, the data should be saved in a CSV or Excel file with 2 columns:

  • 1 column for the text (input),
  • 1 column for the tag (expected output). This is optional.

Finally, you will need to upload the CSV / Excel file to MonkeyLearn, so you can train your model. You can learn more about CSV / Excel format here.

If you have coding skills, you can use MonkeyLearn’s API instead to upload your data after you created your classifier.

Did this answer your question?