After gathering data from your various sources, and coming up with a hierarchy for your categories, you’ll have to "tag" the data into the corresponding categories to create a training dataset for teaching your classifier.
Tagging in this context means that you will have to indicate which categories your text samples will be falling into. This is can be a manual process but it’s key for training your model. By tagging the data, you will be teaching the machine learning algorithm that for a particular input (text), you expect a particular output (category).
An accurate machine learning classifier depends on how accurate is this initial tagging you do for your training dataset.
Data Tagging Tools
The following tools are suggested to perform data tagging:
- Using the MonkeyLearn’s GUI in the Samples section after creating and uploading the data (see next section for details).
- Excel / Libre Office / Google Spreadsheets.
- Open Refine.
- Any particular tool or technique that you are familiar with.
How much training data do we need?
The number of training samples that you need for your model strongly depends on your particular use case, that is, the complexity of the problem and the number of categories you want to use within your model.
For example, it’s not the same to train a model for sentiment analysis for tweets than training a model to identify the topics of product reviews. Sentiment analysis is a much harder problem to solve and it needs much more training data. Analyzing tweets is also far more challenging that analyzing well-written reviews.
In short, the more training data that you have the better. We suggest starting by tagging at least 20 samples per category and take it from there. Depending on how accurate your classifier ends, add more data. For topic detection, we have seen some accurate models with 200~500 training samples in total. Sentiment analysis models usually need at least 3,000 training samples in order to start seeing an acceptable accuracy.
Quality over quantity
It’s much better to start with fewer samples, but being 100% sure that those samples are really representative of each of your categories and are correctly tagged, than to just add tons of data but with lots of errors.
Some of our users add thousands of training samples at once (when are creating a custom classifier for the first time) thinking that the high volumes of data is really great for the machine learning algorithm, but by doing that, they don’t really pay attention to the data they use as training samples. And most of the times many of those samples are incorrectly tagged.
It’s like teaching history to a kid with a history book that has many facts that are plain wrong. The kid will learn from this data, but he will learn from really wrong information. He will definitely don’t know about history, no matter how much he reads and learns from this book.
So, it’s much better to start with few but high-quality training data that is correctly tagged and take it from there. Afterwards, you can work on improving the accuracy of your classifier by adding more quality data.
Saving the tagged data
In order to use the training samples within MonkeyLearn, the data shall be saved in a CSV or Excel file with 2 columns:
- 1 column for the text (input),
- 1 column for the category (expected output).
Finally, you will need to upload the CSV / Excel file to MonkeyLearn, so you can train your model. You can learn more about CSV / Excel format here.
Eventually, if you have coding skills and feel more comfortable, you can instead use MonkeyLearn’s API to upload your data after you created your classifier.