One of the key details you should be familiar with when building your own classifier is that your classifier needs its own structure. A well defined set of tags or classifications is necessary to ensure that the process of training is well implemented, ensuring more accurate predictions.

Defining a Set of Tags

The scope of a classifier is defined by the the list of possible tags. Tags should be distinct enough so that the model can avoid overlaps (or confusions). At the same time, the number of tags should be limited enough so that you have enough texts to assign to each tag.

MonkeyLearn allows you to define a set of tags on one level, meaning that there is no hierarchy in your set of tags.

With this in mind, let's dive into the process of classification, training and evaluation using such a tagging structure.

The Training Process

The training process for each tag has two subprocesses: the evaluation process and the final training process.

The Evaluation Process

This is the process that generates statistics for each tag. In this process each tag is trained with 75% of the texts that pertain to it. The other 25% of the texts in a tag are used for testing. By testing the designated 25% against the trained 75%, the model can gain an indication for how well the classifications for that tag are performing. 

The partitioning of the 25% and 75% is random, so it will change each time you execute the training process. As a result, the statistics resulting from each training process might have small variations.

Check out the reference on Classifier Statistics for a detailed description on the statistics that classifiers generate per tag.

Number of Texts Required per Tag

Having enough texts per tag is critical to accurately training a model. If a tag doesn't have at least 4 pieces of text, the evaluation process will be performed without partitioning (using all the samples for both training and testing).

The result is that you will not be able to get insight (and statistics) into how the classifier is performing until there are more than four texts. This is because there aren't enough texts to generate a distinct testing partition. 

The best practice in Machine Learning is to train and test with separate, or disjoint, sets of data. To ensure that we follow this practice, we recommend that all tags have at least four texts so the model can produce statistics, In this way a user also has the best chance to understand how to build more accuracy.

The Final Training Process

After a tag is evaluated the final training can be performed, this time using all the available texts in the data set. This results is the final classifier that you will be able to consume via the interface and the API.

Did this answer your question?