One of the key details you should be familiar with when building your own classifier is how the classes (or categories) are structured and how the process of classification and training is implemented.
The category tree
As you probably already know, categories in MonkeyLearn are hierarchical, each category may have subcategories. There’s a special category that is always present that doesn’t have a parent category which is called the root category. Categories that don’t have sub-categories are called leaf categories.
Categories are represented by a tree structure called the category tree.
Category tree of a text classifier.
Let’s dive into the process of classification and training & evaluation using such structure.
The classification process
A classification result for any given text is a path from the root category to one of the leaf categories.
Consider the following text:
“The hotel is conveniently located near the restaurant”
When classifying it with the classifier shown in the image above, the result would be something like this:
root > Tourism > Accomodation
Each non-leaf category works as a classifier:
- The Root category classifies the text between its sub-categories Computers and Tourism.
- It decides that Tourism is the best choice.
- The chosen Tourism category now has to classify the text between its sub-categories Accommodation and Transportation
- It decides that Accommodation is the best choice.
- Since the Accommodation category is a leaf category the path is complete and the process is finished.
The most important thing here to keep in mind is that each non-leaf category is a classifier. This allows us to, for example, show you stats for each of these categories as they were totally independent which makes your classifiers more flexible.
The training process
As we saw above each non-leaf category is actually a classifier, in the training process each one of these has to be trained independently.
A key aspect to know is that, when training, the samples that represent a category are not only the ones directly associated with it, but also all the samples of the subtree. For example in the tree shown in the image above, when training the Root category the Computers category samples will be the samples associated to Computers and also the ones associated with all the categories in its sub-tree: in this case Software and Hardware, and similarly for the Tourism node the samples associated with itself and Accommodation and Transportation.
The training process for each category has two subprocesses, the evaluation process and the actual training process.
The evaluation process
The evaluation process is the process that generates the stats for each category. In this process each category is trained with 75% of its samples. The other 25% are used as test samples, they are classified to measure the classifier performance.
The partitioning is random, so it will change each time you execute the training process, thus the final stats values might have small variations.
If a category doesn’t have at least 4 samples (keep in mind that this includes all it’s children samples) the evaluation process will be performed without partitioning, that means the category will be trained with all the samples and tested with the same samples. This is not the correct approach (usually in Machine Learning, you train and test with disjoint sets), you won’t be able to see its prediction performance on unseen samples. On the other hand, it will allow you to see if it’s underfitting, and also partitioning would also return inaccurate results as the partitions would be too small. As a result, it’s recommended you always have more than 4 samples in leaf categories.
Check out the Classifier stats reference for a detailed description on the generated stats.
The final training process
After a category is evaluated the final training can be performed, this time using all the available samples, that is, the final classifier that you will be able to consume via the GUI and the API, will be trained over all the data provided.