What are the categories or tags that you want to assign to your text or data? This is the first question you need to answer when you start working on your text classifier, as you will need to define them as part of this process.
Choosing your categories
Let’s take a simple example: you want to classify daily deals from different websites. In this example, your categories for the different deals might be:
- Food & Drinks
- Health & Beauty
- Travel & Vacations
Now let’s imagine that you are interested in sentiment analysis, you might want to have the following categories:
In contrast, if you are interested in classifying support tickets for an e-commerce site, you might want to design a category tree that includes:
- Shipping issue
- Billing issue
- Product availability
- Discounts, Promo Codes and Gift Cards
Sometimes you know which are the categories you want to work with (for example if interested in sentiment analysis) but sometimes you don’t know what categories you should use. In these cases, you need to first explore and understand your data in order to determine what are appropriate categories for your model.
Structure your categories
A key part of this process is giving a proper structure to your categories. When you want to be more specific and use sub-categories, you will need to define a hierarchical tree that organizes your categories and sub-categories.
Going back to the example of classifying daily deals, you can organize the category tree in the following way:
Tips for your category tree
It may be obvious to fill in as many categories as possible, but the best rule of thumb is to start small (you can add more and edit later). Here are some additional tips to think about:
- Structure - Organize your categories according to their semantic relations. For example, Basketball and Baseball should be sub-categories of Sports because they are specific types of sports. A category tree that has a good structure can make a great difference and will be a huge help to make accurate predictions with your classifier.
- Avoid overlapping - Use disjoint categories and avoid defining categories that are ambiguous or overlapping: there should be no doubt in which category a text should be placed. Overlapping between your categories will cause confusion to your model and affect negatively the accuracy of the predictions.
- Don’t mix classification criteria - Use one single classification criteria per model. Imagine that you want to categorize companies based on their description. Your categories could be things like B2B, B2C, Enterprise, Finance, Media, Construction, etc. In this case, you should build two separate models: a) one to classify a company according to who are their customers (B2C, B2B, Enterprise) and b) another model to classify a company according to the industry vertical it operates (Finance, Media, Construction). Each model has its own criteria and purpose.
- Start small and then go big - If it’s your first time training a machine learning classifier, we recommend starting with a simple model. Complex models can take more effort in making them work well enough to make accurate predictions. Start with a small number of categories (<10) and up to 2 levels of categories.
When you get this simple model to work as expected, try adding a few more categories or adding a third level of categories. Eventually, you can keep iterating adding more categories as you need.