Training samples are used to give information to modules to let it learn to associate texts to their corresponding categories. This is the way we train or “teach” our classifiers. From the examples, the machine learning model automatically learns to generalize “rules” to classify new unlabeled texts.
In MonkeyLearn, training samples are simply a set of plain text examples that could be extracted from webpages, articles, news, tweets, reviews, emails, chat conversations etc. These texts should be representative of future texts that you would want to classify. Usually training data could be in the following states:
- Tagged data: for each text you have the corresponding tag/category.
- Untagged data: you just have a bunch of texts without any tag or category.
With this in mind, usually you’ll be in one of the following situations:
- You have tagged data. That’s great, after getting the data we recommend to take a look and check if the tagging is correct. If they are, you are ready to add these samples to MonkeyLearn and train your classifier!
- You have untagged data. In that case, you should use tools to curate and tag the data. See tagging data for more information.
- You don’t have data at all. In that case, we have to create the training dataset from scratch (gather and tag a training set), the following are some tips to first gather the data.
The following sources are suggested to perform data gathering for both internal and external sources:
You can use internal data, like files, documents, spreadsheets, emails, support tickets, chat conversations and more. You may already have this data in your own databases or tools that you use every day:
Customer Support / Interaction:
- Hubspot CRM
You usually have ways to export this data either by using an export function into CSV files or by using an API.
Tools like Zapier or IFTTT can be very helpful getting your training data, especially if you don’t have coding experience. You can use them to connect to the tools that you use every day through the API but without coding
Otherwise, you can gather data automatically or programmatically from the web by using some of the following tools:
Web Scraping tools:
- Scrapy web scraping framework.
- Beautiful Soup framework.
- Screen scraping tools like Portia, Kimono or Import.io
APIs provided by sites and companies (some free and some commercial), eg:
- Twitter public API for tweets.
- AngelList API for company data.
- Wikipedia API for articles.
- Yelp API for local businesses and reviews.
- Data.gov API for US government data.
- eBay API for retail product data.
- Foursquare API for places data.
Manually / Semi-automatically:
- Doing manual searches within the corresponding website’s search bar using related keywords to each category, and manually copying and pasting the title and content.
- Doing a manual search in Google with related keywords and restricting to a particular domain.
Any other particular tool or technique that you are familiar with.