Text may come in any format: an article, a tweet, a cell on a spreadsheet, etc. In our industry jargon, each piece of text that will be analyzed is called a "sample" and we often refer to the entire set of text or samples as your "data".
In order for MonkeyLearn to properly work with text, it needs to be contained in a recognized way so that a module can access the data inside. MonkeyLearn should only be used if you already have the data readily available. Some data sources are already accessible through MonkeyLearn (see next image). Others datasets that are not listed here, might need to be primed for use.
What should your data look like? If you imagine your data in a spreadsheet, each sample would be on its own row, and if your data sample is large this could be thousands or millions of rows. Each sample may also have a corresponding category that make up the additional columns of the spreadsheet. A slice of that spreadsheet would look like this:
This format, although simple, can work with pretty much any kind of data. For example a tweet, article, email, support ticket, etc. would populate the "text sample" field and the corresponding author, headline, tag, label, etc. would populate the "category sample" fields as necessary. In general, the closer your data is to this kind of structure, the more prepared it will be for use in MonkeyLearn.
So, what questions about your data are most important?
The first important question to consider is, where does your data come from? It may be a .csv or an excel file, it may be a list of tweets, support tickets or emails. It may come from live webpages on the internet. The source of your data will likely have a lot of impact on the following things to look at.
The second question to ask is how easy is your data to access? If your data is already housed in a spreadsheet, that is going to be more easy to work with than if your data is currently contained in live webpages all over the internet. In this case, that data will have to be "scraped" off and translated into something similar to the format of rows and columns mentioned previously so that MonkeyLearn can access it.
A third question that might come into play is how clean is your data? Perhaps your text samples come with a bunch of code, or with some extra junk in them. It could be that some of the categories are not applied correctly. There may be a situation where a number of columns need to be joined together to get all the samples into one cell (called "concatenation"). Ideally, all of this data cleaning would happen prior to trying to access it in MonkeyLearn.
This is all the more important when you are going to be creating a custom module, where your current samples are going to be used to build associations so the module can learn to classify future samples. If your current data set is not clean, so will the future predictions.
What if you don't yet have access to your data?
Here are some ways to automatically, or programmatically, get data from the web:
Web Scraping tools:
- Scrapy web scraping framework.
- Beautiful Soup framework.
- Screen scraping tools like Portia, Kimono or Import.io
APIs provided by sites and companies (some free and some commercial), eg:
- Twitter public API for tweets.
- AngelList API for company data.
- Wikipedia API for articles.
- Yelp API for local businesses and reviews.
- Data.gov API for US government data.
- eBay API for retail product data.
- Foursquare API for places data.
Manually / Semi-automatically:
- Doing manual searches within the corresponding website’s search bar using related keywords to each category, and manually copying and pasting the title and content.
- Doing a manual search in Google with related keywords and restricting to a particular domain.