Text may come in any format: an article, a tweet, a cell on a spreadsheet, etc. We refer to each instance of the above as "text", and if there are more than one, as "texts" or "text data".
In order for MonkeyLearn to properly work with text data, it needs to be contained in a recognized way so that a module can access the piece of text inside: a recognizable set of data.
MonkeyLearn should only be used if you already have the data readily available. Some data sources are already accessible through MonkeyLearn (see next image). Others datasets that are not listed here, might need to be primed for use.
What should your data look like? If you imagine your data in a spreadsheet, each text would be on its own row, and if your data set is large this could be thousands or millions of rows. Each text may also have a corresponding tag that makes up the additional columns of the spreadsheet. A slice of that spreadsheet would look like this:
This format, although simple, can work with pretty much any kind of data. For example a tweet, article, email, support ticket, etc. would populate the "text sample" field and the corresponding author, headline, tag, label, etc. would populate the "category sample" fields as necessary.
💥In general, the closer your data reflects this kind of structure, the more prepared it will be for use in MonkeyLearn.
So, what questions about your data are most important?
The first important question to consider is, where does your data come from? It may be a .csv or an excel file, it may be a list of tweets, support tickets or emails. It may come from live webpages on the internet. The source of your data will likely have a lot of impact on the format of your data and on the following questions.
The second question to ask is how easy is your data to access? If your data is already housed in a spreadsheet, that is going to be more easy to work with than if your data is currently contained in live webpages all over the internet. In this case, that data will have to be "scraped" off and translated into something similar to the format of rows and columns mentioned previously so that MonkeyLearn can access it.
A third question that might come into play is how clean is your data? Perhaps your texts come with a bunch of code or with some extra junk in them. It could be that some of the tags or categories are not applied correctly. There may be a situation where a number of columns need to be joined together to get all the samples into one cell (called "concatenation"). Ideally, all of this data cleaning would happen prior to trying to access it in MonkeyLearn.
This is all the more important when you are going to be creating a custom model, where your current texts are going to be used to build associations so the model can learn to make predictions about future texts. If your current data set is not clean, predictions about future texts won't be clean either.
What if you don't yet have access to your data?
Here are some ways to automatically, or programmatically, get data from the web:
Web Scraping tools:
- Scrapy web scraping framework.
- Beautiful Soup framework.
- Screen scraping tools like Portia, Kimono or Import.io
APIs provided by sites and companies (some free and some commercial), eg:
- Twitter public API for tweets.
- AngelList API for company data.
- Wikipedia API for articles.
- Yelp API for local businesses and reviews.
- Data.gov API for US government data.
- eBay API for retail product data.
- Foursquare API for places data.
Manually / Semi-automatically:
- Doing manual searches within the corresponding website’s search bar using related keywords to each category, and manually copying and pasting the title and content.
- Doing a manual search in Google with related keywords and restricting to a particular domain.