MonkeyLearn classifiers use Comma Separated Values (CSV) or Excel files when importing data into classifiers and CSV when exporting its data (the category tree and their samples). The following sections show more details on the format accepted by MonkeyLearn.

✔️ CSV Primer

CSV files are just plain text files, with a specific format to represent rows and columns: usually each line represents a row and in each line commas are used to separate columns.

Take for example the following table:

This same data represented as a CSV file would look like this:

2001: A Space Odyssey,Science Fiction 
Kung Fu Panda,Animation

If the value contains a comma itself you can wrap it with quotation marks (”) and if you need to escape quotation marks you can add another quotation mark before it. Let’s add a couple of lines to our CSV to illustrate this:

2001: A Space Odyssey,Science Fiction
Kung Fu Panda,Animation
"The Good, the Bad and the Ugly",Western
"Dr. Strangelove or: ""How I Learned to Stop Worrying and Love the Bomb""",Comedy

Finally note that you can have multi-line values but you must wrap those in quotation marks. Take a look at the following example:

"This is a single line column","And another single line column."
"This is a multi-line column.
It continues here","This is a second multi-line column.
It ends here"

The last three lines represent a single row, with columns that have new line characters in it.

Now you have the basics of the CSV format, we can now proceed to describe MonkeyLearn’s specific CSV requirements for representing classifiers and datasets data.

MonkeyLearn CSV/Excel format

MonkeyLearn uses CSV or Excel when importing or exporting data from the classification modules.

  • Sample text
  • Full category path

With a set of this two values you are able to describe the full category tree hierarchy, the samples data and their relation.

Consider the following CSV example:

"Terusan Sutami III Setrasari Bandung Hotel ...",/Travel & Vacations/Hotels 
"Malacca: 2D1N Stay at 5-Star Hotel Casa Del Rio with Breakfast ...",/Travel & Vacations/Hotels

When importing a CSV like this, each of the two lines would create a new sample and every category in the path that are not already created.

For the first line this particular data will:

  • Create a sample with the text Terusan Sutami III Setrasari Bandung Hotel …
  • Create the category Travel & Vacations (as a root category subcategory)
  • Create the Hotels category (as a Travel & Vacations subcategory)
  • Associate the sample with the Hotels category.

while the second line will:

  • Create a sample with the text Malacca: 2D1N Stay at 5-Star Hotel Casa Del Rio with Breakfast …
  • Associate the sample with the already created Hotels category.

Note that the category path is an ordered list of categories from the root to the category the sample belongs to, they are separated using the slash “/” character. A category path must always start with a slash. The same notation applies for Excel files to denote the category hierarchies with the slash charecter.

MonkeyLearn supports zipped CSV or Excel files. When importing a CSV or Excel, it’s recommended to always zip your file before uploading it, so the upload takes less time and, more important, the file sizes is reduced avoiding reaching a file size limit. When you export a classifier you’ll always get a zipped CSV file to speed up the downloading time.

⚠️ Note that MonkeyLearn enforces some limitations on the tree structure:

  • A maximum of 4000 categories in total.
  • A maximum of 50 categories with the same parent.
  • A maximum tree depth of 8 levels.

Multilabel classifiers CSV/Excel file

In the case your classifier is multilabel, in order to upload the same sample to many categories, you have to use a format like in the following CSV example:

"Terusan Sutami III Setrasari Bandung Hotel ...",/Travel & Vacations/Hotels:/Buildings/Hotels 
"Malacca: 2D1N Stay at 5-Star Hotel Casa Del Rio with Breakfast ...",/Travel & Vacations/Hotels:/Food/Breakfast

As you can see, it is almost the same format as in the single-label case described before but you can put a sample in as many categories as you want separating the categories with a colon “:” character.

The same applies for the Excel files when denoting the multiple labels for a single sample.

Encoding requirements

The CSV file should be UTF-8 encoded, latin-1 should also work if that’s a good choice for your data. We make our best to auto detect other encodings but this process might fail so we strongly recommend to always use UTF-8 if possible.

If you are exporting the CSV from XLS file it’s not recommended to use Excel to save as CSV since this tool is known to have issues with some Unicode characters, use an alternative tool. Google Docs Spreadsheets or Libre Office are good alternatives for this task.

If you find yourself having issues with your data encoding when importing a CSV, contact our support team at

Uploading data files

Uploading data files is really easy using the GUI wizard. Within the classifier, just go to the Sandbox/Samples tab and click the Upload button. Then follow the wizard to select the file format (CSV or Excel), select the file from your drive and choose the columns that will be used as the sample content and the column that will be used as the category:

When you upload a tagged dataset, you can specify the column that has the text content and the column that has the category. Use the combo boxes at the top of the column to select “Use as text” or “Use as category” respectively.

File upload size limitations

You must keep your file size (whether it is zipped or not) under 100 MB.

Did this answer your question?