The confusion matrix is a table which shows the confusion between the actual category (where the sample was tagged) and the predicted category (the category predicted by the classifier).

The above example is a confusion matrix for news articles across six different categories (on each axis.) You can see that where the red arrow is pointing 130 samples were tagged as Living but were predicted as Arts & Culture. This indicates that the module is having a tough time discerning between the two categories, and if not retrained (with more samples or by cleaning the existing tags), the module will continue to operate with this confusion.

Note that you always want all numbers greater than 0 to be in the blue diagonal of the matrix. That means that your classifier works with 100% accuracy (100% of precision and 100% of recall in all of the involved categories). Red numbers indicate confusions that are substantially larger, and large red numbers are good candidates to start improving your classifier.

The confusion matrix is created by partitioning the dataset into 4 disjoint subsets and performing k-fold crossvalidation. Basically consists in training and testing 4 different models, each one, trained on 3/4 of the data and tested on 1/4 of the remaining data.

You can click the numbers in the matrix to see the corresponding samples in the Samples section. For example, if you click in the red 49 you’ll see something like this:

You can see the samples that were tagged as /Arts & Cultue but were predicted as Living at testing time.

Did this answer your question?