Being able to understand your extractor statistics is a key part of improving the model's performance.
MonkeyLearn offers two groups of statistics. One group applies to the extractor overall, and the other is for each tag. Each group offers the same statistics but refer to slightly different concepts:
Overall: F1 Score, Precision, and Recall.
Tag Level: F1 Score, Precision, and Recall
Overall Statistics
Overall statistics can be seen under the "Build" tab in the Stats section.
F1 Score
F1 Score is a measure for how well the extractor is doing its job, by combining both Precision and Recall for all the tags (see below). It does a better job of accounting for any imbalances in the distribution of texts among tags than just seing the level of precision or recall.
Precision
Precision refers to the percentage of texts the extractor got right, out of the total number of texts that it predicted for all tags. In other words, the more false positive results your extractor has, the lower its precision.
Recall
Recall refers to the number of correct tags divided by the number of tags that should have been returned. In other words, the more false negative results your extractor has, the lower its recall.
Tag Level Statistics
Tag level statistics can be seen in the Stats section by clicking on the individual tag.
Precision
Precision refers to the percentage of texts the extractor got right, out of the total number of texts that are predicted for a given tag.
Recall
Recall refers to the percentage of texts the extractor predicted for a given tag out of the total number of texts it should have predicted for that given tag.
What Makes Up These Metrics?
Many of the statistics for an extractor start with a simple question: was a text correctly extracted or not?
This forms the basis for four possible outcomes:
A true positive is an outcome where the model correctly predicts the right tag.
Similarly, a true negative is an outcome where the model correctly predicts the tags that don't apply.
A false positive is an outcome where the model incorrectly predicts the right tag.
And a false negative is an outcome where the model incorrectly predicts the tags that don't apply.
Various relationships of these four outputs are involved in generating the statistics for your extractor.
Rigorous Evaluation Stats
Extractors are evaluated by calculating the similar standard performance as for classifiers, namely, F1 score, precision, and recall.
โ
However, these metrics do not account for partial matches of patterns. In order for an extracted segment to be a true positive for a tag, it has to be a perfect match with the segment that was supposed to be extracted.
Consider the following example:
'Your flight will depart on January 14, 2020 at 03:30 PM from SFO'
If we created a date extractor, we would expect it to return January 14, 2020 as a date from the text above, right? So, if the output of the extractor were January 14, 2020, we would count it as a true positive for the tag DATE.
But, what if the output of the extractor were January 14? Would you say the extraction was bad? Would you say it was a false positive for the tag DATE?
In short, consider that MonkeyLearn applies a strict evaluation for extractors statistics and considers that capturing partial matches as mistakes, increasing the number of false positives and false negatives.
Your extractor may be providing low precision or recall, but when you actually test the model, you may sense that is working better than expected because some partial matches might be good enough for you, but these are not good enough for the extractor stats.
โ