N-gram range

Change this setting to take into account more complex expressions

R
Written by Raul Garreta
Updated over a week ago

N-gram range sets if features to be used to characterize texts will be:

  • Unigrams or words (n-gram size = 1)

  • Bigrams or terms compounded by two words (n-gram size = 2)

  • Trigrams or terms compounded by up to three words (n-gram size = 3)

Currently we support the following combinations:

  • Unigrams 

  • Unigrams and Bigrams (Default)

  • Unigrams, Bigrams and Trigrams

  • Bigrams

  • Bigrams and Trigrams

  • Trigrams

For problems like Sentiment Analysis, setting n-gram ranges that use bigrams or trigrams can dramatically improve the accuracy of classification, as they can capture more complex expressions formed by the composition of more than one word. The rationale is that in Sentiment Analysis the outcome depends not only on the frequency of words but also on how they are combined: good has a different meaning alone than when preceded by a not as in not good.

You can try different n-gram ranges to see what affect it has on your classifier statistics.

Did this answer your question?