Social media preprocessing detects social media content, such as mentions, hashtags, emojis, and the like. Once social media content has been detected, similar types of content are grouped under a certain token, for example, _mention_ or _happy_. This way, models will learn about mentions or hashtags in general and not about specific mentions or specific hashtags. 

Regular Expressions

Another way of grouping information under custom tokens or strings is by using regular expressions. Regular expressions are patterns of text that will be replaced with a custom token. 

As an example, say you want to process email thread bodies, but your data contains lines like On May 26, John Doe wrote which might be useless for the task at hand. You could type the regular expression "(?i)\b(on \w* wrote:)\b" in the regular expression textbox and  _header_ in the name textbox and the stopword list. All of the lines that match the pattern will not be used as features in your model. 

You can find more about regular expressions here. There are many regular expression testers to help you get started. This one is really good and contains a lot of information about regular expressions as well.

Did this answer your question?