This tab allows to clean up the data by stacking different filters in a user-defined order and with specific parameters. There are two main types of filters:
Ignore filters: which filter out rows that are just noise for your application, by completely ignoring the rows that match this filter.
Clean filters: which clean up the text field by removing and/or substituting artifacts within the text.
The Cleaning Filters table on the left, shows the list of the filters that will be applied. You can add as many filters as you want and also, edit, delete, or change the order in which they are applied.
The Preview button will apply the current filters to a subset of data shown in the right table. It will take a few seconds to process and update. This is very useful to iterate on the filter configurations before saving and processing the entire dataset.
Save and Clean will save the filter settings and process all the training samples in the project with those filters. This will be then reflected in the Dataset View (you will see the cleaned texts, and will not see samples that were filtered out).
💡 Take into account that the list of filters is actually a pipeline, meaning that the output from the first filter is the input of the second filter and so on.
When you add a filter, you select the filter type from a predefined list of available filters. For each filter you can configure parameters to customize the behavior. In the example below, you can edit the phrase list parameter (list of phrases that if any matches is contained in the text, the sample is discarded):
You can always edit a filter you have already added by clicking on the edit button on the corresponding filter in the list.
💡 Put ignore filters that can quickly discard rows first in the list (eg: Filter Empty, Filter Containing, Filter Matching Regex, etc). This will reduce processing on downstream filters (eg: if a row is removed in steps 1 and 2, then there's no need to process them in steps 3 and below).
💡 There are some ignore filters such as "Filter by Length" that you might put at the end, since by doing previous cleaning steps, you can reduce the length of the text, eg: cleaning URLs or long tokens.