Extraction modules are used to extract data from text, that is, the result you are looking for exists within the text.

The difference with extraction compared to classification is that in classification the result may be a prediction of a label, tag or category that is usually not present within the text, and therefore has to be predicted or induced from the text contents.

MonkeyLearn has different extraction modules to extract different types of data: addresses, emails, entities, company names, keywords, etc. You may select the extraction module that resolves your particular problem. In the near future we will add the feature to allow users to create their custom extractors.

Examples of Extraction

As an example of possible extraction applications, MonkeyLearn has different pre-created extraction modules. The following are some of the pre-created extractors that are ready to use within MonkeyLearn.

Keyword Extraction

Keywords are relevant terms within a text, terms that in some way summarize the contents of a text. Keywords can be compounded by one or more words. They can be used to index data to be searched, summarize texts, generate tag clouds, etc.

For example, if we have the following text as an input to a keyword extractor:

A panel of Goldman Sachs employees spent a recent Tuesday night at the Columbia University faculty club trying to convince a packed room of potential recruits that Wall Street, not Silicon Valley, was the place to be for computer scientists. The Goldman employees knew they had an uphill battle. They were fighting against perceptions of Wall Street as boring and regulation-bound and Silicon Valley as the promised land of flip-flops, beanbag chairs and million-dollar stock options. Their argument to the room of technologically inclined students was that Wall Street was where they could find far more challenging, diverse and, yes, lucrative jobs working on some of the world’s most difficult technical problems. “Whereas in other opportunities you might be considering, it is working one type of data or one type of application, we deal in hundreds of products in hundreds of markets, with thousands or tens of thousands of clients, every day, millions of times of day worldwide,” Afsheen Afshar, a managing director at Goldman Sachs, told the students.

The results may be something like:

{
  "result": [
    [
      {
        "count": 2,
        "relevance": "0.968",
        "positions_in_text": [
          164,
          339
        ],
        "keyword": "Wall Street"
      },
      {
        "count": 2,
        "relevance": "0.968",
        "positions_in_text": [
          181,
          386
        ],
        "keyword": "Silicon Valley"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          456
        ],
        "keyword": "million-dollar stock options"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          11
        ],
        "keyword": "Goldman Sachs employees"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          80
        ],
        "keyword": "University faculty club"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          43
        ],
        "keyword": "recent Tuesday night"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          408
        ],
        "keyword": "promised land"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          247
        ],
        "keyword": "Goldman employees"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          221
        ],
        "keyword": "computer scientists"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          282
        ],
        "keyword": "uphill battle"
      }
    ]
  ]
}

It returned the top most relevant terms within the text. As you can see the terms can be compounded by more than one word and have a corresponding relevance measure that says how important is within that particular content.

Entity Extraction

Entities can be persons, organizations or locations. A Named Entity Recognition (NER) extractor, returns entities that exist within the text contents. NERs label sequences of words in a text which are the names of thing alongside their corresonding types: PERSON, ORGANIZATION and LOCATION.

For example, if we have the following text as an input to an entity extractor:

In the 19th century, the major European powers had gone to great lengths to maintain a balance of power throughout Europe, resulting in the existence of a complex network of political and military alliances throughout the continent by 1900.[7] These had started in 1815, with the Holy Alliance between Prussia, Russia, and Austria. Then, in October 1873, German Chancellor Otto von Bismarck negotiated the League of the Three Emperors (German: Dreikaiserbund) between the monarchs of Austria-Hungary, Russia and Germany.

The results may be something like:

{
  "result": [
    [
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Europe"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Prussia"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Austria-Hungary"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Austria"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Germany"
      },
      {
        "count": 1,
        "tag": "PERSON",
        "entity": "Otto von Bismarck"
      },
      {
        "count": 2,
        "tag": "LOCATION",
        "entity": "Russia"
      }
    ]
  ]
}

It found that the text mentioned six different locations (Europe, Prussia, Austria-Hungary, Austria, Germany and Russia), and one person (Otto von Bismarck).

See more of our pre-trained extraction modules.

Did this answer your question?