What is Extraction?

What does it mean to extract from texts, and how does it differ from classification?

R
Written by Raul Garreta
Updated over a week ago

Extraction models are used to extract data from text, that is, the result you are looking for exists within the text.

The difference with extraction compared to classification is that in classification the result is an associated tag that is usually not present within the text, and therefore has to be predicted or deduced from the text contents.

MonkeyLearn has different extraction models to extract different types of data: addresses, emails, entities, company names, keywords, etc. You may work with the extraction models publicly available to resolve your particular problem. Or you can create your own custom extractor.

Examples of Extraction

As a showing of some possible extraction applications, we'll look at some existing models that were pre-trained and available to all users in MonkeyLearn.

Keyword Extraction

Keywords are relevant terms within a text, terms that in some way summarize the contents of a text. Keywords can be compounded by one or more words. They can be used to index data to be searched, summarize texts, generate tag clouds, etc.

For example, if we have the following text as an input to a keyword extractor:

A panel of Goldman Sachs employees spent a recent Tuesday night at the Columbia University faculty club trying to convince a packed room of potential recruits that Wall Street, not Silicon Valley, was the place to be for computer scientists. The Goldman employees knew they had an uphill battle. They were fighting against perceptions of Wall Street as boring and regulation-bound and Silicon Valley as the promised land of flip-flops, beanbag chairs and million-dollar stock options. Their argument to the room of technologically inclined students was that Wall Street was where they could find far more challenging, diverse and, yes, lucrative jobs working on some of the world’s most difficult technical problems. “Whereas in other opportunities you might be considering, it is working one type of data or one type of application, we deal in hundreds of products in hundreds of markets, with thousands or tens of thousands of clients, every day, millions of times of day worldwide,” Afsheen Afshar, a managing director at Goldman Sachs, told the students.

The results may be something like:

{
  "result": [
    [
      {
        "count": 2,
        "relevance": "0.968",
        "positions_in_text": [
          164,
          339
        ],
        "keyword": "Wall Street"
      },
      {
        "count": 2,
        "relevance": "0.968",
        "positions_in_text": [
          181,
          386
        ],
        "keyword": "Silicon Valley"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          456
        ],
        "keyword": "million-dollar stock options"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          11
        ],
        "keyword": "Goldman Sachs employees"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          80
        ],
        "keyword": "University faculty club"
      },
      {
        "count": 1,
        "relevance": "0.806",
        "positions_in_text": [
          43
        ],
        "keyword": "recent Tuesday night"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          408
        ],
        "keyword": "promised land"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          247
        ],
        "keyword": "Goldman employees"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          221
        ],
        "keyword": "computer scientists"
      },
      {
        "count": 1,
        "relevance": "0.484",
        "positions_in_text": [
          282
        ],
        "keyword": "uphill battle"
      }
    ]
  ]
}

It returned the top most relevant terms within the text. As you can see the terms can be compounded by more than one word and have a corresponding relevance measure that indicates how important the keyword is within that particular content.

Entity Extraction

Entities can be persons, organizations, or locations. A Named Entity Recognition (NER) extractor, returns entities that exist within the text contents. NERs label sequences of words in a text which are the names of things alongside their corresponding types: PERSON, ORGANIZATION, and LOCATION.

For example, if we have the following text as an input to an entity extractor:

In the 19th century, the major European powers had gone to great lengths to maintain a balance of power throughout Europe, resulting in the existence of a complex network of political and military alliances throughout the continent by 1900.[7] These had started in 1815, with the Holy Alliance between Prussia, Russia, and Austria. Then, in October 1873, German Chancellor Otto von Bismarck negotiated the League of the Three Emperors (German: Dreikaiserbund) between the monarchs of Austria-Hungary, Russia and Germany.

The results may be something like:

{
  "result": [
    [
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Europe"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Prussia"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Austria-Hungary"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Austria"
      },
      {
        "count": 1,
        "tag": "LOCATION",
        "entity": "Germany"
      },
      {
        "count": 1,
        "tag": "PERSON",
        "entity": "Otto von Bismarck"
      },
      {
        "count": 2,
        "tag": "LOCATION",
        "entity": "Russia"
      }
    ]
  ]
}

It found that the text mentioned six different locations (Europe, Prussia, Austria-Hungary, Austria, Germany, and Russia), and one person (Otto von Bismarck).

Did this answer your question?