CLOSE

Support for NLP tags in the Siren record table (The NLP cell formatters)

Published: Wednesday, March 16th, 2022

When investigators need to make sense of large amounts of unstructured data such as e-mails, web content, or scientific publications, Natural language processing (NLP) is a critical AI capability.

Entity Recognition is an area of NLP that extracts from free text mentions of entities such as people, organisations, locations and more, often with sophisticated statistical models which do a good job distinguishing the fruit in “Apple juice, please” from the company in “I will invest in Apple”.

Siren is a modular platform, so to process text one can use either Siren’s built in NLP or an external NLP engine.

All engines, however, typically produce outputs which “tag” the text with annotations: for example indicating that a specific word at a certain position is to be considered a “company”.

This tagging is stored in the record before indexing, and there is an expectation that the software will be able to visually show such tagging ­— for example, to be able to click on tags to create filters.

The result is typically something like this: 

But this is only the beginning. 

As Siren supports data entry and information editing, there should also be a way to correct/edit the NLP annotations when in edit mode.

Siren Investigate version 11.0 introduced the capability to do all this via the Siren NLP formatter (part of the Record Table visualization), which has all of the above-mentioned features. 

This post will explore some of the key features of this tool including the required format of NLP annotation data, configuration of the NLP formatter, and editing annotations in the NLP Formatter.

The NLP Cell Formatter

The NLP Cell Formatter is an operator that can be applied to columns of the record table from the configuration options.

It allows the cells of a column containing textual data in the Record Table visualization to be formatted with colored highlighting, depending on the annotation’s categorization, along with a legend of categories displayed. 

Clicking on a highlighted span sets a filter for documents that contain the entity represented by the span.

Note that text search highlighting is also properly handled and  displayed behind the NLP highlighting in Siren as in the example below:

You can also watch it in action in this video:

Siren Investigative Intelligence : NLP & Media Annotations

Annotation data formats compatible with the NLP Cell Formatter

To use the NLP Cell Formatter, documents need to have a field containing annotations for the textual field you want to highlight. 

For example, in the JSON below the value of the field siren.nlp.instances.snippet is such an annotation field for the textual field snippet. The value of siren.nlp.instances.snippet is an object with keys that represent annotation categories (such as location, person, organization), and values that are arrays of objects that contain all of the annotation data for that category. Each annotation object represents a highlighted span of text with start and end offsets and an id which is a normalised value for the entity, as is often generated by NLP systems. For brevity, only annotations for two categories with two annotations each are shown.

{
  "snippet": "A controversial e-commerce patent granted in Australia late last year could have an unexpected impact on trade with the United States according to NSW Liberal senator, John Tierney. It's alleged that the patent's owner, IP lawyer Edward Pool, told Australian trade officials in Washington he may use United States customs authorities to help enforce his claim in Australia.",
  "siren": {
    "nlp": {
      "instances": {
        "snippet": {
          "entity/location": [
            {
              "start": 120,
              "end": 133,
              "id": "entity/location:united states"
            },
            {
              "start": 278,
              "end": 288,
              "id": "entity/location:washington"
            }
          ],
          "entity/person": [
            {
              "start": 168,
              "end": 180,
              "id": "entity/person:john tierney"
            },
            {
              "start": 230,
              "end": 241,
              "id": "entity/person:edward pool"
            }
          ]
        }
      }
    }
  }
}

Configuration of the NLP Cell Formatter:

To configure a Record Table to show the annotations as highlights in the corresponding text field, simply add an NLP Cell Formatter on the textual field Column and select its Annotations field. You also need to enter the name of the field within each annotation object that is used as a unique identifier for the entity highlighted (Annotations nested tags field). This will allow the highlighted text to be clicked to set a filter for documents containing that normalized value, no matter how they may be mentioned in the text, e.g. “U.S.” or “United States”. Available annotation types is a list of types available for editing or creating new annotations.

Editing annotations in the NLP Formatter

For tables that have been configured as editable, you can change, add or delete annotations in the document Big View. Click on the Edit Document icon.

Click on the annotation you want to edit or delete (or click Add Annotation to add a new annotation). Here we have clicked on the annotation Bush

Here we have changed the annotation from Bush to Bush administration, by selecting the text (which also changes the offset values),  and by changing the  Annotation tag drop down menu from entity/person to entity/organisation.

Conclusion

NLP is a critical AI capability for investigators that need to quickly make sense of corpuses of textual information.

By extracting entities automatically, investigators can find non trivial connections between documents and entities and the Siren NLP cell formatters allow tags to be visualized (and edited!) in a convenient way. 

If you want to explore the Siren Investigate NLP Cell Formatter yourself, please download Siren Platform Demo Data Community Edition which contains an Article dashboard with a preconfigured NLP annotation table. 

Resources

Siren Investigate

Siren Platform Demo Data Community Edition

NLP Cell Formatter documentation

Siren NLP

Siren NLP Plugin Download

Siren NLP Plugin Documentation

Get Notified

We'll inform you of major releases and upcoming features.