Dictionary TSV format

Dictionaries are used to normalize entities (also called entity linking or disambiguation).

The dictionary file to upload must be a .tsv file (tab-separated values) (or a compressed .zip or .tar.gz containing a single .tsv).

The dictionary format should follow this pattern:

  entity_1_id    recName1    recName2    ...
  entity_2_id    recName1    recName2    @@@    altName1    altName2    ...

The syntax is simple:

Each entity is defined in a new line. All columns are separated by tabs.

The first column is the entity's unique id. It can be an internal id (e.g. your database) or recognized (using known sources as Wikipedia).

After the id, a list of names follows. These are considered different names (synonyms) of the entity.

You can define recommended names and, optionally, alternative names. At least one recommended name must be given. Alternative names are those placed after the special delimiter @@@ (also separated within tabs). Use them when you know that some names appear less frequently than the standard ones. With this information, the system can handle synonyms better.

👉 Here are some sample, reference dictionaries.