Tabulated format to represent an annotated document’s entities. The main characteristics are:

The EntitiesTsv format closely resembles the tsv output by the Stanford NER tool as briefly shown here and here (see outputFormat). The main differences are that in EntitiesTsv:

This is to give freedom to the user to choose later any desired text tokenizer. If the user still wants to use Stanford’s default tokenizer, this is the corresponding java class.


The format is best explained with an example 🙂:

Sample output

As a comparison, the Stanford NER tsv format would yield for the same text the output (as running java -mx600m -cp ".:*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -outputFormat tsv -textFile my-sample.txt 2>/dev/null):

Sample Stanford NER TSV