Inputs & Outputs

Input types

This is the type of content you can import to tagtog.

Input type	Description	Default `format`
Text	Plain text.	`verbatim`
File	See below for the supported file types	See below for the default formats for each file type. You can import one or more files in a single request.
URL	Web address pointing to any website (e.g. `http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000245.v1.p1`) or resource (e.g. `https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf`).	See below the default format that is associated to the file type of the file the URL is pointing to. E.g. if the URL points to a text file or a PDF file, the text file or PDF file is imported to tagtog and the default format used accordingly. If the URL points to an HTML file, the text is stripped from the HTML content and imported to tagtog. The format used is the default format for the `html` file type.
PMID	PubMed is a free online database of references on life sciences. Each record in the PubMed database is assigned a special number to identify it. This is the PMID. Any PMID is only a number, e.g. `12781165`. It also accepts inputs as: `PMID12781165`. You can introduce a list of documents separated by comma and each of them will be uploaded. e.g. `25821226,12781165`. You can find this id at the bottom of the document at PubMed.	Bio XML format
PMCID	PubMed Central® (PMC) is a free archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). Each record in the PubMed Central database is assigned a special number to identify it. This is the PMCID. Any PMCID is a number plus the `PMC` prefix, e.g. `PMC165443`. You can introduce a list of documents separated by comma and each of them will be uploaded. e.g. `PMC165443,PMC165213`. You can find this id usually at the top of the document at PubMed Central. This feature relies on the availability of the PubMed provider.	Bio XML format

All input types can be imported through the user interface or the API.

Files

You can import files to tagtog. Following are the supported file types.

File extension	Description	Default `format`
`txt`	Any plain text file	`verbatim`
`md` (Markdown)	Any Markdown file, supporting a subset of the CommonMark spec. Go to documentation. Using Markdown you can also use tagtog blocks to build a customized annotation layout for your project! E.g. question answering datasets, chatbot training, tweets, etc.	`markdown`
`pdf`	Two variants are possible: NativePDF (supported on Cloud-Large and On-Premises ML only) to annotate directly on top of the PDF, and Simple to annotate on a stripped out plain text representation of the PDF.	Native PDF format if native PDF is activated Simple PDF format if native PDF is not activated
`html`	Sections are not recognized. Currently, the text content is just stripped out.	HTML format
`csv` and `tsv`	Go to documentation.	CSV format or TSV format
source code files	Supported programming language extensions include: `.c, .coffee, .cpp, .cs, .css, .diff, .go, .h, .java, .js, .jsx, .less, .log, .m, .matlab, .mm, .patch, .php, .pl, .py, .python, .r, .rb, .sass, .scala, .sh, .shell, .sql, .swift, .ts, .tsx, .vb`	`markdown`
`xml`	NCBI Journal Publishing Tag Set (versions JATS 1.0 and NLM 2.x and 3.0). This includes all PLOS journals or F1000Research articles. BioMed Central format. This includes all articles in BioMed Central, ChemistryCentral, or SpringerOpen, among others.	Bio XML format

All input types can be imported through the user interface or the API.

The format is automatically recognized by the file extension; no other parameter is needed.

Bundle files

File extension	Description
`tar.gz`	tarball gzip. Bundle of files with accepted format. Coming soon.
`zip`	zip file. Bundle of files with accepted format. Coming soon

Input formats

If there is no format specified, the default format for the content imported is used.

In the API, use the format parameter to set the format. In the GUI, open the Advanced options under the upload panel to select a format. In both ways, you explicitly "force" tagtog to represent the content by the format selected.

Below you find the different formats. There are formats that are used when you import only content, and other formats that you should use when you import pre-annotated content. The latter is useful if you want to import documents that were annotated outside tagtog (for example by your own machine learning model) or you want to update the annotations for a specific document.

Content	Format	Description
Only content	`verbatim`	Parsed as already pre-formatted. No transformation is done at all to the given content. This is ideal, for instance, for files that contain arbitrary indentation or white spaces. It creates one single block with the whole the content. It is the simpler option if you are dealing with plain text or simple text files. Example.
	`markdown`	The content is expected to follow the markdown syntax. The content will be formatted and visualized as markdown (e.g. you can include images, different sections, lists, code blocks, etc.). Using this format you can also insert tagtog blocks to customize your annotation layout (e.g. question answering datasets, chatbot training, tweets, multi-column, etc.).
	`formatted`	The content is formatted and cleaned. For example, for each paragraph, one content part is created . Ideal if your content has different discourse units. For example: chat bots conversations. Up to the tagtog version 3.2020-W30.1 this was the default mode when a user imported plain text. If using the API, you want to pre-annotate with annotations created in this period of time, please use `formatted-plus-annjson`. See below.
Pre-annotated content	`default-plus-annjson`	Use it if you are importing pre-annotated documents (content + `ann.json`) and you want the content to be recognized using the default format. For example to import pre-annotated native PDFs, plain text, or markdown files. Choose this option if you are not sure which format to use when sending pre-annotated documents. Example
	`verbatim-plus-annjson`	Analogous to `default-plus-annjson`, and complimentary to the `verbatim` format. Use it if you are sending pre-annotated content (content + `ann.json`) and you want to force the content to be recognized with the `verbatim` format.
	`formatted-plus-annjson`	Analogous to `default-plus-annjson`, and complimentary to the `formatted` format. Use it if you are sending pre-annotated content (content + `ann.json`) and you want the content to be recognized with the `formatted` format (instead of the default). Example
	`nativepdfv1-plus-annjson`	(DEPRECATED) Analogous to `default-plus-annjson`, and complimentary to the `nativepdfv1` format. Only use this format to import Native PDF-based annotations, which are old. That is, your annorations are based on Native PDFs uploaded to tagtog up to the tagtog version 3.2020-W28.2 (including). You can double check which PDF parsing format was originally used in the `plain.html`.
	`anndoc`	Use the anndoc format to import a pre-annotated `plain.html` (`plain.html` + `ann.json`). Example.

In the GUI, it is not necessary to specify a pre-annotated format, it is recognized automatically that you are importing content + annotations.

Output formats

Find below the available output formats in tagtog. Some of these outputs are available through the GUI and others through the API.

Format	Description	Type	GUI	API
`ann.json`	This is the official format for annotations. It supports all the annotation tasks in tagtog. Documentation.	Only annotations	✅	✅
`entitiestsv`	Tabulated annotation format, with both plain content and annotations. It closely resembles the output by the Stanford NER tool. Documentation.	Content + annotations	✅	✅
`entitiesonlyclassestsv`	Tabulated annotation format, with both partial plain content and annotations. Similar to `entitiestsv`. The non-labeled text is not included, and it supports overlapping entities. Documentation.	Only annotations	✅	✅
`plain.html`, `html`, `xml`	This is the official representation of the content imported. Any piece of text/document you import to tagtog is converted to plain content and the annotation offsets always refer back to this format. No annotations provided within this format, only content. Documentation.	Only content	✅	✅
`txt`	Plain text. No annotations provided within this format, only content.	Only content	✅	✅
`orig`, `original`	The originally submitted file (e.g. the original html or pdf document that was imported to tagtog).	Only content	✅	✅
`visualize`	This is the default value. Choose to visualize the document resource returning the web page directly (`web` or `web-editor-only` if the User Agent is a recognized browser and a tagtog project information was given, i.e. web, or, respectively, no tagtog project was given, i.e., `web-editor-only`) or otherwise return the `weburl` (typically, the User Agent will be a command line program).	Visualization	✅	✅
`web`	Visual representation of the document and its annotations on the tagtog web interface (HTML page).	Visualization	❌	✅
`web-editor-only`	Analogously as `web`, yet without the information of a tagtog project, i.e., only the document editor layout. Use this output to integrate tagtog in your own web app with iframes.	Visualization	❌	✅
`weburl`	URL of the annotated document at tagtog web interface.	Visualization	❌	✅
`csv`	List of the project's documents and the status of their `master` (ground truth) annotation version. Currently, it only works with a search query (i.e. with the API parameter `search`).	Search	❌	✅
`null`	Special output to signify that no document output is desired. A JSON response of the request will be returned instead. For example, when importing a document: `{ "ok":1 //number of documents successfully changed, "errors":0 //number of documents with errors, "items": //list of documents changed [ { "origid":"text", "names":["text.txt"], "tagtogID":"aOM6EFIvULWc6J.7MAYQB3V2sF84-text", "result":"created"} ], "warnings":[] }` You can use this parameter, for example, if you need the API to return you the id of each document imported.	Operation result	❌	✅

All output formats are returned in their latest format versions. The output format versions cannot be chosen.