Input types
This is the type of content you can import to tagtog.
Input type | Description | Default format |
---|---|---|
Text | Plain text. | verbatim |
File | See below for the supported file types | See below for the default formats for each file type. You can import one or more files in a single request. |
URL | Web address pointing to any website (e.g. http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000245.v1.p1 ) or resource (e.g. https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf ). |
See below the default format that is associated to the file type of the file the URL is pointing to. E.g. if the URL points to a text file or a PDF file, the text file or PDF file is imported to tagtog and the default format used accordingly. If the URL points to an HTML file, the text is stripped from the HTML content and imported to tagtog. The format used is the default format for thehtml file type. |
PMID | PubMed is a free online database of references on life sciences. Each record in the PubMed database is assigned a special number to identify it. This is the PMID. Any PMID is only a number, e.g. 12781165 . It also accepts inputs as: PMID12781165 . You can introduce a list of documents separated by comma and each of them will be uploaded. e.g. 25821226,12781165 . You can find this id at the bottom of the document at PubMed. |
Bio XML format |
PMCID | PubMed Central® (PMC) is a free archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). Each record in the PubMed Central database is assigned a special number to identify it. This is the PMCID. Any PMCID is a number plus the PMC prefix, e.g. PMC165443 . You can introduce a list of documents separated by comma and each of them will be uploaded. e.g. PMC165443,PMC165213 . You can find this id usually at the top of the document at PubMed Central. This feature relies on the availability of the PubMed provider. |
Bio XML format |
Files
You can import files to tagtog. Following are the supported file types.
File extension | Description | Default format |
---|---|---|
txt |
Any plain text file | verbatim |
md (Markdown) |
Any Markdown file, supporting a subset of the CommonMark spec. Go to documentation. Using Markdown you can also use tagtog blocks to build a customized annotation layout for your project! E.g. question answering datasets, chatbot training, tweets, etc. |
markdown |
pdf |
Two variants are possible: NativePDF (supported on Cloud-Large and On-Premises ML only) to annotate directly on top of the PDF, and Simple to annotate on a stripped out plain text representation of the PDF. | Native PDF format if native PDF is activated Simple PDF format if native PDF is not activated |
html |
Sections are not recognized. Currently, the text content is just stripped out. | HTML format |
csv and tsv |
Go to documentation. | CSV format or TSV format |
source code files | Supported programming language extensions include: .c, .coffee, .cpp, .cs, .css, .diff, .go, .h, .java, .js, .jsx, .less, .log, .m, .matlab, .mm, .patch, .php, .pl, .py, .python, .r, .rb, .sass, .scala, .sh, .shell, .sql, .swift, .ts, .tsx, .vb |
markdown |
xml |
NCBI Journal Publishing Tag Set (versions JATS 1.0 and NLM 2.x and 3.0). This includes all PLOS journals or F1000Research articles. BioMed Central format. This includes all articles in BioMed Central, ChemistryCentral, or SpringerOpen, among others. |
Bio XML format |
Bundle files
File extension | Description |
---|---|
tar.gz |
tarball gzip. Bundle of files with accepted format. Coming soon. |
zip |
zip file. Bundle of files with accepted format. Coming soon |
Input formats
If there is no format specified, the default format for the content imported is used.
In the API, use the format
parameter to set the format. In the GUI, open the Advanced options under the upload panel to select a format. In both ways, you explicitly "force" tagtog to represent the content by the format selected.
Below you find the different formats. There are formats that are used when you import only content, and other formats that you should use when you import pre-annotated content. The latter is useful if you want to import documents that were annotated outside tagtog (for example by your own machine learning model) or you want to update the annotations for a specific document.
Content | Format | Description |
---|---|---|
Only content | verbatim |
Parsed as already pre-formatted. No transformation is done at all to the given content. This is ideal, for instance, for files that contain arbitrary indentation or white spaces. It creates one single block with the whole the content. It is the simpler option if you are dealing with plain text or simple text files. Example. |
markdown |
The content is expected to follow the markdown syntax. The content will be formatted and visualized as markdown (e.g. you can include images, different sections, lists, code blocks, etc.). Using this format you can also insert tagtog blocks to customize your annotation layout (e.g. question answering datasets, chatbot training, tweets, multi-column, etc.). |
|
formatted |
The content is formatted and cleaned. For example, for each paragraph, one content part is created . Ideal if your content has different discourse units. For example: chat bots conversations. Up to the tagtog version 3.2020-W30.1 this was the default mode when a user imported plain text. If using the API, you want to pre-annotate with annotations created in this period of time, please use |
|
Pre-annotated content | default-plus-annjson |
Use it if you are importing pre-annotated documents (content + For example to import pre-annotated native PDFs, plain text, or markdown files. Choose this option if you are not sure which format to use when sending pre-annotated documents. |
verbatim-plus-annjson |
Analogous to |
|
formatted-plus-annjson |
Analogous to |
|
nativepdfv1-plus-annjson |
(DEPRECATED) Analogous to |
|
anndoc |
Use the anndoc format to import a pre-annotated plain.html (plain.html + ann.json ). Example. |
Output formats
Find below the available output formats in tagtog. Some of these outputs are available through the GUI and others through the API.
Format | Description | Type | GUI | API |
---|---|---|---|---|
ann.json |
This is the official format for annotations. It supports all the annotation tasks in tagtog. Documentation. | Only annotations | ✅ | ✅ |
entitiestsv |
Tabulated annotation format, with both plain content and annotations. It closely resembles the output by the Stanford NER tool. Documentation. | Content + annotations | ✅ | ✅ |
entitiesonlyclassestsv |
Tabulated annotation format, with both partial plain content and annotations. Similar to entitiestsv . The non-labeled text is not included, and it supports overlapping entities. Documentation. |
Only annotations | ✅ | ✅ |
plain.html , html , xml |
This is the official representation of the content imported. Any piece of text/document you import to tagtog is converted to plain content and the annotation offsets always refer back to this format. No annotations provided within this format, only content. Documentation. | Only content | ✅ | ✅ |
txt |
Plain text. No annotations provided within this format, only content. | Only content | ✅ | ✅ |
orig , original |
The originally submitted file (e.g. the original html or pdf document that was imported to tagtog). | Only content | ✅ | ✅ |
visualize |
This is the default value. Choose to visualize the document resource returning the web page directly (web or web-editor-only if the User Agent is a recognized browser and a tagtog project information was given, i.e. web, or, respectively, no tagtog project was given, i.e., web-editor-only ) or otherwise return the weburl (typically, the User Agent will be a command line program). |
Visualization | ✅ | ✅ |
web |
Visual representation of the document and its annotations on the tagtog web interface (HTML page). | Visualization | ❌ | ✅ |
web-editor-only |
Analogously as web , yet without the information of a tagtog project, i.e., only the document editor layout. Use this output to integrate tagtog in your own web app with iframes. |
Visualization | ❌ | ✅ |
weburl |
URL of the annotated document at tagtog web interface. | Visualization | ❌ | ✅ |
csv |
List of the project's documents and the status of their master (ground truth) annotation version. Currently, it only works with a search query (i.e. with the API parameter search ). |
Search | ❌ | ✅ |
null |
Special output to signify that no document output is desired. A JSON response of the request will be returned instead. For example, when importing a document:
You can use this parameter, for example, if you need the API to return you the id of each document imported. |
Operation result | ❌ | ✅ |
All output formats are returned in their latest format versions. The output format versions cannot be chosen.