Data Formats
Prepare your input data
Upload one or more files containing documents to be annotated in the
data
folder.
We support multiple formats of raw data files, including: csv, tsv, json, or jsonl.
Each document needs, at minimum, a unique identifier and the body of the document.
You can find example data files here. We currently support four different document formats:
- Text: body is the document plaintext (example)
- Image, Video, or GIF: body is the filepath (example)
- Dialogue or a list of text: body is a list of comma-seperated documents and potato will automatically display the list of text horizontally. (example)
- Pairs of text displayed in separate boxes: body is a dictionary of documents (example)
- Html files
- Best-Worst Scaling: body is a comma-separated list of documents to order (example)
- Custom Arguments: body is one of the above + extra fields for
whatever custom arguments you want to enter
(example
-- in this
kwargs
andother_kwargs
are the custom endpoints for a Likert scale) - Annotating Document A in context of Document B: body is document A +
extra
context
field with the body of document B (example)
You can also use html tags to design the way your text to be displayed. In the match finding example project, html tags are used to create two seperate boxes for the finding pairs.
Displaying a list or a dictionary of instances
Potato allows you to easily display a list or a dictionary of instance and you could easily set
list_as_text
as True
.
If you are using a list of instances, you could also define whether adding a prefix to the text
"list_as_text": {
"text_list_prefix_type": 'alphabet'
# whether adding a prefix to each instance in the list
# when creating displayed_text,
# options:
# alphabet: add alphabets before each instance (e.g. A.this is good, B.nice)
# number: add alphabets before each instance (e.g. 1.this is good, 2.nice)
},
the content will displayed vertically by default,
if you want to display the list or dict horizontally, you can set horizontal
as True
"list_as_text": {
"horizontal": True
},
if you want to randomize the diplayed content for a dictionary, you can use randomization
,
In this case, the order of the displayed content will be shuffled, so that you can avoid potential
biased caused by the ordering effect. You can access the displayed content in the annotated outputs.
"list_as_text": {
"randomization":"value" # whether randomizing the list or dictionary when creating displayed_text,
# options:
# "value": only shuffle the values but keep the order of the keys
# "key": shuffle the order of keys
},
Update input data formats on the YAML config file
You would pass the input data paths and field names into the YAML config file as follows (please make sure you have "id" and "text" key in your data):
# Pass in a comma-separated list of data files containing documents to be annotated in this task
"data_files": [
"data/toy-example1.json",
"data/toy-example2.json"
],
# Specify the field names containing the document unique identifier (id) and document body (text)
"item_properties": {
"id_key": "id",
"text_key": "text"
},
Update output data preferences on the YAML config file
The output file will include each labeled document\'s id and
annotations; the header will consist of the question and answer labels
specified in the
schema.
You need to specify a subdirectory of the annotation_output
directory
where files for each annotator should be placed. We support multiple
output formats, including: csv, tsv, json, or jsonl.
# Potato will write the annotation file for all annotations to this
# directory, as well as per-annotator output files and state information
# necessary to restart annotation.
"output_annotation_dir": "annotation_output/folder_name/",
# The output format for the all-annotator data. Allowed formats are:
# * jsonl
# * json (same output as jsonl)
# * csv
# * tsv
#
"output_annotation_format": "json",