Semantically Tagged glosses
Word forms from the definitions ("glosses") in WordNet's synsets are
manually linked to the context-appropriate
sense in WordNet. Thus, the glosses are a sense-disambiguated corpus
and WordNet version 3.0 is the
dictionary against which the corpus was annotated.
Release Contents
This release, once extracted, is comprised of three subdirectories:
/WordNet-3.0/glosstag/merged | WordNet glosses in merged format |
/WordNet-3.0/glosstag/standoff | WordNet glosses in standoff format |
/WordNet-3.0/glosstag/dtd | DTD describing the markup
for the merged annotations |
When using this freely available resource, we ask that you refer to it
as the "Princeton WordNet Gloss Corpus."
Readme
Readme File
Download
Statistics
Tokenized text (word and collocation forms)
Types 47334
Tokens 1621129
Multi-word forms (globs)
man 7168
auto 45967
all 53135
Taggable lemmas (potential lemmas)
Types 55561
Tokens 1504077
Sense tags (sense keys on sense tags)
Kind Types Tokens
man 33862 339969
auto 26139 118856
all 59250 458825
Taggable tokens (word forms and globs)
Kind wf glob all
man 317812 12687 330499
auto 82238 36618 118856
un 202881 3830 206711
ignore 457502 0 457502
Key
wf word form
man manually-inserted sense tag or collocation
auto automatically generated sense tag or collocation
un taggable item that has not been tagged
ignore stoplist item
glob collocation/multi-word term
Disclaimer
While standoff annotations have many benefits, particularly the
ability to isolate
annotations of choice, it is not a well-supported format. Our standoff
encoding is
based heavily on the ANC format, but is not identical to it as our
markup is
necessarily different. Therefore, some tools that work with the ANC
data may work
with ours, but not all. We are supplying the data in this format as a
service to
users who are used to working with standoff annotations, and who will
build or
modify existing software to work with it. We are not supporting the
ANC standoff
annotation format, nor any software that uses or manipulates it, nor
are we
providing any tools ourselves.
The standoff annotations do not contain more, or better, information
than the
merged files. The annotations contained in them are identical to the
merged data,
just reformulated in a different way. If you have any doubts about
which format
to use, then use the merged files.
Acknowledgment
This work was sponsored by ARDA/DTO through the AQUAINT Program.