-------------------------------------------------------------------------------- WordNet Gloss Disambiguation Project First release 04/01/2008 CONTACT: Helen Langone, hlangone@princeton.edu -------------------------------------------------------------------------------- Release contents ---------------- This release, once extracted, is comprised of three subdirectories: /WordNet-3.0/glosstag/merged =WordNet glosses in merged format /WordNet-3.0/glosstag/standoff =WordNet glosses in standoff format /WordNet-3.0/glosstag/dtd =DTD describing the markup for the merged annotations /WordNet-3.0/glosstag/merged Contains WordNet glosses with all annotations combined in a single file. There are four XML files in this directory, which group the glosses by WordNet part of speech. /merged/noun.xml /merged/verb.xml /merged/adj.xml /merged/adv.xml The XML data in these files conforms to the glosstag.dtd found in the /dtd directory in this release. See glosstag.dtd for detailed information about the encoding of the merged files. /WordNet-3.0/glosstag/standoff Contains WordNet glosses annotated using standoff markup in XCES format. Annotations for sense tags and other markup are stored in documents separate from the gloss text. For the standoff annotation, the database of 117,659 WordNet glosses was split over 1,177 files of 100 synsets each. To make this more manageable, the files are organized in 11 subdirectories of 10 sub-subdirectories each, and the final 8 sub-subdirectories in the 12th. /standoff/00 - containing sub-subdirs 000-009 /standoff/01 - containing sub-subdirs 010-019 /standoff/02 - containing sub-subdirs 020-029 /standoff/03 - containing sub-subdirs 030-039 /standoff/04 - containing sub-subdirs 040-049 /standoff/05 - containing sub-subdirs 050-059 /standoff/06 - containing sub-subdirs 060-069 /standoff/07 - containing sub-subdirs 070-079 /standoff/08 - containing sub-subdirs 080-089 /standoff/09 - containing sub-subdirs 090-099 /standoff/10 - containing sub-subdirs 100-109 /standoff/11 - containing sub-subdirs 110-117 Each sub-subdir contains glosses and standoff annotations for 10 files of 100 synsets each. There are seven files for each subgroup of 100 synsets. [prefix].txt = the gloss text [prefix].wn = the header file, which contains the XCES header for the text [prefix]-wngloss.xml = annotations for gloss structure [prefix]-wnann.xml = token-level annotations [prefix]-wnword.xml = sense tag annotations for single-word forms [prefix]-wncoll.xml = sense tag annotations for multi-word forms, treating discontiguous spans as contiguous [prefix]-wndc.xml = same as wncoll.xml, but using proposed markup for annotating discontiguous spans The filename prefix consists of wsd- plus a 6 digit number that is the sub-subdir name concatenated with a 3-digit number indicating 100-synset subgroup. For example, the first chunk of 100 synsets in the first sub-subdir (/000) would be named with a prefix of wsd-000000, the second chunk would have prefix wsd-000100, the third wsd-000200, and so on. The encoding for these files is detailed below. There are indexes mapping WordNet 3.0 synsets, sense keys, and terms to filename prefix plus path to it. The indexes are found in /standoff/index.byid.tab /standoff/index.bysk.tab /standoff/index.bylem.tab /standoff/index.bylem.noun.tab /standoff/index.bylem.verb.tab /standoff/index.bylem.adj.tab /standoff/index.bylem.adv.tab All indexes are tab-delimited, mapping an item to one or more filename prefixes. The paths are relative from /standoff, eg., 00/000/, 00/001/, etc. The basic file structure is: itempath+prefix[path+prefix]* index.byid.tab maps a WordNet synset by its id (consisting of {n,v,a,r}+offset) to the path+prefix of the files containing annotations for that synset. There is a one-to-one correspondence between synset id and path+prefix. Eg., n0000174000/000/wsd-000000 n0000193000/000/wsd-000000 n0000213700/000/wsd-000000 index.bysk.tab maps a WordNet sense key to the path+prefix of the files containing annotations for the sense key's synset. There is a one-to-one correspondence between sense key and path+prefix. Eg., abandonment%1:04:01::00/000/wsd-000100 abandonment%1:04:02::00/000/wsd-000300 abdominoplasty%1:04:00::00/000/wsd-000000 abort%1:04:00::00/000/wsd-000000 abscondment%1:04:00::00/000/wsd-000100 absence_without_leave%1:04:00::00/000/wsd-000100 absolute_space%1:03:00::00/000/wsd-000000 absolution%1:04:00::00/000/wsd-000300 index.bylem.tab maps WordNet terms to path+prefix. A "term" here is the WordNet synset term lower-cased, and with underscores between words (for multi-word terms). Since a term may appear in more than one synset, there is a one-to-many correspondence between term and path+prefix. Eg., abandonment00/000/wsd-00030000/000/wsd-000100 abdominoplasty00/000/wsd-000000 abort00/000/wsd-000000 abscondment00/000/wsd-000100 absence_without_leave00/000/wsd-000100 absolute_space00/000/wsd-000000 absolution00/000/wsd-000300 abstract_entity00/000/wsd-000000 The index.bylem.{noun,verb,adj,adv}.tab indexes are identical in format to index.bylem.tab, and contain mappings for WordNet terms broken down by their part of speech. Document Encoding ----------------- The annotated disambiguated glosses are provided in both standoff and merged formats. The former uses the XCES markup for standoff annotations, the XML schemas for which are found at http://www.xces.org/schema/2003/. While XCES is an emerging standard for representing corpus data, certain aspects of the gloss annotations are not representable with it in its current form (0.4). The gloss annotations contain markup for discontiguous spans of text, as when a WordNet collocation (multi-word form) is interrupted by a word (usually a conjunction, or, in the case of phrasal verbs, an object). Examples of discontiguous collocations in the glosses are: personal or business relationship = personal_relationship, business_relationship canon and civil law = canon_law, civil_law take a player out = take_out put oneself forward = put_forward pay them off = pay_off bring something about = bring_about With this project we are considering the body of glosses to be a corpus, however, the data also contains information relevant to the WordNet synset. This information does not fit neatly within the XCES standard. For this reason, and the fact that we allow for the annotation of discontiguous collocations, the merged files are encoded in a format specific to the project, described by glosstag.dtd. The WordNet glosses are marked up with definition and example sentence boundaries, and within these, text is tokenized and marked up with part of speech, potential lemma forms, and a small set of semantic classes (indicating the token is punctuation, abbreviation, acronym, number, year, currency, or some kind of symbol). Collocations are delimited, including markup to indicate discontiguous forms. Words and collocations that have been disambiguated are further annotated with WordNet sense keys. Encoding of merged files ------------------------ Detailed information about the encoding of the merged files is found in glosstag.dtd. The dtd also lists values for part of speech and other attributes that are found in the standoff annotation files. The following is a sample of the merged annotation for the noun synset at offset 00003553. Line numbers are for reference. 1 2 3 whole 4 unit 5 6 7 whole%1:03:00:: 8 unit%1:03:00:: 9 10 11 an assemblage of parts that is regarded as a single entity; "how big is that part compared to the whole?"; "the team is a unit" 12 13 14 an assemblage of parts that is regarded as a single entity ; “ how big is that part compared to the whole ? ” ; “ the team is a unit ” 15 16 17 18 an 19 20 assemblage 21 of 22 parts 23 that 24 is 25 26 27 28 regarded 29 as 30 a 31 single 32 33 entity 34 ; 35 36 37 38 how 39 big 40 is 41 that 42 part 43 compared 44 to 45 the 46 47 whole 48 ? 49 50 ; 51 52 53 54 the 55 team 56 is 57 a 58 59 unit 60 61 ; 62 63 64 Lines 2 through 12 contain synset terms, synset keys, and the original text of the gloss from the WordNet synset. Lines 13 through 15 contain the tokenized gloss text, unannotated. Line 16 starts the annotations. Automatic processing of the glosses delimited the definition text (lines 17 to 35), and two example sentences (lines 36-51, and 52-62), and tokenized the text. Each token is wrapped in either or , where is a word token, and is a token that is part of a collocation. The tag contains the main markup for a collocation, and also its sense tag, if disambiguated. The carries sense tag information for disambiguated words (when child of wf) and collocations (when child of glob). Each annotatable item has a unique id. Within example sentences, only the synset terms were disambiguated. Encoding of standoff files -------------------------- In the standoff annotation files, the annotations are contained in separate XML documents linked to the original gloss text, and also linked to the merged files via ids. For each [prefix].txt document containing gloss text, there are six XML files. We took the standoff format used by the second release of the ANC as the basis of our format, and modified it from there. In the ANC format, each standoff annotation file is composed of a series of annotations consisting of one or more features (zero or more, in our case). An annotation is represented by a tag, and features as child tags. Each specifies an edge between two nodes in the gloss text. Nodes are located between characters in the text, so an edge between two nodes spans a stretch of text. The following example is from the .txt file containing the gloss text for the synset above: 3 3 3 4 4 7 8 9 0 1 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 |a|n| |a|s|s|e|m|b|l|a|g|e| |o|f| |p|a|r|t|s| |t|h|a|t| |i|s| |r|e|g|a|r|d|e|d| The following annotation delimits the token "assemblage" as an edge between nodes 376 and 386. Header ([prefix].wn) Contains information about the source data, and links to the annotation documents for a filename prefix. For example: Document content WordNet glosses with definitions and examples delimited WordNet collocations (multi-word forms) and sense tags WordNet single-word forms and sense tags WordNet token-level annotations WordNet markup for discontiguous collocations and sense tags Annotations for gloss structure ([prefix]-wngloss.xml) The annotations in this file delimit gloss definitions and example sentences. The following sample applies to the merged gloss above. 1 2 . . . 9 10 11 Token-level annotations ([prefix]-wnann.xml) The annotations in this file delimit tokens as wf (word forms), cf (collocation forms, punc (punctuation), or ignore (stoplist words). The features for each annotation indicate the attributes from the merged file for the token, as well as the token text. See glosstag.dtd for explanations of attribute values. The following sample applies to the tokenized definition in the merged example above. 1 2 . . . 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 Single-word form annotations ([prefix]-wnword.xml) The annotations in this file are for word forms that are disambiguatable (ie., are tokens that are not punctuation, and are not part of a collocation). Sense keys for disambiguated word forms are indicated by a feature. The following sample applies to the merged gloss above. 1 2 . . . 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 Multi-word form annotations ([prefix]-wncoll.xml) The annotations in this file are for collocations. In order to conform to the current XCES standard, discontiguous collocations in this file are represented as contiguous, spanning any intervening text. Therefore, this file is both correct (conformant to the standard) and incorrect (some collocations are wrongly delimited). As for single-word forms, sense keys for disambiguated collocations are indicated by a feature. The following sample applies to the merged gloss above. 1 2 3 4 5 6 Discontiguous collocation annotations ([prefix]-wndc.xml) The annotations in this file are the same as that of [prefix]-wncoll.xml, except that discontiguous collocations are represented using the markup proposed by Keith Suderman and Nancy Ide in their paper "Layering and Merging Linguistic Annotations" (http://acl.ldc.upenn.edu/W/W06/W06-2716.pdf). Since the markup used in this file does not conform to the XCES standard, it will not validate under the schema, nor be usable with software written based on the current standard. We believe it might be useful for users who wish to write or modify their own software in order to make use of the correctly-delimited discontiguous collocations. The following sample is the representation of "neurological and visual disorders", where "neurological disorders" is discontiguous. The contiguous collocation "visual disorders" is encoded as in the [prefix]-wncoll.xml file. First, two pseudo nodes are created to reference the discontiguous parts (lines 37 and 38), and then an edge is created between the two pseudo nodes (lines 39 to 42). 37 38 39 40 41 42 43 44 45 46 The corresponding merged markup is this: neurological and visual disorders Character encoding ------------------ All .txt files within the standoff dirs are encoded as UTF-16. All other files are UTF-8. Disclaimer ---------- While standoff annotations have many benefits, particularly the ability to isolate annotations of choice, it is not a well-supported format. Our standoff encoding is based heavily on the ANC format, but is not identical to it as our markup is necessarily different. Therefore, some tools that work with the ANC data may work with ours, but not all. We are supplying the data in this format as a service to users who are used to working with standoff annotations, and who will build or modify existing software to work with it. We are not supporting the ANC standoff annotation format, nor any software that uses or manipulates it, nor are we providing any tools ourselves. The standoff annotations do not contain more, or better, information than the merged files. The annotations contained in them are identical to the merged data, just reformulated in a different way. If you have any doubts about which format to use, then use the merged files. -------------------------------------------------------------------------- WordNet Gloss Disambiguation Project Copyright (c) 2008 by Princeton University. All rights reserved.