Copyright © 2008 Chusslove Illich (Часлав Илић)
|Last modification (repository rev. c2188bc583a4ccd1594a9ca75575b6c203f1b230).|
This document informally describes the Divergloss XML format. It can be read as a tutorial and usage guide to Divergloss, with some design rationales provided along the way.
Divergloss is work in progress, and should be considered experimental at this moment.
Table of Contents
Divergloss, short of "Diversity Glossary", is an instance of XML for handling glossary data -- succinct descriptions connected with terms which name certain concepts. A necessary question to answer up front is, why yet another XML format for glossaries? For example, there is the TBX format ("TermBase eXchange") by LISA, used widely in localization industry and submitted for adoption as ISO standard. Another option is the glossary document type of the Docbook format. To answer this, we must consider the goals behind Divergloss:
Producers of glossaries should not need in-depth understanding of linguistics or markup languages. The glossary data comes from "ordinary" fields, rather than e.g. linguistic research or natural language AI.
Advanced glossary features are hidden when not needed, rather than requiring a lot of scaffolding for simple glossaries. This relates to both the format as such, and tools needed to validate and produce it.
The format should be well human readable and editable. While special GUI applications can present glossary data in nicer form for user consumption, the plain XML source should make intuitive sense.
The same glossary may be used in various contexts within same language. There may be several terms naming the same concept, not only synonymously, but according to the user's environment. The same holds for the concept descriptions.
Glossaries should evolve by contributions of many people, and be updated according to various external sources of glossary data. A form of version control on the file level is assumed, but some intrinsic support for this process should be provided as well.
Conversely, Divergloss does not place much importance into the following:
Wide generality, with features for custom specialization according to the field and desired levels of representation depth. Instead, common needs are supported by many built-in tags in a rather flat hierarchy.
Strict adherence to the XML data representation practices, when these would impair readability and editability. For example, some records may be encoded as special strings, rather than long sequences of nodes and subnodes.
In essence, Divergloss favors simplicity, ease of getting involved both in terms of cognitive effort and needed tools, at the expense of generality and formal correctness as viewed from pure data-encoding standpoint. This is in contrast with the aforementioned TBX, which is intended as end-all of glossary formats, and in practice mostly handled through features of dedicated glossary or CAT (Computer-Aided Translation) tools. Docbook glossary format, on the other hand, while offering similar simplicity, is too lightweight to cover all envisaged uses of Divergloss.
It is, however, possible that Divergloss could be made an instance of TBX, i.e. one of its XCS ("eXtensible Constraint Specification") customizations. This would facilitate the conversion of Divergloss into TBX glossaries, for use by existing CAT tools. Some sort of conversion to TBX is, of course, possible in any case, but no effort to that end has been undertaken as of yet.
Divergloss strives for language neutrality. There is no limitation on how many languages, and precisely in which fashion, may a single glossary document support. This need is highly dependent on locale and field of use: a glossary of pre-Columbian civilizations for American users may get away fully monolingual, a glossary of oceanic freight for German users may state English terms too, while a glossary of astronomy for Tunisian users may need to be fully Arabic-French bilingual (both descriptions and terms).
Within each language, Divergloss supports another level of diversity: in different environments, e.g. within two large companies doing similar business, same concepts may be named differently. An outsider may thus consider them synonymous, but a user within one of those environments should be primarily presented with the term native to it. "Fully" synonymous terms, in this sense, will have both the language and the environment same.
Aside from language and environment, each description and term can be equipped with a myriad of supplementary data. For example, the description may have an additional comment on source of the text, while a base term may come with few declinations which would be hard for user to guess by general rules of the language.
Bearing previous passages in mind, the entries in Divergloss documents are organized by concepts, not apriori linked to any of the terms. Instead, each concept is referenced by a unique key, and contains any number of descriptions and terms. Here is the simplest example of glossary with two terms in it:
<?xml version="1.0" encoding="UTF-8"?> <glossary id="cosmogloss" lang="en"> <metadata> <title>A Short Cosmic Glossary</title> </metadata> <keydefs> <languages> <language id="en"> <name>English</name> <shortname>En.</shortname> </language> </languages> </keydefs> <concepts> <concept id="blackhole"> <desc> The leftover core of a super massive star after a supernova, that exerts a tremendous gravitational pull.</desc> <term>black hole</term> </concept> <concept id="quasar"> <desc> A distant energy source which gives off vast amounts of radiation, including radio waves and X-rays.</desc> <term>quasar</term> </concept> </concepts> </glossary>
glossary top node states the unique glossary identifier, by the
id attribute. The default language used in the glossary is given by
lang attribute, which applies to all text in the document where not locally overridden. The
metadata node contains general info about the glossary, like title, description, etc. All keys in a Divergloss glossary, such as the language value to the
lang attributes, are defined by the the
Glossary entries are grouped under the
concepts node. Each
concept node has an
id attribute, a key which uniquely defines the concept. Each concept can contain any number of descriptions and terms, defined by
term nodes; each may override the default language using the
<concept id="blackhole"> <desc> The leftover core of a super massive star after a supernova, that exerts a tremendous gravitational pull.</desc> <term>black hole</term> <term lang="fr">trou noir</term> <term lang="de">Schwarzes Loch</term> </concept>
The term without a
lang attribute is in English, which was stated as the default language, whereas the two other terms override the language to French and German. In a fully multilingual case, description nodes too could be stated in several languages in the same way.
<concept id="directory"> <desc> An entity in a file system which contains a group of files and other directories.</desc> <term>directory</term> <term env="mac">folder</term> <term env="amiga">drawer</term> </concept>
Here, the first term, as before, defines no environment. Other two terms define an environment by use of
env attribute. The value of
env attribute is not a text that may be shown to the user, but an environment key. It must be defined elsewhere in the document, together with proper environment name (see the
environments node). Same as with language, it depends on the client how the environment information will be used when presenting concepts to the user. For example, if the user is reading the glossary on an Amiga, the application may show "drawer" as primary term.
In the last example, the description text itself mentioned the word "directories". If the user is using an Amiga, shouldn't he see "drawers" instead? When this occurs in a multienvironment glossary, the descriptions too can be specialized by environment. Furthermore, the terms should be properly crossreferenced, so that the client may allow the user to click on the term in description and go to corresponding concept. Putting it all together, we get:
<concept id="directory"> <desc> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other directories.</desc> <desc env="mac"> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other folders.</desc> <desc env="amiga"> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other drawers.</desc> <term>directory</term> <term env="mac">folder</term> <term env="amiga">drawer</term> </concept>
The reference node tag
ref points to a concept key by its
c attribute. If duplication of descriptions due to different terms by environment is to be frequently expected, alternatively the terse embedded selection can be used.
Descriptions and terms may also credit the person who added them into the glossary. This person is an editor of the glossary, not necessarily the one who wrote the description, or coined the term. The editor's credit is assigned using the
by attribute, for example:
<desc by="hal"> An instruction by a superior factor which, if improperly formulated, can cause a sentient being to undertake unethical actions.</desc> <term by="hal">order</term>
"hal" is the person key, possibly made out of the person's initials. These keys and corresponding editors' real names and contact data are defined within the
A Divergloss document is divided into the following top-level sections:
<glossary id="glosskey" lang="xx" env="yyyy"> <metadata> ... </metadata> <keydefs> ... </keydefs> <concepts> ... </concepts> </glossary>
A particular combination of values to
env attributes, should uniquely determine the glossary within the ecosystem of other published Divergloss glossaries. The
env attributes are not mandatory; if provided, their values apply to all subnodes where meaningfull and not locally overridden.
The title of the glossary.
The description of the glossary.
The release version. There are no constraints or recommendations on the versioning scheme.
The release date. The format is YYYY-MM-DD regardless of the languages of the glossary; it is client's duty to format for presentation according to user's locale.
All child nodes except the
title are optional. The
desc can have the attributes of language (
lang) and environment (
env), and can be repeated for unique combinations of those. The values of these attributes are keys defined by the
The main body of the glossary, concepts and terms naming them, is given by the
concepts node. This node is mandatory -- not much point in a glossary without a single concept.
Within metadata and concepts, many keyword-valued attributes may be used, which denote global data in the glossary: languages, environments, editors, topics, etc. These keywords are defined by the the
keydefs node, where also the user-presentable info on them is stated. This node is not mandatory, but will be needed for all but the simplest glossaries.
concepts nodes can actually appear more than once, each instance containing the data as described previously. In this way the document can be chunked into files, such that each file can still remain a valid XML. E.g. files containing groups of concepts, categorized in some way, can all start with their own
concepts root element.
Concept nodes reside within the
concepts node of the glossary:
<glossary id="glosskey" lang="en"> ... <concepts> <concept id="ckey1"> ... </concept> <concept id="ckey2"> ... </concept> ... </concepts> ... </glossary>
The ordering of concepts is not important, and each concept key, given by
id attribute, must be unique. The keys are best chosen mnemonically, for easier crossreferencing.
The concept node may have the following optional attributes:
A list of topic keys under which this concept may be grouped. The topic keys and names are defined within the
Usage level of the concept, as in basic, intermediate, advanced, etc. among the users of the terminology described by the glossary. The level value is a key, defined with corresponding level name by the
Reference to closely related concepts, given by a list of concept keys.
An example of a concept node equipped with several attributes:
<concept id="saturnv" topic="apollo spacerace" level="newbie" related="n1 proton"> <desc> A multistage liquid-fuel expendable rocket used by NASA's Apollo and Skylab programs. Popularly known as the Moon Rocket.</desc> <term>Saturn V</term> </concept>
All children nodes of the concept may have the attributes
by. The client should use these attributes to decide, possibly based on the execution environment, how and which information to present to the user.
Furthermore, description and term nodes may have the
src attribute, which unlike the
by attribute, defines the source of description or term: another glossary, publication in the field, etc. The source value is a key, with the source data behind it defined by the
There can be one or several description nodes, or even none. When two descriptions have the same values of
env attributes, they are to be considered as different takes at same explanation. Similarly, the concept may be named by one or several terms (with term nodes being detailed in their own section); but there may also be no terms, for concepts which are as of yet unnamed. The following is a valid concept definition:
<concept id="alexq" topic="arch"> <desc>The quality without a name.</desc> </concept>
Other than descriptions and terms, concept node can have the following children nodes:
Reference to external information about the concept, explaining concept in more details. The source of the information is pointed to using two attributes: the
rel attribute states the relative path, while the
root attribute is a key identifying the root of the path. The root keys and related names and data retrieval instructions are defined in the the
extroots node. The text content of the node, if non-empty, is a free-form remark (only for special cases, generally not necessary).
Non-text resource of value to the conveyance of the concept. This could be, for example, an image of the embodiment of the concept. The media file is pointed to with attributes just like for the
details node. The text content is the caption of the data, which can be empty, but should be provided nevertheless.
Information on when, where, how, and by whom the concept was originally formulated, introduced, or demonstrated.
Editor's comment on the concept. For example, doubts on the accuracy or wording of the description, topic qualification, etc. Several editors may add their own comments.
Each concept may be named by several terms, given by
term child node of
concept node. In the simplest case of a traditional glossary, these terms would be synonymous. In a Divergloss glossary, clients should consider as synonymous only those terms with equal language and environment attributes.
All attributes to the
term node are optional, and are as follows:
The language of the term. The value is a list of language codes, as defined by the
The environment in which this term is used. The value is a list of environment keys, as defined by the
The editor who added the term into the glossary. Value is a key defined by the
The source of the term: an organization, publication, another glossary, a person. The value is a list of source keys, as defined by the
Any grammatical categories to which the term may belong. These can be, for example, gender for a noun, aspect for a verb, etc. The value is a list of category keys, defined by the
Terms may sometimes need additional information, in which case the extended
eterm node is used. It has the same attributes as the ordinary
term node, but branches into child nodes, where the nominal form of the term is stated by the
nom child node:
<eterm> <nom>phenomenon</nom> ... </eterm>
Other, optional child nodes of the extended term include:
Especially for inflected languages, it may be useful to know the stem of the nominal form of the term. Clients may use it for processing user queries into the glossary. Only one stem node is allowed.
A particular declension of the term: cases, genders, moods, etc. The declension category is stated by the mandatory
gr attribute (like for the
term node itself), which holds one of the grammar keys defined the
grammar node. There can be as many declension nodes as needed. Several declensions may be of the same grammar category, which means that they are all an acceptable variation of that category.
Text describing the history behind the term. Optional attributes are
by (the editor who added the text),
src as the key of the source from where the information was obtained, as well as
env when differing from that of the term. There can be more than one origin node, possibly offering alternative views.
Editor's comment on the term. Optional attributes are
by, stating the editor's key, and
env, in case the language or environment of the comment are different from that of the term. There can be several comments, by the same or different editors.
An example of a term with some of the extended data:
<eterm> <nom>phenomenon</nom> <decl gr="plu">phenomena</decl> <stem>phenomen</stem> <origin src="dictcom"> From Greek "phainómenon", over Late Latin "phaenomenon", to appear.</origin> </eterm>
All the various keys used in concepts and terms are collected and defined within this node, by sections:
<keydefs> <languages> ... </languages> <environments> ... </environments> ... </keydefs>
The keys themselves are always defined by the
id attribute of the respective key definition node.
Each node that has a text value, within any of the key definition sections, may be equipped with
env attributes, and repeated for different combinations of them. Clients should use this info to select the text to present to the user as a description behind a particular key.
The key definition sections are as follows:
The languages used within the glossary, as applied by the
lang attribute. The definition of a language provides its full and short name:
<languages> <language id="en"> <name>English</name> <shortname>En.</shortname> </language> <language id="fr"> <name>French</name> <shortname>Fr.</shortname> </language> ... </languages>
Language identifiers should follow the codes from ISO 639, when available. This is important for relating the language to a system locale, such that language-dependent processing (e.g. alphabetical sorting) may be correctly performed.
Usage environments for text content, as applied by the
env attribute. For each environment the full and short name are given, and the description of the environment:
<environments> <environment id="unix"> <name>Unix</name> <shortname>U.</shortname> <desc> A computer operating system originally developed in 1969 by a group of AT&T employees...</desc> </environment> ... </environments>
If the environment is also one of the concepts, note the difference between the description here and the concept description: environment's description may provide info on the terminology aspects of the environment (if none are needed, the environment description may just point to the concept by a
An environment may specify terminology-wise close environments: if a term is not defined in the present environment, another from a close environment can be used as if it were its own. Close environments are specified by a list of environment keys in the
closeto attribute of the
environment node. The list order matters: the first environment is considered the closest, etc.
Clients may sometimes need to pick one environment among others, or to order them in a certain way. Two additional attributes may be specified to influence clients at this. The
meta attribute states that the environment is not a true environment (e.g. it may be an umbrella for several environments), and takes one of truth values
weight attribute specifies environment's priority, in a case-dependent sense, and takes numbers from 0 to 9 as values (0 is default).
People who are, or were at one point, adding and modifying the content of the glossary. These keys are applied using the
by attribute. An editor definition contains the name and short name (usually initials), email address, affiliation, and description:
<editors> <editor id="hjjr"> <name>Henry Jones, Jr.</name> <shortname>IJ</shortname> <email>email@example.com</email> <affiliation> Barnett College, visiting professor</affiliation> <desc> Dr. Jones is an eminent archaeologist, who teaches at Barnett College in New York...</desc> </editor> ... </editors>
Email address, affiliation and description are optional.
Sources which the editors use to assemble the glossary, and applied by the
src attribute. A source can be just about anything: a publication, an institution, a person, etc. Each source defines its full and short name, description, email address, and an URL:
<sources> <source id="wp"> <name>Wikipedia, the Free Encyclopedia</name> <shortname>Wp.</shortname> <url>http://en.wikipedia.org</url> <desc> A free, multilingual, open content encyclopedia project operated by the non-profit Wikimedia Foundation...</desc> </source> <source id="jbl"> <name>J. Bigshot Linguist</name> <shortname>JBL</shortname> <email>firstname.lastname@example.org</email> <desc> Esteemed and prolific originator and commentator of many of the terms found within this glossary...</desc> </source> ... </sources>
Email address and URL are optional.
The topics to which the concepts belong, as applied by the
topic attribute of the concept. The definition contains the full and short name, and a description:
<topics> <topic id="apollo"> <name>The Apollo Program</name> <shortname>Apollo</shortname> <desc> The Apollo program was a human spaceflight program undertaken by NASA during the years...</desc> </topic> ... </topics>
Usage levels applied to the concepts by the
level attribute. They are defined by the full and short name, and a description:
<levels> <level id="basic"> <name>Basic Concepts</name> <shortname>basic</shortname> <desc> The concepts that every user should know about.</desc> </level> ... </levels>
The description is optional.
Grammar categories for terms and declensions, as given by their
gr attributes. Each is defined by the full and short name, and a description:
<grammar> <gramm id="pl"> <name>plural</name> <shortname>pl.</shortname> <desc>The plural form of the word.</desc> </gramm> ... </grammar>
The description is optional.
External locations of more detail info on the concept, as provided by the
details child node of a concept. An external root is defined by its full and short name, description, the URL root to which relative paths are appended (given by the
rel attribute in concepts), and an URL for manual browsing:
<extroots> <extroot id="rloc"> <name>Local files</name> <shortname>loc.</shortname> <rooturl>file://usr/share/thisgloss/data</rooturl> <desc>Files on local disk, installed by this glossary.</desc> </extroot> <extroot id="rwp"> <name>Wikipedia</name> <shortname>Wp.</shortname> <rooturl>http://en.wikipedia.org/wiki</rooturl> <browseurl>http://en.wikipedia.org</browseurl> <desc>Links to articles on Wikipedia.</desc> </extroot> ... </extroots>
The URL for manual browsing is not mandatory.
Some nodes may contain larger bodies of text, where additional markup is advantageous (e.g. referencing). Such nodes are
origin, etc. The following markup can be applied within text contents of such nodes:
A reference to a concept defined within the glossary. It wraps a phrase indicative of the concept, and points to a concept using the
c attribute (the value being the key of the concept).
An emphasis on a word or a phrase.
A word or a phrase in another language, as opposed to that of the text. Must have a
lang attribute stating the language of the phrase. An optional argument is
wl, which if present indicates that the short language name should be formatted together with the phrase (its value must be one of
Link to an external resource. The URL of the resource is given by the
url attribute, which is mandatory.
Although glossary texts should be kept short and to the point, sometimes the text content could still be long enough to warrant splitting into several paragraphs, and other higher level groupings. All nodes which could reasonably benefit from such structure have an
l* variant, which contain structured text content:
<ldesc> <para> A huge cloud which is thought to surround our solar system and reach over halfway to the nearest star.</para> <para> Comets originate in the Oort cloud.</para> </ldesc>
Such nodes are:
lorigin. These can be used everywhere instead of their simpler counterparts. For the moment, the only structuring element are paragraphs (the
para nodes), but more may be introduced in the future.
Sometimes a lot of text may need duplicating due to a single phrase in it differing across environments, typically in description nodes -- see an earlier example. To prevent this duplication, clients will support special embedded text selection by environment. Using embedded selection, the mentioned example can be rewritten as:
<concept id="directory"> <desc> An entity in a <ref c="filesystem">file system</r> which contains a group of <ref c="file">files</r> and other ~directories|mac:folders|amiga:drawers~.</desc> <term>directory</term> <term env="mac">folder</term> <term env="amiga">drawer</term> </concept>
i.e. the embedded selector is of the form
~env1:phrase1|env2:phrase2|...~, where if one of the environment keys is empty or omitted (as in the example), that phrase inherits the surrounding text's environment. Instead of a single environment key, a whitespace separated list can also be given. The tilde character (~) cannot be a part of ordinary text by itself, but it can be escaped by doubling it (~~). This kind of special-form selection is unusual by XML standards, but has been introduced due to being more human-readable and editable in the running text than e.g. a selection node with subnodes per environment:
<select><for env="env1">phrase1</for><for env="env2">phrase2</for>...</select>.
Divergloss is distributed in a package which, aside from the format definition and documentation, contains command-line tools for processing Divergloss glossaries into various end-user formats, and requires minimum installation fuss. This enables users to quickly start writting and putting glossary data to practical uses.
$ git clone git://gitorious.org/divergloss/mainline.git
This will create directory
mainline/ with the complete repository. In it there will be the
README file with short setup instructions. The repository can later always be updated to the newest version by issuing:
$ cd mainline/ $ git pull
In the package there is a Python module,
dg, which provides easy access to glossary content and functionality frequently needed for manipulating glossary data. While e.g. XSLT is very succinct for straightforward mappings of XML data, building glossary outputs (among other things) may be much more demanding than that, and therefore more easily tackled with a general purpose programming language such as Python. Not the least is Python's ease of use and rich variety of modules, which makes any special processing of glossaries that much more viable.
The packaged dgproc.py script is one immediate user of the
dg module. It operates by pushing Divergloss files through sieves, which build outputs and perform other operations on the glossary. In the basic mode, when run with the glossary file as the single argument, dgproc.py will validate the glossary, reporting also the problems not discoverable by DTD validation. If the glossary file is
gloss.xml, then executing:
$ dgproc.py gloss.xml
will give no output if the glossary is technically valid.
The list of applicable sieves may be seen by issuing the
-S) option. For example, if the glossary contain terms in English (
en) and German (
bidict-html sieve may be used to create an embeddable HTML dictionary table, with collapsible concept descriptions:
$ dgproc.py html-bidict gloss.xml -solang:en -stlang:de -sfile:gloss.html
-s... options issue sieve parameters. Or, to create a TBX glossary file for use in tools that can make use of it (e.g. a translation editor may automatically issue terminology recommendations):
$ dgproc.py tbx gloss.xml -sfile:gloss.tbx
List of parameters for each sieve may be seen by following the sieve name with the
-H). Each sieve is described in more detail in the
dg.sieve module documentation contained in the package.
The following sources were used when making up the examples: