
This is the unity library, which is able to parse scientific unit
specifications using a variety of syntaxes.

Version: @DIST@

The code is made available under the terms of the 2-clause BSD licence.
See the file LICENCE, in the distribution, for the terms.

The source is on Bitbucket: <https://bitbucket.org/nxg/unity>.
There's an [issues list there][issues], and bugreports are of course very welcome.

Goals
-----

The library was written with the following goals:

  * Producing formal grammars of the existing and proposed standard unit
    syntaxes, for reference purposes.

  * Participating in the [VOUnit standardisation process] [vounitsstd]
    by acting as a locus for experimentation with syntaxes and
    proposed standards.

  * To that end, discovering edge cases and producing test cases.

  * Producing parsing libraries which are fast and standalone, and so
    could be conveniently used by other software; this will act as an
    implementation of the eventual VOUnits standard.  The distribution
    is buildable using no extra software beyond a Java and a C
    compiler.  It is also a goal that the distributed Java is
    source-compatible with Java 1.5 (though this isn't automatically
    tested, so bugreports are welcome).

It is not a goal for this library to do any processing of the
resulting units, such as unit conversion or arithmetic.  Parsing unit
strings isn't a deep or particularly interesting problem, but it's
more fiddly than one might expect, and so it's useful for it to be
done properly once.

A major goal of the VOUnits process is to identify a syntax for unit
strings (called 'VOUnits') which is as nearly as possible in the
intersection of the various existing standards.  The intention is that
if file creators target the VOUnits syntax, the resulting string has
a chance of being readable by as many other parsers as possible.  This
isn't completely possible (the OGIP syntax doesn't allow dots as
multipliers), but we can get close.

Outputs
-------

**Yacc grammars for the three well-known syntaxes, plus a proposed 'VOUnits'
grammar**  These are consistent, in the sense that any string
which parses in more than one of these grammars means the same thing in each
case (ignoring questions of per-syntax valid units).  The 'VOUnits' syntax
is almost in the intersection of the three, in the sense that anything which
conforms to that grammar will parse (and mean the same thing) in the others.
The only exception is that the OGIP grammar uses '*' for multiplication, and the
others accept '.', but that could be got around with a fairly simple character
substitution.  That is, if one writes out in that syntax, then it can be read in
almost anything.

**Parsers in multiple languages** This distribution includes parsers in Java and
C.  Thus the core content here is demonstrably language-agnostic.  Python would
be an obvious next language.

**Test cases** there are more than 200 test cases, of which between 130 and 200
apply to each syntax.

**A collection of 'known' units** There are multiple collections of
these in circulation in different libraries, but this library gathers
them together and generates per-language lookups of the information.
The [VOUnits document] [vounits] discusses the various compromises
necessary here.

Parsing units and prefixes
--------------------------

The grammars defined by this library do not cover the parsing of unit prefixes
since (as it turns out) this cannot be usefully done at this level, and the
grammars identify, in the terminal `STRING`, only the combination of
prefix+unit.  These are subsequently parsed in the following manner:

  1. if the whole string is a 'known unit' then this is the base unit (so
     'pixel' is recognised in some of the syntaxes as a unit, and a 'Pa' is a
     pascal and not a peta-year);

  2. or if the first character in the `STRING` is one of the SI prefixes (or the
     first two are 'da') and there is more than one (two) character, then that's
     a prefix and the rest of the string is the unit (so 'pixe' would be parsed as
     a pico-ixe);

  3. or else the whole thing is a unit (so 'wibble' is an unknown unit called
     the 'wibble', 'm' is the metre and not a milli-nothing, but 'furlong' would
     be a femto-urlong).

That is, validity checking – checking whether this is an allowed unit,
or whether it's allowed to have an SI prefix – happens at a later stage
from the parsing of the units string and only on request, since it's
essentially an auxiliary parse.  This (a) avoids the cumbersomeness of
doing this check earlier, (b) separates the _grammatical_ error of having a star
in the wrong place from the stylistic or semantic error of using an
inappropriate unit, and (c) retains the freedom to use odd
units if someone really wants to.

The library also recognises the binary prefixes (kibi, mebi, and so on) of ISO/IEC 80000-13.

Summary:

  * `pixel` --> 'pixel' in the FITS and OGIP syntaxes, the pico-ixel in CDS
  * `furlong/pixe` --> femto-urlong per pico-ixe
  * `m` --> metre in all syntaxes
  * `mm` --> millimetre
  * `dam` --> dekametre (not the deci-`am`)

Notes
=====

The recognised syntaxes are:

* **fits**
    FITS v3.0, section 4.3, W.D. Pence et al., A&A 524, A42, 2010.
    [doi:10.1051/0004-6361/201015362][fits]

* **ogip**
    [OGIP memo OGIP/93-001, 1993][ogip]

* **cds**
    [Standards for Astronomical Catalogues, Version 2.0, section 3.2, 2000][cds]

* **vounits**
    The VOUnits syntax.  This is a subset of the FITS syntax,
    specified by the [(still-draft) VOUnits specification][vounitsstd].

The grammars are available in src/grammar/unity.y.  Note that this
file is pre-processed before it is fed into a parser generator, and
isn't a valid yacc file as it stands; see the relevant targets in
`src/java` and `src/c`.

The grammars are implemented by (at present) two libraries, one in C
and one in Java.  Each of these generates its parsers directly from
the grammars.  See `src/c/docs` and `src/java/docs` for documentation.

The Java implementation has, and will probably continue to have, more
functionality than the C one.

Each of the implementations supports reading and writing each of the 
grammars, plus LaTeX output
(supported by the LaTeX siunitx package).

The main testcases – a set of unit strings and the intended parse
results – are in `src/grammar/testcases*.csv`.  There are also
library-specific unit tests within the source trees.

If you want to experiment with the library, build src/c/unity:

    % ./unity -icds -oogip 'mm2/s'
    mm**2 /s
    % ./unity -icds -ofits -v mm/s
    mm s-1
    check: all units recognised?           yes
    check: all units recommended?          yes
    check: all units satisfy constraints?  yes
    % ./unity -ifits -ocds -v 'merg/s'
    merg/s
    check: all units recognised?           yes
    check: all units recommended?          no
    check: all units satisfy constraints?  no
    % ./unity -icds -ofits -v 'merg/s'
    merg s-1
    check: all units recognised?           no
    check: all units recommended?          no
    check: all units satisfy constraints?  yes

In the latter cases, the -v option _validates_ the input string
against various constraints.  The expression mm/s is completely valid
in all the syntaxes.  In the FITS syntax, the erg is a recognised
unit, but it is deprecated; although it is recognised, it is not
permitted to have SI prefixes.  In the CDS syntax, the erg is neither
recognised nor (a fortiori) recommended; since there are no
constraints on it in this syntax, it satisfies all of them (this
latter behaviour is admittedly slightly counterintuitive).


Portability
-----------

The library builds on OS X (tested on 10.6 and 10.8), on Scientific
Linux, on Ubuntu, and on OpenBSD (with all checks on).  I don't
systematically test on all these platforms, however.  I have as yet
made no serious attempt to port the library more broadly, but I don't
anticipate problems.  Reports of success or failure, and fixes, are
both welcome ([issues list][issues]).

The Java implementation is source-compatible with Java 1.5, and
unity.jar is built to be compatible with a 1.5 JRE.


Building
---------

The usual:

    % ./configure
    % make
    % make check
    % make install

The build process requires GNU make (as opposed to BSD make).

Pre-requirements: distribution tarball
--------------------------------------

**No library dependencies.**  To build from a distribution, the only
pre-requirements are a C compiler and a JDK (1.5 or later).  You can
build either or both of the C and Java libraries, at your option
(eg `cd src/c; make check`)

If the JUnit jar is in the CLASSPATH, then `make check` will run more tests
than if it's absent.

Pre-requirements: repository checkout
-------------------------------------

To build from a source checkout, retrieved from
<https://bitbucket.org/nxg/unity>,
you need to download, or have installed, rather more.

Before doing anything else, run `./bootstrap` (which runs autoconf and
autoheader).  This generates the `./configure` script, as above.

Tools:

  * autoconf
  * bison or byacc or byaccj (original yacc might work), and flex or lex
  * Java
  * [byaccj][]
  * [doxygen][] and [graphviz][] if you wish to build the C documentation

Make sure these are all in the path before configuring, for example by setting
PATH as one of the `./configure` arguments).  Byaccj is required to
generate the Java parsers; it will also work for generating the C
parsers, if bison happens not to be present.

Some of the source code is generated using a Java program,
therefore you do need Java to be present,
even if you only want to build the C library.

Java dependencies:

  * [jflex][]
  * [JUnit4][]
  * [Mulgara MRG][]

These dependencies can most conveniently be obtained using Maven
(`% mvn dependency:copy-dependencies`);
the Java dependencies must live in `<build-directory>/lib`.

Doxygen is optional:
if it is not in the PATH, the C documentation will be skipped.

Note on installing byaccj: byaccj's Makefile needs to be tweaked to
remove OS Xisms, and it needs to be installed, by hand, some place
like `.../bin/byaccj`.



Limitations
-----------

  * Currently ignores some of the odder unit restrictions (such as the
    OGIP requirement that 'Crab' can have a 'milli' prefix, but no
    other SI prefixes).


Hacking
-------

The distributed source set is assembled using quite a lot of
preprocessing, involving parser- and documentation-generators.  It's
not intended to be a useful starting point for hacking on the
software.  For that, see the section on
_Pre-requirements: repository checkout_ above.


[vounits]:	http://www.ivoa.net/documents/VOUnits/
[vounitsstd]:	http://wiki.ivoa.net/cgi-bin/twiki/bin/view/IVOA/DiscussionOnVOUnits
[cds]:		http://cdsweb.u-strasbg.fr/doc/catstd-3.2.htx
[fits]:		http://dx.doi.org/10.1051/0004-6361/201015362
[ogip]: 	ftp://legacy.gsfc.nasa.gov/fits_info/fits_formats/docs/general/ogip_93_001/ogip_93_001.ps
[byaccj]:	http://byaccj.sourceforge.net/
[doxygen]:	http://www.doxygen.org
[graphviz]:	http://www.graphviz.org/
[jflex]:	http://jflex.de/
[JUnit4]:	http://junit.org
[Mulgara MRG]:	http://code.google.com/p/mrg/
[issues]:       https://bitbucket.org/nxg/unity/issues

Norman Gray  
http://nxg.me.uk  
@RELEASE_DATE@
