\documentclass[11pt,twoside]{article}

%%% XXX Comment out before submission!!!
%%%\def\RCS$#1: #2 ${\expandafter\def\csname RCS#1\endcsname{#2}}
%%%\RCS$Revision: 1.6 $

\usepackage{adassconf}

% NB: wierdo conference style has \title and co typeset material,
% rather than store it, so \begin{document} must go here, and there's no
% \maketitle
\begin{document}

\paperID{O6-3}
%%%% ID=O6-3
\contact{Norman Gray}
\email{norman@astro.gla.ac.uk}

\title{Data Models in the VO: How Do They Make Code Better?}
%%%\marginpar{[v\RCSRevision]}
\titlemark{Data Models in the VO}

\author{Norman Gray,
        David L. Giaretta,
        David S. Berry,
        Malcolm J. Currie,
        Peter W. Draper,
        Mark B. Taylor\altaffilmark{1}%
        }
\affil{Starlink Project,
        Rutherford Appleton Laboratory,
        Chilton,
        Didcot,
        OX11 0QX,
        UK}

\altaffiltext{1}{Alternate affiliations:
NG,
%        Department of Physics and Astronomy,
        University of Glasgow%,
%        Glasgow,
%        G12 8QQ, 
%        UK%
;
DSB,
%        Centre for Astrophysics,
        University of Central Lancashire%,
%        Preston,
%        PR1 2HE,
%        UK%
;
PWD,
%        Department of Physics, 
        University of Durham%, 
        %Rochester Building, 
        %Science Laboratories, 
        %South Road, 
%        Durham, DH1 3LE, 
%        UK%
;
MBT,
%        Department of Physics,
        University of Bristol%,
%        %Tyndall Avenue,
%        Bristol,
%        BS8 1TL,
%        UK%
        }

\paindex{Gray, N.}
\aindex{Giaretta, D. L.}
\aindex{Berry, D. S.}
\aindex{Currie, M. J.}
\aindex{Draper, P.}
\aindex{Taylor, M.}

\authormark{Gray, Giaretta, Berry, Currie, Draper \& Taylor}

\keywords{Virtual Observatory, data: model, usability, Starlink}



\begin{abstract}
Data Models exist in people's heads.  Data modelling consists of making
these explicit on paper, so that (a) we can discover if there is more
than one important model, and (b) we can develop using the model which
%has the best impedance match with the community being targeted.
has the best impedance match with the targeted community.
%Thus
%modelling is not just about communications -- about bits (or angle
%brackets!) on the wire -- but is a software quality and usability
%issue.

We contend that there is in fact more than one model relevant to the
Virtual Observatory (VO), and that while the VOTable model is a
valuable fit to the archivists' model of data, it may be a poor match
for many users or (much the same thing) for the software written to
service the sort of end-user astronomical applications which the VO
targets.

%% Modelling work in other areas shows the importance of abstraction in
%% the concrete goal of freeing software design from the particulars of
%% any single implementation. This is extremely important for the VO
%% because it allows us, and the software we write, to deal with the
%% essentials of the data rather than the superficial aspects of a
%% particular format such as XML or FITS. We discuss the work that we and
%% others have been doing within this context; with this in mind, 
We will also review some of the various modelling languages available.
%, such as XSchemas, UML, OMG MDA, HUTN, RDF, and Topic Maps.
\end{abstract}



\section{Language, models and usability}

In linguistics, the well-known Sapir-Whorf hypothesis claims that the
way we conceive of the world depends on the language we use to
describe it.  This means that the language we use affects what we can
think; and conversely, if we wish to have and manipulate a thought, we
must find some language to express it.

All this talk of languages matters to us, since when we create
software systems, we are generally creating a `language' -- in a
user-interface, an API or a protocol -- which users must employ to
interact with the underlying system, be it an application, a library,
or a remote service.  The user, the language and the system each
have a \emph{model} associated with them (see Figure~\ref{O6-3:f:models}), and
if there is a good three-way match between the models,
%If there is a good
%three-way match between the concepts in the user's head, the concepts
%implied by the language, and the concepts implicit in (some aspect of)
%the system's actual design,
then the user's interactions with the
system will be straightforward and generally error-free.  If there is
a mismatch, they will not.  This is not just a matter of
user-interface design; if the `user' is a programmer using an
API or protocol, then this three-way match will help them write
correct code faster, and help produce an application which will be
usable by its eventual mouse-wielding audience.

In some cases this match is reasonably obvious (think of the web and
either HTML or HTTP); in other cases more work is involved (Unix shell
language is tightly bound to the underlying system, but it takes
effort for the user to acquire the corresponding model); and in other
situations (notoriously video recorder interfaces) the complete
dislocation between the three models makes the interface language
almost unusable.

% Figure must go here rather than just before the paragraph which
% refers to it, to ensure that it doesn't end up on the first page
\begin{figure}
\epsscale{0.9}
\plotone{O6-3_1.eps}
\caption{\label{O6-3:f:models}Systems, users, and the languages which
mediate between them, all have potentially separate implicit models.}
\end{figure}

Ideally, then, there is a single model, which our user thinks with
(possibly with the help of documentation), which the language
expresses, and which the underlying system implements; it is the
rendezvous which helps the system as a whole hang together.  If the
model is made explicit during the development process, then it can
itself be examined, criticised, and checked for consistency with
itself and with the external system it is supposed to model, whether
that is an archive or a `quantity'.  The Sapir-Whorf hypothesis
suggests that it is only once the model is itself part of the language
of development, that we can think with it and talk about it.  Thus
this is a software quality and usability issue; it is an abstraction
with the concrete goal of freeing software design from the details of
any particular implementation.

It is important to note that the interface language's syntax is not a
model.  Instead, that syntax should be chosen so as to faithfully
reflect the model which the language hopefully shares with user and
system.  However, we can only discuss the faithfulness of the syntax
once we have an explicit model.

So how do we make the model explicit?  There are several obvious
answers to this, of which the most currently fashionable will be
XSchema, UML, and `a Java class library'.  A potential problem with
each of these is that they each come with a good deal of baggage.
%
This is another aspect of the Sapir-Whorf hypothesis: we are driven to
see the world in terms of the structure of the language we use to
describe it, irrespective of whether these features are present in, or
adequately describe, the
system being modelled.
That means that if we use XSchemas to model the world, we will
discover that the world is hierarchical with attributes, and if we use
an OO language, the world turns out to have methods.  For example,
while it is certainly possible in XML to model circular reference or
multiple containment (an RA IsA quantity \emph{and} IsA position), the
resulting language would likely be confusing and hard to use
correctly.
%
%% This is another aspect of the Sapir-Whorf hypothesis: we are driven to see
%% the world in terms of the structure of the language we use to describe
%% it.  That means that if we use XSchemas to model the world, we will
%% discover that the world is hierarchical with
%% attributes, and if we use an O-O language, the world turns out to have
%% methods.  That is, our choice of modelling language biases our model
%% to have certain features, irrespective of whether these
%% features are present in the system being modelled, and it makes other
%% things hard to express.  For example, while it is certainly possible
%% in XML to model circular reference or multiple containment (an RA IsA
%% quantity \emph{and} IsA position), the resulting language would likely
%% be confusing and hard to use correctly.

%A separate problem is to start thinking about syntactical or
%presentational details too early.  If we retrofit a data model, it
%will end up being a poor model of the world, but an excellent
%model of our syntax.
% During the UCD modelling effort, we found ourselves asking questions
% which purported to be about astronomy, but which were really about
% underscores.

There are several strategies to deal with this.  The first is to
decide that syntax is more important than anything else and let that
lead the process (this was arguably the case for UCD1).  Another is to confront
the problem, acknowledge that our choice of language is not neutral,
and make sure that choice is a good one.  A third is to choose a
more primitive modelling language, which will push fewer things into
the model.  We will return to this question in Section~\ref{O6-3:s:techniques}



\section{Different models for the same data -- the case of VOTable}

Our second point was first made by the Emperor
Charles~V: ``I speak Spanish to God, Italian to women, French to men
and German to my horse''.
% Variant: `German for soldiers, French for women, Italian for
% princes, Spanish for God''
% (Charles -- who was born in Ghent, and so would doubtless now
%count as Dutch -- seems to have had no use for fringe languages like
%English or \textsc{.net}).
% If we wish to say several things, we may need several languages.
If there is more than one type of user, and thus more than one user
model, we may need more than one language to achieve the required three-way
match.

Several of the systems in the current VO use VOTable
syntax (Williams et~al.\ 2002).  Despite this, we claim that they do
not all use the underlying VOTable data model.
%
VOTable is excellent as a way of archiving catalogue metadata, and
encoding \emph{all} the available information about a catalogue or image.
This comes about because VOTable is
%extremely
expressive, recursive and
flexible, and these are advantages for many users, with the result that
there is a good match between the user, system and (VOTable) language.
There is another category of users, however, who do not need or want this
sophistication, and who simply wish to extract a more modest amount of
information (such as image and variance data) with as little knowledge as
possible; this may be because they are busy and cannot afford the time to
read full documentation, or because they are writing generic applications,
and so cannot afford the luxury of reading documentation and building
in to their application knowledge of the various ways that a set of data
providers have exploited VOTable's flexibility.  The Simple Image Access
(SIA) protocol implicitly uses this simple model at both the user and
system ends, even though it uses in its responses the VOTable syntax.
%which implies a potentially much more complicated model.
This
dissonance between models is a potential usability problem, and is
addressable by using an alternative model such as that
of % implied by
HDX (Giaretta et~al.\ 2003).

The UCD system (Derriere et~al.\ 2004) implies a third distinct data
model, corresponding to a distinct constituency which is interested in
metadata rather than pixels, and whose focus is on registries rather
than image viewers.

This (Kuhnian) incommensurability is not a defect which can be evaded
by a yet more comprehensive Grand Unified Data Model; instead, it
is a fixed feature of the problem that the VO is addressing.  Although
they are less severe, there are also similar dislocations between the
metadata models which different VO participants prefer.  These gaps
between models can be bridged by software that understands more than
one model (such as an archive which can service two communities), by a
consensus model, if one can be produced in a useful timescale, or
by accepting the existence of multiple models, and declaring what
mappings exist between them in such a way that software can
mechanically extract the semantics of a given set of metadata with at
least as much fidelity as it would from a consensus model.  It is this
last which seems the most flexible and scalable of the options.

To illustrate what is gained by being able to discuss the data model
explicitly, consider the following pair of RDF statements, which is
how we might express the fact that a column with ID \texttt{raerr},
say, is the error in an RA:
\begin{verbatim}
_x rdf:type :pos.eq.ra .
_x :stat.err #raerr .
\end{verbatim}
This is RDF `notation3', and expresses in detail that there
exists a thing
{\catcode`\_=12 \texttt{_x}} % plain \_ looks too thin, so make _ an `other'
that is of type \texttt{pos.eq.ra}, and
% that {\catcode`\_=12 \texttt{_x}}
has a property \texttt{stat.err} which is in the URL
\texttt{\#raerr}; however the details are much less important than the
observation that such a neutral notation exists, and that this 
notation is distinct from the language we would use to communicate
this in practice.
%
This prompts a number of valuable questions: \texttt{pos.eq.ra}
appears to be a type whereas \texttt{stat.err} is a property (in RDF
terms): did we want this?  Is this what we want a UCD like
\texttt{stat.err;pos.eq.ra} to mean?  If not, what?  For a given
proposed UCD syntax, what would such a set of statements look like?
Can we construct similar sets of statements which our proposed syntax
could not express?  Such questions are extremely difficult to ask, let
alone answer, without a notation which is distinct from the
syntax under discussion.

In summary, the VO contains multiple user groups and models, 
% which are incompatible to a greater or lesser extent,
and its challenges therefore cannot be met by an approach focused on a
single central model.  Even when groups have largely compatible models
(such as the producers and consumers of complicated archive data), the
mappings between them, and consensus models, must be discussed using a
notation \emph{which is distinct from the proposed syntax}, if it to
be possible to discuss that syntax with clarity.


\section{Modelling techniques\label{O6-3:s:techniques}}

RDF, illustrated above, is one of several possible modelling languages.
%Above, we illustrated using RDF as a modelling language; it is one of
%several possibilities.

XSchema is the World-Wide Web Consortium's (W3C) \htmladdnormallink
{standard schema language} {http://www.w3.org/XML/Schema} which, as
well as validation, is usable is a modelling language.  It is both
verbose and rather complicated, and its main benefit over DTDs is its
elaborate type system, which looks familiar to those with a database
schema background, as well as being the type system for a variety of
other W3C standards.

The Unified Modelling Language (UML) is intended for designing
object-oriented systems, and modelling their environments.
Object-oriented notions of subclassing and object manipulation are
very natural within UML, in the way that containment, for example, is
natural in XML schemas.
%
The Object Management Group (\htmladdnormallink {\texttt{www.omg.org}}
{http://www.omg.org}) curates the UML standard as part of the larger
Model Driven Architecture (MDA) as a means of specifying
platform-independent application descriptions: XMI (XML Metadata
Interchange) is an XML-based modelling language intended to help
generate and exchange consistent XSchemas, UML, and code.

Resource Description Format (\htmladdnormallink {RDF}
{http://www.w3.org/RDF/}) is a W3C low-level modelling notation,
which is usable by the inferencing engines which
should drive the Semantic Web; it is the modelling aspect, rather than
this inferencing one, we have emphasised above.
Other W3C standards such as RDFSchema and OWL supplement the base RDF syntax to the point where it is
useful for describing realistic ontologies.

%% Resource Description Format (\htmladdnormallink {RDF}
%% {http://www.w3.org/RDF/}) is a W3C low-level modelling notation, with
%% properties which make it usable by the inferencing engines which
%% should drive the Semantic Web; it is the modelling aspect, rather than
%% this inferencing one, we have emphasised in the discussion above.
%% Other W3C standards such as RDFSchema and OWL (Ontology Language for
%% the Web) supplement the base RDF syntax to the point where it is
%% useful for describing realistic ontologies.


%% Above, we illustrated using RDF as a modelling language; it is one of
%% several possibilities.

%% \paragraph{XSchema}

%% XSchema is the World-Wide Web Consortium's (W3C) replacement for the
%% DTD syntax which XML inherited from SGML.  As such, it is intended
%% primarily for validation, but since modelling is involved in the
%% process of developing a markup schema, it is usable as a modelling
%% language also.  XSchema is both verbose and rather complicated, and
%% its main benefit over DTDs is its elaborate type system, which looks
%% familiar to those with a database schema background, as well as being
%% the type system for a variety of other W3C standards.  An alternative
%% schema language is RelaxNG, which has a syntax which is substantially
%% easier to read than XSchema, but which can be easily transformed to
%% it.

%% \paragraph{UML}

%% The Unified Modelling Language is primarily intended for designing
%% object-oriented systems, but it is also designed to be able to model the
%% environments those systems must operate within.  Object-oriented notions
%% of subclassing and object manipulation are clearly very natural within
%% UML, in the way that containment, for example, is natural in XML schemas.

%% \paragraph{OMG standards}

%% UML as a standard is curated by the Object Management Group
%% (\htmladdnormallink{www.omg.org}{http://www.omg.org}), along with
%% a collection of other object-modelling technologies including IDL and
%% CORBA.  These can be tied together using yet more acronyms.  MDA
%% (Model Driven Architecture) is a means of specifying
%% platform-independent application descriptions: XMI (XML Metadata Interchange) is an XML-based modelling
%% language intended to help generate and exchange mutually consistent
%% XSchemas, UML, and code in the form of CORBA and IDL interfaces.
%% %Since XMI suffers from the general XML verbosity, work is being done
%% %on HUTNs (Human-Usable Textual Notations), as more
%% %readable versions of XMI models.

%% \paragraph{RDF and OWL}

%% Resource Description Format
%% (RDF) is a modelling notation standardised 
%% by the W3C.  It is a low-level modelling language, with properties
%% which make it usable by the inferencing engines which, it is hoped,
%% will drive the Semantic Web; it is the modelling aspect, rather than
%% this inferencing one, which we have emphasised in the discussion
%% above.  Other W3C standards such as RDFSchema and OWL (Ontology
%% Language for the Web) supplement the base RDF syntax to the point
%% where it is useful for describing realistic ontologies.


%% \section{Conclusions}

%% The VO will have several distinct types of users, and thus several
%% potentially incommensurable data models; since we believe this to be
%% inevitable, it follows that it should be planned for, rather than
%% attempting to force all applications and protocols into a single
%% compromise data model.  This requires us to articulate these user or
%% system data models so that we can bring about the three-way consistency
%% that will be necessary for a usable and robust VO.  Linking these
%% independent models is probably more profitable than attempting to develop
%% some Grand Unified Data Model.




\begin{references}
% Reference format from proceedings instructions, which refers to
%http://ukads.nottingham.ac.uk/cgi-bin/nph-bib_query?bibcode=1990ApJ...357....1A
\reference Derriere, S. et al.\ 2004, \adassxiii, \paperref{P3-17}

\reference
Giaretta, D., % David,
%% Taylor, M., % Mark,
%% Draper, P., % Peter,
%% Gray, N., % Norman
%% and
%% McIlwrath, B., % Brian  
    et al.\ 2003, \adassxii, 221

\reference
Williams, R., % Roy,
%% Ochsenbein, F., % Fran\c{c}ois,
%% Davenhall, C., % Clive,
%% Durand, D., % Daniel,
%% Fernique, P., % Pierre,
%% Giaretta, D., % David,
%% Hanisch, R., % Robert,
%% McGlynn, T., % Tom,
%% Szalay, A., % Alex,
%% and
%% Wicenec, A., % Andreas.
et al.\ 2002,
VOTable: A Proposed XML Format for Astronomical Tables, version 1.0,
% April 2002
%online at URL 
\htmladdURL{http://cdsweb.u-strasbg.fr/doc/VOTable/}

\end{references}

\end{document}

