%ADASS_PROCEEDINGS_FORM%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
\documentclass[11pt,twoside]{article}  % Leave intact
\usepackage{adassconf}

\begin{document}   % Leave intact
\paperID{P4-9}
%%%% ID=P4-9

\title{Grid Data Distribution strategy: Design and Implementation of a
 Pipeline Oriented Data Management System}
\titlemark{Pipeline Oriented Data Management System for the Grid}
%-----------------------------------------------------------------------
%                 Authors of Paper
%-----------------------------------------------------------------------
 \author{N.\ Lama, C.\ Vuerli, R.\ Smareglia, F.\ Gasparo, F.\ Pasian}
\affil{ INAF/Osservatorio Astronomico di Trieste,
   Via G.B.Tiepolo 11, I-34131, Trieste, Italy.
   e-mail: family-name@ts.astro.it }
\author{M.\ Genghini}
\affil{Istituto Astrofisica Spaziale e Fisica Cosmica, Bologna }
%-----------------------------------------------------------------------
%            Contact Information
%-----------------------------------------------------------------------
% This information will not appear in the paper but will be used by
% the editors in case you need to be contacted concerning your
% submission.  Enter your name as the contact along with your email
% address.
\contact{Nicola Lama} \email{lama@ts.astro.it}

\paindex{Lama, N.} 
\aindex{Vuerli, C.} 
\aindex{Smareglia, R.}
\aindex{Gasparo, F.} 
\aindex{Genghini, M.} 
\aindex{Pasian, F.}
%-----------------------------------------------------------------------
%                     Author list for page header
%-----------------------------------------------------------------------
\authormark{Lama et al.}

\keywords{archives, data: processing, pipelines, data: management, databases, Grid}
%-----------------------------------------------------------------------
%                  Abstract
%-----------------------------------------------------------------------

\begin{abstract}          % Leave intact
Dynamic data distribution is a key factor in Grid computing. The
DMC project, aiming at improving collaborative research by
allowing data to be shared more easily across applications
cooperating within a federated environment, is described. DMC is
the data management system chosen by the Planck Satellite Survey
Community, and specifically by the two Data Processing Centers, as
a common infrastructure for the data handling applications being
developed. Particular reference is here made to the design of the
model, the data structures and to the portability of the Planck
experience to other {pipeline-oriented} distributed environments,
with particular reference to Grid-enabled systems.
\end{abstract}
%-----------------------------------------------------------------------
%                 Main Body
%-----------------------------------------------------------------------
% Place the text for the main body of the paper here.  You should use
% the \section command to label the various sections; use of
% \subsection is optional.  Significant words in section titles should
% be capitalized.  Sections and subsections will be numbered
% automatically.
\section{Introduction}
 The aim of the project is to provide a pipeline-oriented
data management system specialized with data products required by
grid oriented data processing modules.
%% (see Figure~\ref{P4-9:Fig1}).
The underlying principle of DMC is to have a service tool through
which a pool of applications can store and retrieve their data
products from a number of geographically distributed data
repositories. These concepts make the DMC a tool particularly
suited to data grid  \htmladdnormallinkfoot{{applications}
}{http://wwwas.oat.ts.astro.it/draco/DRACO-home.htm}. Originally
required within the framework of the Planck IDIS (Integrated Data
and Information System) Working Group, the system has been
designed so as to be fully portable to other experiments, missions
and data management projects. Design details are given in {[Vuerli
2001a; Vuerli 2001b; Lama 2002]}.
%%\begin{figure} [t]
%%\epsscale{.70} \plotone{P4-9_f1.eps}
%%\caption{ Grid Data Management System design} \label{P4-9:Fig1}%
%%\end{figure}

 \section{DMC model: THE CORE}

The DMC has a multi-tier software architecture which is
object-oriented and is organised into independent layers: the DMCI
(DMC Interface) and the physical implementation (see
Figure~\ref{P4-9:Fig1}). The DMCI is the User Interface (or
Presentation Layer), a set of interfaces (API-like) through which
scientific applications can exploit the DMC services. These
interfaces hide the actual physical implementation from the user
or the calling application. The DMC Physical implementation is the
Data Services Layer which communicates directly with the Database.
A crucial objective was  to hierarchically develop the DMC; the
result is that the DMC is implemented by a Business Services
Layer, related to application oriented objects, plus a DMC Core
implementation. The latter is the Basic Services Layer, which
implements the foundation for the data handling. It provides a set
of basic services portable to all those experiments that are
pipeline/module oriented. The core organizes data products within
the associated module or pipeline producer object, aiming at
speeding up data exchange between clients.

   \begin{figure*}[t]
    \epsscale{.90}
    \plotone{P4-9_f1.eps}
    \caption{ DMC multi-tier layout }  \label{P4-9:Fig1}%
    \end{figure*}

 \section{DMC compatibility with Grid concepts}
The DMC is a Digital Library that can be mounted on the top of a
Data Grid infrastructure [{Pasian 2004, Smareglia 2004}] and
provides services for manipulating, presenting, discovering,
browsing and displaying digital objects. It is a particular
implementation of the Generic Virtual Data Access and Integration
Technology layer. It enhances and specializes the following core
services of Grid-enabled data storage resource [{Stockinger
2001}].

%\subsection{Data Formats}
{\bf Data Formats} -- Metadata management is a Virtual
Organization (VO) task. According to [{Segal 2001}], experiment
specific or more generally VO-specific metadata is managed by the
VO’s software infrastructure and not by DataGrid Middleware tools.
DMC data model design [{Vuerli 2001, Lama 2002}] foresees clients
to store information through metadata management common API (e.g.
database schema, FITS file structure). The usage of
undistinguishable Binary Large Object (BLOB) data is not
encouraged since it limits data sharing, which is the aim of DMC itself.
The Digital Library nature of the DMC guarantees its smooth
evolution following forthcoming metadata requirements.

%\subsection{Data access operations}
{\bf Data access operations} -- The DMC data model is composed of
an inventory of objects representing the variety of data products
created along the pipeline processing path. Objects are aggregated
into containers (namespaces) and connected into data flows,
expressing an invocation sequence of scientific solvers and
visualization tools. DMC provides primitives for uniform access to
metadata and storage structure through data model browsing
(virtual directories) and advanced lookup mechanisms (queries, see
below).

%\subsection{Local transparency and global name space}
{\bf Local transparency and global name space} -- Through DMCI,
users can access data in a federation of data repositories
transparently. DMC-enabled applications deal with a set of virtual
data repositories and access data independently of their physical
location. Currently, an LDAP based IDIS Federation Layer component
is in charge of dynamically resolving this link at runtime. Plans
are to move towards a DataGrid-like Storage Resource Broker
approach.

%\subsection{Persistence and Replication}
{\bf Persistence and Replication} -- DMC emphasizes the scientific
computing ability to access large amount of data, stored in
blocks. Data objects can be used as temporary containers
(non-persistent objects) as regards local processing or particular
high performance applications. Replication can be wrapped on the
top of the physical COTS (e.g. Versant replication API)

%\subsection{Privilege and security issues}
{\bf Privilege and security issues} -- Aiming at encouraging
resource/data sharing and the collaborative approach of DMC users,
read/write privileges are handled at the data repository
granularity level (authentication level). This well fits security
requirements of those projects that, like Planck, are working
group oriented. Plans are to enforce security through EDG Java
Security package [{Bosio 2003}], data cryptography and digital
certificates

%\subsection{Error and exception handling}
{\bf Error and exception handling} -- DMC manages data handling
errors and exceptions generated when accessing a data repository
and throws exceptions on failure of consistency checks that
enforce data model integrity and pre-processing data quality
checks.

%\subsection{Check-pointing and state management}
{\bf Check-pointing and state management} -- DMC services are
transaction oriented; it is possible to re-build state and
re-start operations, on failure. The DMC provides multiple
database connection within sophisticated locking models
(optimistic locking, transaction shared among different data
repositories).

\section{Implementation issues: TECHNOLOGY}

%\subsection{COTS adopted}
{\bf COTS adopted} The programming language is JAVA (to ensure
high portability) and JNI for ad hoc integration with non-java
client modules. Versant is the OODBMS choice supported
Planck-wide. \htmladdnormallinkfoot{{Java Data Object}
}{http://access1.sun.com/jdo/} (JDO) technology is being
evaluated: DMC JDO-compliant implementation would provide access
to relational databases, object databases, flat files, or any
other compatible persistent storage device. A Java Servlet
Web-based visualization tool is being developed, exploiting
Starlink software experience on VO data viewing and modeling
[{Gray 2004, Taylor 2003}].

%\subsection{Core implementation}
{\bf Core implementation} The data model has been designed to
reflect data usage and so as to be pipeline oriented. Data are
organized within a graph structure modeling pipeline path. This
has been done aiming at exploiting fast data browsing by link and
preventing time expensive internal queries traversing the
databases to find and evaluate starting point objects. The history
of the processing path of data products is logged so to let
clients browse data products following their processing path.

%\section{Data retrieval and facilities: QUERIES.}
{\bf Queries} Data retrieval features include object lookup by
mnemonic alias, by version and attribute values. Modules can
retrieve products owned by a specified user or produced from a
module or pipeline with certain parameter values. Advanced lookup
services under construction: lookup of time ordered data by sky
position through scanning strategy information.

%\subsection{Huge sized data management}
{\bf huge sized data management} Maps and time series are
internally managed as segmented arrays. Data are buffered within
data chunks forming a segmented array structure that allows the
DMC to manage huge-sized data. Data can also be stored in
compressed form. This DMC architecture issue is being reviewed
according to forthcoming data distribution services optimized for
parallel computing on Beowulf workstation cluster using MPI
{[Gropp 2000a, 2000b]}.

\section{Conclusions}
In the future, a FITS file implementation of the DMCI will be
developed. Modules that rely on DMCI will be able to store data
within database structures or FITS files transparently. JDO
technology will let DMC deal Java objects to any transactional
data store transparently. The DMC was released in late October
2003, after the completion of alpha testing campaign. DMC is being
currently tuned while undergoing beta tests at the Max Planck
Institute for Astrophysics and at the LFI DPC pipeline integration
site {[Zacchei, 2004]}.

% Finally, we have a little acknowledgments section.
\acknowledgments We wish to thank the Research and Science Support
Department of ESA ESTEC for their alpha testing activity, Max
Planck Institute for Astrophysics for their beta testing activity
and the Planck IDIS community and the LFI DPC Consortium
Institutes for comments and suggestions.

%-----------------------------------------------------------------------
%                 References
%-----------------------------------------------------------------------

\begin{references}
\reference Bosio et al.\ 2003,Computing in High Energy Physics (CHEP 2003). 
\reference Gray et al.\ 2004, \adassxiii \paperref{O6-3}
\reference Gropp W. et al, 2000a, Using MPI, MIT Press. 
\reference Gropp W. et al, 2000b, Using MPI-2: Advanced Features, MIT Press
\reference Lama et al.\ 2002,\textsc{Planck Int. Doc.}, 
   IDIS DMC Architectural Design Doc.
\reference Pasian et al.\ 2004, \adassxiii \paperref{P3-1}
\reference Segal 2001, DataGrid Data Management (WP2) Architecture Report. 
\reference Smareglia et al.\ 2004, \adassxiii \paperref{P8-2}
\reference Stockinger et al.\ 2001, European High Performance Computing 
   Conference. 
\reference Taylor et al.\ 2003, \adassxii, \adassref{xii:P2-5}{325} % Paper added FO
	%  APS Conf. Ser. Vol 295, ADASS XII 
\reference Vuerli et al.\ 2001a, \textsc{Planck Int. Doc.} 
  IDIS DMC Users Requirements Doc. 
\reference Vuerli et al.\ 2001b, \textsc{Planck Int. Doc.} 
  IDIS DMC Data Model Specification
\reference Zacchei et al.\ 2004, \adassxiii \paperref{P4-8}

\end{references}

% Do not place any material after the references section

\end{document}  % Leave intact
