\documentclass[11pt,twoside]{article}  % Leave intact
\usepackage{adassconf}


\begin{document}   % Leave intact
\paperID{P1-16}
%%%% ID=P1-16
\title{Distributed Data Storage}
\author{Justin Kanoa Withington}
\affil{Canada France Hawaii Telescope, Hawaii, USA}
\contact{Justin Kanoa Withington}
\email{kanoa@cfht.hawaii.edu}
\authormark{Withington}
\keywords{archives, data: storage, archives, distributed}

\paindex{Withington, J. K.}

\begin{abstract}          % Leave intact

Distributed storage is the technique of storing a single data set
across multiple hosts. This paper focuses on massive localized systems
for distributed storage which are directed at overcoming the capacity
and performance limitations of single-host, or monolithic, storage
systems. Furthermore it will focus on applications in astronomy and
draw examples from several existing astronomical distributed storage
systems.

\end{abstract}

%-----------------------------------------------------------------------
%			      Main Body
%-----------------------------------------------------------------------
 

\section{Methods}

Several aspects of distributed storage systems are described. Where
appropriate, examples from existing storage systems are given. For
this paper distributed data storage systems at five astronomical
facilities were surveyed.

\begin{deluxetable}{lcc}
\scriptsize
\tablecaption{Example Systems Surveyed \label{P1-16:T1.10-tbl-1}}
\tablehead{
\colhead{Location} & \colhead{Aggregate Capacity} & \colhead{Number of Nodes}} 
\startdata
CFHT &18 Terabytes &12 \nl
CADC &55 Terabytes &21 \nl
SDSS &44 Terabytes &24 \nl
JAC  &11 Terabytes &21 \nl
ESO  &20 Terabytes &15 \nl
\enddata
\end{deluxetable}

\section{Cost}

Central to leveraging the cost benefits of a distributed storage
system is the relationship between the base cost of a host platform
and the point at which it becomes impractical to add additional low
cost local storage. 

For example, the lowest cost class of random access storage device
today is the IDE disk drive. As inexpensive disks are added to the
base system, the cost per gigabyte will tend to decrease until
capacity approaches four terabytes, at which point the cost per
gigabyte is about
\$2.60. As capacity grows past this point the cost per gigabyte
starts increasing because the size of the disks and the number of
attached disks begin to require specialized equipment.


This ``sweet spot'' depends on changing market conditions and can be
periodically recalculated in an effort to determine the optimum node
size.

\section{Scalability}

Scalability is perhaps the most liberating quality of a distributed
storage system. Capacity can be dynamically added or removed and nodes
are not limited to a previous design strategy so expansions can also
take advantage of the moving sweet spot mentioned earlier.

A well designed system can scale in a perfectly linear fashion. That
is to say that there is no upper limit to the maximum capacity of such
a system. Furthermore, the cost/capacity relationship is also
linear. Given the estimates of three terabytes at \$2.60 per gigabyte
we could estimate a 100 terabyte system today would cost roughly
\$260,000 and consist of about 25 nodes.

\section{Data Tracking}

Data tracking is simply the maintenance of a single point of reference
for the location of all the elements in the data set. Two common ways
of accomplishing this are through the use of a database and the use of
a virtual file system.

A database provides a great deal of interpretive flexibility. A
database also requires some kind of interface for clients to look up
the location of data. Some software might need to be modified to the
use the database interface.

A virtual file system symbolically represents the entire data set in a
hierarchical fashion similar to a normal file system where each data
point is a link to the actual data on a storage node. An advantage of
a virtual file system is that users do not need to be retrained since
the file access mode is familiar. Multiple virtual file systems can
represent the same data set differently to suit particular
applications. For example, one virtual file system might represent the
data organized by instrument and another might represent the same set
of files by program ID.

The choice of database or virtual file system depends primarily on the
usage pattern of the data set. For example, the CADC is an archive and
a self contained system in the sense that the only clients of the
storage system are the CADC themselves and software written by the
CADC to access the system. In this situation it is most reasonable to
track the data in a database as it will be faster and more flexible
for specialized interfaces. At CFHT, a diversity of engineers and
pipelines use the storage system so a more traditional data
representation is desirable.

\section{Data Organization}

Every distributed data storage volume consists of a collection of
subvolumes located on the storage nodes. Each storage node may host
one or more subvolumes. Data organization is the method used by the
system to store the files across the subvolumes.

At one extreme the system might treat each file as a distinct data
element and disperse them serially or pseudo-randomly across every
available subvolume. On the other extreme, the system might only
recognize classes of data as elements. For example, the system at
SDSS tracks stripes of observations as directories. Under each
directory is raw and processed data not directly managed by the
storage system. JAC's WFCAM storage system organizes each extension of
the camera onto a dedicated subvolume, with another dedicated
subvolume for the processed data of each extension.

Such high level organization of data is attractive because it is less
complicated to design and manage. It only works well however when the
data itself is fairly well organized or homogeneous. It is less
practical for a system which is managing diverse data types or data
from multiple instruments.

\begin{deluxetable}{lll}
\scriptsize
\tablecaption{Data Organization\label{P1-16:T1.20-tbl-1}}
\tablehead{
\colhead{Location} & \colhead{Data Organization} & \colhead{Data Set Type}} 
\startdata
CFHT &file         &heterogeneous \nl
CADC &file         &heterogeneous \nl
SDSS &subdirectory &homogeneous \nl
JAC  &subvolume    &homogeneous \nl
ESO  &file         &heterogeneous \nl
\enddata
\end{deluxetable}

\section{Redundancy}

Some order of redundancy is a necessity in a massive storage
system. Even if the data can be easily restored from an archive, the
administrative overhead and operational downtime of responding to the
inevitable failures can be impractical unless there is some layer of
protection. The two most common techniques are data duplication and
the use of internally redundant subvolumes.

In data duplication each data element is stored twice. The higher
level system takes care that each copy is on a different subvolume and
preferably on a different node. In this way the loss of any single
subvolume and perhaps any single node will not result in the loss of
any data from the overall collection. This is a simple and effective
method but also costly, the storage system must be essentially twice
the size of the data set.

One alternative is to make the subvolumes internally redundant, for
example by using RAID level 5 arrays. Such an array can tolerate the
loss of any single member disk with a usable capacity of $c$($n$-1)
where $c$ is the capacity of the smallest disk and $n$ is the number
of member disks. In other words the cost of redundancy decreases as
the number of disks in the array increases.

This tendency of RAID level 5 arrays to be rather large is the basis
of the cost/risk trade off. A data duplicating system will tend
towards using the smallest possible subvolume - usually a single disk,
whereas an internally redundant subvolume will tend to be as large as
possible - usually eight or more disks. While an internally redundant
subvolume is less likely to fail, if it ever does fail the effect is
more likely to be catastrophic.

The purpose of the storage system is usually the deciding factor. If
it is an archive, as in the case of ESO and CADC, then data
duplication is really the only appropriate method. The other systems
are used for data processing and operate in parallel with an archive,
for which purpose internally redundant subvolumes are adequate.

\begin{deluxetable}{llll}
\scriptsize
\tablecaption{Redundancy and Subvolume Type\label{P1-16:T1.30-tbl-1}}
\tablehead{
\colhead{Location} & \colhead{Redundancy Type} &\colhead{Subvolume Type} & \colhead{Primary Purpose}} 
\startdata
CFHT &internally redundant &RAID 5 &analysis \nl CADC &data
duplication &single disk &archive \nl SDSS &internally redundant &RAID
5 &analysis \nl JAC &internally redundant &RAID 5 &analysis \nl ESO
&data duplication &single disk &archive \nl
\enddata
\end{deluxetable}
\section{Conclusion}

While the technique of building storage clusters at this scale is
relatively new and the underlying technology in a state of rapid
change, sound design methodologies can be developed. Existing systems
provide examples of design considerations and a proof of concept for
future systems.

\acknowledgments
%%FO
The following individuals graciously contributed data for the examples
used in this paper: %%FO \\
\begin{list}{}{\leftmargin=0.5em\labelwidth=0em\itemsep=0ex\labelsep=-2em} 
\item[] Dan Yocum, Sloan Digital Sky Survey %%FO \\
\item[] Nick Reese, Joint Astronomy Centre %%FO \\
\item[] Andreas Wicenec, European Southern Observatory %%FO \\
\item[] Severin Gaudet, Canadian Astronomy Data Centre %%FO \\
\end{list} %%FO

%\begin{references}

% Do not place any material after the references section

\end{document}  % Leave intact
