RFC 12: Backends for the Data Access Layer¶
Author: | Stephan Krause |
---|---|
Created: | 2011-08-31 |
Last Edit: | $Date$ |
Status: | ACCEPTED |
Discussion: | http://eoxserver.org/wiki/DiscussionRfc12 |
This RFC proposes the implementation of different backends that provide common interfaces for data stored in different ways. It describes the first version of the Data Access Layer implementation as well as changes to the Data Integration Layer that are caused by the changes to the data model.
Introduction¶
RFC 1: An Extensible Software Architecture for EOxServer introduced the Data Access Layer as an abstraction layer for access to different kinds of data storages. These are most notably:
- data stored on the local file system
- data stored on a remote file system that can be accessed using FTP
- data stored in a rasdaman database
The term backend has been coined for the part of the software implementing data access to different storages.
This RFC discusses an architecture for these backends which is based on the extension mechanisms discussed in RFC 2: Extension Mechanism for EOxServer. After the Requirements section the architecture of the Data Access Layer is presented. It is structured into a section describing the Data Access Layer Data Model which consists basically of Storages and Locations.
Furthermore, the necessary changes to the Data Integration Layer are explained. On the one hand these affect the Data Model which is altered considerably. On the other hand new structures (Data Sources and Data Packages) that provide more flexible solutions for data handling by the Data Integration Layer and the layers that build on it.
Requirements¶
We may refer here to the Backends Requirements section as well as the description of the Data Access Layer in RFC 1: An Extensible Software Architecture for EOxServer. These state the need for different backends to access local and remote data in different ways and thus are the incentive for this RFC and the respective implementation.
Data Access Layer Data Model¶
The new database model for the Data Access Layer is shown in the figure below:
The core element of the Data Access Layer data model is the Location
.
A location designates a piece of data or metadata, actually any object that can
be stored in one of the Storage
facilities supported. Each backend
defines its own subclasses of Location
and Storage
to
represent repositories, databases, directories and objects stored therein.
The database model is embedded in wrappers that add logic to the model and provide common interfaces to access the data and metadata of the objects in the backend. Internally, they make use of the extension mechanism of RFC2 to allow to find and get the right model records and wrappers.
Last but not least, there is a File Cache for storing files retrieved from remote hosts. The locations of the cache files are stored in the database so EOxServer can keep track of them and implement an intelligent cleanup process.
Storages¶
The Storage
subclasses represent different types of storage
facilities. In the database model, only FTP and rasdaman backends have their own
models defined that contain the information how to connect to the server. This
is not needed for locally mounted file systems, so the local backend does not
have a representation in the database.
The wrapper layer constructed on top of the database model on the other hand knows three classes of storages that provide a common interface to access their data:
LocalStorage
which implements access to locally mounted file systemsFTPStorage
which implements access to a remote FTP serverRasdamanStorage
which implements access to a rasdaman database
Each of these storage classes is associated to a certain type of location.
The common interface for storages allows to retrieve their type and their
capabilities. Depending on these capabilities the storage classes also
provide methods for getting a local copy of the data and retrieving the size
of an object as well as scanning a directory for files. At the moment these
three methods are implemented by file-based backends only
(LocalStorage
and FTPStorage
).
Locations¶
Locations represent the points where to access single objects on a storage facility. At the moment three types of locations corresponding to the three storage types are implemented:
LocalPath
defines a path on the locally mounted file systemRemotePath
defines a path on a remote server reachable via FTPRasdamanLocation
defines a collection (database table) and oid corresponding to a single rasdaman array
Locations share a common interface that is closely related to the storage
interface. So, given the storage capabilities, it is possible to fetch a local
copy, retrieve the size of an object and scan the location for files. The
LocationWrapper
subclasses extend these interfaces to make storage
specific location information (e.g. host name for remote storages) accessible.
File Cache¶
With the CacheFileWrapper
class the Data Access Layer provides a
very simple file cache implementation at the moment that serves to cache
remote files retrieved via FTP. The cache keeps track of the files it contains
using the CacheFile
model in the database.
So far, no synchronization for data access is implemented, i.e. threads that are processing requests have no possibility to lock a cache file in order to prevent it from being removed by another thread or process (e.g. periodical cleanup process). This is foreseen for the future.
Changes to Data Integration Layer Data Model¶
In order to use the new possibilities brought by the implementation of the Data Access Layer, the Data Integration Layer had to be revised and changed considerably. Up until now there has been a strong link between the type of coverage and the way it was stored. Datasets had to be stored as files in the local file system whereas mosaics were stored in tile indexes. This strong link had to be weakened to allow for new combinations.
The solution is a compromise between flexibility and simplicity. Although one
can think of many more combinations, we introduce three classes of so-called
DataPackage
objects. A data package combines a data resource with an
accompanying metadata resource. Both resources are referred to by
Location
subclass instances. Now the three data package classes are:
LocalDataPackage
which combines a local data file with a local metadata fileRemoteDataPackage
which combines a remote data file with a remote metadata file (both reachable via FTP); it contains aCacheFile
reference for data in the local cacheRasdamanDataPackage
which combines a rasdaman array with a local metadata file
Furthermore, the concept of data directories where to look up datasets
automatically had to be revised in order to use the new capabilities of the
Data Access Layer. They were replaced by a concept called data sources which
includes local and remote repositories. The DataSource
model combines
a local or remote Location
with a search pattern for dataset names.
Automatic lookup of rasdaman arrays is not foreseen at the moment.
Like most database objects, data packages and data sources are accessible using wrappers that provide a common interface and add application logic to the data model.
Data Packages¶
The DataPackageInterface
defines methods for high-level and low-level
data access and for metadata extraction from the underlying datasets. It is
implemented by wrappers for local, remote and rasdaman data packages
(LocalDataPackageWrapper
, RemoteDataPackageWrapper
and
RasdamanDataPackageWrapper
respectively).
The implementation of the data package wrappers is based on the
GDAL library and its Python binding for data access
as well as for geospatial metadata extraction. It contains an
open()
method that returns a GDAL dataset providing
a uniform interface for raster data from different sources and formats. For
low-level data access a getGDALDatasetIdentifier()
method is provided which allows to retrieve the correct connection string
for GDAL and thus to configure MapServer.
Geospatial metadata is read from the datasets themselves at the moment. Note that this is not possible for rasdaman arrays so far, so automatic detection and ingestion of these is not enabled.
EO Metadata is read from the accompanying metadata file and translated into the internal data model of EOxServer. The existing metadata extraction classes have been revised in order to comply with the extensible architecture presented in RFC 1 and RFC 2.
Data Sources¶
The wrappers for data sources (DataSourceWrapper
) provide the
capability to search a local or remote location for datasets. At the moment
only file lookup is implemented whereas automatic rasdaman array lookup has
been omitted. This is mostly due to the fact that rasdaman arrays do not
contain geospatial metadata and a separate mechanism has to be found to retrieve
this vital information.
The wrapper implementations provide a detect
method that returns a list of DataPackageWrapper
objects with
which coverages are initialized (using the geospatial and EO metadata read from
the data package).
Ingestion and Synchronization¶
The Synchronizer
implementation in
eoxserver.resources.coverages.synchronize
has to be revised according to
the changes in the Data Access Layer and Data Integration Layer.
The implementations for containers, i.e. Rectified Stitched Mosaics and Dataset
Series, shall retrieve the data sources associated with a coverage and
use its detect
method to obtain the data packages
included in it. Rectified or Referenceable Datasets are constructed from these.
The interfaces of both should not change.
The interface of RectifiedDatasetSynchronizer
on the other hand will
have to change in order to allow for remote files to be ingested. In detail,
the create()
and
update()
methods will not expect a file
name any more, but a location wrapper instance (either
LocalPathWrapper
or RemotePathWrapper
). These can be
generated by a call to the LocationFactory
like this:
from eoxserver.core.system import System
factory = System.getRegistry.bind("backends.factories.LocationFactory")
location = factory.create(
type = "local",
path = "<path/to/file>"
)
...
Voting History¶
Motion: | To accept RFC 12 |
---|---|
Voting Start: | 2011-09-06 |
Voting End: | 2011-09-15 |
Result: | +5 for ACCEPTED (including 1 +0) |
Traceability¶
Requirements: | N/A |
---|---|
Tickets: | N/A |