Overview¶
This section gives an overview of the components of the Data Library system.
Infrastructure¶
The Data Library ansible playbook currently targets CentOS7 and CentOS Stream 9. IRI does not test or support the software on other platforms. CentOS7 is at its End of Life, and is no longer recommended to install.
The Data Library services runs under Docker, using docker compose. Most services log to stdout, which the Docker daemon forwards to journald.
Configuration management using ansible¶
Installation and configuration of the Data Library software is automated using ansible, a configuration management tool. Automating the installation and configuration process has the following advantages over doing it by hand:
Convenience: an automated installation process makes setting up a new Data Library site quicker.
Repeatability: having relevant system configuration details documented in executable form makes it easier to reproduce the same configuration on another server, e.g. after a hardware failure or upgrade.
The above advantages could be achieved by automating the installation process with a shell script, but using ansible instead of a shell script brings further advantages. The same ansible “ playbook” (configuration management script) that performs the initial software installation can also be used subsequently to manage the server’s configuration.
When the server configuration needs to change, those changes can be described in the playbook and checked into version control, so there is a record of what was changed and when. Running the playbook then applies the changes to the server.
An ansible playbook can be run in “check mode”. In this mode, the playbook makes no changes, but merely reports any differences between the server’s configuration and the desired state. Knowing what changes the tool will make before it makes them gives the administrator more confidence in the tool and helps avoid some kinds of configuration errors.
Software components¶
The system is composed of four containerized services.
squid is configured as both forward and reverse proxy. This is the only service that is directly reachable from the public network interface (the others listen only on a docker virtual network). Docker forwards the host’s port 80 to squid, which then forwards each request to one of the three other services depending on the URL path.
ingrid is a web application for browsing, analyzing, and visualizing climate data in a browser.
PostgreSQL is used by ingrid to store relational data. It is used with the PostGIS plugin to store GIS shapes, particularly boundaries of administrative regions (countries, states, cities, etc.) and bodies of water.
Apache httpd serves web applications called maprooms. Whereas ingrid is a general-purpose tool that lets the user perform arbitrary calculations, each maproom is tailored to a particular dataset and a particular application. Maprooms typically embed interactive maps that are generated by ingrid.
Squid routes the root URL /
and URLs that begin with /maproom
to httpd; all
other URLs are routed to ingrid.
User groups¶
Two kinds of users will need accounts (unix logins) on a Data Library server:
Administrators are responsible for system configuration, software updates, backups, and user support. They are members of the
wheel
group and thus have permission to assume root privileges usingsudo
.Authors are responsible for adding and extending datasets to the Data Library, and for creating maprooms. By virtue of being members of the
datag
group, they have permission to add data files to the directory read by ingrid, and to execute SQL queries that modify ingrid’s database. They don’t have generalsudo
privileges, but they are granted targeted permission to run certain scripts in/usr/local/bin
as root.
User accounts for authors are managed by the ansible playbook, but administrators are not.
Important file and directory paths¶
Ingrid datasets are defined in a data catalog, which is developed in a git
repository traditionally named dlentries
or dlentries_countryname
, e.g. dlentries_madagascar
. Authors cannot edit
ingrid’s copy of the data catalog directly;
to make changes, they edit the catalog in another location, push their changes
to their git host, and then
run /usr/local/bin/update_datalib
on the server, which downloads the latest
catalog from the git host.
Catalog entries refer to data files, which should be located in subdirectories
of /data/datalib/data
. Members of
the datag
group have permission to write to that directory, and ingrid has
permission to read from it.
Each member of the datag
group also has their own personal data catalog,
located
in /data/datalib/home/<username>/DataCatalog
. Whereas authors don’t have
permission to edit the main data catalog
except by pushing changes to the git host, each author can edit their own
personal data catalog directly. This is useful
for testing catalog entries during development.
Catalog entries in the main catalog should refer to data files located
in /data/datalib/data
. All members of
the datag
group have permission to modify the files in that directory. Catalog
entries in personal data catalogs can
refer to data files in /data/datalib/data
or /data/datalib/home/<username>
.
Configuration files for the Data Library services are located
under /usr/local/datalib
. In particular, the
docker-compose file that defines the Data Library services
is /usr/local/datalib/docker-compose.yaml
, so do use
the docker-compose
command to manage services, an administrator should
first cd /usr/local/datalib
and then
run sudo docker-compose
. Note that administrators should not edit most
configuration files directly. The configuration
is managed by ansible, so changes should be made by modifying the ansible
configuration and then applying it
using ansible-playbook
. This process will be explained in more detail below.
In the typical configuration, /data
is a mount point for a large storage
volume that is distinct from the root volume.
In addition to /data/datalib
, which was mentioned above, there
is /data/docker
. /var/lib/docker
is a symbolic link
to /data/docker
, so data created by the Docker daemon and containers, such as
container images and volumes, are stored
in that directory.