Overview

This section gives an overview of the components of the Data Library system.

Infrastructure

The Data Library ansible playbook currently targets CentOS 7. IRI does not test or support the software on other platforms. Plans to migrate to CentOS 8 have been abandoned because Red Hat has ended support for it prematurely. Support for CentOS 7 is scheduled to continue through June 2024; If you have thoughts about what we should target as the next platform after CentOS 7, please communicate them to help@iri.columbia.edu.

The Data Library services run under Docker, using docker-compose. Most services log to stdout, which the Docker daemon forwards to journald.

Configuration management using ansible

Installation and configuration of the Data Library software is automated using ansible, a configuration management tool. Automating the installation and configuration process has the following advantages over doing it by hand:

  • Convenience: an automated installation process makes setting up a new Data Library site quicker.

  • Repeatability: having relevant system configuration details documented in executable form makes it easier to reproduce the same configuration on another server, e.g. after a hardware failure or upgrade.

The above advantages could be achieved by automating the installation process with a shell script, but using ansible instead of a shell script brings further advantages. The same ansible “playbook” (configuration management script) that performs the initial software installation can also be used subsequently to manage the server’s configuration.

  • When the server configuration needs to change, those changes can be described in the playbook and checked into version control, so there is a record of what was changed and when. Running the playbook then applies the changes to the server.

  • An ansible playbook can be run in “check mode”. In this mode, the playbook makes no changes, but merely reports any differences between the server’s configuration and the desired state. Knowing what changes the tool will make before it makes them gives the administrator more confidence in the tool and helps avoid some kinds of configuration errors.

Software components

The system is composed of four containerized services.

  • squid is configured as both forward and reverse proxy. This is the only service that is directly reachable from the public network interface (the others listen only on a docker virtual network). Docker forwards the host’s port 80 to squid, which then forwards each request to one of the three other services depending on the URL path.

  • ingrid is a web application for browsing, analyzing, and visualizing climate data in a browser.

  • PostgreSQL is used by ingrid to store relational data. It is used with the PostGIS plugin to store GIS shapes, particularly boundaries of administrative regions (countries, states, cities, etc.) and bodies of water.

  • Apache httpd serves web applications called maprooms. Whereas ingrid is a general-purpose tool that lets the user perform arbitrary calculations, each maproom is tailored to a particular dataset and a particular application. Maprooms typically embed interactive maps that are generated by ingrid.

Squid routes the root URL / and URLs that begin with /maproom to httpd; all other URLs are routed to ingrid.

User groups

Two kinds of users will need accounts (unix logins) on a Data Library server:

  • Administrators are responsible for system configuration, software updates, backups, and user support. They are members of the wheel group and thus have permission to assume root privileges using sudo.

  • Authors are responsible for adding and extending datasets to the Data Library, and for creating maprooms. By virtue of being members of the datag group, they have permission to add data files to the directory read by ingrid, and to execute SQL queries that modify ingrid’s database. They don’t have general sudo privileges, but they are granted targeted permission to run certain scripts in /usr/local/bin as root.

User accounts for authors are managed by the ansible playbook, but administrators are not.

Important file and directory paths

Ingrid datasets are defined in a data catalog, which is developed in a git repository traditionally named dlentries or dlentries_countryname, e.g. dlentries_madagascar. Authors cannot edit ingrid’s copy of the data catalog directly; to make changes, they edit the catalog in another location, push their changes to their git host, and then run /usr/local/bin/update_datalib on the server, which downloads the latest catalog from the git host.

Catalog entries refer to data files, which should be located in subdirectories of /data/datalib/data. Members of the datag group have permission to write to that directory, and ingrid has permission to read from it.

Each member of the datag group also has their own personal data catalog, located in /data/datalib/home/<username>/DataCatalog. Whereas authors don’t have permission to edit the main data catalog except by pushing changes to the git host, each author can edit their own personal data catalog directly. This is useful for testing catalog entries during development.

Catalog entries in the main catalog should refer to data files located in /data/datalib/data. All members of the datag group have permission to modify the files in that directory. Catalog entries in personal data catalogs can refer to data files in /data/datalib/data or /data/datalib/home/<username>.

Configuration files for the Data Library services are located under /usr/local/datalib. In particular, the docker-compose file that defines the Data Library services is /usr/local/datalib/docker-compose.yaml, so do use the docker-compose command to manage services, an administrator should first cd /usr/local/datalib and then run sudo docker-compose. Note that administrators should not edit most configuration files directly. The configuration is managed by ansible, so changes should be made by modifying the ansible configuration and then applying it using ansible-playbook. This process will be explained in more detail below.

In the typical configuration, /data is a mount point for a large storage volume that is distinct from the root volume. In addition to /data/datalib, which was mentioned above, there is /data/docker. /var/lib/docker is a symbolic link to /data/docker, so data created by the Docker daemon and containers, such as container images and volumes, are stored in that directory.