BDRC Airflow Development

This document contains details of the different elements of airflow-docker and how their elements relate.

Architecture of this module

the principal components of this module are:

  • the DAG code that runs the operations.

  • facilities for marshalling code blocks.

  • test utilities.

Development and Test Before Docker

Before launching into moving the DAGS into a Docker environment, you can test your airflow DAGS and the system in two other modes: - debugging (in an IDE) - on a local web server, to test parallelism and scheduling.

Debugging and initial test

  • create a development environment with pyenv that includes the apache and the BRC libraries. (see requirements.txt)

  • In your DAG code, inclue a __main__ block that runs ``your_dag.test() (See dags/FileWatcherDebagSync.py for an example) You can run this under your IDE’s debugger (You have to watch for PATH, because the shell that syncs is imperfect.)

Local airflow services test

The IDE environment above doesn’t test some parallelism that you might need (e.g. can several instances of the same DAG run in parallel). To do this, you would need to run airflow locally. Happily, this is easy. airflow-docker/``local-airflow.sh provides shorthands for:

  • re-initializing the airflow database (good for cleaning out old records)

  • starting and stopping the airflow webserver and scheduler (these are necessarily separate services, but are all you’ll need to test locally

  • creating the admin user you’ll need.

local-airflow.sh is largely self-documenting and easy to read. The bare minimum to get up and running the first time (or to clear out a clutter of old test runs) is:

./local-airflow.sh -r
./local-airflow.sh -a admin admin some@email.address
./local-airflow.sh -w -u
./local-airflow.sh -s -u

and to stop:

./local-airflow.sh -w -d
./local-airflow.sh -s -d

Eager beaver improvers are welcome to coalesce the stop actions into one command.

After starting the services, airflow will look in ~/airflow/dags for DAGs to run. You can copy your DAGs there, and they will be available in the UI. I make and test changes in my IDE, under source control, and copy what I need into ~/airflow/dags as needed.

Building the image

Note

bdrc-compose-dockerfile.yml has a dual role in this process. It not only is a standard dockerfile for running the image, but it is also used to build it. Building inside the dockerfile frees you from having to worry about changing directories to access the material, and keeps the build process out of the development repo.

bdrc-docker.sh is the entry point to building the bdrc-airflow image, that docker compose runs later. It:

  1. Sets up a COMPOSE_BUILD_DIR

  2. if -rebuild, wipes out the COMPOSE_BUILD_DIR otherwise adds to it.

  3. Adds these elements to

  4. Copies in the archive-ops/scripts/syncAnywhere/deployment scripts (syncOneWork.sh)

  5. Merges two sets of requirement files, and invokes RUN pip install -r requirements.txt on the image. This installs all the python libraries that both the DAG and the sync scripts require

  6. Brings down the audit tool install image from github and installs it.

  7. Adds the audit tool configuration to image.

  8. Exports environment variables for the docker compose build step. These are referenced in the bdrc-docker-compose.yml file:

Tip

Really important to be careful about .config. We could possibly bind mount ~service/.config to the container (since the container runs under the host’s service uid: (see scheduler:....user: clause in bdrc-docker-compose.yml`) but that brings in the whole tree, and is fragile. So I decided that copying the material from .config should be a manual operation that is selective. As the range of operations in airflow-docker expands, images may need to be built that need more entries from .config e.g, Google books. For now, just copy bdrc/auditTool into a config dir, and give that dir as the –config_dir argument. After the build is complete, it can be deleted, but should be preserved for next builds.

In bdrc-docker.sh

# in the bdrc-docker.sh:
export DAG_REQUIREMENTS_DEFAULT="./StagingGlacierProcess-requirements.txt"
export COMPOSE_AIRFLOW_IMAGE=bdrc-airflow
export COMPOSE_BDRC_DOCKER=bdrc-docker-compose.yml
export COMPOSE_BDRC_DOCKERFILE=Dockerfile-bdrc
export BIN=bin
export AUDIT_HOME=
export BUILD_CONFIG_ROOT=.config

These are read by bdrc-docker-compose.yml to build the image:

#--------------------------------------------
# Refereneced in the bdrc-docker-compose.yml, referenced:
  any-name:
build:
  context: ${COMPOSE_BUILD_DIR}
  dockerfile: ${COMPOSE_BDRC_AIRFLOW_DOCKERFILE:-Dockerfile-bdrc}
  args:
    SYNC_SCRIPTS_HOME: ${BIN}
    PY_REQS: ${COMPOSE_PY_REQS}
    CONFIG_ROOT: ${BUILD_CONFIG_ROOT}

Note especially the args: clause above. these are exported into Dockerfile-bdrc to build the image. Here are some examples of how Dockerfile-bdrc uses these:

ARG SYNC_SCRIPTS_HOME
ARG PY_REQS
ARG CONFIG_ROOT- ``bdrc-docker-compose.yml``
.....
ADD $SYNC_SCRIPTS_HOME bin
ADD $PY_REQS .

Building the container

The other purpose of bdrc-docker-compose.yml is to guide the run-time execution of the bdrc-airflow image. The script deploy sets this up. It:

  1. Creates a compose build directory (the --dest argument)

  2. Copies the bdrc-docker-compose.yml file to the compose build directory/docker-compose.yaml (for normalization).

  3. Creates useful folders in the --dest directory:

  • logs for the logs
    • dags for the DAGs

    • plugins for the plugins (none used)

    • processing for the logs

    • data for working data (most usually, downloaded archives)

  1. Populates secrets - See Docker concepts

  2. Populates the .env file, the default, ** and only ** external source for the environment available to the docker compose command. .env is the source for resolving variables in the docker-compose.yaml file.

.env fragment:

COMPOSE_PY_REQS=
BIN=
ARCH_ROOT=/mnt
... # other variables
SYNC_ACCESS_UID=1001

references in bdrc-docker-compose.yml:

scheduler:
 ...
  user: ${SYNC_ACCESS_UID}
  ...
    - ${ARCH_ROOT:-.}/AO-staging-Incoming/bag-download:/home/airflow/bdrc/data

Note

The - ${ARCH_ROOT:-.}/AO-staging-Incoming uses standard bash variable resolution. If ARCH_ROOT is not set, it uses .. This is a common pattern in the .env file.

From the --dest dir, you can then control the docker compose with docker compose commands.

Configuring Dev/Test and Production Environments

config invariant:

The item referred to does not havve any differences between dev/test and production.

What you can skip

Building the docker image and the container are config invariant Even though bdrc-docker.sh adds in BDRC code, that variables that determine the dev or production environment are all configured at run time (see airflow-docker/dags/glacier_staging_to_sync.py:sync_debagged for the implementation).

Patterns

The general pattern in in the code is to specify global and environment variable variants:

_DEV_THING="Howdy"
_PROD_THING="Folks"
# ...
THING=${_DEV_THING}
# THING=${_PROD_THING}

In some cases, THING is replaced as MY_THING

Things to change

There are two locations that specify a dev/test or production environment. These are all in airflow-docker:

deploy.sh

  • Change the SYNC_ACCESS_UID to the current value.

dags/glacier_staging_to_sync.py

  • Change the MY_DB global to the current value.

Tip

deploy.sh writes the changed environment variables to the path *compose_build_dir*``/.env`` You can change these values in .env and simply docker compose down && dockef compose up -d to update them.

The MY_DB global is used in the sync_debagged function to determine the database to use. To update it, you simply replace the *compose_build_dir*``/dags/glacier_staging_to_sync.py`` file with the new version. You may have to check the auto update settings in the airflow UI to be sure this takes effect.

bdrc-docker-compose.yml

What is actually happening

All this work supports essentially four functions, which comprise the process. The process container is an airflow DAG named sqs_scheduled_dag It appears in the docker UI (https://sattva:8089) as sqs_scheduled_dag.

_images/Dag_view.png

The DAG contains four tasks, which operate sequentially: their relationship is defined in the code quite directly, using an advanced airflow concept known as the Taskflow API.

msgs = get_restored_object_messages()
downloads = download_from_messages(msgs)
to_sync = debag_downloads(downloads)
sync_debagged(to_sync)

In the Airflow UI, their relationship is shown in the UI:

_images/Task-graph.png

The actions of the scripts are mostly straightforward Python, but there are two airflow specific elements worth noting:

Retrying when there is no data

The get_restored_object_messages` task will retry if there are no messages. This is shown in the task graph above: the task is labeled as ‘up-for-retry’ This is given as a parameter to the task’s decorator. This is the only task to retry on failure, as it is the only one expected to fail, when there are no object messages to retrieve.

Using a bash shell

The task sync debagged uses a bash shell to run the syncOneWork.sh script. The environment to run that script is configured in the task itself. It is a separate environment from the docker image and the airflow container itself.