Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. While it offers robust features for managing complex workflows, it has experienced security vulnerabilities. One notable vulnerability, CVE-2024-39877, is the DAG (Directed Acyclic Graph) code execution vulnerability. This allows authenticated DAG authors to craft a doc_md parameter in a way that can execute arbitrary code in the scheduler context, which is prohibited according to the Airflow security model.
Patch Diffing
From the pull request on GitHub that patches the vulnerability, we can see that the DAG code execution vulnerability arises from improper handling of the doc_md parameter, which allows attackers to inject and execute arbitrary code within the scheduler context. The doc_md parameter in Airflow’s DAG allows for the inclusion of Markdown documentation. However, due to improper sanitization, as Jinja2 is used to render the content of this parameter, it is possible to inject Jinja2 templates that can execute arbitrary Python code. Since the Airflow scheduler processes this parameter, any code injected will run in the context of the scheduler. The vulnerability was patched by treating the data within the doc_md parameter as raw data.
Testing Lab
1. We will build the lab on Docker. First, we need to pull the vulnerable image:
airflow % docker pull apache/airflow:2.4.0
2. Then, download the Docker Compose file:
airflow % curl -LfO ‘https://airflow.apache.org/docs/apache-airflow/2.4.0/docker-compose.yaml’
3. Create the logs, dags, plugins, and config folders, and the .env file:
airflow % mkdir -p ./dags ./logs ./plugins ./config && echo -e “AIRFLOW_UID=$(id -u)” > .env
4. Check the created directories and files:
airflow % ls
config dags docker-compose.yaml logs plugins
5. Initiate Airflow:
airflow % sudo docker compose up airflow-init
6. Now, run Airflow:
airflow % sudo docker compose up
We can find it working on port 8080. Username and password are airflow:airflow.
The Analysis
Now, to reproduce the vulnerability, we need to create a DAG.
What is a DAG?
A Directed Acyclic Graph (DAG) is a finite graph with directed edges and no cycles. In the context of Apache Airflow, a DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
- Directed: Each edge in the graph has a direction, going from one node (task) to another.
- Acyclic: There are no cycles in the graph, meaning that you cannot start at one task and follow the directed edges back to the same task.
- Graph: A collection of nodes (tasks) and edges (dependencies between tasks).
DAGs in Apache Airflow
In Apache Airflow, DAGs are defined in Python scripts, which specify the relationships and dependencies between tasks. Here are some key components:
- Tasks: The individual units of work, which can be anything from running a shell command to calling an API or running a machine learning model.
- Dependencies: The relationships between tasks, specifying which tasks need to be completed before others can start.
- Scheduling: Defines when and how often the DAG should run.
DAG Example
The following DAG includes a doc_md parameter. This parameter allows you to document your DAG using Markdown. The documentation will be visible in the Airflow web interface when you view the DAG details.
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 1, 1),
'retries': 1
}
# Define the DAG
dag = DAG(
'example_dag_with_doc_md',
default_args=default_args,
description='An example DAG with doc_md',
schedule='@daily',
doc_md="""
# Example DAG
This is an example DAG that demonstrates the use of the `doc_md` parameter to add documentation.
## Description
This DAG has two dummy tasks: `start` and `end`.
## Tasks
- `start`: This is the starting task.
- `end`: This is the ending task.
## Dependencies
The `end` task depends on the `start` task.
"""
)
# Define the tasks
start = EmptyOperator(
task_id='start',
dag=dag
)
end = EmptyOperator(
task_id='end',
dag=dag
)
# Set the task dependencies
start >> end
- doc_md: This parameter is used to add Markdown documentation to the DAG. The content within the doc_md string is written in Markdown and will be rendered in the Airflow web interface when you view the DAG details.
- EmptyOperator: This is a simple operator that does nothing. It is used here to create placeholder tasks.
Now, Let’s Try Our DAG
Save the DAG File
Save the above code as a Python file (e.g., example_dag_with_doc_md.py) in the Airflow DAGs folder (/opt/airflow/dags/ in the case of our Docker setup).
Trigger the DAG
Go to the Airflow web interface and trigger the DAG named example_dag_with_doc_md.
View Documentation
Click on the DAG in the Airflow web interface to view its details. You will see the rendered Markdown documentation in the Doc tab.
What Happened Here Exactly?
Let’s take a look at the def get_doc_md(self, doc_md: str | None) -> str | None: function from the vulnerable code to see how it resolves the Markdown content from doc_md:
def get_doc_md(self, doc_md: str | None) -> str | None:
if doc_md is None:
return doc_md
env = self.get_template_env(force_sandboxed=True)
if not doc_md.endswith(".md"):
template = jinja2.Template(doc_md)
else:
try:
template = env.get_template(doc_md)
except jinja2.exceptions.TemplateNotFound:
return f"""
# Templating Error!
Not able to find the template file: `{doc_md}`.
"""
return template.render()
The get_doc_md method is designed to process the doc_md parameter, allowing DAG authors to embed Markdown documentation within their DAGs. Here’s a breakdown of how it works:
1. Check if doc_md is None:
If doc_md is None, the function returns early.
2. Initialize Jinja2 Environment:
It initializes a Jinja2 environment with sandboxing enabled using self.get_template_env(force_sandboxed=True).
3. Handle doc_md Content:
- If doc_md does not end with .md, it directly creates a Jinja2 template from the doc_md string using template = jinja2.Template(doc_md). This step is highly dangerous as it allows any string provided in doc_md to be treated as a Jinja2 template without any sanitization. If an attacker can manipulate this content, they can easily inject malicious Jinja2 expressions or even arbitrary Python code into the template.
- If doc_md ends with .md, the method attempts to load the template from the environment using env.get_template(doc_md). If the template file is not found, it returns a templating error message. However, this part is less critical than the direct template creation.
4. Render the Template:
The final step template.render() executes the rendered template, which is where the injected code gets executed.
So, the vulnerability is a classic example of an injection attack (Server-Side Template Injection, SSTI).
Exploitation
Let’s see how this vulnerability can be exploited:
Attack Scenario
Step-by-Step Breakdown
1. Send Malicious doc_md Payload:
The attacker sends a malicious payload through the doc_md parameter to the web server.
2. Forward Payload to Airflow:
The web server forwards this payload to the Airflow application, which then invokes the get_doc_md method.
3. Invoke get_doc_md Method:
The method checks if the doc_md parameter is None and proceeds to initialize the Jinja2 environment.
4. Create Jinja2 Template:
Next, it creates a Jinja2 template using the doc_md content and renders the template. During the rendering process, the malicious code embedded in the doc_md parameter is executed by the operating system (OS).
5. Execute Injected Code:
The OS performs the command and returns the output to the Airflow application.
6. Send Response:
Finally, Airflow sends the rendered template output back to the web server, which then delivers the response, including the command output, back to the attacker.
Example of Injected Code
To demonstrate this, let’s inject code to dump the available classes:
doc_md="""
{{ ''.__class__.__mro__[1].__subclasses__() }}
"""
The {{ ”.__class__.__mro__[1].__subclasses__() }} in Jinja2 template code leverages Python’s introspection capabilities to list all subclasses of the object class, effectively revealing all classes loaded in the current Python environment. Here’s how it works:
- ”.__class__ retrieves the class of an empty string, which is str.
- Accessing .__mro__ on this class provides the method resolution order (MRO), a tuple that includes the str class itself and its base classes, including object.
- The expression .__mro__[1] selects the object class from this tuple.
- Finally, .__subclasses__() lists all known subclasses of the object class, allowing us to enumerate the classes available in the runtime. This can be used to identify useful classes like os.system to execute commands on the OS and achieve code execution.
After updating the DAG, our injected expression got rendered and dumped all the available classes.
Here, we can see that useful classes like subprocess.Popen can be used to execute commands. The exploitation depends on the environment and the availability of the classes.
Conclusion
In this analysis, we discovered the CVE-2024-39877 vulnerability, which allows authenticated DAG authors to exploit the doc_md parameter to execute arbitrary code in the scheduler context, violating Airflow’s security model. The vulnerability arises from improper handling and sanitization of the doc_md parameter, which is rendered using Jinja2 templates. This oversight allows attackers to inject malicious Jinja2 expressions that can execute Python code.
The method get_doc_md in the vulnerable code initializes a Jinja2 environment and directly creates a template from the doc_md string if it does not end with .md, rendering the template without adequate sanitization. This process can be exploited by injecting payloads that leverage Python’s introspection capabilities to enumerate available classes and execute commands, thereby compromising the system.
To mitigate this, the patch ensures proper handling of doc_md as raw data, preventing the execution of arbitrary code.