Debugging Airflow With Pyspy

Recently we were running into the problem that our Airflow server got “stuck” - it just stopped doing anything, without any logging, errors, or weird CPU or memory usage. A restart resolved the issue, and we have healthchecks in place that detect these forms of stalling and automatically restart the scheduler. But since the issue occurred multiple times in a short time period, I figured that this was a good time to do a deep dive, since normally these issues are hard to reproduce.

I got into contact with @ash on Slack who recommended to try using py-spy.

What py-spy does

Py-spy is a sample profiler, meaning that it looks at what the python program is doing: at which lines in which file is the CPU spending its time? This can give you an indication of why a program may be slow or even grinding to a halt.

Py-spy can be attached to a running process, without needing to change the configuration of the process or restarting it.

Enabling py-spy on kubernetes

Since we were running Airflow on Kubernetes, I had to install py-spy in our Docker container. I just added it to the Dockerfile, and build a new image.

Also, some configuration is needed because the default security settings do not allow py-spy. Specifically, in the deployment manifest, under Deployment.spec.template.spec.containers you need to add:

securityContext:
  capabilities:
    add:
    - SYS_PTRACE

(source)

So in this case, I actually had to change some configuration. But that is because Docker containers are immutable - we can’t change the specification for a running process. This has nothing to do with py-spy. If we were running on a plain VM or server, you can just install py-spy next to your python program while running and start profiling.

Profiling a process

With this configuration in place I was ready to dive into the issue. After some time I got the notification in the UI: The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago. The DAGs list may not update, and new tasks will not be scheduled.

You can now exec into the pod (kubeclt exec -ti <pod> bash) and run py-spy. Py-spy attaches to a process ID, and since the main process in a Docker container is 1, we can just run py-spy --pid 1. (Note that Airflow actually runs multiple processes in a single pod. You can find the other process IDs with ps -ef.)

This will give you an output such as this:

  Collecting samples from 'pid: 1' (python v3.6.9)
  Total Samples 43994
  GIL: 0.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
 100.00% 100.00%   38.00s    38.00s   recv_into (OpenSSL/SSL.py:1821)
  0.00% 100.00%   0.000s    38.00s   sync (airflow/contrib/executors/kubernetes_executor.py:795)
  0.00% 100.00%   0.000s    38.00s   request_encode_body (urllib3/request.py:150)
  0.00% 100.00%   0.000s    38.00s   begin (http/client.py:307)
  0.00% 100.00%   0.000s    38.00s   recv_into (urllib3/contrib/pyopenssl.py:304)
  0.00% 100.00%   0.000s    38.00s   wrapper (airflow/utils/cli.py:74)
  0.00% 100.00%   0.000s    38.00s   create_namespaced_pod_with_http_info (kubernetes/client/apis/core_v1_api.py:6206)
  0.00% 100.00%   0.000s    38.00s   _make_request (urllib3/connectionpool.py:383)
  0.00% 100.00%   0.000s    38.00s   _execute_helper (airflow/jobs/scheduler_job.py:1412)

From this trace we learn that the process is stuck polling for the state of the Pod (running the task that was launched).

Debugging the issue further

To see which HTTP calls are causing the problem, I added some extra logging to cli.py:

# added for debugging scheduler
import http.client; http.client.HTTPConnection.debuglevel = 1
import logging
logging.basicConfig() # you need to initialize logging, otherwise you will not see anything from requests
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

These two lines enable debugging at httplib level (requests->urllib3->http.client). You will see the HTTP request, including heades are data. You will also see the reponse, including headers (but without data).

I build a new container with this debug logging, and collected some logs. I then searched for the process that was causing the issues (pod_launcher):

(base) macbook:~ roland$ kubectl logs airflow-scheduler-7c5d6df8f4-w4tpn > logs.txt
(base) macbook:~ roland$ cat logs.txt | grep pod_launcher.py
[2019-08-20 14:40:10,471] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:40:10,652] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:43:03,213] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:43:03,296] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:38,078] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:38,409] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:39,804] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:39,867] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:45:21,785] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:45:22,040] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:50:22,224] {pod_launcher.py:60} DEBUG - Pod Creation Request:

This is all traffic to the Kubernetes API for creating pods. We can see that the final request at 14:50:22,224 doesn’t get a response. At that point, the Airflow scheduler hangs. At this point, I could dive into the Airflow source code. It turned out that theere is no timeout on this API call.

Fixing the issue

First I configured a timeout on urllib3. This causes the scheduler to exit when a timeout is occurring. Instead of deadly silence, I could now observe obvious errors like this:

  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 307, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='rg-akscode-platform-a-101-1ba9537e.hcp.westeurope.azmk8s.io', port=443): Read timed out. (read timeout=9.99950340400028)
[2019-08-20 15:25:20,241] {helpers.py:319} INFO - Sending Signals.SIGTERM to GPID 41
[2019-08-20 15:25:20,598] {settings.py:238} DEBUG - Disposing DB connection pool (PID 5578)
[2019-08-20 15:25:20,661] {settings.py:238} DEBUG - Disposing DB connection pool (PID 5576)
[2019-08-20 15:25:20,695] {helpers.py:297} INFO - Process psutil.Process(pid=5578, status='terminated') (5578) terminated with exit code None
[2019-08-20 15:25:20,696] {helpers.py:297} INFO - Process psutil.Process(pid=41, status='terminated') (41) terminated with exit code 0
[2019-08-20 15:25:20,696] {helpers.py:297} INFO - Process psutil.Process(pid=5576, status='terminated') (5576) terminated with exit code None
[2019-08-20 15:25:20,696] {scheduler_job.py:1328} INFO - Exited execute loop
[2019-08-20 15:25:20,719] {cli_action_loggers.py:82} DEBUG - Calling callbacks: []
[2019-08-20 15:25:20,719] {settings.py:238} DEBUG - Disposing DB connection pool (PID 1)

A crashing process is slightly better than a process that hangs. But not ideal. Therefore I added a try-except in the KubernetesExectutor that puts the task back on the internal queue, so it will be retried.

I created a ticket on the Airflow JIRA (AIRFLOW-5282), created a pull request, and resolved the issue. This has made Airflow more stable when you run on Kubernetes. Thanks for @ash for helping with the analysis.

Conclusion

Py-spy is a very handy tool to keep in mind if you are running into issues with python programs and you have no logging output. Since it is not intrusive, it is safe to use in production. This may allow you to debug difficult issues.