Recently we were running into the problem that our Airflow server got “stuck” - it just stopped doing anything, without any logging, errors, or weird CPU or memory usage. A restart resolved the issue, and we have healthchecks in place that detect these forms of stalling and automatically restart the scheduler. But since the issue occurred multiple times in a short time period, I figured that this was a good time to do a deep dive, since normally these issues are hard to reproduce.
I got into contact with @ash on Slack who recommended to try using py-spy.
What py-spy does
Py-spy is a sample profiler, meaning that it looks at what the python program is doing: at which lines in which file is the CPU spending its time? This can give you an indication of why a program may be slow or even grinding to a halt.
Py-spy can be attached to a running process, without needing to change the configuration of the process or restarting it.
Enabling py-spy on kubernetes
Since we were running Airflow on Kubernetes, I had to install py-spy in our Docker container. I just added it to the Dockerfile, and build a new image.
Also, some configuration is needed because the default security settings do not allow py-spy. Specifically, in the deployment manifest, under Deployment.spec.template.spec.containers
you need to add:
securityContext:
capabilities:
add:
- SYS_PTRACE
(source)
So in this case, I actually had to change some configuration. But that is because Docker containers are immutable - we can’t change the specification for a running process. This has nothing to do with py-spy. If we were running on a plain VM or server, you can just install py-spy next to your python program while running and start profiling.
Profiling a process
With this configuration in place I was ready to dive into the issue. After some time I got the notification in the UI: The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago.
The DAGs list may not update, and new tasks will not be scheduled.
You can now exec into the pod (kubeclt exec -ti <pod> bash
) and run py-spy. Py-spy attaches to a process ID, and since the main process in a Docker container is 1
, we can just run py-spy --pid 1
. (Note that Airflow actually runs multiple processes in a single pod. You can find the other process IDs with ps -ef
.)
This will give you an output such as this:
Collecting samples from 'pid: 1' (python v3.6.9)
Total Samples 43994
GIL: 0.00%, Active: 100.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 38.00s 38.00s recv_into (OpenSSL/SSL.py:1821)
0.00% 100.00% 0.000s 38.00s sync (airflow/contrib/executors/kubernetes_executor.py:795)
0.00% 100.00% 0.000s 38.00s request_encode_body (urllib3/request.py:150)
0.00% 100.00% 0.000s 38.00s begin (http/client.py:307)
0.00% 100.00% 0.000s 38.00s recv_into (urllib3/contrib/pyopenssl.py:304)
0.00% 100.00% 0.000s 38.00s wrapper (airflow/utils/cli.py:74)
0.00% 100.00% 0.000s 38.00s create_namespaced_pod_with_http_info (kubernetes/client/apis/core_v1_api.py:6206)
0.00% 100.00% 0.000s 38.00s _make_request (urllib3/connectionpool.py:383)
0.00% 100.00% 0.000s 38.00s _execute_helper (airflow/jobs/scheduler_job.py:1412)
From this trace we learn that the process is stuck polling for the state of the Pod (running the task that was launched).
Debugging the issue further
To see which HTTP calls are causing the problem, I added some extra logging to cli.py
:
# added for debugging scheduler
import http.client; http.client.HTTPConnection.debuglevel = 1
import logging
logging.basicConfig() # you need to initialize logging, otherwise you will not see anything from requests
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
These two lines enable debugging at httplib level (requests->urllib3->http.client). You will see the HTTP request, including heades are data. You will also see the reponse, including headers (but without data).
I build a new container with this debug logging, and collected some logs. I then searched for the process that was causing the issues (pod_launcher
):
(base) macbook:~ roland$ kubectl logs airflow-scheduler-7c5d6df8f4-w4tpn > logs.txt
(base) macbook:~ roland$ cat logs.txt | grep pod_launcher.py
[2019-08-20 14:40:10,471] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:40:10,652] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:43:03,213] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:43:03,296] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:38,078] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:38,409] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:39,804] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:39,867] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:45:21,785] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:45:22,040] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:50:22,224] {pod_launcher.py:60} DEBUG - Pod Creation Request:
This is all traffic to the Kubernetes API for creating pods. We can see that the final request at 14:50:22,224
doesn’t get a response. At that point, the Airflow scheduler hangs. At this point, I could dive into the Airflow source code. It turned out that theere is no timeout on this API call.
Fixing the issue
First I configured a timeout on urllib3. This causes the scheduler to exit when a timeout is occurring. Instead of deadly silence, I could now observe obvious errors like this:
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 307, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='rg-akscode-platform-a-101-1ba9537e.hcp.westeurope.azmk8s.io', port=443): Read timed out. (read timeout=9.99950340400028)
[2019-08-20 15:25:20,241] {helpers.py:319} INFO - Sending Signals.SIGTERM to GPID 41
[2019-08-20 15:25:20,598] {settings.py:238} DEBUG - Disposing DB connection pool (PID 5578)
[2019-08-20 15:25:20,661] {settings.py:238} DEBUG - Disposing DB connection pool (PID 5576)
[2019-08-20 15:25:20,695] {helpers.py:297} INFO - Process psutil.Process(pid=5578, status='terminated') (5578) terminated with exit code None
[2019-08-20 15:25:20,696] {helpers.py:297} INFO - Process psutil.Process(pid=41, status='terminated') (41) terminated with exit code 0
[2019-08-20 15:25:20,696] {helpers.py:297} INFO - Process psutil.Process(pid=5576, status='terminated') (5576) terminated with exit code None
[2019-08-20 15:25:20,696] {scheduler_job.py:1328} INFO - Exited execute loop
[2019-08-20 15:25:20,719] {cli_action_loggers.py:82} DEBUG - Calling callbacks: []
[2019-08-20 15:25:20,719] {settings.py:238} DEBUG - Disposing DB connection pool (PID 1)
A crashing process is slightly better than a process that hangs. But not ideal. Therefore I added a try-except
in the KubernetesExectutor that puts the task back on the internal queue, so it will be retried.
I created a ticket on the Airflow JIRA (AIRFLOW-5282), created a pull request, and resolved the issue. This has made Airflow more stable when you run on Kubernetes. Thanks for @ash for helping with the analysis.
Conclusion
Py-spy is a very handy tool to keep in mind if you are running into issues with python programs and you have no logging output. Since it is not intrusive, it is safe to use in production. This may allow you to debug difficult issues.