Diagnosing a Google Cloud Functions build failure
My Google Cloud Functions deployed fine to the Python 3.7 runtime, but failed to build for Python 3.8. Here's how I investigated and discovered the cause.
Recently, I had a small project with a few Google Cloud Functions which I planned to write in Python. I set up Terraform to manage all of the configuration including the deployment of a dummy hello-world function with the Python 3.7 runtime. This was fine, but once I started working on the real functions and deploying them for testing, I hit this ongoing issue with the Python 3.7 runtime where no logs or stack traces are available when there’s a crash; the suggested workaround is to move to the Python 3.8 runtime, currently in Beta.
So, I changed my configuration to target the Python 3.8 runtime and re-ran Terraform… only to have it fail with the rather cryptic error message:
Error: Error waiting for Creating CloudFunctions Function: Error code 3, message:
Build failed: Listing '.'...; Error ID: 0cfbe9cc
Here is how I solved it…
Cloud Functions with Terraform
First, I need to explain my use of Terraform. Using the gcloud
CLI, you can deploy/update a function from a local directory, a ZIP file in a storage bucket, or a Cloud Source Repository (Git repo). However, the current Terraform resource google_cloudfunctions_function doesn’t support deploying directly from a local directory, so I chose the ZIP-file-in-a-storage-bucket route.
I used the archive_file data source from Terraform’s archive provider:
data "archive_file" "hello_zip" {
type = "zip"
source_dir = "./hello"
output_path = "./staging/hello.zip"
}
When I ran terraform plan
this (re)generated a ZIP file from my source directory. I then used this as the source of a google_storage_bucket_object:
resource "google_storage_bucket_object" "hello_zip" {
name = format("hello_%s.zip", data.archive_file.hello_zip.output_md5)
bucket = "my-bucket-name"
source = data.archive_file.hello_zip.output_path
}
I’m using the MD5 of the ZIP in the filename so that Terraform will only update the remote file when the ZIP has changed. This will, in turn, cause an update to the function, which I defined like this:
resource "google_cloudfunctions_function" "hello" {
provider = google-beta
name = "hello"
runtime = "python38"
source_archive_bucket = google_storage_bucket_object.hello_zip.bucket
source_archive_object = google_storage_bucket_object.hello_zip.name
entry_point = "handle"
event_trigger { ... } // details not relevant
}
Running terraform apply
successfully created the ZIP file in the GCS bucket, but when it tried to update the cloud function, Terraform failed with this error:
google_cloudfunctions_function.hello: Creating...
google_cloudfunctions_function.hello: Still creating... [10s elapsed]
google_cloudfunctions_function.hello: Still creating... [20s elapsed]
google_cloudfunctions_function.hello: Still creating... [30s elapsed]
google_cloudfunctions_function.hello: Still creating... [40s elapsed]
2020/08/09 12:14:33 [ERROR] <root>: eval: *terraform.EvalApplyPost, err: Error waiting for Creating CloudFunctions Function: Error code 3, message: Build failed: Listing '.'...; Error ID: 0cfbe9cc
2020/08/09 12:14:33 [ERROR] <root>: eval: *terraform.EvalSequence, err: Error waiting for Creating CloudFunctions Function: Error code 3, message: Build failed: Listing '.'...; Error ID: 0cfbe9cc
Error: Error waiting for Creating CloudFunctions Function: Error code 3, message: Build failed: Listing '.'...; Error ID: 0cfbe9cc
on main.tf line 47, in resource "google_cloudfunctions_function" "hello":
47: resource "google_cloudfunctions_function" "hello" {
So, what to do next?
Pinpoint the conditions triggering the issue
Simplify the test case
In cases like this, I find that a good approach is to come up with a simple example that does work, then gradually make it closer and closer to your broken example until you make a change that causes it to stop working. Alternatively, gradually make your broken example closer and closer to the working example until it starts working. If they are quite different, you may want to make larger changes, a bit like a binary search.
My original problem wasn’t actually on this simple “hello world” function, so my first step was to create this repo with a simple function that I could comfortably share publicly (useful if I needed to submit an issue on GitHub or raise a support request with Google).
I confirmed that this simple function still had the same issue when deployed in the same way.
I next tried to ensure that my function code actually worked with the Python 3.8 runtime. Using the Cloud Console (Google Cloud’s web UI), I edited the function definition and changed the runtime to Python 3.7. This deployed successfully, and I was able to invoke the function, but re-running the Terraform apply again still resulted in the same error. So, it definitely seemed like something to do with the deployment process, rather than my code.
Eliminate Terraform
I next tried to eliminate the Terraform provider as a possible cause of the problem, by using the gcloud CLI to deploy the function from the same ZIP file in GCS:
$ gcloud functions deploy hello \
--project=cloudfunctions-terraform \
--runtime=python38 \
--service-account=hello-function@cloudfunctions-terraform.iam.gserviceaccount.com \
--entry-point=handle \
--trigger-topic=hello-trigger \
--source=gs://cloudfunctions-terraform-staging/hello_91948d92dbabc8aa1614b1516bf03a0f.zip \ --max-instances=1 \
--memory=128MB \
--region=europe-west2
This failed with the same kind of error message:
Deploying function (may take a while - up to 2 minutes)...
For Cloud Build Stackdriver Logs, visit: https://console.cloud.google.com/logs/viewer?project=cloudfunctions-terraform&advancedFilter=resource.type%3Dbuild%0Aresource.labels.build_id%3D79124a26-6733-480f-aa04-d57154c26faf%0AlogName%3Dprojects%2Fcloudfunctions-terraform%2Flogs%2Fcloudbuild
Deploying function (may take a while - up to 2 minutes)...failed.
ERROR: (gcloud.beta.functions.deploy) OperationError: code=3, message=Build failed: Listing '.'...; Error ID: 0cfbe9cc
but note that this time I was given a link to the Cloud Build Stackdriver Logs1.
My repo includes a copy of the failed build log where you can see a Python traceback just before the error message I’d seen already:
Step #4 - "builder": === Python - Functions Framework (google.python.functions-framework@0.9.0) ===
Step #4 - "builder": ————————————————————————————————————————————————————————————————————————————————
Step #4 - "builder": Running "python3 -m compileall ."
Step #4 - "builder": Traceback (most recent call last):
Step #4 - "builder": File "/opt/python3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Step #4 - "builder": return _run_code(code, main_globals, None,
Step #4 - "builder": File "/opt/python3.8/lib/python3.8/runpy.py", line 87, in _run_code
Step #4 - "builder": exec(code, run_globals)
Step #4 - "builder": File "/opt/python3.8/lib/python3.8/compileall.py", line 332, in <module>
Step #4 - "builder": exit_status = int(not main())
Step #4 - "builder": File "/opt/python3.8/lib/python3.8/compileall.py", line 314, in main
Step #4 - "builder": if not compile_dir(dest, maxlevels, args.ddir,
Step #4 - "builder": File "/opt/python3.8/lib/python3.8/compileall.py", line 96, in compile_dir
Step #4 - "builder": if not compile_file(file, dfile, force, rx, quiet,
Step #4 - "builder": File "/opt/python3.8/lib/python3.8/compileall.py", line 151, in compile_file
Step #4 - "builder": expect = struct.pack('<4sll', importlib.util.MAGIC_NUMBER,
Step #4 - "builder": struct.error: 'l' format requires -2147483648 <= number <= 2147483647
Step #4 - "builder": Listing '.'...
Step #4 - "builder": Done "python3 -m compileall ." (88.129337ms)
Step #4 - "builder": Failure: (ID: 0cfbe9cc) Listing '.'...
Eliminate ZIP file source?
Before digging further I really wanted to see a Python 3.8 function work, and so I tried deploying not from the ZIP file but from local sources (which is how all Cloud Functions “getting started” documentation does it).
$ gcloud functions deploy hello \
--project=cloudfunctions-terraform \
--runtime=python38 \
--service-account=hello-function@cloudfunctions-terraform.iam.gserviceaccount.com \
--entry-point=handle \
--trigger-topic=hello-trigger \
--source=hello \ --max-instances=1 \
--memory=128MB \
--region=europe-west2
(Note that the only difference to the previous command is highlighted: --source
was specified as a relative path to the directory containing my Python source, instead of a location in Cloud Storage.)
This worked! The build log is here, and you can see that the equivalent section of output showed no traceback:
Step #4 - "builder": === Python - Functions Framework (google.python.functions-framework@0.9.0) ===
Step #4 - "builder": ————————————————————————————————————————————————————————————————————————————————
Step #4 - "builder": Running "python3 -m compileall ."
Step #4 - "builder": Listing '.'...
Step #4 - "builder": Compiling './main.py'...
Step #4 - "builder": Done "python3 -m compileall ." (94.3883ms)
Step #4 - "builder": Handling functions without dependency on functions-framework.
Figure out why these conditions cause a failure
To recap, it seems like the failure occurs only when we attempt to use the zip file generated by Terraform’s archive_file
resource. An exception is thrown when running a python module called compileall
, and specifically the exception message is:
File "/opt/python3.8/lib/python3.8/compileall.py", line 151, in compile_file
expect = struct.pack('<4sll', importlib.util.MAGIC_NUMBER,
struct.error: 'l' format requires -2147483648 <= number <= 2147483647
So, we know when the error occurs… but why?
Examine the zip files
I observed in the build logs that even when deploying from a source folder, there was still a zip file in GCS being used as the source for the build – presumably created by gcloud:
Fetching storage object: gs://gcf-sources-4108251557-europe-west2/hello-8141ccd5-3daa-41e0-b263-4f01bc16966c/version-3/function-source.zip#1596972617211838
I downloaded this zip file, and compared it with the one generated by Terraform. I unzipped both, and observed that while the file contents were identical, the Terraform-generated zip set all the file modification dates to 1 Jan 2049:
$ ls -la function-source.from-terraform
total 8
drwxr-xr-x 4 michael staff 128 Aug 9 12:50 .
drwxr-xr-x 24 michael staff 768 Aug 9 13:04 ..
-rw-r--r-- 1 michael staff 259 Jan 1 2049 main.py-rw-r--r-- 1 michael staff 953 Jan 1 2049 requirements.txt
$ ls -la function-source.from-filesystem
total 8
drwxr-xr-x 4 michael staff 128 Aug 9 12:50 .
drwxr-xr-x 24 michael staff 768 Aug 9 13:04 ..
-rw-r--r-- 1 michael staff 259 Aug 9 11:55 main.py
-rw-r--r-- 1 michael staff 953 Aug 9 12:02 requirements.txt
At this point, I went down a bit of a rabbit-hole examing the zip files themselves, using
unzip -Z
. I’ll skip that, because it turned out to be a red herring, but if you’re interested, you can read a bit more here.
I found a mention of this aspect of the archive_file
resource in a GitHub issue Files in the archive do not have the correct modification time metadata, where the rationale for setting a fixed modification date is explained.
Examine the Python module source
So, why is this causing a problem? I found the exception-throwing line in Lib/compileall.py in the Python source repo in GitHub:
def compile_file(fullname, ddir=None, force=False, rx=None, quiet=0,
legacy=False, optimize=-1,
invalidation_mode=None):
# (some code omitted)
if tail == '.py':
if not force:
try:
mtime = int(os.stat(fullname).st_mtime)
expect = struct.pack('<4sll', importlib.util.MAGIC_NUMBER, 0, mtime) with open(cfile, 'rb') as chandle:
actual = chandle.read(12)
if expect == actual:
return success
except OSError:
pass
if not quiet:
print('Compiling {!r}...'.format(fullname))
try:
ok = py_compile.compile(fullname, cfile, dfile, True,
optimize=optimize,
invalidation_mode=invalidation_mode)
This appears to be fetching the st_mtime
attribute of the python source file, formatting it (along with a couple of other values) into a bytes representation and comparing with the first few bytes of another file2. The l
format character denotes a signed long (see docs) which the exception says must be between -2147483648 and 2147483647.
Using epochconverter.com I found that midnight on 1 Jan 2049 gives a unix epoch time of 2493072000, which is out of range for a signed long.
A bit of searching found me an open issue on Python’s bug tracker for this very issue, and from there I found it’s a specific example of a more general issue known as the Year 2038 problem or “Unix Y2K” when the Unix time will exceed the maximum value representable by a signed 32-bit integer.
So, I finally understood the exact nature of the failure.
The solution
At work, our level of support with Google Cloud means that I was able to raise a support ticket, explaining what I’d found and linking to my public repo to show the problem. Within a day I was told it’d been passed to the Cloud Functions engineering team, and a few days after that I was assured that they’d enabled the “force” option (which you can see in the compileall.py
code snippet above) and that this would be rolled out to the Python 3.8 builder within a couple of weeks.
At the time I’m publishing this, I can see the fix against the Buildpack but Cloud Functions is still using an older builder image. I’ll update here when a new release fixes it.
Edit (2020-10-10): the Python 3.8 runtime has been updated, the cloud buildpack used to build the functions includes the fix, and I can finally deploy functions to Python 3.8 through Terraform!
In conclusion
I could have simply raised a support ticket as soon as I found the problem. But I’m reasonably sure that I wouldn’t have got such a quick and positive response from the Cloud Functions engineering team if I hadn’t provided a sample repo with plenty of detail about what I believed was the specific cause of the issue. It would have been easy to say “we’re not familiar with / don’t support this Terraform archive_file
module”, for example. But by narrowing the problem down to a specific behaviour of compileall
I’d saved them time and effort, and demonstrated that I wasn’t just someone who needed to be referred to a Getting Started guide.
I also learned a number of things along the course of my investigation which may not be immediately or obviously useful, but which are now filed away in my brain for possible future reference. I enjoy this kind of detective work – when it’s not getting in the way of an urgent task – and I almost always learn something interesting.
Image credit: Photo by Agence Olloweb on Unsplash
-
Since I captured this output, something seems to have been updated in Cloud Functions so that I now get a link to the build logs in the error message output by Terraform.
↩ -
I assume this is an optimisation for the case when you might be re-running
↩compileall
for Python modules that you’ve changed, where the.pyc
(compiled) files are still present. You would only need Python to re-compile modules that have changed. This will never apply for a Cloud Functions build because the previously-compiled files would not be present in the build container.