Running a Scrapy spider in a GCP cloud function

What you’ve probably tried but didn’t work

Scrapy docs explains how to Run Scrapy from a script. My first try was to simply copy that code, and run it in a Cloud Functions.

from ... import MySpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

def my_cloud_function(event, context):
    process = CrawlerProcess(settings={
        'FEED_FORMAT': 'json',
        'FEED_URI': 'items.json'
    })

    process.crawl(MySpider)
    process.start()

    return 'ok'

This worked, but surprisingly only on the first execution of the each function instance. Starting the second invocation, I started seeing in the logs:

Function execution took 509 ms, finished with status: 'crash'

Function execution took 509 ms, finished with status: 'crash'

This is because my Cloud Function is not idempotent. Under the hood, Scrapy is using Twisted which sets up many global objects to power its events driven machinery.

When asked to run a Cloud Function, GCP will either create a new instance of your function (they call this a cold start), or reuse one of the instances it previously created. Therefore it’s very important for your function to be idempotent.

My Cloud Function here cannot run twice, and you can see this even on your local environment:

my_cloud_function(None, None)
my_cloud_function(None, None)

Running a Scrapy spider in a GCP cloud function

The solution: wrapping Scrapy in its own child process

from multiprocessing import Process, Queue
from ... import MySpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

def my_cloud_function(event, context):
    def script(queue):
        try:
            settings = get_project_settings()

            settings.setdict({
                'LOG_LEVEL': 'ERROR',
                'LOG_ENABLED': True,
            })

            process = CrawlerProcess(settings)
            process.crawl(MySpider)
            process.start()
            queue.put(None)
        except Exception as e:
            queue.put(e)

    queue = Queue()

    # wrap the spider in a child process
    main_process = Process(target=script, args=(queue,))
    main_process.start()    # start the process
    main_process.join()     # block until the spider finishes

    result = queue.get()    # check the process did not return an error
    if result is not None:
        raise result 

    return 'ok'

Again you can check this works locally by calling your function twice in a row.