Working with distributed workers
Apr 1, 2024
I discussed challenges of concurrency in Python in a previous post. As the complexity of the application grows, it will eventually become necessary to distribute tasks across multiple workers. Celery is the default distributed task queue that is used to execute tasks asynchronously. My goal in this post is to use Celery as an example to illustrate considerations distributing tasks across multiple workers.
Key concepts
AMQP primer
The client sending messages is typically called a publisher, or a producer, while the entity receiving messages is called a consumer. The broker is the message server, routing messages from producers to consumers. Essentially, pub/sub pattern, where a producer sends messages to a broker, which then distributes the messages to consumers.
Exchanges are AMQP 0-9-1 entities where messages are sent to. Exchanges take a message and route it into zero or more queues. The default exchange is a direct exchange with no name (empty string) pre-declared by the broker. It has one special property that makes it very useful for simple applications: every queue that is created is automatically bound to it with a routing key which is the same as the queue name.
Bindings are rules that exchanges use (among other things) to route messages to queues. To instruct an exchange E to route messages to a queue Q, Q has to be bound to E. Bindings may have an optional routing key attribute used by some exchange types. The purpose of the routing key is to select certain messages published to an exchange to be routed to the bound queue. In other words, the routing key acts like a filter.
The steps required to send and receive messages are:
- Create an exchange
- Create a queue
- Bind the queue to the exchange.
Concurrency
By default multiprocessing is used to perform concurrent execution of tasks, but you can also use Eventlet
.“More pool processes are usually better, but there’s a cut-off point where adding more pool processes affects performance in negative ways. There’s even some evidence to support that having multiple worker instances running, may perform better than having a single worker. For example 3 workers with 10 pool processes each. You need to experiment to find the numbers that works best for you, as this varies based on application, work load, task run times and other factors.”
# start celery worker with the gevent pool
$ celery worker --app=worker.app --pool=gevent --concurrency=100
prefork
: The default pool implementation, using multiple processes. You want to use the prefork pool if your tasks are CPU-bound. This is why Celery defaults to the number of CPUs available on the machine if the –concurrency argument is not set.- threads: It uses Python’s ThreadPoolExecutor. These threads are real OS threads, managed directly by the operating system kernel.
- solo: It runs everything in a single process, this is useful for debugging and small tasks.
- green threads: It uses
gevent
andeventlet
to provide a coroutine-based concurrency model. Celery doesn’t support asyncio: green threads are your best bet for I/O-bound tasks.
Prefetching
Prefetch is a term inherited from AMQP – this is essentially a preprocessing buffer to avoid a network roundtrip of fetching a marginal task. The workers’ default prefetch count is the worker_prefetch_multiplier
setting multiplied by the number of concurrency slots. Kombu is part of the Celery ecosystem and is the library used to send and receive messages.
The AMQP 0-9-1 specification does not explain what happens if you invoke basic.qos
multiple times with different global values. RabbitMQ
interprets this as meaning that the two prefetch limits should be enforced independently of each other:
Channel channel = ...;
Consumer consumer1 = ...;
Consumer consumer2 = ...;
channel.basicQos(10, false); // Per consumer limit
channel.basicQos(15, true); // Per channel limit
channel.basicConsume("my-queue1", false, consumer1);
channel.basicConsume("my-queue2", false, consumer2);
These two consumers will only ever have 15 unacknowledged messages between them, with a maximum of 10 messages for each consumer. This will be slower than the above examples, due to the additional overhead of coordinating between the channel and the queues to enforce the global limit.
If you want to disable prefetching, you can set the worker_prefetch_multiplier
to 1. The default is 4 (four messages for each process). If it is zero, the worker will keep consuming messages, not respecting that there may be other available worker nodes that may be able to process them sooner, or that the messages may not even fit in memory.
Acknowledgement
Positive acknowledgements, basic.ack
, simply instruct RabbitMQ to record a message as delivered and can be discarded. Negative acknowledgements, basic.nack
, with basic.reject
have the same effect. The difference is primarily in the semantics: positive acknowledgements assume a message was successfully processed while their negative counterpart suggests that a delivery wasn’t processed but still should be deleted.
In automatic acknowledgement mode, a message is considered to be successfully delivered immediately after it is sent. This mode trades off higher throughput (as long as the consumers can keep up) for reduced safety of delivery and consumer processing. Consumers therefore can be overwhelmed by the rate of deliveries, potentially accumulating a backlog in memory and running out of heap or getting their process terminated by the OS.
task_acks_late
is by default set to False
, which means that the task is acknowledged as soon as it is executed. If the worker crashes before the task is acknowledged, the task is lost. For mission-critical tasks, you should set task_acks_late
to True
to ensure that the task is acknowledged only after it is completed. But the task needs to be idempotent. When using the default of early acknowledgment, having a prefetch multiplier setting of one, means the worker will reserve at most one extra task for every worker process. You can disable prefetching by setting the worker_prefetch_multiplier
to 1 and task_acks_late
to True
.
Priority
RabbitMQ supports adding “priorities” to classic queues. Classic queues with the “priority” feature turned on are commonly referred to as “priority queues”. Priorities between 1 and 255 are supported, however, values between 1 and 5 are highly recommended. It is important to know that higher priority values require more CPU and memory resources, since RabbitMQ needs to internally maintain a sub-queue for each priority from 1, up to the maximum value configured for a given queue.
from kombu import Exchange, Queue
app.conf.task_queues = [
Queue('tasks', Exchange('tasks'), routing_key='tasks',
queue_arguments={'x-max-priority': 10}),
]
app.conf.task_queue_max_priority = 3
app.conf.task_default_priority = 1
Heartbeats
Heartbeats are important because they help to ensure that tasks are not lost if a worker goes offline or crashes. If a worker stops sending heartbeats, the broker will assume that the worker is no longer available and will re-assign its tasks to other workers.
# Configure Celery to use heartbeats
app.conf.update(
worker_heartbeat=120, # Send a heartbeat every 120 seconds
)
Other task related settings
For CPU-bound tasks, you may want to set the following settings:
task_time_limit
(default: None) The maximum number of seconds a task is allowed to run before it’s terminated and retried.task_soft_time_limit
(default: None): Similar totask_time_limit
, but the task is not terminated, only aSoftTimeLimitExceeded
exception is raised.task_track_started
(default:False
): IfTrue
, the task will report its status asSTARTED
when the task is started by a worker. This can be useful for long-running tasks.task_remote_tracebacks
(default:False
): IfTrue
, exceptions will include the remote traceback in the error message. This can be useful for debugging.
Simple Worker
# pub.py
import pika
import json
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='hello')
channel.basic_publish(exchange='',
routing_key='hello',
body=json.dumps({"task": "cpu_bound", "param": "1"})
)
print(" [x] Sent 'Hello World!'")
connection.close()
# sub.py
import pika, sys, os
from time import sleep
import json
def cpu_bound(param=0):
sleep(1)
print(f" [x] Received {param}")
def main():
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()
channel.queue_declare(queue='hello')
def callback(ch, method, properties, body):
task = json.loads(body)
globals()[task["task"]](task["param"])
print(f" [x] processed {task}")
channel.basic_consume(queue='hello', on_message_callback=callback, auto_ack=True)
print(' [*] Waiting for messages. To exit press CTRL+C')
channel.start_consuming()
if __name__ == '__main__':
try:
main()
except KeyboardInterrupt:
print('Interrupted')
try:
sys.exit(0)
except SystemExit:
os._exit(0)
A more advanced example
Modified from pika.
# pub.py
import logging
import pika
from pika import DeliveryMode
from pika.exchange_type import ExchangeType
logging.basicConfig(level=logging.INFO)
credentials = pika.PlainCredentials('guest', 'guest')
parameters = pika.ConnectionParameters('localhost', credentials=credentials)
connection = pika.BlockingConnection(parameters)
channel = connection.channel()
channel.exchange_declare(exchange="test_exchange",
exchange_type=ExchangeType.direct,
passive=False,
durable=True,
auto_delete=False)
print("Sending message to create a queue")
channel.basic_publish(
'test_exchange', 'standard_key', 'queue:group',
pika.BasicProperties(content_type='text/plain',
delivery_mode=DeliveryMode.Transient))
print("Sending text message to group")
channel.basic_publish(
'test_exchange', 'group_key', 'Message to group_key',
pika.BasicProperties(content_type='text/plain',
delivery_mode=DeliveryMode.Transient))
print("Sending text message")
channel.basic_publish(
'test_exchange', 'standard_key', 'Message to standard_key',
pika.BasicProperties(content_type='text/plain',
delivery_mode=DeliveryMode.Transient))
connection.close()
# sub.py
import functools
import logging
import time
import pika
from pika.adapters.asyncio_connection import AsyncioConnection
from pika.exchange_type import ExchangeType
LOG_FORMAT = ('%(levelname) -10s %(asctime)s %(name) -30s %(funcName) '
'-35s %(lineno) -5d: %(message)s')
LOGGER = logging.getLogger(__name__)
class ExampleConsumer(object):
"""This is an example consumer that will handle unexpected interactions
with RabbitMQ such as channel and connection closures.
If RabbitMQ closes the connection, this class will stop and indicate
that reconnection is necessary. You should look at the output, as
there are limited reasons why the connection may be closed, which
usually are tied to permission related issues or socket timeouts.
If the channel is closed, it will indicate a problem with one of the
commands that were issued and that should surface in the output as well.
"""
EXCHANGE = 'message'
EXCHANGE_TYPE = ExchangeType.topic
QUEUE = 'text'
ROUTING_KEY = 'example.text'
def __init__(self, amqp_url):
"""Create a new instance of the consumer class, passing in the AMQP
URL used to connect to RabbitMQ.
:param str amqp_url: The AMQP url to connect with
"""
self.should_reconnect = False
self.was_consuming = False
self._connection = None
self._channel = None
self._closing = False
self._consumer_tag = None
self._url = amqp_url
self._consuming = False
# In production, experiment with higher prefetch values
# for higher consumer throughput
self._prefetch_count = 1
def connect(self):
"""This method connects to RabbitMQ, returning the connection handle.
When the connection is established, the on_connection_open method
will be invoked by pika.
:rtype: pika.adapters.asyncio_connection.AsyncioConnection
"""
LOGGER.info('Connecting to %s', self._url)
return AsyncioConnection(
parameters=pika.URLParameters(self._url),
on_open_callback=self.on_connection_open,
on_open_error_callback=self.on_connection_open_error,
on_close_callback=self.on_connection_closed)
def close_connection(self):
self._consuming = False
if self._connection.is_closing or self._connection.is_closed:
LOGGER.info('Connection is closing or already closed')
else:
LOGGER.info('Closing connection')
self._connection.close()
def on_connection_open(self, _unused_connection):
"""This method is called by pika once the connection to RabbitMQ has
been established. It passes the handle to the connection object in
case we need it, but in this case, we'll just mark it unused.
:param pika.adapters.asyncio_connection.AsyncioConnection _unused_connection:
The connection
"""
LOGGER.info('Connection opened')
self.open_channel()
def on_connection_open_error(self, _unused_connection, err):
"""This method is called by pika if the connection to RabbitMQ
can't be established.
:param pika.adapters.asyncio_connection.AsyncioConnection _unused_connection:
The connection
:param Exception err: The error
"""
LOGGER.error('Connection open failed: %s', err)
self.reconnect()
def on_connection_closed(self, _unused_connection, reason):
"""This method is invoked by pika when the connection to RabbitMQ is
closed unexpectedly. Since it is unexpected, we will reconnect to
RabbitMQ if it disconnects.
:param pika.connection.Connection connection: The closed connection obj
:param Exception reason: exception representing reason for loss of
connection.
"""
self._channel = None
if self._closing:
self._connection.ioloop.stop()
else:
LOGGER.warning('Connection closed, reconnect necessary: %s', reason)
self.reconnect()
def reconnect(self):
"""Will be invoked if the connection can't be opened or is
closed. Indicates that a reconnect is necessary then stops the
ioloop.
"""
self.should_reconnect = True
self.stop()
def open_channel(self):
"""Open a new channel with RabbitMQ by issuing the Channel.Open RPC
command. When RabbitMQ responds that the channel is open, the
on_channel_open callback will be invoked by pika.
"""
LOGGER.info('Creating a new channel')
self._connection.channel(on_open_callback=self.on_channel_open)
def on_channel_open(self, channel):
"""This method is invoked by pika when the channel has been opened.
The channel object is passed in so we can make use of it.
Since the channel is now open, we'll declare the exchange to use.
:param pika.channel.Channel channel: The channel object
"""
LOGGER.info('Channel opened')
self._channel = channel
self.add_on_channel_close_callback()
self.setup_exchange(self.EXCHANGE)
def add_on_channel_close_callback(self):
"""This method tells pika to call the on_channel_closed method if
RabbitMQ unexpectedly closes the channel.
"""
LOGGER.info('Adding channel close callback')
self._channel.add_on_close_callback(self.on_channel_closed)
def on_channel_closed(self, channel, reason):
"""Invoked by pika when RabbitMQ unexpectedly closes the channel.
Channels are usually closed if you attempt to do something that
violates the protocol, such as re-declare an exchange or queue with
different parameters. In this case, we'll close the connection
to shutdown the object.
:param pika.channel.Channel: The closed channel
:param Exception reason: why the channel was closed
"""
LOGGER.warning('Channel %i was closed: %s', channel, reason)
self.close_connection()
def setup_exchange(self, exchange_name):
"""Setup the exchange on RabbitMQ by invoking the Exchange.Declare RPC
command. When it is complete, the on_exchange_declareok method will
be invoked by pika.
:param str|unicode exchange_name: The name of the exchange to declare
"""
LOGGER.info('Declaring exchange: %s', exchange_name)
# Note: using functools.partial is not required, it is demonstrating
# how arbitrary data can be passed to the callback when it is called
cb = functools.partial(
self.on_exchange_declareok, userdata=exchange_name)
self._channel.exchange_declare(
exchange=exchange_name,
exchange_type=self.EXCHANGE_TYPE,
callback=cb)
def on_exchange_declareok(self, _unused_frame, userdata):
"""Invoked by pika when RabbitMQ has finished the Exchange.Declare RPC
command.
:param pika.Frame.Method unused_frame: Exchange.DeclareOk response frame
:param str|unicode userdata: Extra user data (exchange name)
"""
LOGGER.info('Exchange declared: %s', userdata)
self.setup_queue(self.QUEUE)
def setup_queue(self, queue_name):
"""Setup the queue on RabbitMQ by invoking the Queue.Declare RPC
command. When it is complete, the on_queue_declareok method will
be invoked by pika.
:param str|unicode queue_name: The name of the queue to declare.
"""
LOGGER.info('Declaring queue %s', queue_name)
cb = functools.partial(self.on_queue_declareok, userdata=queue_name)
self._channel.queue_declare(queue=queue_name, callback=cb)
def on_queue_declareok(self, _unused_frame, userdata):
"""Method invoked by pika when the Queue.Declare RPC call made in
setup_queue has completed. In this method we will bind the queue
and exchange together with the routing key by issuing the Queue.Bind
RPC command. When this command is complete, the on_bindok method will
be invoked by pika.
:param pika.frame.Method _unused_frame: The Queue.DeclareOk frame
:param str|unicode userdata: Extra user data (queue name)
"""
queue_name = userdata
LOGGER.info('Binding %s to %s with %s', self.EXCHANGE, queue_name,
self.ROUTING_KEY)
cb = functools.partial(self.on_bindok, userdata=queue_name)
self._channel.queue_bind(
queue_name,
self.EXCHANGE,
routing_key=self.ROUTING_KEY,
callback=cb)
def on_bindok(self, _unused_frame, userdata):
"""Invoked by pika when the Queue.Bind method has completed. At this
point we will set the prefetch count for the channel.
:param pika.frame.Method _unused_frame: The Queue.BindOk response frame
:param str|unicode userdata: Extra user data (queue name)
"""
LOGGER.info('Queue bound: %s', userdata)
self.set_qos()
def set_qos(self):
"""This method sets up the consumer prefetch to only be delivered
one message at a time. The consumer must acknowledge this message
before RabbitMQ will deliver another one. You should experiment
with different prefetch values to achieve desired performance.
"""
self._channel.basic_qos(
prefetch_count=self._prefetch_count, callback=self.on_basic_qos_ok)
def on_basic_qos_ok(self, _unused_frame):
"""Invoked by pika when the Basic.QoS method has completed. At this
point we will start consuming messages by calling start_consuming
which will invoke the needed RPC commands to start the process.
:param pika.frame.Method _unused_frame: The Basic.QosOk response frame
"""
LOGGER.info('QOS set to: %d', self._prefetch_count)
self.start_consuming()
def start_consuming(self):
"""This method sets up the consumer by first calling
add_on_cancel_callback so that the object is notified if RabbitMQ
cancels the consumer. It then issues the Basic.Consume RPC command
which returns the consumer tag that is used to uniquely identify the
consumer with RabbitMQ. We keep the value to use it when we want to
cancel consuming. The on_message method is passed in as a callback pika
will invoke when a message is fully received.
"""
LOGGER.info('Issuing consumer related RPC commands')
self.add_on_cancel_callback()
self._consumer_tag = self._channel.basic_consume(
self.QUEUE, self.on_message)
self.was_consuming = True
self._consuming = True
def add_on_cancel_callback(self):
"""Add a callback that will be invoked if RabbitMQ cancels the consumer
for some reason. If RabbitMQ does cancel the consumer,
on_consumer_cancelled will be invoked by pika.
"""
LOGGER.info('Adding consumer cancellation callback')
self._channel.add_on_cancel_callback(self.on_consumer_cancelled)
def on_consumer_cancelled(self, method_frame):
"""Invoked by pika when RabbitMQ sends a Basic.Cancel for a consumer
receiving messages.
:param pika.frame.Method method_frame: The Basic.Cancel frame
"""
LOGGER.info('Consumer was cancelled remotely, shutting down: %r',
method_frame)
if self._channel:
self._channel.close()
def on_message(self, _unused_channel, basic_deliver, properties, body):
"""Invoked by pika when a message is delivered from RabbitMQ. The
channel is passed for your convenience. The basic_deliver object that
is passed in carries the exchange, routing key, delivery tag and
a redelivered flag for the message. The properties passed in is an
instance of BasicProperties with the message properties and the body
is the message that was sent.
:param pika.channel.Channel _unused_channel: The channel object
:param pika.Spec.Basic.Deliver: basic_deliver method
:param pika.Spec.BasicProperties: properties
:param bytes body: The message body
"""
LOGGER.info('Received message # %s from %s: %s',
basic_deliver.delivery_tag, properties.app_id, body)
self.acknowledge_message(basic_deliver.delivery_tag)
def acknowledge_message(self, delivery_tag):
"""Acknowledge the message delivery from RabbitMQ by sending a
Basic.Ack RPC method for the delivery tag.
:param int delivery_tag: The delivery tag from the Basic.Deliver frame
"""
LOGGER.info('Acknowledging message %s', delivery_tag)
self._channel.basic_ack(delivery_tag)
def stop_consuming(self):
"""Tell RabbitMQ that you would like to stop consuming by sending the
Basic.Cancel RPC command.
"""
if self._channel:
LOGGER.info('Sending a Basic.Cancel RPC command to RabbitMQ')
cb = functools.partial(
self.on_cancelok, userdata=self._consumer_tag)
self._channel.basic_cancel(self._consumer_tag, cb)
def on_cancelok(self, _unused_frame, userdata):
"""This method is invoked by pika when RabbitMQ acknowledges the
cancellation of a consumer. At this point we will close the channel.
This will invoke the on_channel_closed method once the channel has been
closed, which will in-turn close the connection.
:param pika.frame.Method _unused_frame: The Basic.CancelOk frame
:param str|unicode userdata: Extra user data (consumer tag)
"""
self._consuming = False
LOGGER.info(
'RabbitMQ acknowledged the cancellation of the consumer: %s',
userdata)
self.close_channel()
def close_channel(self):
"""Call to close the channel with RabbitMQ cleanly by issuing the
Channel.Close RPC command.
"""
LOGGER.info('Closing the channel')
self._channel.close()
def run(self):
"""Run the example consumer by connecting to RabbitMQ and then
starting the IOLoop to block and allow the AsyncioConnection to operate.
"""
self._connection = self.connect()
self._connection.ioloop.run_forever()
def stop(self):
"""Cleanly shutdown the connection to RabbitMQ by stopping the consumer
with RabbitMQ. When RabbitMQ confirms the cancellation, on_cancelok
will be invoked by pika, which will then closing the channel and
connection. The IOLoop is started again because this method is invoked
when CTRL-C is pressed raising a KeyboardInterrupt exception. This
exception stops the IOLoop which needs to be running for pika to
communicate with RabbitMQ. All of the commands issued prior to starting
the IOLoop will be buffered but not processed.
"""
if not self._closing:
self._closing = True
LOGGER.info('Stopping')
if self._consuming:
self.stop_consuming()
self._connection.ioloop.run_forever()
else:
self._connection.ioloop.stop()
LOGGER.info('Stopped')
class ReconnectingExampleConsumer(object):
"""This is an example consumer that will reconnect if the nested
ExampleConsumer indicates that a reconnect is necessary.
"""
def __init__(self, amqp_url):
self._reconnect_delay = 0
self._amqp_url = amqp_url
self._consumer = ExampleConsumer(self._amqp_url)
def run(self):
while True:
try:
self._consumer.run()
except KeyboardInterrupt:
self._consumer.stop()
break
self._maybe_reconnect()
def _maybe_reconnect(self):
if self._consumer.should_reconnect:
self._consumer.stop()
reconnect_delay = self._get_reconnect_delay()
LOGGER.info('Reconnecting after %d seconds', reconnect_delay)
time.sleep(reconnect_delay)
self._consumer = ExampleConsumer(self._amqp_url)
def _get_reconnect_delay(self):
if self._consumer.was_consuming:
self._reconnect_delay = 0
else:
self._reconnect_delay += 1
if self._reconnect_delay > 30:
self._reconnect_delay = 30
return self._reconnect_delay
def main():
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
amqp_url = 'amqp://guest:guest@localhost:5672/%2F'
consumer = ReconnectingExampleConsumer(amqp_url)
consumer.run()
if __name__ == '__main__':
main()