Scaling Redis
And vertical scaling isn’t an option
When in the course of human events you encounter a scenario that drives redis CPU into the red zone, what is your move? I recently encountered such a scenario.
Redis is a popular open source in memory database. It is available as a managed service on AWS (Elasticache) and Google Cloud (Memorystore).
Here’s a chart of Google Memorystore CPU utilization for the past 2 weeks.
On March 3, CPU crossed the 80% threshold. You may be wondering what kind of workload can drive redis CPU to this level.
Here’s a timechart by redis operation over the same period.
The workload is almost entirely scans peaking at over 25,000 scans/second!
Now you may feel a sense of moral indignation at this scan heavy workload. Scan is used to incrementally iterate over all the keys in the database. Each iteration is an O(n) operation.
Here’s the block of code driving this load.
for key in self._redis.scan_iter(match=key_mask):
_, lp_candidate = key.decode("utf-8").rsplit(":", 1)
sim = Similarity.lp_similarity_ratio(plate, lp_candidate) if Similarity.is_similar(sim, self.__lp_sim_threshold) and (best_match is None or best_match[1] < sim):
best_match = LicensePlate(lp_candidate), sim
The block iterates over a set of keys, extracts a license plate, and compares this candidate to a search key using a fuzzy similarity function. If the candidate is deemed more similar than any candidate seen so far, it becomes the new best_match.
Here are some example keys to make this more clear.
“lp-day-sess:loc:3d5a6232–43b4–4cd8-a423–5f322d05cd85:lp:JT4344”
“lp-day-sess:loc:84bba9e7–0a8a-4af5–8b23-f1a16844c7bb:lp:HXS290”
“lp-day-sess:loc:7eae9995–23dd-4f0d-be0c-175979bfdc5b:lp:GMP1936”
“lp-day-sess:loc:ea9ef92b-1814–4a24-b4b9-ec94a496f301:lp:944101”
“lp-sess:loc:741100c2–4a1c-45b5-a44f-c315b63aa30b:lp:IKR069”
“lp-day-sess:loc:a1981989-aabf-4e52–9925–78e3b93d5d2e:lp:AJT138”
“lp-sess:loc:f31077cc-0bb5–4157–939c-adc08941e7ed:lp:KDC3600”
“lp-day-sess:loc:72cf7004-abef-47c5–9d8e-22d32e22f73b:lp:9J1252”
“lp-day-sess:loc:d76f798e-4653–4240-bb11-ec1981f1f228:lp:P556P”
“lp-day-sess:loc:a36c0c25–4c68–46bc-962c-47f55c2198c6:lp:M275079”
The application here is a car wash analytics platform. As the car arrives into view, we’re interested in knowing if we have seen it previously (i.e. at another video camera at the same location).
It may be possible to solve this problem in a more efficient manner. Perhaps we can maintain sets of keys by location id and limit iteration to the relevant set. Perhaps a different technology that supports indexing would be more appropriate.
But what can be done quickly? How can we buy time? Once we’ve removed this bottleneck to scale, we can come back and refactor the solution.
First, it’s not an option to turn a dial in Google Memorystore and give redis more CPU. Redis runs on a single core. While it is possible to give a Memorystore instance more memory, vertical scaling of Memorystore CPU is not possible.
Second, as described in the FAQ, Memorystore does not support redis cluster. There is no out of the box scale out option.
Proxy Service: twemproxy
One option is twemproxy. Twemproxy is an open source proxy component that supports the redis protocol. It also supports sharding across redis instances using a hash key.
I put together an example of how we can shard a redis workload across multiple nodes using twemproxy. In the example, we start 4 redis containers, a twemproxy container, and a client container. The client will set values and these will be sharded across redis instances.
A nice thing about this approach is that no change to the client code is required in order to scale out: the client just points at the proxy.
A problem with this approach is that it does not support the scan operation. That’s a dealbreaker for our use case.
twemproxy_1 | [2021–03–14 16:16:51.572] nc_redis.c:1092 parsed unsupported command ‘SCAN’
twemproxy_1 | [2021–03–14 16:16:51.572] nc_core.c:237 close c 11 ‘172.18.0.7:54580’ on event 00FF eof 0 done 0 rb 269 sb 34: Invalid argument
client_1 | Traceback (most recent call last):
client_1 | File “/usr/src/app/./app.py”, line 33, in <module>
client_1 | for key in r.scan_iter(match=’ock’):
…
Client-side sharding: redis-shard
Another option is client side sharding. You could roll your own, or you could try and use a library available in your application language. I tried the latter and explored the Python redis-shard package.
I put together an example of how we can apply redis-shard. The example starts 4 redis containers, and a client container. The client sets keys and values and these are sharded across instances.
Like twemproxy, redis-shard supports a hash_tag feature, which enables us to send keys with a common substring to the same node.
print(r.get_server_name(‘bar’))
print(r.get_server_name(‘c{bar}’))
print(r.get_server_name(‘c{bar}d’))
print(r.get_server_name(‘c{bar}e’))
print(r.get_server_name(‘e{bar}f’))
The substring in curly braces is used as the hash key.
client_1 | redis1
client_1 | redis1
client_1 | redis1
client_1 | redis1
client_1 | redis1
However, redis-shard does not support the scan operation.
client_1 | Traceback (most recent call last):
client_1 | File “/usr/src/app/./app.py”, line 42, in <module>
client_1 | for key in r.scan_iter(match=’ock’):
client_1 | File “/usr/local/lib/python3.9/site-packages/redis_shard/shard.py”, line 115, in __getattr__
client_1 | raise NotImplementedError(“method ‘%s’ cannot be sharded” % method)
client_1 | NotImplementedError: method ‘scan_iter’ cannot be sharded
Again, dealbreaker.
Redis Enterprise
The next option is heavy artillery. Some of the folks behind the redis project have created a Redis Enterprise product. This is redis cluster available as a managed service. It is available for purchase on the Google Cloud.
This may be our Obi Wan Kenobi. I’ll let you know after I experiment.