Rate Limiting Product
Rate Limiting has become a common system design interview question.
But at the time I was working on it, it didn't seem to have nearly the same amount of content. In retrospect, this would have been an opportunity to talk about it and socialize it since it's a common-enough problem.
That was probably a fail on my end to not seize an opportunity.
At the end of this essay, I reflect on why there's so much content on this problem.
What was the problem?
Customers were concerned about two kinds of attacks:
- Too many requests against their origin, which caused a denial of service
- Too many requests across their site from a single IP, indicating a scraping bot or a shopping bot
Both of these were addressed through rate limiting.
This meant that we could throttle the rate of requests per period of time for the customer.
The use cases however required us to give the user control along different dimensions:
- rate per period (including defining what that period is)
- The URL pattern (including wild cards)
- The request method
- Response code
- Cookie string
- IP address
This addressed the different kinds of use cases, but it also presented a UX challenge of providing granular controls alongside an "easy-button" option.
We also needed a way to show how rate limiting rules were being implemented. This meant we included a metrics component per rule.
This was important for the customer to be able to see how the rules actually were working (for example, if it were rate limiting what appeared to be regular traffic, we needed to empower users to identify this quickly).
Constraints and Concerns
- Fast reaction time. We didn’t pick a hard number but felt less than 1 second to open or close was reasonable in DDoS situation
- Low to Zero False positives: also afraid of blocking legitimate customers
- Scalability: whatever decision needed to scale across traffic volume, users, and potentially the number of rules per user
Engineering
The system was distributed.
Which means that potentially just 1 per second from 50 different locations was effectively 50 rps to the origin.
So the first problem to solve was how to maintain state for every rule bucket.
Because of the potentially high throughout, we didn’t want to use write operations to a database or a distributed key value store.
But we could leverage the in-memory caching to cache based on rules and client data.
This decision seemed the most robust against some other considerations which didn't make sense from a product perspective:
- sticky sessions
- Chatty gossip
- KV store with read write to disk
- Kafka pub sub
Most of these seemed like they introduced complexity or latency.
Next we had to consider algorithms.
My role backed-off on this, but I was able to provide some requirements which helped to scope things down:
- Less interested in supporting spiky bursts to exceed some threshold
- Wanted it to be easy and intuitive to customers to understand how to parameterize it
- Balance protecting the origin with legitimate users during a surge in traffic
The options we looked at were:
- Token bucket
- Leaky bucket
- sliding window
From a user perspective, token bucket seemed to permit spiking traffic, and it also seemed to require customers to put a bucket size and refresh rate, and those felt less intuitive for the use case.
The use case where it potentially made more sense was for password tries, but we felt we could capture that in another way.
The second approach was leaky bucket. Because that tended to smooth out what request rate went through to the origin, I preferred that. It was more intuitive, and a way for the end-user to associate origin capacity with a rule more directly.
Engineering initially looked into implementing this in a distributed environment using memcached. But I believe they looked for another situation because of some limitation around atomicity to support this.
Here was just made a check-point that, if we run tests that we are fine, I was okay with the implementation details.
API first
While the engineers worked on designing the distributed system, blocking, and counters, I started to put requirements and specifications for the API.
This felt less about the capabilities but its ease-of-use.
I read the API docs for existing rate-limiting products, and was confused, so I started with the documentation of how I would want to understand and engage with the API.
I worked alongside the front-end engineers around their existing API patterns since they were also going to be internal customers for the rate limiting product.
This helped me to think about a couple of concepts:
- Work in understandable chunks around defining rules
- For example, the rate controls needed period, size of period, amount of period -- that was a clear 'chunk'
- Actions: block, mark but don't block, do nothing
- Type of filters, such as IP, cookies, URL or allowing a combination of those
Although this was API first, at the same time, I started to mock-up the UI: how would a user make choices but not be overwhelmed with too many options.
This was non-trivial
Metrics
At the same time, I had to think about how we were logging the events. This was a different pipeline from the caching for rules enforcement.
This meant starting with the types of queries and filters that an end-users would want to use.
I was also working on the pipeline for analytics so found some overlap here.
Billing
At the same time of doing the metrics, needed to work with the Billing system.
The Billing system (which I was also a PM for) was adding a new business model -- usage-based billing.
This meant thinking through the billing approach (and then transitioning the details to the dedicated Billing PM brought on later).
Since I had set up the initial foundation for the Billing system to support usage based billing, I felt comfortable we could do this. But the details around how we charged and the behavior with upgrades and downgrades I needed to define the specifications (for example, what happened if the customer downgraded in the middle of a period and then had a large rate limiting usage.)
Pricing
The goal for pricing was to make it effectively free for the long tail of customers by coming up with a free tier that matched their existing usage.
But we also wanted to model the ramp up curve on what types of payments customers would need to make in the event of large L7 attacks that were mitigated.
This involved both looking at usage patterns, such as attacks on /login domains, attacks from the same IP address, to get a set of "buckets" around large usage.
I used did interviews with both enterprise and pro-sumer customers to get a feel for what the costs were to them for the problems that could be solved by rate limiting.
This included loss of business, costs in servers, costs of downtime, costs of someone trying to build something themselves.
UX
The UX presented some challenges on two fronts: the two personas were different -- those who just wanted it to "just work" and those with complex, often existing rule sets.
The second was, in the first part of the project, I worked with Head of Design and we got along fabulously to understand the customer and the flow. Unfortunately, he then transferred me to work with a Junior Designer whom he felt could handle it.
It turned out, he didn't know what the User Flow meant (I had to create it), he didn't understand how to think through UX edge cases for the browser, for CRUD operations, error handling...I suddenly was scrambling to define them.
Even then, he had a hard time translating the requirements into actual designs.
He was really more of a graphic designer than a product designer. The Head Designer, however, had gone onto paternity leave, and wasn't scheduled to be back until after the release!
This was very hard, and in retrospect I would have done things differently.
What I did to get the project done was I took over: I mocked up, I explained graceful degradation, I wrote a User Flow including failure cases, and asked the designer to make things that worked and would just give daily feedback on each case.
At the time, I didn't see a way out. He had no understanding of the core concepts that I had hoped he would do, and the engineering team was desperate for the designs: if we didn't get them shipped to them, they had committed other sprints to other product teams!
On top of it, this Junior Designer just had his father die, and so he was very emotional and confrontational in all of our meetings, especially when the designs he produced weren't good.
The result was great: people loved the designs, I got feedback on the API from developers (I released it with a PDF to a tutorial using Postman, that's it!).
But it was messy and ugly to get there.
How would I have addressed it?
First, I would have a transition meeting with the Head of Design before he left and vetted together the Junior Designer. Some things had already come up where the HoD would then coach the designer when I asked things and the junior designer flubbed.
But I didn't think about, "Wait, this guy can't answer basic questions and I'm going to be left with just him. Do something about it, now!"
The second would be to sit down with the designer and say, "We have problems. We are in a tough situation. We need to find a way to fix this. Here's what I am looking for, and it seems we are not on the same page. Is there anyone on your team or from past work that we can leverage? We can't continue this way -- I mean, we can but it will be bloody and if that's the case, we need to talk about what it will look like moving forward."
I think an option would have been for us to talk about whether there are books, courses on YouTube, some other resource we could pull in that he could use so that the basic concepts at least we could use so that I'm not diving in.
I think I would also have spent the time to write up a "rubric" of quality control on what a good design looked like to me with questions that he could use on his own to test it.
I did try to do that in pieces through feedback, but that ended up being difficult because he would take them literally and then say, "but it's what you asked for" and I would have to come up with a more comprehensive approach.
The best approach would have been, before the HoD left, really call out what we needed and the complexity and switched Designers. That was viable, but I didn't have the full situational awareness to do that.