The previous post on Stateless Session Logout in OpenAM 13 has proved to be quite popular in terms of this blog. While it went into some detail about the technology and the problems that need to be solved in a production system, it was a bit short on actual figures to illustrate the gains. In this post we will rectify that with some theoretical numbers on memory usage and some measurements from early performance testing.
Firstly, how much memory do these bloom filters take up? There is a formula for calculating the optimum size of the bit vector (m) based on the number of items we expect to insert (n) and the desired probability of false positives (p):
For example, for n = 10,000 elements and p = 0.01 (1% chance of false positives), we get m = 95,851 bits (~9.6 bits per element) or around 11.7KiB. The following table shows how this scales up:
|1% FPP||0.1% FPP||0.01% FPP|
Here we can see that we can store over 10 million expired sessions on each server with a false positive rate of 1 in 10,000 in less than 25MiB of RAM — i.e., for every 10,000 sessions that we think are blacklisted and therefore query the CTS for, we would expect a single false positive (unnecessary request). At the extreme end, we can scale up to over a billion blacklisted sessions in around 1-2.5GiB of RAM, which is achievable with proper JVM tuning (an off-heap Bloom Filter backend could help at these extreme sizes, but we do not yet support that). In theory, the absolute limit is around 7.15 billion items at 0.01% FPP with the current implementation (which uses an array of
long values to store the bit vector), using almost 12GiB of RAM. In practice, the rolling bloom filter implementation may use more memory than this in the worst case, but will self-adjust to limit the overuse. For realistic use-cases, we do not expect memory usage to be a significant problem.
The implementation in OpenAM server provides reasonable default values and automatically adapts itself to the observed usage patterns, to avoid any tuning burden on system administrators. The initial values are a capacity of 10,000 elements, and an expected false positive probability of 0.1% when at capacity. The system will expand the capacity dynamically, using a doubling rate of expansion. This ensures that the system will rapidly adjust to accommodate higher than expected rates of logout. Importantly, the system will also release resources when the rate drops below this capacity, ensuring that bursts of activity do not impact long-term resource usage. The false positive probability is automatically adjusted to ensure that it never exceeds 0.1% in total.
The following graphs show some captures from internal JMX logging during development testing showing the system automatically adjusting to load. The system in this test was tuned for a 1% overall false positive probability, which is larger than the default production value in AM 13:
Here you can see the total Bloom Filter size adjusting between around 50KiB up to a maximum of around 250KiB when using 3 bloom filters in the chain. The overall expected FPP similarly fluctuates between bounds up to around 0.5%. As can be seen, the actual observed false positive probability (i.e., the number of requests requiring a lookup on the CTS but turning out to still be active sessions) in this scenario stabilises at around 0.2%.
But what does this actually mean?
To require a blacklist of 1 billion sessions with the default maximum session time of 2 minutes, the system would need to be processing logouts (not timeouts) at a rate of 8⅓ million sessions per second. For a longer session time of 1 day, the rate would be around 11,500 logouts per second. The longer the maximum session expiry timeout, the greater the requirement to blacklist potentially large number of sessions, but the technology should be able to handle very large logout rates. If long maximum session times are desired then using short idle timeouts will also reduce the pressure on the blacklist as sessions would only need to be blacklisted until their idle timeout has been reached (the blacklist will prevent activity until then). However, idle timeouts are not currently implemented for stateless sessions—watch this space. As we can see from the above calculations, even with relatively long maximum session timeouts, the bloom filter technology can already handle very large load rates.
What is the impact of the false positive probability?
The false positive probability is the probability that the bloom filter thinks a session has been logged out when in fact it is still active. For any session that either (a) has been logged out (true positives) or (b) has not been logged out but is flagged as a maybe by the bloom filter (false positives), a query will be made against the CTS master blacklist to check for sure. If the result was a true positive (the session is really logged out), then this is cached in a per-server LRU cache. For false positives, we always check the CTS in case the session is subsequently blacklisted from a different server. This ensures the cache is always accurate, as once a session is blacklisted it will always be blacklisted, but an active session may be blacklisted at any time.
The effectiveness of the bloom filter comes from the expected usage patterns. Once a session has been explicitly logged out, the expectation is that it will not be used again anyway. We might expect one or two extra requests if the user has an old browser tab open with the old session cookie and tries to use it again. In many cases, this will be after the session times out and so can be checked locally. Therefore, in normal use we would expect that most requests will be for sessions that are in fact still valid. This is the usage pattern that the bloom filter excels at. The vast majority of active sessions will be determined as still active by the bloom filter without any additional work. This check is very fast: typically in the order of 1 or 2 microseconds even for very large bloom filters.
The impact of the false positive probability is that in the worst case (when the bloom filter is near capacity) some number of active sessions will be indicated as blacklisted by the bloom filter when they actually are not. This will result in those sessions requiring a lookup against the CTS master blacklist to check for sure, which will add latency to those requests. So long as the bloom filter is not used beyond its configured capacity, this should never exceed the false positive probability. For example, if the FPP is 0.1% at 10,000 elements, then we would expect no more than 10 sessions in those 10,000 to be false positives. Due to the monotonic nature of bloom filters, a session that is indicated as a false positive once will always be considered a false positive (until it expires or is logged out and becomes a true positive). However, as this set is much smaller than the full active set of sessions, the load on the CTS OpenDJ backend is significantly reduced. As mentioned previously, reads can be directed to any DJ server in the replication topology, allowing for scalable load-balancing strategies and additional caching is performed to minimise the impact of these false positives.
Possible future enhancements
The implementation of bloom filter session blacklisting in OpenAM 13 is already very advanced, but there are some areas we are considering for future enhancements to improve the performance and scalability even further. None of these items are concrete roadmap items yet, and we will be prioritising based on feedback from customers deploying these features in the wild.
- Improving worst-case latency for the false positive case via more sophisticated caching
- Specialised strategies to reduce network bandwidth required for bloom filter replication
- Support for off-heap memory, to reduce garbage collection pressure of large bloom filters
- Improved monitoring and parameter tuning to provide more control to administrators