Skip to content

Commit 63eda17

Browse files
committed
Add advanced cache documentation and belady approximator to hitrates
- belady is useful to get *some* sort of semi-realistic expectation of a cache, as the maximum hit rate is only somewhat realistic as cache sizes get close to the number of unique entries - caches have been busting my balls and I'd assume the average user doesn't have the time and inclination to bother, so some guidance is useful - as caching is generally a CPU/memory tradeoff, while ``hitrates`` provides a cache overhead estimation giving users a better grasp of the implementation details and where the overhead comes from is useful - plus I regularly re-wonder and re-research and re-discover the size complexity of various collections so this gives me the opportunity to actually write it down for once
1 parent 7ee2823 commit 63eda17

3 files changed

Lines changed: 442 additions & 11 deletions

File tree

doc/advanced/caches.rst

Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
=========
2+
On Caches
3+
=========
4+
5+
Evaluating Caches
6+
=================
7+
8+
UA-Parser tries to provide a somewhat decent cache by default, but
9+
cache algorithms react differently to traffic patterns, and setups can
10+
have different amounts of space to dedicate to cache overhead.
11+
12+
Thus, ua-parser also provides some tooling to try and evaluate
13+
fitness, in the form of two built-in command-line scripts. Both
14+
scripts take a mandatory *sample file* in order to provide evaluation
15+
on representative traffic. Thus this sample file should be a
16+
representative sample of your real world traffic (no sorting, no
17+
deduplicating, ...).
18+
19+
``python -mua_parser hitrates``
20+
-------------------------------
21+
22+
As its name indicates, the ``hitrates`` script allows measuring the
23+
hit rates of ua-parser's available caches by simulating cache use at
24+
various sizes on the sample file. It also provides the memory overhead
25+
of each cache implementation at those sizes, both in total and per
26+
entry.
27+
28+
.. warning::
29+
30+
The cache overhead does not include the size of the cached entries
31+
themselves, which is generally 500~700 bytes for a complete entry
32+
(all three domains matched).
33+
34+
``hitrates`` also includes Bélády's MIN (aka OPT) algorithm for
35+
reference. MIN is not a practical cache as it requires knowledge of
36+
the future, but it provides the theoretical upper bound at a given
37+
cache size (very theoretical, practical cache algorithms tend to be
38+
way behind until cache sizes close in on the total number of unique
39+
values in the dataset).
40+
41+
``hitrates`` has the advantage of being very cheap as it only
42+
exercises the caches themselves and barely looks at the data.
43+
44+
``python -mua_parser bench``
45+
----------------------------
46+
47+
``bench`` is much more expensive in both CPU and wallclock as it
48+
actually runs the base resolvers, combined with various caches of
49+
various sizes. For usability, it can report its data (the average
50+
parse time per input entry) in both human-readable text with one
51+
result per line and CSV with resolver configurations as the columns
52+
and cache sizes as the rows.
53+
54+
``hitrates`` is generally sufficient as generally speaking for the
55+
same base resolver performances tend to more or less follo hit rates:
56+
a cache hit is close to free compared to a cache miss. Although this
57+
is truer for the basic resolver (for which misses tend to be very
58+
expensive). ``bench`` is mostly useful to validate or tie-break
59+
decisions based on ``hitrates``, and allows creating nice graphs in
60+
your spreadsheet software of choice.
61+
62+
Cache Algorithms
63+
================
64+
65+
[S3-FIFO]_
66+
----------
67+
68+
[S3-FIFO]_ is a novel fifo-based cache algorithm. It might seem odd to
69+
pick that as default rather than a "tried and true" LRU_, but the
70+
principles are interesting and on our sample it shows very good
71+
performances for an acceptable implementation complexity.
72+
73+
Advantages
74+
''''''''''
75+
76+
- excellent hit rates
77+
- thread-safe on hits
78+
- excellent handling of one hit wonders (entries unique to the data
79+
set) and rare fews (multiple entries with a lot of separation)
80+
- flexible implementation
81+
82+
Drawbacks
83+
'''''''''
84+
85+
- O(n) eviction
86+
- somewhat demanding on memory, especially at small sizes
87+
88+
Space
89+
'''''
90+
91+
An S3Fifo of size n is composed of:
92+
93+
- one :ref:`dict` of size 1.9*n
94+
- three :ref:`deque` of sizes 0.1 * n, 0.9 * n, and 0.9 * n
95+
96+
[SIEVE]_
97+
--------
98+
99+
[SIEVE]_ is an other novel fifo-based algorithm, a cousin of S3Fifo it
100+
works off of somewhat different principle. It has good performances
101+
and a more straightforward implementation than S3, but it is strongly
102+
wedded to linked lists as it needs to remove entries from the middle
103+
of the fifo (whereas S3 uses strict fifo).
104+
105+
Advantages
106+
''''''''''
107+
108+
- good hit rates
109+
- thread-safe on hits
110+
- memory efficient
111+
112+
Drawbacks
113+
'''''''''
114+
115+
- O(n) eviction
116+
117+
Space
118+
'''''
119+
120+
A SIEVE of size n is composed of:
121+
122+
- a :ref:`dict` of size n
123+
- a linked list with n :ref:`nodes of 4 pointers each <class>`
124+
125+
LRU
126+
---
127+
128+
The grandpappy of non-trivial cache eviction, it's mostly included as
129+
a safety in case users encounter workloads for which the fifo-based
130+
algorithms completely fall over (do report them, I'm sure the authors
131+
would be interested).
132+
133+
Advantages
134+
''''''''''
135+
136+
- basically built in the Python stdlib (via
137+
:class:`~collections.OrderedDict`)
138+
- O(1) eviction
139+
- nobody ever got evicted for using an LRU
140+
141+
Drawbacks
142+
'''''''''
143+
144+
- must be synchronised on hit: entries are moved
145+
- poor hit rates
146+
147+
Space
148+
'''''
149+
150+
An LRU of size n is composed of:
151+
152+
- an :ref:`ordered dict <odict>` of size n
153+
154+
Memory analysis of Python objects
155+
=================================
156+
157+
Measures as of Python 3.11, on a 64b platform. Information is the
158+
overhead of the object itself, not the data it stores e.g. if an
159+
object stores strings the sizes of the strings are not included in the
160+
calculations.
161+
162+
.. _class:
163+
164+
``class``
165+
---------
166+
167+
With ``__slots__``, a Python object is 32 bytes + 8 bytes for each
168+
member. An additional 8 bytes is necessary for weakref support
169+
(slotted objects in UA-Parser don't have weakref support).
170+
171+
Without ``__slots__``, a Python object is 48 bytes plus an instance
172+
:ref:`dict`.
173+
174+
.. note:: The instance dict is normally key-sharing, which is not
175+
included in the analysis, see :pep:`412`.
176+
177+
.. _dict:
178+
179+
``dict``
180+
--------
181+
182+
Python's ``dict`` is a relatively standard hash map, but it has a bit
183+
of a twist in that it stores the *entries* in a dense array, which
184+
only needs to be sized up to the dict's load factor, while the shallow
185+
array used for hash lookups (which needs to be sized to match
186+
capacity) only holds indexes into the dense array. This also allows
187+
the *size* of the indices to only be as large as needed to index into
188+
the dense array, so for small dicts the sparse array is an array of
189+
bytes (8 bits).
190+
191+
*However* because the dense array of entries is used as a stack (only
192+
the last entry can be replaced) in case a dict "churns" (entries get
193+
added and removed without the size changing) if the size of the dict
194+
is close to the next break-point it would need to be compacted
195+
frequently leading to poor performances.
196+
197+
As a result, although a dictionary being created or added to will be
198+
just the next size up a dict with a lot of churn will be two sizes up
199+
to limit the amout of compaction necessary e.g. 10000 entries would
200+
fit in ``2**14`` (capacity 16384, for a usable size of 10922) but the
201+
dict may be sized up to ``2**15`` (capacity 32768, for a usable size
202+
of 21845).
203+
204+
Python dicts also have a concept of *key kinds* which influences parts
205+
of the layout. As of 3.12 there are 3 kinds called
206+
``DICT_KEYS_GENERAL``, ``DICT_KEYS_UNICODE``, and ``DICT_KEYS_SPLIT``.
207+
This is relevant here because UA-Parser caches are keyed on strings,
208+
which means they should always use the ``DICT_KEYS_UNICODE`` kind.
209+
210+
In the ``DICT_KEYS_GENERAL`` layout, each entry of the dense array has
211+
to store three pointer-sized items: a pointer to the key, a pointer to
212+
the value, and a cached version of the key hash. However since strings
213+
memoize their hash internally, the ``DICT_KEYS_UNICODE`` layout
214+
retrieves the hash value from the key itself when needed and can save
215+
8 bytes per entry.
216+
217+
Thus the space necessary for a dict is:
218+
219+
- the standard 4 pointers object header (``prev``, ``next``, and type
220+
pointers, and reference count)
221+
- ``ma_size``, 8 bytes, the number of entries
222+
- ``ma_version_tag``, 8 bytes, deprecated
223+
- ``ma_keys``, a pointer to the dict entries
224+
- ``ma_values``, a pointer to the split values in ``DICT_KEYS_SPLIT``
225+
layout (not relevant for UA-Parser)
226+
227+
The dict entries then are:
228+
229+
- ``dk_refcnt``, an 8 bytes refcount (used for the ``DICT_KEYS_SPLIT``
230+
layout)
231+
- ``dk_log2_size``, 1 byte, the total capacity of the hash map, as a
232+
power of two
233+
- ``dk_log2_index_bytes``, 1 byte, the size of the sparse indexes
234+
array in bytes, as a power of two, it essentially memoizes the log2
235+
size of the sparse indexes array by incrementing ``dk_log2_size`` by
236+
3 if above 32, 2 if above 16, and 1 if above 8
237+
238+
.. note::
239+
240+
This means the dict bumps up the indexes array a bit early to
241+
avoids having to resize again within a ``dk_log2_size`` e.g. at
242+
171 elements the dict will move to size 9 (total capacity 512,
243+
usable capacity 341) and the index size will immediately get
244+
bumped to 10 even though it can still fit ~80 additional items
245+
with a u8 index.
246+
247+
- ``dk_kind``, 1 byte, the key kind explained above
248+
- ``dk_version``, 4 bytes, used for some internal optimisations of
249+
cpython
250+
- ``dk_usable``, 8 bytes, the number of usable entries in the dense array
251+
- ``dk_nentries``, 8 bytes, the number of used entries in the dense
252+
array, this can't be computed from ``dk_usable`` and
253+
``dk_log2_size`` because ??? from the mention of ``DKIX_DUMMY`` I
254+
assume it's because ``dk_usable`` is used to know when the dict
255+
needs to be compacted or resized, and because python uses open
256+
addressing and leaves tombstone (``DKIX_DUMMY``) in the sparse array
257+
they matter for collision performances, and thus load calculations
258+
- ``dk_indices``, the sparse array of size
259+
``1<<dk_log2_size_index_bytes``
260+
- ``dk_entries``, the dense array of size
261+
``USABLE_FRACTION(1<<dk_log2_size) * 16``
262+
263+
.. note:: ``USABLE_FRACTION`` is 2/3
264+
265+
Thus the space formula for dicts -- in the context of string-indexed
266+
caches -- is::
267+
268+
32 + 32 + 32
269+
+ 2**(ceil(log2(n)) + 1) * ceil(log256(n))
270+
+ floor(2/3 * 2**ceil(log2(n)) + 1) * 16
271+
272+
.. _odict:
273+
274+
``collections.OrderedDict``
275+
---------------------------
276+
277+
While CPython has a pure-python ``OrderedDict`` it's not actually
278+
used, instead a native implementation with a native doubly linked list
279+
and a bespoke secondary hashmap is used, leading to a much denser
280+
collection than achievable in Python. The broad strokes are similar
281+
though:
282+
283+
- a regular ``dict`` links keys to values
284+
- a secondary hashmap links keys to *nodes* of the linked list,
285+
allowing reordering entries easily
286+
287+
The secondary hashmap is only composed of a dense array of nodes,
288+
using the internal details of the dict in order to handle lookups in
289+
the sparse array and collision resolution. Unlike ``dict`` however
290+
it's sized to the dict's capacity rather than ``USABLE_FRACTION``
291+
thereof.
292+
293+
The entire layout is:
294+
295+
- a full dict object (see above), inline
296+
- pointers to the first and last nodes of the doubly linked list
297+
- a pointer to the array of nodes
298+
- ``od_fast_nodes_size``, 8 bytes, which is used to see if the
299+
underlying dict has been resized
300+
- ``*od_resize_sentinel`` which is *also* used to see if the
301+
underlying dict has been redized (a pointer to the dict entries
302+
object)
303+
- ``od_state``, 8 bytes, to check for concurrent mutations during
304+
iteration
305+
- ``od_inst_dict``, 8 bytes, used to provide a fake ``__dict__`` and
306+
better imitate
307+
- ``od_inst_dict``, 8 bytes, weakref support
308+
309+
And each node in the linked list is 4 pointers: previous, next, key,
310+
and hash.
311+
312+
.. note::
313+
314+
Hash is (likely) to speed up lookup since going from odict node to
315+
dict entry requires a full lookup, and such a lookup is what
316+
happens during iteration, except it uses a regular
317+
``PyDict_GetItem`` instead of a low-level lookup, why?
318+
319+
So the ordereddict space requirement formula is::
320+
321+
dict(n) + 64 + 8 * 2**(ceil(log2(n)) + 1) + 32 * n
322+
323+
Because it matches dict's, like dict's the capacity is double what's
324+
strictly required due to amortising churn.
325+
326+
.. _deque:
327+
328+
``collections.deque``
329+
---------------------
330+
331+
Deque is an unrolled doubly linked list of order 64, that is every
332+
node of the linked list stores 64 items, plus two pointers for the
333+
previous and next links. Note that the deque always allocates a block
334+
upfront (nb: why not allocate on use?).
335+
336+
The deque metadata (excluding the blocks) is 232 bytes:
337+
338+
- the 32 bytes standard object of an object header (next pointer,
339+
previous pointer, refcount, and type pointer)
340+
- the ``ob_size`` of a VAR_OBJ, apparently used to store the number of
341+
items as the deque does not track its blocks size
342+
- pointers to the left and right blocks
343+
- offsets into the left and right blocks (as they may only be
344+
partially filled)
345+
- ``state``, a mutation counter used to track mutations during
346+
iteration
347+
- ``maxlen``, in case the deque is length-bounded
348+
- ``numfreeblocks``, the actual size of the freelist
349+
- ``freelist``, 16 pointers to already allocated available blocks
350+
- ``weakreflist``, the weakref support pointer
351+
352+
So the deque space requirement formula is::
353+
354+
232 + max(1, ceil(n / 64)) * 66 * 8
355+
356+
:func:`~functools.lru_cache`
357+
----------------------------
358+
359+
While not strictly relevant to ua-parser, it should be noted that
360+
:func:`~functools.lru_cache` is *not* built on
361+
:class:`~collections.OrderedDict`, it has its own native
362+
implementation which uses a single dict and a different bespoke doubly
363+
linked list with larger nodes (9 pointers).
364+
365+
.. [S3-FIFO] Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue, Rashmi
366+
Vinayak. 2023. FIFO queues are all you need for cache eviction.
367+
SOSP '23. https://dl.acm.org/doi/10.1145/3600006.3613147
368+
369+
.. [SIEVE] Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vigfusson,
370+
K. V. Rashmi. 2023. SIEVE is Simpler than LRU: an Efficient
371+
Turn-Key Eviction Algorithm for Web Caches. NSDI24.
372+
https://junchengyang.com/publication/nsdi24-SIEVE.pdf

doc/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,11 @@ For more detailed insight and advanced uses, see the :doc:`api` and
99
:doc:`guides`.
1010

1111
.. toctree::
12+
:maxdepth: 2
1213
:caption: Contents:
1314

1415
installation
1516
quickstart
1617
guides
1718
api
19+
advanced/caches

0 commit comments

Comments
 (0)