|
| 1 | +========= |
| 2 | +On Caches |
| 3 | +========= |
| 4 | + |
| 5 | +Evaluating Caches |
| 6 | +================= |
| 7 | + |
| 8 | +UA-Parser tries to provide a somewhat decent cache by default, but |
| 9 | +cache algorithms react differently to traffic patterns, and setups can |
| 10 | +have different amounts of space to dedicate to cache overhead. |
| 11 | + |
| 12 | +Thus, ua-parser also provides some tooling to try and evaluate |
| 13 | +fitness, in the form of two built-in command-line scripts. Both |
| 14 | +scripts take a mandatory *sample file* in order to provide evaluation |
| 15 | +on representative traffic. Thus this sample file should be a |
| 16 | +representative sample of your real world traffic (no sorting, no |
| 17 | +deduplicating, ...). |
| 18 | + |
| 19 | +``python -mua_parser hitrates`` |
| 20 | +------------------------------- |
| 21 | + |
| 22 | +As its name indicates, the ``hitrates`` script allows measuring the |
| 23 | +hit rates of ua-parser's available caches by simulating cache use at |
| 24 | +various sizes on the sample file. It also provides the memory overhead |
| 25 | +of each cache implementation at those sizes, both in total and per |
| 26 | +entry. |
| 27 | + |
| 28 | +.. warning:: |
| 29 | + |
| 30 | + The cache overhead does not include the size of the cached entries |
| 31 | + themselves, which is generally 500~700 bytes for a complete entry |
| 32 | + (all three domains matched). |
| 33 | + |
| 34 | +``hitrates`` also includes Bélády's MIN (aka OPT) algorithm for |
| 35 | +reference. MIN is not a practical cache as it requires knowledge of |
| 36 | +the future, but it provides the theoretical upper bound at a given |
| 37 | +cache size (very theoretical, practical cache algorithms tend to be |
| 38 | +way behind until cache sizes close in on the total number of unique |
| 39 | +values in the dataset). |
| 40 | + |
| 41 | +``hitrates`` has the advantage of being very cheap as it only |
| 42 | +exercises the caches themselves and barely looks at the data. |
| 43 | + |
| 44 | +``python -mua_parser bench`` |
| 45 | +---------------------------- |
| 46 | + |
| 47 | +``bench`` is much more expensive in both CPU and wallclock as it |
| 48 | +actually runs the base resolvers, combined with various caches of |
| 49 | +various sizes. For usability, it can report its data (the average |
| 50 | +parse time per input entry) in both human-readable text with one |
| 51 | +result per line and CSV with resolver configurations as the columns |
| 52 | +and cache sizes as the rows. |
| 53 | + |
| 54 | +``hitrates`` is generally sufficient as generally speaking for the |
| 55 | +same base resolver performances tend to more or less follo hit rates: |
| 56 | +a cache hit is close to free compared to a cache miss. Although this |
| 57 | +is truer for the basic resolver (for which misses tend to be very |
| 58 | +expensive). ``bench`` is mostly useful to validate or tie-break |
| 59 | +decisions based on ``hitrates``, and allows creating nice graphs in |
| 60 | +your spreadsheet software of choice. |
| 61 | + |
| 62 | +Cache Algorithms |
| 63 | +================ |
| 64 | + |
| 65 | +[S3-FIFO]_ |
| 66 | +---------- |
| 67 | + |
| 68 | +[S3-FIFO]_ is a novel fifo-based cache algorithm. It might seem odd to |
| 69 | +pick that as default rather than a "tried and true" LRU_, but the |
| 70 | +principles are interesting and on our sample it shows very good |
| 71 | +performances for an acceptable implementation complexity. |
| 72 | + |
| 73 | +Advantages |
| 74 | +'''''''''' |
| 75 | + |
| 76 | +- excellent hit rates |
| 77 | +- thread-safe on hits |
| 78 | +- excellent handling of one hit wonders (entries unique to the data |
| 79 | + set) and rare fews (multiple entries with a lot of separation) |
| 80 | +- flexible implementation |
| 81 | + |
| 82 | +Drawbacks |
| 83 | +''''''''' |
| 84 | + |
| 85 | +- O(n) eviction |
| 86 | +- somewhat demanding on memory, especially at small sizes |
| 87 | + |
| 88 | +Space |
| 89 | +''''' |
| 90 | + |
| 91 | +An S3Fifo of size n is composed of: |
| 92 | + |
| 93 | +- one :ref:`dict` of size 1.9*n |
| 94 | +- three :ref:`deque` of sizes 0.1 * n, 0.9 * n, and 0.9 * n |
| 95 | + |
| 96 | +[SIEVE]_ |
| 97 | +-------- |
| 98 | + |
| 99 | +[SIEVE]_ is an other novel fifo-based algorithm, a cousin of S3Fifo it |
| 100 | +works off of somewhat different principle. It has good performances |
| 101 | +and a more straightforward implementation than S3, but it is strongly |
| 102 | +wedded to linked lists as it needs to remove entries from the middle |
| 103 | +of the fifo (whereas S3 uses strict fifo). |
| 104 | + |
| 105 | +Advantages |
| 106 | +'''''''''' |
| 107 | + |
| 108 | +- good hit rates |
| 109 | +- thread-safe on hits |
| 110 | +- memory efficient |
| 111 | + |
| 112 | +Drawbacks |
| 113 | +''''''''' |
| 114 | + |
| 115 | +- O(n) eviction |
| 116 | + |
| 117 | +Space |
| 118 | +''''' |
| 119 | + |
| 120 | +A SIEVE of size n is composed of: |
| 121 | + |
| 122 | +- a :ref:`dict` of size n |
| 123 | +- a linked list with n :ref:`nodes of 4 pointers each <class>` |
| 124 | + |
| 125 | +LRU |
| 126 | +--- |
| 127 | + |
| 128 | +The grandpappy of non-trivial cache eviction, it's mostly included as |
| 129 | +a safety in case users encounter workloads for which the fifo-based |
| 130 | +algorithms completely fall over (do report them, I'm sure the authors |
| 131 | +would be interested). |
| 132 | + |
| 133 | +Advantages |
| 134 | +'''''''''' |
| 135 | + |
| 136 | +- basically built in the Python stdlib (via |
| 137 | + :class:`~collections.OrderedDict`) |
| 138 | +- O(1) eviction |
| 139 | +- nobody ever got evicted for using an LRU |
| 140 | + |
| 141 | +Drawbacks |
| 142 | +''''''''' |
| 143 | + |
| 144 | +- must be synchronised on hit: entries are moved |
| 145 | +- poor hit rates |
| 146 | + |
| 147 | +Space |
| 148 | +''''' |
| 149 | + |
| 150 | +An LRU of size n is composed of: |
| 151 | + |
| 152 | +- an :ref:`ordered dict <odict>` of size n |
| 153 | + |
| 154 | +Memory analysis of Python objects |
| 155 | +================================= |
| 156 | + |
| 157 | +Measures as of Python 3.11, on a 64b platform. Information is the |
| 158 | +overhead of the object itself, not the data it stores e.g. if an |
| 159 | +object stores strings the sizes of the strings are not included in the |
| 160 | +calculations. |
| 161 | + |
| 162 | +.. _class: |
| 163 | + |
| 164 | +``class`` |
| 165 | +--------- |
| 166 | + |
| 167 | +With ``__slots__``, a Python object is 32 bytes + 8 bytes for each |
| 168 | +member. An additional 8 bytes is necessary for weakref support |
| 169 | +(slotted objects in UA-Parser don't have weakref support). |
| 170 | + |
| 171 | +Without ``__slots__``, a Python object is 48 bytes plus an instance |
| 172 | +:ref:`dict`. |
| 173 | + |
| 174 | +.. note:: The instance dict is normally key-sharing, which is not |
| 175 | + included in the analysis, see :pep:`412`. |
| 176 | + |
| 177 | +.. _dict: |
| 178 | + |
| 179 | +``dict`` |
| 180 | +-------- |
| 181 | + |
| 182 | +Python's ``dict`` is a relatively standard hash map, but it has a bit |
| 183 | +of a twist in that it stores the *entries* in a dense array, which |
| 184 | +only needs to be sized up to the dict's load factor, while the shallow |
| 185 | +array used for hash lookups (which needs to be sized to match |
| 186 | +capacity) only holds indexes into the dense array. This also allows |
| 187 | +the *size* of the indices to only be as large as needed to index into |
| 188 | +the dense array, so for small dicts the sparse array is an array of |
| 189 | +bytes (8 bits). |
| 190 | + |
| 191 | +*However* because the dense array of entries is used as a stack (only |
| 192 | +the last entry can be replaced) in case a dict "churns" (entries get |
| 193 | +added and removed without the size changing) if the size of the dict |
| 194 | +is close to the next break-point it would need to be compacted |
| 195 | +frequently leading to poor performances. |
| 196 | + |
| 197 | +As a result, although a dictionary being created or added to will be |
| 198 | +just the next size up a dict with a lot of churn will be two sizes up |
| 199 | +to limit the amout of compaction necessary e.g. 10000 entries would |
| 200 | +fit in ``2**14`` (capacity 16384, for a usable size of 10922) but the |
| 201 | +dict may be sized up to ``2**15`` (capacity 32768, for a usable size |
| 202 | +of 21845). |
| 203 | + |
| 204 | +Python dicts also have a concept of *key kinds* which influences parts |
| 205 | +of the layout. As of 3.12 there are 3 kinds called |
| 206 | +``DICT_KEYS_GENERAL``, ``DICT_KEYS_UNICODE``, and ``DICT_KEYS_SPLIT``. |
| 207 | +This is relevant here because UA-Parser caches are keyed on strings, |
| 208 | +which means they should always use the ``DICT_KEYS_UNICODE`` kind. |
| 209 | + |
| 210 | +In the ``DICT_KEYS_GENERAL`` layout, each entry of the dense array has |
| 211 | +to store three pointer-sized items: a pointer to the key, a pointer to |
| 212 | +the value, and a cached version of the key hash. However since strings |
| 213 | +memoize their hash internally, the ``DICT_KEYS_UNICODE`` layout |
| 214 | +retrieves the hash value from the key itself when needed and can save |
| 215 | +8 bytes per entry. |
| 216 | + |
| 217 | +Thus the space necessary for a dict is: |
| 218 | + |
| 219 | +- the standard 4 pointers object header (``prev``, ``next``, and type |
| 220 | + pointers, and reference count) |
| 221 | +- ``ma_size``, 8 bytes, the number of entries |
| 222 | +- ``ma_version_tag``, 8 bytes, deprecated |
| 223 | +- ``ma_keys``, a pointer to the dict entries |
| 224 | +- ``ma_values``, a pointer to the split values in ``DICT_KEYS_SPLIT`` |
| 225 | + layout (not relevant for UA-Parser) |
| 226 | + |
| 227 | +The dict entries then are: |
| 228 | + |
| 229 | +- ``dk_refcnt``, an 8 bytes refcount (used for the ``DICT_KEYS_SPLIT`` |
| 230 | + layout) |
| 231 | +- ``dk_log2_size``, 1 byte, the total capacity of the hash map, as a |
| 232 | + power of two |
| 233 | +- ``dk_log2_index_bytes``, 1 byte, the size of the sparse indexes |
| 234 | + array in bytes, as a power of two, it essentially memoizes the log2 |
| 235 | + size of the sparse indexes array by incrementing ``dk_log2_size`` by |
| 236 | + 3 if above 32, 2 if above 16, and 1 if above 8 |
| 237 | + |
| 238 | + .. note:: |
| 239 | + |
| 240 | + This means the dict bumps up the indexes array a bit early to |
| 241 | + avoids having to resize again within a ``dk_log2_size`` e.g. at |
| 242 | + 171 elements the dict will move to size 9 (total capacity 512, |
| 243 | + usable capacity 341) and the index size will immediately get |
| 244 | + bumped to 10 even though it can still fit ~80 additional items |
| 245 | + with a u8 index. |
| 246 | + |
| 247 | +- ``dk_kind``, 1 byte, the key kind explained above |
| 248 | +- ``dk_version``, 4 bytes, used for some internal optimisations of |
| 249 | + cpython |
| 250 | +- ``dk_usable``, 8 bytes, the number of usable entries in the dense array |
| 251 | +- ``dk_nentries``, 8 bytes, the number of used entries in the dense |
| 252 | + array, this can't be computed from ``dk_usable`` and |
| 253 | + ``dk_log2_size`` because ??? from the mention of ``DKIX_DUMMY`` I |
| 254 | + assume it's because ``dk_usable`` is used to know when the dict |
| 255 | + needs to be compacted or resized, and because python uses open |
| 256 | + addressing and leaves tombstone (``DKIX_DUMMY``) in the sparse array |
| 257 | + they matter for collision performances, and thus load calculations |
| 258 | +- ``dk_indices``, the sparse array of size |
| 259 | + ``1<<dk_log2_size_index_bytes`` |
| 260 | +- ``dk_entries``, the dense array of size |
| 261 | + ``USABLE_FRACTION(1<<dk_log2_size) * 16`` |
| 262 | + |
| 263 | + .. note:: ``USABLE_FRACTION`` is 2/3 |
| 264 | + |
| 265 | +Thus the space formula for dicts -- in the context of string-indexed |
| 266 | +caches -- is:: |
| 267 | + |
| 268 | + 32 + 32 + 32 |
| 269 | + + 2**(ceil(log2(n)) + 1) * ceil(log256(n)) |
| 270 | + + floor(2/3 * 2**ceil(log2(n)) + 1) * 16 |
| 271 | + |
| 272 | +.. _odict: |
| 273 | + |
| 274 | +``collections.OrderedDict`` |
| 275 | +--------------------------- |
| 276 | + |
| 277 | +While CPython has a pure-python ``OrderedDict`` it's not actually |
| 278 | +used, instead a native implementation with a native doubly linked list |
| 279 | +and a bespoke secondary hashmap is used, leading to a much denser |
| 280 | +collection than achievable in Python. The broad strokes are similar |
| 281 | +though: |
| 282 | + |
| 283 | +- a regular ``dict`` links keys to values |
| 284 | +- a secondary hashmap links keys to *nodes* of the linked list, |
| 285 | + allowing reordering entries easily |
| 286 | + |
| 287 | +The secondary hashmap is only composed of a dense array of nodes, |
| 288 | +using the internal details of the dict in order to handle lookups in |
| 289 | +the sparse array and collision resolution. Unlike ``dict`` however |
| 290 | +it's sized to the dict's capacity rather than ``USABLE_FRACTION`` |
| 291 | +thereof. |
| 292 | + |
| 293 | +The entire layout is: |
| 294 | + |
| 295 | +- a full dict object (see above), inline |
| 296 | +- pointers to the first and last nodes of the doubly linked list |
| 297 | +- a pointer to the array of nodes |
| 298 | +- ``od_fast_nodes_size``, 8 bytes, which is used to see if the |
| 299 | + underlying dict has been resized |
| 300 | +- ``*od_resize_sentinel`` which is *also* used to see if the |
| 301 | + underlying dict has been redized (a pointer to the dict entries |
| 302 | + object) |
| 303 | +- ``od_state``, 8 bytes, to check for concurrent mutations during |
| 304 | + iteration |
| 305 | +- ``od_inst_dict``, 8 bytes, used to provide a fake ``__dict__`` and |
| 306 | + better imitate |
| 307 | +- ``od_inst_dict``, 8 bytes, weakref support |
| 308 | + |
| 309 | +And each node in the linked list is 4 pointers: previous, next, key, |
| 310 | +and hash. |
| 311 | + |
| 312 | +.. note:: |
| 313 | + |
| 314 | + Hash is (likely) to speed up lookup since going from odict node to |
| 315 | + dict entry requires a full lookup, and such a lookup is what |
| 316 | + happens during iteration, except it uses a regular |
| 317 | + ``PyDict_GetItem`` instead of a low-level lookup, why? |
| 318 | + |
| 319 | +So the ordereddict space requirement formula is:: |
| 320 | + |
| 321 | + dict(n) + 64 + 8 * 2**(ceil(log2(n)) + 1) + 32 * n |
| 322 | + |
| 323 | +Because it matches dict's, like dict's the capacity is double what's |
| 324 | +strictly required due to amortising churn. |
| 325 | + |
| 326 | +.. _deque: |
| 327 | + |
| 328 | +``collections.deque`` |
| 329 | +--------------------- |
| 330 | + |
| 331 | +Deque is an unrolled doubly linked list of order 64, that is every |
| 332 | +node of the linked list stores 64 items, plus two pointers for the |
| 333 | +previous and next links. Note that the deque always allocates a block |
| 334 | +upfront (nb: why not allocate on use?). |
| 335 | + |
| 336 | +The deque metadata (excluding the blocks) is 232 bytes: |
| 337 | + |
| 338 | +- the 32 bytes standard object of an object header (next pointer, |
| 339 | + previous pointer, refcount, and type pointer) |
| 340 | +- the ``ob_size`` of a VAR_OBJ, apparently used to store the number of |
| 341 | + items as the deque does not track its blocks size |
| 342 | +- pointers to the left and right blocks |
| 343 | +- offsets into the left and right blocks (as they may only be |
| 344 | + partially filled) |
| 345 | +- ``state``, a mutation counter used to track mutations during |
| 346 | + iteration |
| 347 | +- ``maxlen``, in case the deque is length-bounded |
| 348 | +- ``numfreeblocks``, the actual size of the freelist |
| 349 | +- ``freelist``, 16 pointers to already allocated available blocks |
| 350 | +- ``weakreflist``, the weakref support pointer |
| 351 | + |
| 352 | +So the deque space requirement formula is:: |
| 353 | + |
| 354 | + 232 + max(1, ceil(n / 64)) * 66 * 8 |
| 355 | + |
| 356 | +:func:`~functools.lru_cache` |
| 357 | +---------------------------- |
| 358 | + |
| 359 | +While not strictly relevant to ua-parser, it should be noted that |
| 360 | +:func:`~functools.lru_cache` is *not* built on |
| 361 | +:class:`~collections.OrderedDict`, it has its own native |
| 362 | +implementation which uses a single dict and a different bespoke doubly |
| 363 | +linked list with larger nodes (9 pointers). |
| 364 | + |
| 365 | +.. [S3-FIFO] Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue, Rashmi |
| 366 | + Vinayak. 2023. FIFO queues are all you need for cache eviction. |
| 367 | + SOSP '23. https://dl.acm.org/doi/10.1145/3600006.3613147 |
| 368 | +
|
| 369 | +.. [SIEVE] Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vigfusson, |
| 370 | + K. V. Rashmi. 2023. SIEVE is Simpler than LRU: an Efficient |
| 371 | + Turn-Key Eviction Algorithm for Web Caches. NSDI24. |
| 372 | + https://junchengyang.com/publication/nsdi24-SIEVE.pdf |
0 commit comments