Add compression to replicator#6013
Conversation
|
Can you please use our PR template? |
There was a problem hiding this comment.
Nice work @lacklacklack!
A few suggestions / ideas as bullet points:
-
We're trying to do a bit of both: request and response sides; let's keep it simpler at first and focus on the request side (_bulk_docs and _revs_diff). Since we already have server side handling for gzip, let's handle gzip only for the request side. Then we can cleanly test the feature in CI without extra proxies or having to add server side compression sending here too.
-
Let's skip
deflateand use gzip only. But add the possibility to addzstdin the future (not this PR though) so keep the configurable compression algorithm. The reason to skip deflate is simplicity (our server handles gzip only), and deflate is a bit of a mess according to https://en.wikipedia.org/wiki/HTTP_compression
Another problem found while deploying HTTP compression on large scale is due to the deflate encoding definition: while HTTP 1.1 defines the deflate encoding as data compressed with deflate (RFC 1951) inside a zlib formatted stream (RFC 1950), Microsoft server and client products historically implemented it as a "raw" deflated stream making its deployment unreliable. For this reason, some software, including the Apache HTTP Server, only implements gzip encoding.
-
As it stands, the most important thing to compress (_bulk_docs body) won't actually be compressed. The body there isn't an iolist or binary but
{BodyFun, [prefix | Docs]}}. So that makes me think maybe a better place for this is not in httpc but in api_wrap -
Do not set
AcceptEncodings = config:get("replicator", "accept_encodings", "gzip, deflate, zstd")unless we can always handle these responses and decompress them. If the server then sends us zstd data and we're on OTP 27 we won't be able to handle it and fail the request. For this pr let's just skip setting that altogether -
Do not enable gzip compression by default. Since that is not a negotiated setting, if the replicator was talking to an older CouchDB or other server not implementing gzip decompression we'd break a customers' setup as soon as they upgrade.
-
Don't forget to fill out the template like Jan suggested
ea146fb to
f1fe550
Compare
nickva
left a comment
There was a problem hiding this comment.
Looks much better!
Added a few more comments with some minor tweaks
One new major bit I thought of is if we can make this a per job overridable. In theory we have https://docs.couchdb.org/en/stable/config/replicator.html#replicator/worker_processes to copy from (and a few others). But if that proves difficult we can punt it for later
| ; Compress outbound replication request bodies (_bulk_docs, _revs_diff) with gzip. | ||
| ; Disabled by default. Only gzip is supported. Enable only when talking to CouchDB | ||
| ; servers that support gzip Content-Encoding on inbound requests. | ||
| ;compress_requests = false |
There was a problem hiding this comment.
Maybe we can drive both the request compression toggle and the algorithm with just a single config.
We have [couchdb] file_compression = none | snappy | deflate | zstd maybe we can use the same scheme
[replicator]
request_compression = none | gzipEventually add zstd in the future perhaps.
| -export([stop_http_worker/0]). | ||
| -export([full_url/2]). | ||
|
|
||
|
|
There was a problem hiding this comment.
Avoid adding extra changes to the PR that are not relevant. We can do a separate PR later to cleanup or fix up whitespace.
| {Doc, Acc + iolist_size(Doc)} | ||
| end, | ||
| byte_size(Prefix) + byte_size(Suffix) + length(DocList) - 1, | ||
| 0, |
There was a problem hiding this comment.
I think it might be nice to only materialize the full body and compress it if we're compressing it (algorithm != none) otherwise keep it as a stream
{Body, Headers} =
case compress_requests(Len) of
true ->
FullBody = [Prefix, lists:join(<<",">>, Docs), Suffix],
gzip_request_body(FullBody, Headers0);
false ->
{{BodyFun, [prefix | Docs]}, Headers0}
end| case config:get_boolean("replicator", "compress_requests", false) of | ||
| true -> | ||
| Algorithm = config:get("replicator", "compression_algorithm", "gzip"), | ||
| MinSize = config:get_integer("replicator", "compress_min_size", 1024), |
There was a problem hiding this comment.
Tiny nit: pull 1024 as a define to the top (-define(COMPRESS_MIN_SIZE, 1024)) also can make a few defines for compression algorithms: "none", "gzip" that way we can ensure we don't mistype them as the compiler will check them for us automatically
|
|
||
| compress_with("gzip", Body) -> | ||
| Compressed = zlib:gzip(Body), | ||
| couch_stats:increment_counter([couch_replicator, requests_compressed]), |
| couch_stats:increment_counter([couch_replicator, requests_compressed]), | ||
| couch_stats:increment_counter([couch_replicator, requests_compressed, gzip]), | ||
| {Compressed, [{"Content-Encoding", "gzip"}]}; | ||
| compress_with(Other, Body) -> |
There was a problem hiding this comment.
If user specified an unusable compression we'd spam the logs potentially thousands of times a second with the warning. We could probably leave it out and user can judge by the metrics
| @@ -0,0 +1,16 @@ | |||
| # CouchDB Replicator Request Compression | |||
There was a problem hiding this comment.
Don't think we need this file. If it pertains to replicator internal design update the dev md file in the replicator app, otherwise documentation should go in src/docs and some in default.ini related to the replicator options
|
|
||
| %% Compress Body with gzip if enabled and body is large enough. | ||
| %% Returns {Body, ExtraHeaders} where ExtraHeaders may contain Content-Encoding. | ||
| maybe_compress(Body) when is_binary(Body) -> |
There was a problem hiding this comment.
Not sure if it would be much harder but this could also be a candidate to set per jobs and/or per cluster. For example replicating from a->b running on c we may want to compress requests to b but then for another job, a->d we might not, because d can't handle gzip. So it might be nice if this option can be the same as https://docs.couchdb.org/en/stable/config/replicator.html#replicator/worker_processes? If that looks too unwieldy or tricky we can skip it
Overview
Adds optional gzip compression of outbound request bodies in the replicator.
When enabled,
_bulk_docsand_revs_diffrequest bodies are gzip-compressedbefore sending. CouchDB already supports
Content-Encoding: gzipon inboundrequests so no server-side changes are needed.
Compression is disabled by default to avoid breaking setups with older or
non-CouchDB targets.
Testing recommendations
Enable compression with a low threshold and run a replication, then check the stats:
Automated tests:
couch_replicator_compression_testsinsrc/couch_replicator/test/eunit/.Related Issues or Pull Requests
Checklist
rel/overlay/etc/default.inisrc/docsfolder