Skip to content

Add compression to replicator#6013

Open
lacklacklack wants to merge 3 commits into
apache:mainfrom
lacklacklack:replicator-compression-support
Open

Add compression to replicator#6013
lacklacklack wants to merge 3 commits into
apache:mainfrom
lacklacklack:replicator-compression-support

Conversation

@lacklacklack

@lacklacklack lacklacklack commented May 22, 2026

Copy link
Copy Markdown

Overview

Adds optional gzip compression of outbound request bodies in the replicator.
When enabled, _bulk_docs and _revs_diff request bodies are gzip-compressed
before sending. CouchDB already supports Content-Encoding: gzip on inbound
requests so no server-side changes are needed.

Compression is disabled by default to avoid breaking setups with older or
non-CouchDB targets.

Testing recommendations

Enable compression with a low threshold and run a replication, then check the stats:

# Start CouchDB
./dev/run --admin=admin:admin

# Lower the threshold for testing
curl -X PUT "http://admin:admin@127.0.0.1:15984/_node/_local/_config/replicator/compress_requests" -d '"true"'
curl -X PUT "http://admin:admin@127.0.0.1:15984/_node/_local/_config/replicator/compress_min_size" -d '"10"'

# Create DBs, add docs, replicate
curl -X PUT http://admin:admin@127.0.0.1:15984/source_db
curl -X PUT http://admin:admin@127.0.0.1:25984/target_db
curl -X PUT http://admin:admin@127.0.0.1:15984/source_db/doc1 \
  -H "Content-Type: application/json" \
  -d '{"data":"'$(python3 -c "print('x'*2000)'"}'
curl -X POST http://admin:admin@127.0.0.1:15984/_replicate \
  -H "Content-Type: application/json" \
  -d '{"source":"http://admin:admin@127.0.0.1:15984/source_db","target":"http://admin:admin@127.0.0.1:25984/target_db"}'

# Check stats — value should be > 0
curl -s http://admin:admin@127.0.0.1:15984/_node/_local/_stats/couch_replicator/requests_compressed | python3 -m json.tool

Automated tests: couch_replicator_compression_tests in src/couch_replicator/test/eunit/.

Related Issues or Pull Requests

Checklist

  • This is my own work, I did not use AI, LLM's or similar technology
  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

@janl

janl commented May 29, 2026

Copy link
Copy Markdown
Member

Can you please use our PR template?

@nickva nickva left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @lacklacklack!

A few suggestions / ideas as bullet points:

  1. We're trying to do a bit of both: request and response sides; let's keep it simpler at first and focus on the request side (_bulk_docs and _revs_diff). Since we already have server side handling for gzip, let's handle gzip only for the request side. Then we can cleanly test the feature in CI without extra proxies or having to add server side compression sending here too.

  2. Let's skip deflate and use gzip only. But add the possibility to add zstd in the future (not this PR though) so keep the configurable compression algorithm. The reason to skip deflate is simplicity (our server handles gzip only), and deflate is a bit of a mess according to https://en.wikipedia.org/wiki/HTTP_compression

Another problem found while deploying HTTP compression on large scale is due to the deflate encoding definition: while HTTP 1.1 defines the deflate encoding as data compressed with deflate (RFC 1951) inside a zlib formatted stream (RFC 1950), Microsoft server and client products historically implemented it as a "raw" deflated stream making its deployment unreliable. For this reason, some software, including the Apache HTTP Server, only implements gzip encoding.

  1. As it stands, the most important thing to compress (_bulk_docs body) won't actually be compressed. The body there isn't an iolist or binary but {BodyFun, [prefix | Docs]}}. So that makes me think maybe a better place for this is not in httpc but in api_wrap

  2. Do not set AcceptEncodings = config:get("replicator", "accept_encodings", "gzip, deflate, zstd") unless we can always handle these responses and decompress them. If the server then sends us zstd data and we're on OTP 27 we won't be able to handle it and fail the request. For this pr let's just skip setting that altogether

  3. Do not enable gzip compression by default. Since that is not a negotiated setting, if the replicator was talking to an older CouchDB or other server not implementing gzip decompression we'd break a customers' setup as soon as they upgrade.

  4. Don't forget to fill out the template like Jan suggested

@lacklacklack lacklacklack requested a review from nickva June 25, 2026 15:07
@lacklacklack lacklacklack marked this pull request as ready for review June 25, 2026 15:08
@lacklacklack lacklacklack force-pushed the replicator-compression-support branch from ea146fb to f1fe550 Compare June 25, 2026 15:21

@nickva nickva left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better!

Added a few more comments with some minor tweaks

One new major bit I thought of is if we can make this a per job overridable. In theory we have https://docs.couchdb.org/en/stable/config/replicator.html#replicator/worker_processes to copy from (and a few others). But if that proves difficult we can punt it for later

; Compress outbound replication request bodies (_bulk_docs, _revs_diff) with gzip.
; Disabled by default. Only gzip is supported. Enable only when talking to CouchDB
; servers that support gzip Content-Encoding on inbound requests.
;compress_requests = false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can drive both the request compression toggle and the algorithm with just a single config.

We have [couchdb] file_compression = none | snappy | deflate | zstd maybe we can use the same scheme

[replicator]
request_compression = none | gzip

Eventually add zstd in the future perhaps.

-export([stop_http_worker/0]).
-export([full_url/2]).


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid adding extra changes to the PR that are not relevant. We can do a separate PR later to cleanup or fix up whitespace.

{Doc, Acc + iolist_size(Doc)}
end,
byte_size(Prefix) + byte_size(Suffix) + length(DocList) - 1,
0,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nice to only materialize the full body and compress it if we're compressing it (algorithm != none) otherwise keep it as a stream

    {Body, Headers} =
        case compress_requests(Len) of
            true ->
                FullBody = [Prefix, lists:join(<<",">>, Docs), Suffix],
                gzip_request_body(FullBody, Headers0);
            false ->
                {{BodyFun, [prefix | Docs]}, Headers0}
        end

case config:get_boolean("replicator", "compress_requests", false) of
true ->
Algorithm = config:get("replicator", "compression_algorithm", "gzip"),
MinSize = config:get_integer("replicator", "compress_min_size", 1024),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: pull 1024 as a define to the top (-define(COMPRESS_MIN_SIZE, 1024)) also can make a few defines for compression algorithms: "none", "gzip" that way we can ensure we don't mistype them as the compiler will check them for us automatically


compress_with("gzip", Body) ->
Compressed = zlib:gzip(Body),
couch_stats:increment_counter([couch_replicator, requests_compressed]),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant/extra metric?

couch_stats:increment_counter([couch_replicator, requests_compressed]),
couch_stats:increment_counter([couch_replicator, requests_compressed, gzip]),
{Compressed, [{"Content-Encoding", "gzip"}]};
compress_with(Other, Body) ->

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user specified an unusable compression we'd spam the logs potentially thousands of times a second with the warning. We could probably leave it out and user can judge by the metrics

@@ -0,0 +1,16 @@
# CouchDB Replicator Request Compression

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need this file. If it pertains to replicator internal design update the dev md file in the replicator app, otherwise documentation should go in src/docs and some in default.ini related to the replicator options


%% Compress Body with gzip if enabled and body is large enough.
%% Returns {Body, ExtraHeaders} where ExtraHeaders may contain Content-Encoding.
maybe_compress(Body) when is_binary(Body) ->

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it would be much harder but this could also be a candidate to set per jobs and/or per cluster. For example replicating from a->b running on c we may want to compress requests to b but then for another job, a->d we might not, because d can't handle gzip. So it might be nice if this option can be the same as https://docs.couchdb.org/en/stable/config/replicator.html#replicator/worker_processes? If that looks too unwieldy or tricky we can skip it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants