Skip to content

Commit bf0457d

Browse files
authored
Merge pull request #355 from kurok/doc/json-utf8-encoding-caveat
docs: document non-UTF-8 encoding behavior in JSON output
2 parents 20f06ad + 40e4c7d commit bf0457d

File tree

4 files changed

+76
-2
lines changed

4 files changed

+76
-2
lines changed

Lsof.8

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -924,7 +924,10 @@ is mutually exclusive with
924924
.B \-j
925925
and
926926
.BR \-t .
927-
Warnings and errors are sent to stderr; stdout is always valid JSON.
927+
Warnings and errors are sent to stderr; stdout is always valid JSON
928+
(see
929+
.B "CHARACTER ENCODING NOTE"
930+
below).
928931
.TP \w'names'u+4
929932
.B \-j
930933
selects JSON Lines output mode. Each open file produces one JSON
@@ -940,6 +943,27 @@ is mutually exclusive with
940943
.B \-J
941944
and
942945
.BR \-t .
946+
.IP
947+
.B "Character encoding note:"
948+
JSON (RFC\ 8259) mandates that strings be valid UTF\-8.
949+
However, file names on Unix\-like systems are arbitrary byte sequences
950+
and may contain bytes that are not valid UTF\-8.
951+
When such bytes appear,
952+
.B lsof
953+
passes them through to the output unchanged.
954+
This means the output is not strictly conformant JSON, but the
955+
original file name can be recovered.
956+
This is consistent with the behaviour of
957+
.BR lsfd (1),
958+
.BR ip (8)
959+
.RB ( \-j ),
960+
and other Linux utilities that produce JSON output.
961+
Consumers that require strict RFC\ 8259 conformance should
962+
filter or re\-encode such values (e.g.\& using
963+
.BR iconv (1)
964+
or Python's
965+
.B surrogateescape
966+
error handler).
943967
.TP \w'names'u+4
944968
.BI \-i " [i]"
945969
selects the listing of files any of whose Internet address

docs/options.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,23 @@ Lsof has these options to control its output format:
7676
- -F produce output that can be parsed by a subsequent
7777
program.
7878

79+
- -J produce nested JSON output. Instead of tabular or
80+
field output, lsof emits a single JSON object with a
81+
`processes` array. Field selection follows -F rules.
82+
Mutually exclusive with -j and -t.
83+
84+
- -j produce JSON Lines output. Each open file produces
85+
one JSON object per line (denormalized with process
86+
fields). Suitable for streaming pipelines and log
87+
ingestion tools. Mutually exclusive with -J and -t.
88+
89+
**Note:** Unix file names are arbitrary byte sequences and may
90+
contain bytes that are not valid UTF-8. When this occurs, lsof
91+
passes the raw bytes through unchanged, producing output that is
92+
not strictly conformant with RFC 8259. This matches the behavior
93+
of `lsfd(1)`, `ip -j`, `systemctl --output=json`, and other Linux
94+
tools.
95+
7996
- -g print process group (PGID) IDs.
8097

8198
- -l list UID numbers instead of login names.

docs/tutorial.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -602,7 +602,30 @@ homogeneous across Unix dialects. Thus, if you write a script
602602
to post-process field output for AIX, it probably will work for
603603
HP-UX, Solaris, and Ultrix as well.
604604

605-
Support for other formats e.g. JSON is planned.
605+
### JSON Output
606+
607+
Lsof supports two JSON output modes:
608+
609+
- **`-J`** (nested JSON) — produces a single JSON object containing a
610+
`processes` array, where each process has a `files` array of open-file
611+
entries. Suitable for tools that consume a complete document (e.g.
612+
`python3 -m json.tool`, `jq`).
613+
614+
- **`-j`** (JSON Lines) — produces one JSON object per line, combining
615+
process and file fields in a single denormalized record. Suitable for
616+
streaming pipelines, log ingestion (Splunk, Datadog, Elastic Stack),
617+
and line-oriented tools.
618+
619+
Both modes reuse the `-F` field-selection mechanism. For example,
620+
`lsof -J -Fpcfn` limits output to PID, command, fd, and name fields.
621+
622+
**Encoding caveat:** JSON (RFC 8259) requires strings to be valid UTF-8,
623+
but Unix file names are arbitrary byte sequences. When file names
624+
contain non-UTF-8 bytes, lsof passes them through unchanged — the output
625+
is technically not valid JSON, but preserves the original file name.
626+
This is the same approach taken by `lsfd`, `ip -j`, and most Linux tools
627+
that produce JSON. If your consumer requires strict UTF-8, use a filter
628+
such as `iconv` or Python's `surrogateescape` codec error handler.
606629

607630
## The Lsof Exit Code and Shell Scripts
608631

src/print.c

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,16 @@ static int human_readable_size(SZOFFTYPE sz, int print, int col);
9999
* JSON output helpers
100100
*/
101101

102+
/*
103+
* json_puts_escaped() - write a C string as a JSON string value (without
104+
* the surrounding quotes).
105+
*
106+
* Control characters (< 0x20) are escaped as \uXXXX. Bytes >= 0x80 are
107+
* passed through unchanged. This means non-UTF-8 file names produce
108+
* output that is not strictly RFC 8259 conformant, but preserves the
109+
* original byte sequence. This is the same trade-off made by lsfd(1),
110+
* ip(8) -j, and other Linux JSON-producing tools. See issue #354.
111+
*/
102112
static void json_puts_escaped(const char *s) {
103113
const unsigned char *p = (const unsigned char *)s;
104114
while (*p) {

0 commit comments

Comments
 (0)