feat(data-masking): add Data Masking utility#5143
Conversation
b8ad946 to
8bf61c2
Compare
|
|
||
| /** Thrown when input data contains an unsupported type for the operation. */ | ||
| export class DataMaskingUnsupportedTypeError extends Error { | ||
| constructor(message: string) { |
There was a problem hiding this comment.
Is the optional options?: ErrorOptions missing intentional?
| @@ -0,0 +1,23 @@ | |||
| /** Thrown when a field path does not match any value in the data. */ | |||
| export class DataMaskingFieldNotFoundError extends Error { | |||
| constructor(message: string) { | |||
There was a problem hiding this comment.
Is the optional options?: ErrorOptions missing intentional?
| @@ -0,0 +1 @@ | |||
| export const DEFAULT_MASK_VALUE = '*****'; | |||
There was a problem hiding this comment.
Let's add a comment to explain what this is
| readonly #provider?: EncryptionProvider; | ||
| readonly #throwOnMissingField: boolean; | ||
|
|
||
| constructor(options?: DataMaskingConstructorOptions) { |
There was a problem hiding this comment.
Need to set public on the constructor
| readonly #provider?: EncryptionProvider; | ||
| readonly #throwOnMissingField: boolean; |
There was a problem hiding this comment.
Add comments documenting the two props
| @@ -0,0 +1 @@ | |||
| export const DEFAULT_MASK_VALUE = '*****'; | |||
There was a problem hiding this comment.
consider if needs to be as const
| erase<T>(data: T, options?: EraseOptions): T { | ||
| if (data === null || data === undefined) return data; | ||
| if (!options?.fields && !options?.maskingRules) { | ||
| return DEFAULT_MASK_VALUE as unknown as T; |
There was a problem hiding this comment.
This as unknown as T needs looking into it - right now I think it's needed because DEFAULT_MASK_VALUE is a string and we're casting it to T but the bigger question is: does the return type need to stay T or should it be string? Can it ever not be a string?
There was a problem hiding this comment.
Ok this is a recursive function that can return objects - so this as unknown as T is basically "the leaf state" of the recursion which is always a string but we're telling it to be T. So because it's a public method we need to keep T but it could've been probably T | string.
| * ``` | ||
| */ | ||
| erase<T>(data: T, options?: EraseOptions): T { | ||
| if (data === null || data === undefined) return data; |
There was a problem hiding this comment.
Might want to use the "isNullorUndefined" (look up exact name) helper from commons here
| * const masked = masker.erase(data, { fields: ['email', 'customer.ssn'] }); | ||
| * ``` | ||
| */ | ||
| erase<T>(data: T, options?: EraseOptions): T { |
There was a problem hiding this comment.
Need to document args & options in the comment above.
| */ | ||
| erase<T>(data: T, options?: EraseOptions): T { | ||
| if (data === null || data === undefined) return data; | ||
| if (!options?.fields && !options?.maskingRules) { |
There was a problem hiding this comment.
This looks odd - the way I understand it right now is that we replace the entire value/object with DEFAULT_MASK_VALUE if I don't specify any option - which feels odd. I don't know for sure, but at the moment I'd expect instead to replace ALL values in the object instead?
| if (options.maskingRules) { | ||
| this.#applyMaskingRules(copy, options.maskingRules); | ||
| } else { | ||
| /* v8 ignore next -- @preserve fallback unreachable: line 46 returns early when fields is falsy */ |
There was a problem hiding this comment.
Is this really unreachable or is the agent lazy?
There was a problem hiding this comment.
Ok I got it now, maybe we want to consider changing options.fields in the signature to not optional and make it default to[] so that the below never happens.
|
|
||
| const copy = this.#deepCopy(data); | ||
|
|
||
| if (options.maskingRules) { |
There was a problem hiding this comment.
Can we consider destructuring options at the top of this method.
| return current; | ||
| }; | ||
|
|
||
| const setAtPath = (data: unknown, path: string[], value: unknown): void => { |
There was a problem hiding this comment.
This function needs a comment explaining what it does as it's a bit hard to reason about
| throw new DataMaskingFieldNotFoundError(`Field not found: '${field}'`); | ||
| } | ||
| for (const path of paths) { | ||
| setAtPath(copy, path, DEFAULT_MASK_VALUE); |
There was a problem hiding this comment.
Should this always be the default mask value or should the customer be able to override it? Current behavior always sets the default
| field | ||
| )) { | ||
| const value = getAtPath(copy, path); | ||
| if (typeof value !== 'string') { |
There was a problem hiding this comment.
might want to use isString helper from commons
| if (rule.dynamicMask) return '*'.repeat(value.length); | ||
| if (rule.customMask !== undefined) return rule.customMask; |
There was a problem hiding this comment.
What happens if I set a custom mask (i.e. +) and also want it dynamic at the same time?
| dynamicMask?: boolean; | ||
| /** Fixed replacement string. */ | ||
| customMask?: string; |
There was a problem hiding this comment.
These seem to be mutually exclusive in the implementation and meaning - if that's the case the type should reflect it.
| } | ||
|
|
||
| /** Per-field custom masking configuration for {@link DataMasking.erase}. */ | ||
| export interface MaskingRule { |
There was a problem hiding this comment.
Review all the options to make sure that the mutual exclusion is reflecting the implementation
| const provider = this.#requireProvider(); | ||
|
|
||
| if (!options?.fields) { | ||
| return provider.encrypt(JSON.stringify(data), options?.context); |
There was a problem hiding this comment.
Maybe this could be its own named function in the method
const encrypt = provider.encrypt(JSON.stringify(data), options?.context);
| } | ||
| } | ||
|
|
||
| await Promise.all(operations); |
There was a problem hiding this comment.
This might need to be allSettled - check what happens if one of them fails
There was a problem hiding this comment.
Maybe this is fine because as you said it's an all or nothing op - I'm just curious to see how the error is bubbled up.
| const copy = this.#deepCopy(data); | ||
| if (options?.fields) { | ||
| await this.#transformFields(copy, options.fields, async (value) => { | ||
| if (typeof value !== 'string') return value; |
There was a problem hiding this comment.
Should we consider emitting a warning here? It's good that we prevent the customer from shooting themselves in the foot, but they should know so they can fix the field paths or the data.
| export interface AWSEncryptionSDKProviderOptions { | ||
| keys: string[]; | ||
| localCacheCapacity?: number; | ||
| maxCacheAgeSeconds?: number; | ||
| maxMessagesEncrypted?: number; | ||
| maxBytesEncrypted?: number; | ||
| } | ||
|
|
||
| type EncryptFn = ReturnType<typeof buildEncrypt>['encrypt']; | ||
| type DecryptFn = ReturnType<typeof buildDecrypt>['decrypt']; | ||
| type CacheManager = Parameters<EncryptFn>[0]; | ||
|
|
||
| interface SDKClients { | ||
| encryptFn: EncryptFn; | ||
| decryptFn: DecryptFn; | ||
| cacheManager: CacheManager; | ||
| } | ||
|
|
There was a problem hiding this comment.
Consider splitting these in their own types file under provider
| import type { buildDecrypt, buildEncrypt } from '@aws-crypto/client-node'; | ||
| import type { EncryptionProvider } from '../types.js'; | ||
|
|
||
| export interface AWSEncryptionSDKProviderOptions { |
There was a problem hiding this comment.
All these need documenting
|
|
||
| export class AWSEncryptionSDKProvider implements EncryptionProvider { | ||
| readonly #options: AWSEncryptionSDKProviderOptions; | ||
| #clients?: SDKClients; |
There was a problem hiding this comment.
Can we call this client (singular?) I get that it's an object we compose from functions coming from the encryption sdk but it reads weird and a client class is an object after all
|
|
||
| // Dynamic import — @aws-crypto/client-node is an optional peer dep, | ||
| // so we only load it when encryption is actually used. | ||
| const { |
There was a problem hiding this comment.
Consider making this a regular import at the top using the same pattern that we use in Idempotency for the persistence layers - which looks something like this:
import { DataMasking } from '@aws-lambda-powertools/data-masking'; // this does not include any provider
import { AWSEncryptionSDKProvider } from '@aws-lambda-powertools/data-masking/aws-provider';
const datamasking = new DataMasking({ provider: new AWSEncryptionSDKProvider() });This way the provider gets its own sub-path export which functionally makes it opt-in.
There was a problem hiding this comment.
This might also allow you to not have #init to be async - which then in turn means you could do it in the constructor.
| maxAge: (this.#options.maxCacheAgeSeconds ?? 300) * 1000, | ||
| maxMessagesEncrypted: | ||
| this.#options.maxMessagesEncrypted ?? 4294967296, |
There was a problem hiding this comment.
Check that the defaults are aligned with the Python implementation.
| if (context) { | ||
| for (const [key, value] of Object.entries(context)) { | ||
| if (messageHeader.encryptionContext[key] !== value) { | ||
| throw new Error( |
There was a problem hiding this comment.
Should we make this a custom package-managed error like all others
|
|
||
| === "Result" | ||
|
|
||
| ```json hl_lines="6" |
There was a problem hiding this comment.
Consider extracting all these in separate files
| | --- | --- | --- | | ||
| | `localCacheCapacity` | `100` | The maximum number of entries that can be retained in the local cryptographic materials cache | | ||
| | `maxCacheAgeSeconds` | `300` | The maximum time (in seconds) that a cache entry may be kept in the cache | | ||
| | `maxMessagesEncrypted` | `4294967296` | The maximum number of messages that may be encrypted under a cache entry | |
There was a problem hiding this comment.
Why is this magic number so special?
| const kmsKey = new Key(testStack.stack, 'TestKmsKey', { | ||
| description: 'KMS key for DataMasking e2e tests', | ||
| }); |
There was a problem hiding this comment.
Let's make sure that the KMS Key actually gets cleaned up after the test
Summary
Changes
This PR is an experiment in delivering a full feature, end to end, using spec-driven development and agentic coding. As such, I have set it as a draft. We may or may not merge this, but if we do, only after a thorough review by the team. The purpose of this PR is as much to provoke discussion as it is to implement a feature. I would appreciate if @dreamorosi and @sdangol could look at the code and give their opinions.
From my perspective, I think this was a very successful experiment: I am happy with the code quality and I also took the opportunity to add property tests, which are a perfect fit for this sort of logic.
Something I would note is that this probably worked so well because of how well-defined the issue was by @walmsles, and also that we could use the Python implementation as a reference.
One place I differed from the proposed implementation was that I don't batch the calls to KMS. This simplifies the API and means we mirror the Python implementation exactly. While ordinarily this would be a performance concern, we use the caching feature in the AWS Cryptography library to ensure that we only ever make one call to KMS when encrypting multiple fields.
What's included
@aws-lambda-powertools/data-maskingpackage with erase, encrypt, and decrypt operationsAWSEncryptionSDKProviderusing KMS envelope encryption (@aws-crypto/client-nodeas optional peer dep)[*]array wildcards, and.*object wildcardsfast-checkA note on field path resolutionThe Python implementation usesjsonpath_ngfor field selection, which natively supports both querying and path extraction for write-back. We considered using a JavaScript JSONPath library (e.g.jsonpath-plus) to match this approach, but decided against it for a few reasons:-jsonpath-plus(9.8m weekly downloads) is marked as unmaintained by its maintainers- We already have@aws-lambda-powertools/jmespathin-houseInstead, we use JMESPath to validate expressions and a small ((approx. 20 line)) custom walker to resolve wildcards ([*]and*) into concrete paths for write-back. JMESPath is read-only by design so it can't be used for path extraction directly, but the walker is simple, well-tested, and avoids any new dependencies.There is however another library, jsonpath, that has 3.8m weekly downloads. I am always hesitant to introduce new dependencies to the project but I think if we want feature parity with Python we will need to take the dependency on. I would like to hear the maintainers thoughts before committing to this course of action though.Update: The JMESPath dependency has been removed. It was unnecessary since we only used it for reading, and keeping it could mislead users into thinking we support the full JMESPath spec for writes. Instead, we use a small custom walker that supports:
obj.key)obj.*.key)obj.[*].key)These three cases are validated with property tests, which are much more thorough than traditional unit tests.
Issue number: closes #4960
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.