|
1 | | -# Solution: Level 4 / Project 01 - Schema Validator Engine |
| 1 | +# Schema Validator Engine — Annotated Solution |
2 | 2 |
|
3 | | -> **STOP** — Have you attempted this project yourself first? |
4 | | -> |
5 | | -> Learning happens in the struggle, not in reading answers. |
6 | | -> Spend at least 20 minutes trying before reading this solution. |
7 | | -> If you are stuck, try the [Walkthrough](./WALKTHROUGH.md) first — it guides |
8 | | -> your thinking without giving away the answer. |
| 3 | +> **STOP!** Try solving this yourself first. Use the [project README](./README.md) and [walkthrough](./WALKTHROUGH.md) before reading the solution. |
9 | 4 |
|
10 | 5 | --- |
11 | 6 |
|
12 | | - |
13 | | -## Complete solution |
| 7 | +## Complete Solution |
14 | 8 |
|
15 | 9 | ```python |
16 | | -# WHY configure_logging: [explain the design reason] |
17 | | -# WHY load_schema: [explain the design reason] |
18 | | -# WHY load_records: [explain the design reason] |
19 | | -# WHY validate_record: [explain the design reason] |
20 | | -# WHY validate_all: [explain the design reason] |
21 | | -# WHY run: [explain the design reason] |
22 | | -# WHY parse_args: [explain the design reason] |
23 | | -# WHY main: [explain the design reason] |
24 | | - |
25 | | -# [paste the complete working solution here] |
26 | | -# Include WHY comments on every non-obvious line. |
| 10 | +"""Level 4 / Project 01 — Schema Validator Engine. |
| 11 | +
|
| 12 | +Validates data records against a JSON schema definition. |
| 13 | +Demonstrates: schema loading, type checking, required-field enforcement, |
| 14 | +and structured error collection. |
| 15 | +""" |
| 16 | + |
| 17 | +from __future__ import annotations |
| 18 | + |
| 19 | +import argparse |
| 20 | +import json |
| 21 | +import logging |
| 22 | +from pathlib import Path |
| 23 | + |
| 24 | +# ---------- logging setup ---------- |
| 25 | + |
| 26 | +def configure_logging() -> None: |
| 27 | + """Set up structured logging so every validation event is traceable.""" |
| 28 | + # WHY: The pipe-delimited layout (timestamp | level | message) makes logs |
| 29 | + # easy to parse with CLI tools like awk and grep, which matters when |
| 30 | + # debugging validation failures across thousands of records. |
| 31 | + logging.basicConfig( |
| 32 | + level=logging.INFO, |
| 33 | + format="%(asctime)s | %(levelname)s | %(message)s", |
| 34 | + ) |
| 35 | + |
| 36 | +# ---------- schema helpers ---------- |
| 37 | + |
| 38 | +# WHY: We translate JSON schema type names ("string", "integer") into Python |
| 39 | +# builtins so isinstance() can check values directly. This avoids scattered |
| 40 | +# if/elif chains and makes adding new types a one-line change. |
| 41 | +TYPE_MAP: dict[str, type] = { |
| 42 | + "string": str, |
| 43 | + "integer": int, |
| 44 | + "float": float, |
| 45 | + "boolean": bool, |
| 46 | + "number": (int, float), # type: ignore[assignment] |
| 47 | +} |
| 48 | + |
| 49 | + |
| 50 | +def load_schema(path: Path) -> dict: |
| 51 | + """Load a JSON schema file that describes expected fields.""" |
| 52 | + # WHY: Fail early with a clear message if the schema file is missing, |
| 53 | + # rather than letting json.loads raise a confusing error later. |
| 54 | + if not path.exists(): |
| 55 | + raise FileNotFoundError(f"Schema not found: {path}") |
| 56 | + return json.loads(path.read_text(encoding="utf-8")) |
| 57 | + |
| 58 | + |
| 59 | +def load_records(path: Path) -> list[dict]: |
| 60 | + """Load a JSON array of data records to validate.""" |
| 61 | + if not path.exists(): |
| 62 | + raise FileNotFoundError(f"Records file not found: {path}") |
| 63 | + data = json.loads(path.read_text(encoding="utf-8")) |
| 64 | + # WHY: Validate the top-level structure up front. A JSON object instead |
| 65 | + # of an array would cause cryptic errors during iteration. |
| 66 | + if not isinstance(data, list): |
| 67 | + raise ValueError("Records file must contain a JSON array") |
| 68 | + return data |
| 69 | + |
| 70 | +# ---------- validation logic ---------- |
| 71 | + |
| 72 | + |
| 73 | +def validate_record(record: dict, schema: dict) -> list[str]: |
| 74 | + """Validate one record against the schema, returning a list of errors. |
| 75 | +
|
| 76 | + Checks performed: |
| 77 | + 1. Required fields must be present and non-null. |
| 78 | + 2. Field values must match the declared type. |
| 79 | + 3. Numeric fields must fall within min/max bounds (if specified). |
| 80 | + """ |
| 81 | + # WHY: Returning a list instead of raising exceptions lets the caller |
| 82 | + # decide how to handle invalid records (log, quarantine, fail, etc.). |
| 83 | + errors: list[str] = [] |
| 84 | + fields_spec = schema.get("fields", {}) |
| 85 | + |
| 86 | + for field_name, rules in fields_spec.items(): |
| 87 | + value = record.get(field_name) |
| 88 | + |
| 89 | + # WHY: Check both "value is None" and "field_name not in record" |
| 90 | + # because a field could exist with a None value or be entirely |
| 91 | + # absent — both count as missing. |
| 92 | + if rules.get("required", False) and (value is None or field_name not in record): |
| 93 | + errors.append(f"missing required field '{field_name}'") |
| 94 | + continue # no point checking type/range on a missing field |
| 95 | + |
| 96 | + if field_name not in record: |
| 97 | + continue # optional and absent — that is fine |
| 98 | + |
| 99 | + # WHY: Look up the Python type from TYPE_MAP so we can use isinstance() |
| 100 | + # for a clean, extensible type check. |
| 101 | + expected = TYPE_MAP.get(rules.get("type", ""), None) |
| 102 | + if expected and not isinstance(value, expected): |
| 103 | + errors.append( |
| 104 | + f"field '{field_name}' expected {rules['type']}, " |
| 105 | + f"got {type(value).__name__}" |
| 106 | + ) |
| 107 | + continue # skip range check if type is wrong |
| 108 | + |
| 109 | + # WHY: Range checks only make sense for numeric values, so guard |
| 110 | + # with isinstance before comparing. |
| 111 | + if isinstance(value, (int, float)): |
| 112 | + if "min" in rules and value < rules["min"]: |
| 113 | + errors.append( |
| 114 | + f"field '{field_name}' value {value} < min {rules['min']}" |
| 115 | + ) |
| 116 | + if "max" in rules and value > rules["max"]: |
| 117 | + errors.append( |
| 118 | + f"field '{field_name}' value {value} > max {rules['max']}" |
| 119 | + ) |
| 120 | + |
| 121 | + # WHY: Flag extra fields because in data pipelines, unexpected columns |
| 122 | + # often signal upstream schema drift. Surfacing them early prevents |
| 123 | + # silent data loss or misinterpretation downstream. |
| 124 | + for key in record: |
| 125 | + if key not in fields_spec: |
| 126 | + errors.append(f"unexpected field '{key}'") |
| 127 | + |
| 128 | + return errors |
| 129 | + |
| 130 | + |
| 131 | +def validate_all(records: list[dict], schema: dict) -> dict: |
| 132 | + """Validate every record and return a structured report.""" |
| 133 | + report: dict = {"total": len(records), "valid": 0, "invalid": 0, "errors": []} |
| 134 | + |
| 135 | + for idx, record in enumerate(records): |
| 136 | + issues = validate_record(record, schema) |
| 137 | + if issues: |
| 138 | + report["invalid"] += 1 |
| 139 | + report["errors"].append({"record_index": idx, "issues": issues}) |
| 140 | + logging.warning("record %d invalid: %s", idx, issues) |
| 141 | + else: |
| 142 | + report["valid"] += 1 |
| 143 | + |
| 144 | + return report |
| 145 | + |
| 146 | +# ---------- CLI ---------- |
| 147 | + |
| 148 | + |
| 149 | +def run(schema_path: Path, records_path: Path, output_path: Path) -> dict: |
| 150 | + """Full validation run: load schema + records, validate, write report.""" |
| 151 | + schema = load_schema(schema_path) |
| 152 | + records = load_records(records_path) |
| 153 | + report = validate_all(records, schema) |
| 154 | + |
| 155 | + # WHY: Create parent directories automatically so the user does not need |
| 156 | + # to manually mkdir before running the tool. |
| 157 | + output_path.parent.mkdir(parents=True, exist_ok=True) |
| 158 | + output_path.write_text(json.dumps(report, indent=2), encoding="utf-8") |
| 159 | + logging.info("Validation complete — %d valid, %d invalid", report["valid"], report["invalid"]) |
| 160 | + return report |
| 161 | + |
| 162 | + |
| 163 | +def parse_args() -> argparse.Namespace: |
| 164 | + parser = argparse.ArgumentParser(description="Validate records against a JSON schema") |
| 165 | + parser.add_argument("--schema", default="data/schema.json", help="Path to schema file") |
| 166 | + parser.add_argument("--input", default="data/records.json", help="Path to records file") |
| 167 | + parser.add_argument("--output", default="data/validation_report.json", help="Output report path") |
| 168 | + return parser.parse_args() |
| 169 | + |
| 170 | + |
| 171 | +def main() -> None: |
| 172 | + configure_logging() |
| 173 | + args = parse_args() |
| 174 | + report = run(Path(args.schema), Path(args.input), Path(args.output)) |
| 175 | + print(json.dumps(report, indent=2)) |
| 176 | + |
| 177 | + |
| 178 | +if __name__ == "__main__": |
| 179 | + main() |
27 | 180 | ``` |
28 | 181 |
|
29 | | -## Design decisions |
| 182 | +## Design Decisions |
30 | 183 |
|
31 | | -| Decision | Why | Alternative considered | |
32 | | -|----------|-----|----------------------| |
33 | | -| configure_logging function | [reason] | [alternative] | |
34 | | -| load_schema function | [reason] | [alternative] | |
35 | | -| load_records function | [reason] | [alternative] | |
| 184 | +| Decision | Why | |
| 185 | +|----------|-----| |
| 186 | +| `TYPE_MAP` as a module-level constant | Keeps the mapping in one place. Adding a new type (e.g., `"date"`) is a single-line change instead of editing validation logic. | |
| 187 | +| Collect all errors per record instead of stopping at the first | Batch reporting is more useful for data pipelines — fixing one error at a time and re-running is slow when you have thousands of records. | |
| 188 | +| Flag unexpected fields in the record | Catches upstream schema drift early. In production, a new column appearing silently can cause downstream bugs that are hard to trace. | |
| 189 | +| Separate `load_schema` / `load_records` / `validate_record` functions | Each function has one job. You can test validation without touching the filesystem, or swap the loader for a database reader. | |
36 | 190 |
|
37 | | -## Alternative approaches |
| 191 | +## Alternative Approaches |
38 | 192 |
|
39 | | -### Approach B: [Name] |
| 193 | +### Using a validation library (e.g., `jsonschema` or `pydantic`) |
40 | 194 |
|
41 | 195 | ```python |
42 | | -# [Different valid approach with trade-offs explained] |
| 196 | +from pydantic import BaseModel, validator |
| 197 | + |
| 198 | +class PersonRecord(BaseModel): |
| 199 | + name: str |
| 200 | + age: int |
| 201 | + email: str | None = None |
| 202 | + |
| 203 | + @validator("age") |
| 204 | + def age_in_range(cls, v): |
| 205 | + if not 0 <= v <= 150: |
| 206 | + raise ValueError("age out of range") |
| 207 | + return v |
43 | 208 | ``` |
44 | 209 |
|
45 | | -**Trade-off:** [When you would prefer this approach vs the primary one] |
| 210 | +**Trade-off:** Libraries like `pydantic` handle nested objects, custom validators, and type coercion out of the box, but they add a dependency and hide the validation mechanics. Writing your own validator teaches you exactly how schema checking works, which matters when you need to customize behavior or debug failures. |
| 211 | + |
| 212 | +### Using `try/except` per record instead of error lists |
46 | 213 |
|
47 | | -## What could go wrong |
| 214 | +```python |
| 215 | +def validate_or_raise(record, schema): |
| 216 | + for field, rules in schema["fields"].items(): |
| 217 | + if rules["required"] and field not in record: |
| 218 | + raise ValueError(f"Missing {field}") |
| 219 | +``` |
48 | 220 |
|
49 | | -| Scenario | What happens | Prevention | |
50 | | -|----------|-------------|------------| |
51 | | -| [bad input] | [error/behavior] | [how to handle] | |
52 | | -| [edge case] | [behavior] | [how to handle] | |
| 221 | +**Trade-off:** Raising exceptions is simpler to write but only reports the first error per record. The list-based approach in the main solution is better for batch data work where you want to see all problems at once. |
53 | 222 |
|
54 | | -## Key takeaways |
| 223 | +## Common Pitfalls |
55 | 224 |
|
56 | | -1. [Most important lesson from this project] |
57 | | -2. [Second lesson] |
58 | | -3. [Connection to future concepts] |
| 225 | +1. **Forgetting that `bool` is a subclass of `int` in Python** — `isinstance(True, int)` returns `True`. If your schema has both `"boolean"` and `"integer"` types, check for `bool` first or a boolean value will pass an integer check. |
| 226 | +2. **Checking `value is None` but not `field_name not in record`** — A field can be present with value `None` (explicit null in JSON), or entirely absent from the dict. Both are "missing" but require different checks. |
| 227 | +3. **Mutating the input records during validation** — If you add or modify fields on the original dicts, subsequent validation passes or downstream code will see corrupted data. Always work on copies if you need to transform. |
0 commit comments