Add support for use_answer_as_expected_output and use_answer_as_test_code in evaluation function

m-messer · m-messer · commit f5b5a033a2f8 · 2026-05-26T16:26:21.000+01:00
Enhances `io_test` and `unit_test` modes by allowing the answer field to serve as the reference solution or test code. Updates evaluation logic, adds unit tests, and refines documentation across `CLAUDE.md`, `README.md`, `user.md`, and `dev.md` to detail usage and advantages.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -50,11 +50,32 @@ All source lives in `evaluation_function/`:
     ]
 }
 
+# io_test — expected outputs derived from answer code (preferred when using LF UI)
+# Write the reference solution in the answer field; only provide inputs in tests.
+# The system runs the answer code with each test's input to compute expected output.
+{
+    "mode": "io_test",
+    "use_answer_as_expected_output": True,   # runs answer code to get expected output
+    "tests": [
+        {"input": "5\n"},
+        {"inject": {"n": 5}}
+    ]
+}
+
 # unit_test — run student code then execute test functions/TestCases
 {
     "mode": "unit_test",
     "test_code": "def test_square():\n    assert square(5) == 25\n"
 }
+
+# unit_test — test code in the answer field (preferred when using LF UI)
+# The LF params editor handles multiline code poorly; the answer field is a
+# proper code editor. Set use_answer_as_test_code=True and write test code
+# in the response area's answer field instead of params["test_code"].
+{
+    "mode": "unit_test",
+    "use_answer_as_test_code": True   # reads test code from the answer argument
+}
 ```
 
 ### Security model (`preview.py`)
diff --git a/README.md b/README.md
@@ -31,6 +31,8 @@ The function supports three modes, set via `params.mode`.
 }
 ```
 
+Bare expressions (e.g. `5 * 5`) print automatically without `print()`, like a Jupyter notebook cell.
+
 **`io_test`** — compare stdout against expected output for each test case:
 
 ```json
@@ -46,6 +48,8 @@ The function supports three modes, set via `params.mode`.
 }
 ```
 
+Set `"use_answer_as_expected_output": true` to run the `answer` (reference solution) against each test's input instead of hardcoding `expected_output`. Variable injection via `inject` is also supported as an alternative to stdin.
+
 **`unit_test`** — run student code then execute `test_*` functions or `unittest.TestCase` subclasses (including Hypothesis tests):
 
 ```json
@@ -58,6 +62,8 @@ The function supports three modes, set via `params.mode`.
 }
 ```
 
+Set `"use_answer_as_test_code": true` to read test code from the `answer` field instead of `params.test_code` — useful in the LF UI where the answer field is a proper code editor.
+
 ## Development
 
 ### Prerequisites
@@ -99,6 +105,14 @@ python -m evaluation_function.dev "print(int(input())**2)" "" \
 # unit_test mode
 python -m evaluation_function.dev "def square(n): return n*n" "" \
   '{"mode":"unit_test","test_code":"def test_sq():\n    assert square(3)==9\n"}'
+
+# unit_test — test code from answer field
+python -m evaluation_function.dev "def square(n): return n*n" "def test_sq():\n    assert square(3)==9\n" \
+  '{"mode":"unit_test","use_answer_as_test_code":true}'
+
+# io_test — expected output from answer field
+python -m evaluation_function.dev "3.14159*2*5" "3.14159*2*5" \
+  '{"mode":"io_test","use_answer_as_expected_output":true,"tests":[{"input":""}]}'
 ```
 
 ### Running Tests
diff --git a/docs/dev.md b/docs/dev.md
@@ -16,7 +16,7 @@
 ```json
 {
   "response": "<student code string>",
-  "answer":   "<unused — may be null>",
+  "answer":   "<reference solution — used when use_answer_as_test_code or use_answer_as_expected_output is set>",
   "params": { ... }
 }
 ```
@@ -35,6 +35,8 @@ Run student code with no stdin and return its stdout as output feedback. No pass
 }
 ```
 
+If the last statement is a bare expression (e.g. `3.14 * 2 * 5`), it is automatically wrapped in `print(repr(...))` so it prints like a REPL. Existing `print()` calls are not double-wrapped.
+
 Feedback tags produced: `output` (stdout + any plots), or `error` (timeout / runtime error).
 
 ---
@@ -67,14 +69,32 @@ Each test case uses either `input` (stdin-based) or `inject` (variable injection
 |-------|-------------|
 | `input` | Text piped to stdin. Mutually exclusive with `inject`. |
 | `inject` | Dict of `{variable_name: value}` prepended as assignments before student code. Values can be any JSON type. Mutually exclusive with `input`. |
-| `expected_output` | Expected stdout; trailing whitespace stripped before comparison. |
+| `expected_output` | Expected stdout; trailing whitespace stripped before comparison. Required unless `use_answer_as_expected_output` is set. |
 | `hidden` | `true` = suppress input/variables and expected output from feedback. |
 
 - `tests` is required; an empty list sets `is_correct = true` with `0/0 tests passed`.
 - `hidden: true` replaces details with `"Hidden test N: failed."` so students cannot reverse-engineer the answer.
 - With `inject`, feedback shows a "Variables:" block (e.g. `n = 5`) instead of "Input:".
+- Bare final expressions in student code are auto-wrapped in `print(repr(...))` (REPL behaviour).
 - Matplotlib figures generated during a test are uploaded to S3 and embedded in the feedback.
 
+#### `use_answer_as_expected_output`
+
+When `true`, the `answer` argument (reference solution code) is executed with the same input/inject as each test, and its stdout is used as the expected output. The `expected_output` field on each test object is ignored.
+
+```json
+{
+  "mode": "io_test",
+  "use_answer_as_expected_output": true,
+  "tests": [
+    { "input": "5\n" },
+    { "inject": {"n": 5} }
+  ]
+}
+```
+
+This avoids hardcoding expected outputs in params — useful when the LF UI code editor holds the reference solution.
+
 Feedback tags produced per test: `pass`, `fail`, or `hidden_fail`. Global: `summary`, `error` (timeout / runtime error).
 
 ---
@@ -98,6 +118,17 @@ Append teacher-supplied test code to the student submission, then execute the co
 - Student `print()` calls do not pollute test results (stdout is discarded; results are passed via a temp JSON file).
 - `is_correct` is `true` only when all tests pass and at least one test ran.
 
+#### `use_answer_as_test_code`
+
+When `true`, the `answer` argument is used as the test code instead of `params["test_code"]`. This is preferred when using the LF UI, whose params field is a plain JSON editor (poor for multiline code) while the answer field is a proper code editor.
+
+```json
+{
+  "mode": "unit_test",
+  "use_answer_as_test_code": true
+}
+```
+
 Feedback tags produced per test: `pass`, `fail`. Global: `summary`, `error` (timeout / module-level crash / empty test_code).
 
 ---
diff --git a/docs/user.md b/docs/user.md
@@ -21,7 +21,7 @@ Runs the student's code and shows them their output. No pass/fail verdict is giv
 { "mode": "demo" }
 ```
 
-Students see their stdout and any matplotlib figures they produced.
+Students see their stdout and any matplotlib figures they produced. Bare expressions (e.g. `3.14 * 2 * 5`) print automatically without needing `print()`, just like a Jupyter notebook.
 
 ---
 
@@ -59,6 +59,32 @@ Each test case uses **either** `input` (student reads via `input()`) **or** `inj
 - You can mix visible and hidden tests in the same question.
 - Matplotlib figures produced during a passing or failing test are shown to the student.
 - A 25-second per-test timeout applies; timed-out tests count as failures.
+- Students can write bare expressions (e.g. `3.14 * r * r`) without `print()` — the output is captured automatically.
+
+### Using the answer field as the reference solution
+
+If you set `"use_answer_as_expected_output": true`, you can write your reference solution in the **answer** field (the code editor in the LF UI) instead of hardcoding `expected_output` in every test case. The system runs your solution with each test's input and uses its output as the expected result.
+
+**Params**
+```json
+{
+  "mode": "io_test",
+  "use_answer_as_expected_output": true,
+  "tests": [
+    { "input": "5\n" },
+    { "input": "0\n" },
+    { "input": "-3\n", "hidden": true }
+  ]
+}
+```
+
+**Answer field** (reference solution):
+```python
+n = int(input())
+print(n * n)
+```
+
+This is especially convenient when the reference solution is already in the answer field for the worked solution display — you don't need to duplicate the expected outputs.
 
 ### Example — square a number (stdin-based)
 
@@ -159,6 +185,27 @@ def test_square_is_nonnegative(n):
 - Student `print()` calls do not affect test results.
 - A 25-second total timeout applies to the entire execution.
 
+### Writing test code in the answer field
+
+The LF params editor is a plain JSON editor, which makes writing multiline test code awkward. Instead, set `"use_answer_as_test_code": true` and write your test functions in the **answer** field (the proper code editor):
+
+**Params**
+```json
+{
+  "mode": "unit_test",
+  "use_answer_as_test_code": true
+}
+```
+
+**Answer field** (test code):
+```python
+def test_positive():
+    assert square(5) == 25, "square(5) should be 25"
+
+def test_zero():
+    assert square(0) == 0
+```
+
 ### Example — testing a `square` function
 
 Student code:
diff --git a/evaluation_function/evaluation.py b/evaluation_function/evaluation.py
@@ -151,14 +151,13 @@ def _evaluate_demo(response: str, result: Result) -> Result:
     return result
 
 
-def _evaluate_io(response: str, tests: list, result: Result) -> Result:
+def _evaluate_io(response: str, tests: list, result: Result, answer: str = "") -> Result:
     passed = 0
     response = _add_repl_print(response)
 
     for i, test in enumerate(tests, 1):
         inject = test.get("inject")
         stdin = test.get("input", "")
-        expected = test.get("expected_output", "").rstrip()
         hidden = test.get("hidden", False)
 
         if inject:
@@ -167,10 +166,19 @@ def _evaluate_io(response: str, tests: list, result: Result) -> Result:
             run_stdin = ""
             input_block = _code_block("Variables", "\n".join(f"{k} = {v!r}" for k, v in inject.items()))
         else:
+            prefix = ""
             run_code = response
             run_stdin = stdin
             input_block = _code_block("Input", stdin.rstrip()) if stdin.strip() else None
 
+        if answer:
+            ans_code = _add_repl_print(answer)
+            ans_run_code = (prefix + ans_code) if inject else ans_code
+            ans_stdout, _, _, _ = _run_code(ans_run_code, run_stdin)
+            expected = ans_stdout.rstrip()
+        else:
+            expected = test.get("expected_output", "").rstrip()
+
         stdout, stderr, timed_out, images = _run_code(run_code, run_stdin)
         actual = stdout.rstrip()
         label = f"Hidden test {i}" if hidden else f"Test {i}"
@@ -274,5 +282,7 @@ def evaluation_function(response: Any, answer: Any, params: Params) -> Result:
     if mode == "demo":
         return _evaluate_demo(str(response), result)
     if mode == "io_test":
-        return _evaluate_io(str(response), params.get("tests", []), result)
-    return _evaluate_unit(str(response), params.get("test_code", ""), result)
+        ans = str(answer) if params.get("use_answer_as_expected_output") else ""
+        return _evaluate_io(str(response), params.get("tests", []), result, answer=ans)
+    test_code = str(answer) if params.get("use_answer_as_test_code") else params.get("test_code", "")
+    return _evaluate_unit(str(response), test_code, result)
diff --git a/evaluation_function/evaluation_test.py b/evaluation_function/evaluation_test.py
@@ -131,6 +131,32 @@ def test_assignment_no_auto_print(self):
         self.assertFalse(result["is_correct"])
 
 
+class TestIoAnswerMode(unittest.TestCase):
+
+    def test_answer_used_as_expected(self):
+        params = {"mode": "io_test", "use_answer_as_expected_output": True,
+                  "tests": [{"input": ""}]}
+        result = evaluation_function("3.14159*2*5", "3.14159*2*5", params).to_dict()
+        self.assertTrue(result["is_correct"])
+
+    def test_answer_used_student_wrong(self):
+        params = {"mode": "io_test", "use_answer_as_expected_output": True,
+                  "tests": [{"input": ""}]}
+        result = evaluation_function("3.14159*2*6", "3.14159*2*5", params).to_dict()
+        self.assertFalse(result["is_correct"])
+
+    def test_answer_with_inject(self):
+        params = {"mode": "io_test", "use_answer_as_expected_output": True,
+                  "tests": [{"inject": {"n": 5}}]}
+        result = evaluation_function("print(n * n)", "print(n * n)", params).to_dict()
+        self.assertTrue(result["is_correct"])
+
+    def test_flag_absent_uses_expected_output_field(self):
+        params = _params(_test("", "31.4159\n"))
+        result = evaluation_function("3.14159*2*5", "ignored", params).to_dict()
+        self.assertTrue(result["is_correct"])
+
+
 _PLOT_CODE = "import matplotlib.pyplot as plt\nplt.plot([1, 2, 3])\n"
 _MULTI_PLOT_CODE = (
     "import matplotlib.pyplot as plt\n"
@@ -262,6 +288,12 @@ def test_hypothesis_pass(self):
         self.assertTrue(result["is_correct"])
         self.assertIn("1/1 tests passed", result["feedback"])
 
+    def test_use_answer_as_test_code(self):
+        params = {"mode": "unit_test", "use_answer_as_test_code": True}
+        result = evaluation_function(_SQUARE_FN, _SQUARE_TESTS, params).to_dict()
+        self.assertTrue(result["is_correct"])
+        self.assertIn("2/2 tests passed", result["feedback"])
+
     def test_hypothesis_fail_shows_minimal_example(self):
         result = evaluation_function(_WRONG_SQUARE_FN, None, _unit_params(_SQUARE_TESTS_HYPOTHESIS)).to_dict()