Skip to content

Commit f5b5a03

Browse files
committed
Add support for use_answer_as_expected_output and use_answer_as_test_code in evaluation function
Enhances `io_test` and `unit_test` modes by allowing the answer field to serve as the reference solution or test code. Updates evaluation logic, adds unit tests, and refines documentation across `CLAUDE.md`, `README.md`, `user.md`, and `dev.md` to detail usage and advantages.
1 parent 0bf23f8 commit f5b5a03

6 files changed

Lines changed: 162 additions & 7 deletions

File tree

CLAUDE.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,32 @@ All source lives in `evaluation_function/`:
5050
]
5151
}
5252

53+
# io_test — expected outputs derived from answer code (preferred when using LF UI)
54+
# Write the reference solution in the answer field; only provide inputs in tests.
55+
# The system runs the answer code with each test's input to compute expected output.
56+
{
57+
"mode": "io_test",
58+
"use_answer_as_expected_output": True, # runs answer code to get expected output
59+
"tests": [
60+
{"input": "5\n"},
61+
{"inject": {"n": 5}}
62+
]
63+
}
64+
5365
# unit_test — run student code then execute test functions/TestCases
5466
{
5567
"mode": "unit_test",
5668
"test_code": "def test_square():\n assert square(5) == 25\n"
5769
}
70+
71+
# unit_test — test code in the answer field (preferred when using LF UI)
72+
# The LF params editor handles multiline code poorly; the answer field is a
73+
# proper code editor. Set use_answer_as_test_code=True and write test code
74+
# in the response area's answer field instead of params["test_code"].
75+
{
76+
"mode": "unit_test",
77+
"use_answer_as_test_code": True # reads test code from the answer argument
78+
}
5879
```
5980

6081
### Security model (`preview.py`)

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ The function supports three modes, set via `params.mode`.
3131
}
3232
```
3333

34+
Bare expressions (e.g. `5 * 5`) print automatically without `print()`, like a Jupyter notebook cell.
35+
3436
**`io_test`** — compare stdout against expected output for each test case:
3537

3638
```json
@@ -46,6 +48,8 @@ The function supports three modes, set via `params.mode`.
4648
}
4749
```
4850

51+
Set `"use_answer_as_expected_output": true` to run the `answer` (reference solution) against each test's input instead of hardcoding `expected_output`. Variable injection via `inject` is also supported as an alternative to stdin.
52+
4953
**`unit_test`** — run student code then execute `test_*` functions or `unittest.TestCase` subclasses (including Hypothesis tests):
5054

5155
```json
@@ -58,6 +62,8 @@ The function supports three modes, set via `params.mode`.
5862
}
5963
```
6064

65+
Set `"use_answer_as_test_code": true` to read test code from the `answer` field instead of `params.test_code` — useful in the LF UI where the answer field is a proper code editor.
66+
6167
## Development
6268

6369
### Prerequisites
@@ -99,6 +105,14 @@ python -m evaluation_function.dev "print(int(input())**2)" "" \
99105
# unit_test mode
100106
python -m evaluation_function.dev "def square(n): return n*n" "" \
101107
'{"mode":"unit_test","test_code":"def test_sq():\n assert square(3)==9\n"}'
108+
109+
# unit_test — test code from answer field
110+
python -m evaluation_function.dev "def square(n): return n*n" "def test_sq():\n assert square(3)==9\n" \
111+
'{"mode":"unit_test","use_answer_as_test_code":true}'
112+
113+
# io_test — expected output from answer field
114+
python -m evaluation_function.dev "3.14159*2*5" "3.14159*2*5" \
115+
'{"mode":"io_test","use_answer_as_expected_output":true,"tests":[{"input":""}]}'
102116
```
103117

104118
### Running Tests

docs/dev.md

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
```json
1717
{
1818
"response": "<student code string>",
19-
"answer": "<unused — may be null>",
19+
"answer": "<reference solution — used when use_answer_as_test_code or use_answer_as_expected_output is set>",
2020
"params": { ... }
2121
}
2222
```
@@ -35,6 +35,8 @@ Run student code with no stdin and return its stdout as output feedback. No pass
3535
}
3636
```
3737

38+
If the last statement is a bare expression (e.g. `3.14 * 2 * 5`), it is automatically wrapped in `print(repr(...))` so it prints like a REPL. Existing `print()` calls are not double-wrapped.
39+
3840
Feedback tags produced: `output` (stdout + any plots), or `error` (timeout / runtime error).
3941

4042
---
@@ -67,14 +69,32 @@ Each test case uses either `input` (stdin-based) or `inject` (variable injection
6769
|-------|-------------|
6870
| `input` | Text piped to stdin. Mutually exclusive with `inject`. |
6971
| `inject` | Dict of `{variable_name: value}` prepended as assignments before student code. Values can be any JSON type. Mutually exclusive with `input`. |
70-
| `expected_output` | Expected stdout; trailing whitespace stripped before comparison. |
72+
| `expected_output` | Expected stdout; trailing whitespace stripped before comparison. Required unless `use_answer_as_expected_output` is set. |
7173
| `hidden` | `true` = suppress input/variables and expected output from feedback. |
7274

7375
- `tests` is required; an empty list sets `is_correct = true` with `0/0 tests passed`.
7476
- `hidden: true` replaces details with `"Hidden test N: failed."` so students cannot reverse-engineer the answer.
7577
- With `inject`, feedback shows a "Variables:" block (e.g. `n = 5`) instead of "Input:".
78+
- Bare final expressions in student code are auto-wrapped in `print(repr(...))` (REPL behaviour).
7679
- Matplotlib figures generated during a test are uploaded to S3 and embedded in the feedback.
7780

81+
#### `use_answer_as_expected_output`
82+
83+
When `true`, the `answer` argument (reference solution code) is executed with the same input/inject as each test, and its stdout is used as the expected output. The `expected_output` field on each test object is ignored.
84+
85+
```json
86+
{
87+
"mode": "io_test",
88+
"use_answer_as_expected_output": true,
89+
"tests": [
90+
{ "input": "5\n" },
91+
{ "inject": {"n": 5} }
92+
]
93+
}
94+
```
95+
96+
This avoids hardcoding expected outputs in params — useful when the LF UI code editor holds the reference solution.
97+
7898
Feedback tags produced per test: `pass`, `fail`, or `hidden_fail`. Global: `summary`, `error` (timeout / runtime error).
7999

80100
---
@@ -98,6 +118,17 @@ Append teacher-supplied test code to the student submission, then execute the co
98118
- Student `print()` calls do not pollute test results (stdout is discarded; results are passed via a temp JSON file).
99119
- `is_correct` is `true` only when all tests pass and at least one test ran.
100120

121+
#### `use_answer_as_test_code`
122+
123+
When `true`, the `answer` argument is used as the test code instead of `params["test_code"]`. This is preferred when using the LF UI, whose params field is a plain JSON editor (poor for multiline code) while the answer field is a proper code editor.
124+
125+
```json
126+
{
127+
"mode": "unit_test",
128+
"use_answer_as_test_code": true
129+
}
130+
```
131+
101132
Feedback tags produced per test: `pass`, `fail`. Global: `summary`, `error` (timeout / module-level crash / empty test_code).
102133

103134
---

docs/user.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Runs the student's code and shows them their output. No pass/fail verdict is giv
2121
{ "mode": "demo" }
2222
```
2323

24-
Students see their stdout and any matplotlib figures they produced.
24+
Students see their stdout and any matplotlib figures they produced. Bare expressions (e.g. `3.14 * 2 * 5`) print automatically without needing `print()`, just like a Jupyter notebook.
2525

2626
---
2727

@@ -59,6 +59,32 @@ Each test case uses **either** `input` (student reads via `input()`) **or** `inj
5959
- You can mix visible and hidden tests in the same question.
6060
- Matplotlib figures produced during a passing or failing test are shown to the student.
6161
- A 25-second per-test timeout applies; timed-out tests count as failures.
62+
- Students can write bare expressions (e.g. `3.14 * r * r`) without `print()` — the output is captured automatically.
63+
64+
### Using the answer field as the reference solution
65+
66+
If you set `"use_answer_as_expected_output": true`, you can write your reference solution in the **answer** field (the code editor in the LF UI) instead of hardcoding `expected_output` in every test case. The system runs your solution with each test's input and uses its output as the expected result.
67+
68+
**Params**
69+
```json
70+
{
71+
"mode": "io_test",
72+
"use_answer_as_expected_output": true,
73+
"tests": [
74+
{ "input": "5\n" },
75+
{ "input": "0\n" },
76+
{ "input": "-3\n", "hidden": true }
77+
]
78+
}
79+
```
80+
81+
**Answer field** (reference solution):
82+
```python
83+
n = int(input())
84+
print(n * n)
85+
```
86+
87+
This is especially convenient when the reference solution is already in the answer field for the worked solution display — you don't need to duplicate the expected outputs.
6288

6389
### Example — square a number (stdin-based)
6490

@@ -159,6 +185,27 @@ def test_square_is_nonnegative(n):
159185
- Student `print()` calls do not affect test results.
160186
- A 25-second total timeout applies to the entire execution.
161187

188+
### Writing test code in the answer field
189+
190+
The LF params editor is a plain JSON editor, which makes writing multiline test code awkward. Instead, set `"use_answer_as_test_code": true` and write your test functions in the **answer** field (the proper code editor):
191+
192+
**Params**
193+
```json
194+
{
195+
"mode": "unit_test",
196+
"use_answer_as_test_code": true
197+
}
198+
```
199+
200+
**Answer field** (test code):
201+
```python
202+
def test_positive():
203+
assert square(5) == 25, "square(5) should be 25"
204+
205+
def test_zero():
206+
assert square(0) == 0
207+
```
208+
162209
### Example — testing a `square` function
163210

164211
Student code:

evaluation_function/evaluation.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -151,14 +151,13 @@ def _evaluate_demo(response: str, result: Result) -> Result:
151151
return result
152152

153153

154-
def _evaluate_io(response: str, tests: list, result: Result) -> Result:
154+
def _evaluate_io(response: str, tests: list, result: Result, answer: str = "") -> Result:
155155
passed = 0
156156
response = _add_repl_print(response)
157157

158158
for i, test in enumerate(tests, 1):
159159
inject = test.get("inject")
160160
stdin = test.get("input", "")
161-
expected = test.get("expected_output", "").rstrip()
162161
hidden = test.get("hidden", False)
163162

164163
if inject:
@@ -167,10 +166,19 @@ def _evaluate_io(response: str, tests: list, result: Result) -> Result:
167166
run_stdin = ""
168167
input_block = _code_block("Variables", "\n".join(f"{k} = {v!r}" for k, v in inject.items()))
169168
else:
169+
prefix = ""
170170
run_code = response
171171
run_stdin = stdin
172172
input_block = _code_block("Input", stdin.rstrip()) if stdin.strip() else None
173173

174+
if answer:
175+
ans_code = _add_repl_print(answer)
176+
ans_run_code = (prefix + ans_code) if inject else ans_code
177+
ans_stdout, _, _, _ = _run_code(ans_run_code, run_stdin)
178+
expected = ans_stdout.rstrip()
179+
else:
180+
expected = test.get("expected_output", "").rstrip()
181+
174182
stdout, stderr, timed_out, images = _run_code(run_code, run_stdin)
175183
actual = stdout.rstrip()
176184
label = f"Hidden test {i}" if hidden else f"Test {i}"
@@ -274,5 +282,7 @@ def evaluation_function(response: Any, answer: Any, params: Params) -> Result:
274282
if mode == "demo":
275283
return _evaluate_demo(str(response), result)
276284
if mode == "io_test":
277-
return _evaluate_io(str(response), params.get("tests", []), result)
278-
return _evaluate_unit(str(response), params.get("test_code", ""), result)
285+
ans = str(answer) if params.get("use_answer_as_expected_output") else ""
286+
return _evaluate_io(str(response), params.get("tests", []), result, answer=ans)
287+
test_code = str(answer) if params.get("use_answer_as_test_code") else params.get("test_code", "")
288+
return _evaluate_unit(str(response), test_code, result)

evaluation_function/evaluation_test.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,32 @@ def test_assignment_no_auto_print(self):
131131
self.assertFalse(result["is_correct"])
132132

133133

134+
class TestIoAnswerMode(unittest.TestCase):
135+
136+
def test_answer_used_as_expected(self):
137+
params = {"mode": "io_test", "use_answer_as_expected_output": True,
138+
"tests": [{"input": ""}]}
139+
result = evaluation_function("3.14159*2*5", "3.14159*2*5", params).to_dict()
140+
self.assertTrue(result["is_correct"])
141+
142+
def test_answer_used_student_wrong(self):
143+
params = {"mode": "io_test", "use_answer_as_expected_output": True,
144+
"tests": [{"input": ""}]}
145+
result = evaluation_function("3.14159*2*6", "3.14159*2*5", params).to_dict()
146+
self.assertFalse(result["is_correct"])
147+
148+
def test_answer_with_inject(self):
149+
params = {"mode": "io_test", "use_answer_as_expected_output": True,
150+
"tests": [{"inject": {"n": 5}}]}
151+
result = evaluation_function("print(n * n)", "print(n * n)", params).to_dict()
152+
self.assertTrue(result["is_correct"])
153+
154+
def test_flag_absent_uses_expected_output_field(self):
155+
params = _params(_test("", "31.4159\n"))
156+
result = evaluation_function("3.14159*2*5", "ignored", params).to_dict()
157+
self.assertTrue(result["is_correct"])
158+
159+
134160
_PLOT_CODE = "import matplotlib.pyplot as plt\nplt.plot([1, 2, 3])\n"
135161
_MULTI_PLOT_CODE = (
136162
"import matplotlib.pyplot as plt\n"
@@ -262,6 +288,12 @@ def test_hypothesis_pass(self):
262288
self.assertTrue(result["is_correct"])
263289
self.assertIn("1/1 tests passed", result["feedback"])
264290

291+
def test_use_answer_as_test_code(self):
292+
params = {"mode": "unit_test", "use_answer_as_test_code": True}
293+
result = evaluation_function(_SQUARE_FN, _SQUARE_TESTS, params).to_dict()
294+
self.assertTrue(result["is_correct"])
295+
self.assertIn("2/2 tests passed", result["feedback"])
296+
265297
def test_hypothesis_fail_shows_minimal_example(self):
266298
result = evaluation_function(_WRONG_SQUARE_FN, None, _unit_params(_SQUARE_TESTS_HYPOTHESIS)).to_dict()
267299

0 commit comments

Comments
 (0)