You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Install all the dependencies for the evaluation script by running the following command:
4
+
5
+
```bash
6
+
pip install -r requirements-dev.txt
7
+
```
8
+
9
+
## Generate ground truth data
10
+
11
+
Generate ground truth data by running the following command:
12
+
13
+
```bash
14
+
python evals/generate.py
15
+
```
16
+
17
+
Review the generated data after running that script, removing any question/answer pairs that don't seem like realistic user input.
18
+
19
+
## Evaluate the RAG answer quality
20
+
21
+
Review the configuration in `evals/eval_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. [TODO: link to evaluator docs]
22
+
23
+
By default, the evaluation script will evaluate every question in the ground truth data.
24
+
Run the evaluation script by running the following command:
25
+
26
+
```bash
27
+
python evals/evaluate.py
28
+
```
29
+
30
+
## Review the evaluation results
31
+
32
+
The evaluation script will output a summary of the evaluation results, inside the `evals/results` directory.
33
+
34
+
You can see a summary of results across all evaluation runs by running the following command:
35
+
36
+
```bash
37
+
python -m evaltools summary evals/results
38
+
```
39
+
40
+
Compare answers across runs by running the following command:
41
+
42
+
```bash
43
+
python -m evaltools diff evals/results/baseline/
44
+
```
45
+
46
+
## Run the evaluation in GitHub actions
47
+
48
+
49
+
# TODO: Add GPT-4 deployment with high capacity for evaluation
50
+
# TODO: Add CI workflow that can be triggered to run the evaluate on the local app
0 commit comments