diff --git a/.claude/skills/port-pr/SKILL.md b/.claude/skills/port-pr/SKILL.md new file mode 100644 index 00000000..4fe0b391 --- /dev/null +++ b/.claude/skills/port-pr/SKILL.md @@ -0,0 +1,145 @@ +--- +name: port-pr +description: Port a machine (C#) PR into machine.py (Python). Given a GitHub issue number for a porting task in sillsdev/machine.py, finds the linked machine (C#) PR, ports its changes to the Python codebase, runs the local checks (black/flake8/isort/pyright/pytest), and opens a PR that closes the issue. Use when asked to port a PR/issue from machine (C#), complete a "porting" issue, or sync a machine change into machine.py. +--- + +# Port a machine (C#) PR into machine.py (Python) + +`machine` (C#) and `machine.py` (Python) are direct, intentionally-synced ports of each +other. This skill ports a change that already landed in `machine` into `machine.py`, +driven by a "porting" issue in `sillsdev/machine.py`. + +**Required argument:** the GitHub issue number in `sillsdev/machine.py` (always exists). + +## Repos + +- Python target repo: the current working directory (`sillsdev/machine.py`). +- C# source repo: the sibling clone at `../machine` (`sillsdev/machine`). + Use the local clone for reading surrounding context; use `gh ... --repo sillsdev/machine` + for authoritative PR data. + +## Step 1 — Read the porting issue + +```bash +gh issue view --json title,body,labels +``` + +- The body looks like: `Port any relevant changes in https://github.com/sillsdev/machine/pull/ from machine to machine.py.` + Extract `` — the machine (C#) PR number — from that URL. +- The issue title looks like `Port ''`. Keep `<Title>` for the branch and PR. +- If the body has no machine (C#) PR link, stop and ask the user for the source PR. + +## Step 2 — Understand the source change + +```bash +gh pr view <PR> --repo sillsdev/machine --json title,body,files,commits +gh pr diff <PR> --repo sillsdev/machine +``` + +Read the full diff. For each changed C# file, open the corresponding file(s) in +`../machine` to understand the surrounding context, and identify the Python counterpart +(see mapping below). Read the existing Python code you're about to change so the port +matches local idiom. + +Note: not every change ports. Skip C#-only concerns (`.csproj`/`.sln`/`Directory.*.props`, +`AssemblyInfo`, `omnisharp.json`, csharpier/editorconfig formatting, NuGet packaging). +The issue says "any *relevant* changes" — use judgment and call out anything you +intentionally skip. + +## Step 3 — File & API mapping + +| machine (C#) | machine.py (Python) | +|---|---| +| `src/SIL.Machine/<PascalArea>/<PascalCase>.cs` (or the matching `SIL.Machine.*` project) | `machine/<area>/<snake_case>.py` | +| `tests/SIL.Machine.Tests/<PascalArea>/<PascalCase>Tests.cs` (or the matching `*.Tests` project) | `tests/<area>/test_<snake_case>.py` | +| `PascalCase` methods / `camelCase` locals / `_camelCase` fields | `snake_case` functions/vars | +| `IReadOnlyList<T>` / `IDictionary<,>` / `ISet<T>` etc. | `Sequence[T]`/`list` / `Mapping`/`dict` / `Set`/`set` etc. | +| NUnit `Assert.That(...)` | pytest plain `assert` (check neighboring test files) | +| `AssemblyInfo`/`.csproj` `<Version>` | `pyproject.toml` `version` (poetry) | + +The top-level Python areas are: `annotations`, `clusterers`, `corpora`, `jobs`, +`optimization`, `punctuation_analysis`, `scripture`, `sequence_alignment`, `statistics`, +`tokenization`, `translation`, `utils`. Some C# code lives in tool/plugin projects +(`SIL.Machine.Tool`, `SIL.Machine.Translation.Thot`, `SIL.Machine.Morphology.HermitCrab`, +etc.); the Python equivalent may live under `machine/jobs` or a different area, or may not +exist at all. Find the Python counterpart by searching for the type/method name (translated +to snake_case): `grep -ri "<type_or_method_name>" machine tests` before assuming a path. + +Port the **behavior**, not the syntax. Match existing Python patterns in the neighboring +code (naming, type hints, dataclasses, generators vs. loops, async conventions). Port the +tests too. + +## Step 4 — Branch & apply + +Create a branch off `main` (do not commit to `main`): + +```bash +git switch main && git pull && git switch -c port-<slug> +``` + +where `<slug>` is a short kebab-case form of the issue title (e.g. +`port-update-library-version-to-1.8.11`). + +Apply the ported changes with Edit/Write. + +## Step 5 — Verify locally + +Install, format, lint, type-check, and test (this is `local_check.sh`): + +```bash +poetry install +poetry run black . +poetry run flake8 . +poetry run isort . +poetry run pyright +poetry run pytest +``` + +`black` and `isort` rewrite files in place; `flake8` and `pyright` are gates that must pass +clean. Fix any formatting, lint, type, or test failures before proceeding. Report the test +results plainly (pass/fail counts); don't claim success if anything failed. + +## Step 6 — Commit, push, open PR (pause first) + +Show the user a summary of the diff and the proposed PR title/body, and **confirm before +pushing**. Then: + +```bash +git add -A +git commit # message: Port '<Title>' from machine PR #<PR> +git push -u origin port-<slug> +gh pr create --title "Port '<Title>' from machine" --body "<body>" +``` + +Commit message footer: + +``` +Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> +``` + +### PR body template + +```markdown +Ports [machine PR #<PR>](https://github.com/sillsdev/machine/pull/<PR>) — <one-line summary of the change>. + +## <Section per area changed> +<What changed and why, mirroring the source PR. Include short before/after or code snippets where helpful.> + +## Tests +<Tests ported / added.> + +Closes #<ISSUE> +``` + +PR body footer: + +``` +🤖 Generated with [Claude Code](https://claude.com/claude-code) +``` + +## Notes + +- Keep the two codebases as similar as is reasonable for a Python-vs-C# port. +- If the source PR spans multiple commits, the squashed PR diff is the source of truth, but + reading individual commits can clarify intent. +- If a change has no sensible Python counterpart, say so in the PR body rather than forcing it. diff --git a/machine/corpora/__init__.py b/machine/corpora/__init__.py index d07e52ee..52c034f6 100644 --- a/machine/corpora/__init__.py +++ b/machine/corpora/__init__.py @@ -11,7 +11,7 @@ from .file_paratext_project_settings_parser import FileParatextProjectSettingsParser from .file_paratext_project_terms_parser import FileParatextProjectTermsParser from .file_paratext_project_text_updater import FileParatextProjectTextUpdater -from .file_paratext_project_versification_error_detector import FileParatextProjectVersificationErrorDetector +from .file_usfm_versification_analyzer import FileUsfmVersificationAnalyzer from .flatten import flatten from .memory_alignment_collection import MemoryAlignmentCollection from .memory_stream_container import MemoryStreamContainer @@ -28,7 +28,6 @@ from .paratext_project_settings_parser_base import ParatextProjectSettingsParserBase from .paratext_project_terms_parser_base import KeyTerm, ParatextProjectTermsParserBase from .paratext_project_text_updater_base import ParatextProjectTextUpdaterBase -from .paratext_project_versification_error_detector_base import ParatextProjectVersificationErrorDetectorBase from .paratext_text_corpus import ParatextTextCorpus from .place_markers_usfm_update_block_handler import PlaceMarkersAlignmentInfo, PlaceMarkersUsfmUpdateBlockHandler from .scripture_element import ScriptureElement @@ -78,10 +77,12 @@ from .usfm_update_block import UsfmUpdateBlock from .usfm_update_block_element import UsfmUpdateBlockElement, UsfmUpdateBlockElementType from .usfm_update_block_handler import UsfmUpdateBlockHandler -from .usfm_versification_error_detector import ( - UsfmVersificationError, - UsfmVersificationErrorDetector, - UsfmVersificationErrorType, +from .usfm_versification_analyzer_base import UsfmVersificationAnalyzerBase +from .usfm_versification_analyzer_handler import ( + UsfmVersificationAnalysis, + UsfmVersificationAnalyzerHandler, + UsfmVersificationDiagnostic, + UsfmVersificationDiagnosticType, ) from .usx_file_alignment_collection import UsxFileAlignmentCollection from .usx_file_alignment_corpus import UsxFileAlignmentCorpus @@ -93,7 +94,7 @@ from .zip_paratext_project_settings_parser import ZipParatextProjectSettingsParser from .zip_paratext_project_terms_parser import ZipParatextProjectTermsParser from .zip_paratext_project_text_updater import ZipParatextProjectTextUpdater -from .zip_paratext_project_versification_detector import ZipParatextProjectVersificationErrorDetector +from .zip_usfm_versification_analyzer import ZipUsfmVersificationAnalyzer __all__ = [ "AlignedWordPair", @@ -114,7 +115,7 @@ "FileParatextProjectSettingsParser", "FileParatextProjectTermsParser", "FileParatextProjectTextUpdater", - "FileParatextProjectVersificationErrorDetector", + "FileUsfmVersificationAnalyzer", "flatten", "is_scripture", "KeyTerm", @@ -139,7 +140,6 @@ "ParatextProjectSettingsParserBase", "ParatextProjectTermsParserBase", "ParatextProjectTextUpdaterBase", - "ParatextProjectVersificationErrorDetectorBase", "ParatextTextCorpus", "parse_usfm", "PlaceMarkersAlignmentInfo", @@ -187,9 +187,11 @@ "UsfmUpdateBlockElement", "UsfmUpdateBlockElementType", "UsfmUpdateBlockHandler", - "UsfmVersificationError", - "UsfmVersificationErrorDetector", - "UsfmVersificationErrorType", + "UsfmVersificationAnalysis", + "UsfmVersificationAnalyzerBase", + "UsfmVersificationAnalyzerHandler", + "UsfmVersificationDiagnostic", + "UsfmVersificationDiagnosticType", "UsxFileAlignmentCollection", "UsxFileAlignmentCorpus", "UsxFileText", @@ -200,5 +202,5 @@ "ZipParatextProjectSettingsParser", "ZipParatextProjectTermsParser", "ZipParatextProjectTextUpdater", - "ZipParatextProjectVersificationErrorDetector", + "ZipUsfmVersificationAnalyzer", ] diff --git a/machine/corpora/file_paratext_project_versification_error_detector.py b/machine/corpora/file_usfm_versification_analyzer.py similarity index 73% rename from machine/corpora/file_paratext_project_versification_error_detector.py rename to machine/corpora/file_usfm_versification_analyzer.py index 41138922..c284e8f8 100644 --- a/machine/corpora/file_paratext_project_versification_error_detector.py +++ b/machine/corpora/file_usfm_versification_analyzer.py @@ -4,10 +4,10 @@ from .file_paratext_project_file_handler import FileParatextProjectFileHandler from .file_paratext_project_settings_parser import FileParatextProjectSettingsParser from .paratext_project_settings import ParatextProjectSettings -from .paratext_project_versification_error_detector_base import ParatextProjectVersificationErrorDetectorBase +from .usfm_versification_analyzer_base import UsfmVersificationAnalyzerBase -class FileParatextProjectVersificationErrorDetector(ParatextProjectVersificationErrorDetectorBase): +class FileUsfmVersificationAnalyzer(UsfmVersificationAnalyzerBase): def __init__(self, project_dir: StrPath, parent_settings: Optional[ParatextProjectSettings] = None) -> None: super().__init__( FileParatextProjectFileHandler(project_dir), diff --git a/machine/corpora/paratext_project_versification_error_detector_base.py b/machine/corpora/usfm_versification_analyzer_base.py similarity index 62% rename from machine/corpora/paratext_project_versification_error_detector_base.py rename to machine/corpora/usfm_versification_analyzer_base.py index 4b96be59..5a04db02 100644 --- a/machine/corpora/paratext_project_versification_error_detector_base.py +++ b/machine/corpora/usfm_versification_analyzer_base.py @@ -1,14 +1,14 @@ -from typing import List, Optional, Set, Union +from typing import Dict, Optional, Set, Union from ..scripture.canon import book_id_to_number from .paratext_project_file_handler import ParatextProjectFileHandler from .paratext_project_settings import ParatextProjectSettings from .paratext_project_settings_parser_base import ParatextProjectSettingsParserBase from .usfm_parser import parse_usfm -from .usfm_versification_error_detector import UsfmVersificationError, UsfmVersificationErrorDetector +from .usfm_versification_analyzer_handler import UsfmVersificationAnalysis, UsfmVersificationAnalyzerHandler -class ParatextProjectVersificationErrorDetectorBase: +class UsfmVersificationAnalyzerBase: def __init__( self, paratext_project_file_handler: ParatextProjectFileHandler, @@ -20,10 +20,20 @@ def __init__( else: self._settings = settings - def get_usfm_versification_errors( - self, handler: Optional[UsfmVersificationErrorDetector] = None, books: Optional[Set[int]] = None - ) -> List[UsfmVersificationError]: - handler = handler or UsfmVersificationErrorDetector(self._settings) + def analyze_usfm_versification( + self, + books_and_chapters: Optional[Union[Dict[str, Optional[Set[int]]], Dict[int, Optional[Set[int]]]]] = None, + handler: Optional[UsfmVersificationAnalyzerHandler] = None, + ) -> UsfmVersificationAnalysis: + book_nums_and_chapters = ( + { + (book_id_to_number(book) if isinstance(book, str) else book): chapters + for book, chapters in books_and_chapters.items() + } + if books_and_chapters is not None + else None + ) + handler = handler or UsfmVersificationAnalyzerHandler(self._settings, book_nums_and_chapters) for book_id in self._settings.get_all_scripture_book_ids(): file_name = self._settings.get_book_file_name(book_id) @@ -31,7 +41,7 @@ def get_usfm_versification_errors( if not self._paratext_project_file_handler.exists(file_name): continue - if books is not None and not book_id_to_number(book_id) in books: + if book_nums_and_chapters is not None and book_id_to_number(book_id) not in book_nums_and_chapters: continue with self._paratext_project_file_handler.open(file_name) as sfm_file: @@ -45,4 +55,4 @@ def get_usfm_versification_errors( f". Error: '{e}'" ) raise RuntimeError(error_message) from e - return handler.errors + return handler.get_analysis() diff --git a/machine/corpora/usfm_versification_analyzer_handler.py b/machine/corpora/usfm_versification_analyzer_handler.py new file mode 100644 index 00000000..f7cc7384 --- /dev/null +++ b/machine/corpora/usfm_versification_analyzer_handler.py @@ -0,0 +1,274 @@ +from dataclasses import dataclass +from enum import Enum, auto +from typing import Dict, Iterator, List, Optional, Sequence, Set + +from machine.scripture import canon + +from ..scripture.constants import ORIGINAL_VERSIFICATION +from ..scripture.verse_ref import ValidStatus, VerseRef +from .paratext_project_settings import ParatextProjectSettings +from .usfm_parser_handler import UsfmParserHandler +from .usfm_parser_state import UsfmParserState + + +class UsfmVersificationDiagnosticType(Enum): + MISSING = auto() # Missing content + EXTRA = auto() # Extra content + INVALID = auto() # Invalid verse or chapter reference + INCORRECT_VERSE_SEGMENT = auto() # Verse segment in vrs but not in USFM or segment in USFM but not in vrs + UNSUPPORTED_VERSE_RANGE = auto() # Verse range that will cross chapter boundaries when mapped to ScrVers.Original + + +@dataclass +class UsfmVersificationDiagnostic: + type: UsfmVersificationDiagnosticType + # Expected verses for MISSING, actual verses for EXTRA and INVALID + references: List[VerseRef] + filename: Optional[str] + line_numbers: List[int] + + @property + def num_affected_verses(self) -> int: + return sum(len(list(vr.all_verses())) for vr in self.references) + + def extend(self, verse_reference: VerseRef, line_number: Optional[int] = None) -> None: + combined = False + if self.references: # Combine contiguous references + last_reference = self.references[-1] + if ( + verse_reference.book == last_reference.book + and verse_reference.chapter_num == last_reference.chapter_num + ): + last_verse_num = list(last_reference.all_verses())[-1].verse_num + next_verse_num = list(verse_reference.all_verses())[0].verse_num + first_verse_num = list(last_reference.all_verses())[0].verse_num + final_verse_num = list(verse_reference.all_verses())[-1].verse_num + if ( + next_verse_num == last_verse_num + 1 + and ( + updated_verse_reference := VerseRef.try_from_string( + f"{verse_reference.book} {verse_reference.chapter_num}:{first_verse_num}-{final_verse_num}" + ) + ) + is not None + ): + self.references[-1] = updated_verse_reference + combined = True + + if not combined: + self.references.append(verse_reference) + if line_number is not None: + self.line_numbers.append(line_number) + + +@dataclass +class UsfmVersificationAnalysis: + total_num_affected_verses: int + total_num_encountered_verses: int + diagnostics: Sequence[UsfmVersificationDiagnostic] + project_settings: ParatextProjectSettings + + +class UsfmVersificationAnalyzerHandler(UsfmParserHandler): + def __init__( + self, settings: ParatextProjectSettings, only_chapters: Optional[Dict[int, Optional[Set[int]]]] = None + ) -> None: + self._settings = settings + self._only_chapters = only_chapters + self._expected_verses: Iterator[VerseRef] = iter(settings.versification.all_included_verses(only_chapters)) + self._cursor: Optional[VerseRef] = None + self._has_more = self._move_next() + self._next_expected_verse = VerseRef() + self._prev_encountered_verse_ref = VerseRef(1, 1, 0) + self._diagnostics: List[UsfmVersificationDiagnostic] = [] + self._filename: Optional[str] = None + self._last_verse_in_error = False + self._last_verse_was_extra = False + self._last_verse_was_invalid = False + self._total_verses_analyzed = 0 + self._last_line_number = 1 + + def _move_next(self) -> bool: + try: + self._cursor = next(self._expected_verses) + return True + except StopIteration: + self._cursor = None + return False + + def _get_next_expected_verse(self) -> None: + self._next_expected_verse = self._cursor if self._cursor is not None else VerseRef() + self._has_more = self._move_next() + + @property + def _current_error(self) -> UsfmVersificationDiagnostic: + return self._diagnostics[-1] + + def get_analysis(self) -> UsfmVersificationAnalysis: + while self._has_more: + if not self._last_verse_was_invalid: + self._get_next_expected_verse() + self._handle_missing_verse() + self._last_verse_was_invalid = False + return UsfmVersificationAnalysis( + total_num_affected_verses=sum(d.num_affected_verses for d in self._diagnostics), + total_num_encountered_verses=self._total_verses_analyzed, + diagnostics=self._diagnostics, + project_settings=self._settings, + ) + + def start_book(self, state: UsfmParserState, marker: str, code: str) -> None: + self._filename = self._settings.get_book_file_name(state.verse_ref.book) + + def chapter( + self, state: UsfmParserState, number: str, marker: str, alt_number: Optional[str], pub_number: Optional[str] + ) -> None: + verse_ref = state.verse_ref.copy() + if not canon.is_canonical(verse_ref.book): + return + verse_ref.chapter = number + if verse_ref.chapter_num == -1: + self._diagnostics.append( + UsfmVersificationDiagnostic( + type=UsfmVersificationDiagnosticType.INVALID, + references=[verse_ref], + filename=self._filename, + line_numbers=[state.line_number], + ) + ) + self._last_verse_in_error = True + + def verse( + self, state: UsfmParserState, number: str, marker: str, alt_number: Optional[str], pub_number: Optional[str] + ) -> None: + current_verses = state.verse_ref.copy() + if self._only_chapters is not None: + if current_verses.book_num not in self._only_chapters: + return + chapters_filter = self._only_chapters[current_verses.book_num] + if chapters_filter is not None and current_verses.chapter_num not in chapters_filter: + return + + verse_ref = current_verses.copy() + if not canon.is_canonical(verse_ref.book): + return + verse_ref.verse = number + invalid_verse_num = verse_ref.verse_num == -1 + bad_verse_range = verse_ref.valid_status in (ValidStatus.VERSE_OUT_OF_ORDER, ValidStatus.VERSE_REPEATED) + if invalid_verse_num or bad_verse_range: + self._handle_invalid_verse(state, verse_ref) + self._last_verse_was_invalid = True + else: + self._last_verse_was_invalid = False + + segment_mismatch = (not current_verses.segment()) == current_verses.has_segments_defined + if segment_mismatch: + self._handle_incorrect_verse_segment(state, verse_ref) + + if current_verses.has_multiple: + copy = current_verses.copy() + has_cross_chapter_verse_range = not copy.change_versification(ORIGINAL_VERSIFICATION) + if has_cross_chapter_verse_range: + self._diagnostics.append( + UsfmVersificationDiagnostic( + type=UsfmVersificationDiagnosticType.UNSUPPORTED_VERSE_RANGE, + references=[current_verses], + filename=self._filename, + line_numbers=[state.line_number], + ) + ) + + for current_verse in sorted(current_verses.all_verses()): + # Properly handle verse segments + if self._prev_encountered_verse_ref.compare_to(current_verse, compare_segments=False) < 0: + self._total_verses_analyzed += 1 + if not self._last_verse_was_extra and self._has_more: + self._get_next_expected_verse() + if self._next_expected_verse.is_default: + continue + compare = self._next_expected_verse.compare_to(current_verse, compare_segments=False) + if compare < 0 and self._has_more: + self._handle_missing_verse() + self._get_next_expected_verse() + while ( + self._has_more and self._next_expected_verse.compare_to(current_verse, compare_segments=False) < 0 + ): + self._current_error.extend(self._next_expected_verse) + self._get_next_expected_verse() + elif (compare > 0 and not self._last_verse_was_invalid) or (compare < 0 and not self._has_more): + # We want Invalid and Extra to be mutually exclusive to avoid duplicate errors for every + # Invalid/Extra verse + if not self._has_more and self._next_expected_verse.compare_to(self._prev_encountered_verse_ref) > 0: + self._diagnostics.append( + UsfmVersificationDiagnostic( + type=UsfmVersificationDiagnosticType.MISSING, + references=[self._next_expected_verse], + filename=self._filename, + line_numbers=[self._last_line_number], + ) + ) + + self._handle_extra_verse(state.line_number, current_verse) + else: + self._last_verse_in_error = False + if compare <= 0: + self._last_verse_was_extra = False + + self._prev_encountered_verse_ref = current_verse + + self._last_line_number = state.line_number + + def _handle_invalid_verse(self, state: UsfmParserState, verse_ref: VerseRef) -> None: + self._diagnostics.append( + UsfmVersificationDiagnostic( + type=UsfmVersificationDiagnosticType.INVALID, + references=[verse_ref], + filename=self._filename, + line_numbers=[state.line_number], + ) + ) + self._last_verse_in_error = True + + def _handle_incorrect_verse_segment(self, state: UsfmParserState, verse_ref: VerseRef) -> None: + self._diagnostics.append( + UsfmVersificationDiagnostic( + type=UsfmVersificationDiagnosticType.INCORRECT_VERSE_SEGMENT, + references=[verse_ref], + filename=self._filename, + line_numbers=[state.line_number], + ) + ) + self._last_verse_in_error = True + + def _handle_extra_verse(self, line_number: int, current_verse: VerseRef) -> None: + if not self._last_verse_in_error or ( + self._last_verse_in_error and self._current_error.type != UsfmVersificationDiagnosticType.EXTRA + ): + self._diagnostics.append( + UsfmVersificationDiagnostic( + type=UsfmVersificationDiagnosticType.EXTRA, + references=[current_verse], + filename=self._filename, + line_numbers=[line_number], + ) + ) + self._last_verse_in_error = True + else: + self._current_error.extend(current_verse, line_number) + self._last_verse_was_extra = True + + def _handle_missing_verse(self) -> None: + if not self._last_verse_in_error or ( + self._last_verse_in_error and self._current_error.type != UsfmVersificationDiagnosticType.MISSING + ): + self._diagnostics.append( + UsfmVersificationDiagnostic( + type=UsfmVersificationDiagnosticType.MISSING, + references=[self._next_expected_verse], + filename=self._filename, + line_numbers=[self._last_line_number], + ) + ) + self._last_verse_in_error = True + else: + self._current_error.extend(self._next_expected_verse) diff --git a/machine/corpora/usfm_versification_error_detector.py b/machine/corpora/usfm_versification_error_detector.py deleted file mode 100644 index fc0374ad..00000000 --- a/machine/corpora/usfm_versification_error_detector.py +++ /dev/null @@ -1,238 +0,0 @@ -from enum import Enum, auto -from typing import List, Optional - -from machine.scripture import canon - -from ..scripture.verse_ref import ValidStatus, VerseRef -from .paratext_project_settings import ParatextProjectSettings -from .usfm_parser_handler import UsfmParserHandler -from .usfm_parser_state import UsfmParserState - - -class UsfmVersificationErrorType(Enum): - MISSING_CHAPTER = auto() - MISSING_VERSE = auto() - EXTRA_VERSE = auto() - INVALID_VERSE_RANGE = auto() - MISSING_VERSE_SEGMENT = auto() - EXTRA_VERSE_SEGMENT = auto() - INVALID_CHAPTER_NUMBER = auto() - INVALID_VERSE_NUMBER = auto() - - -class UsfmVersificationError: - def __init__( - self, - book_num: int = 0, - expected_chapter: int = 0, - expected_verse: int = 0, - actual_chapter: int = 0, - actual_verse: int = 0, - project_name: str = "", - verse_ref: Optional[VerseRef] = None, - actual_value: Optional[str] = None, - usfm_versification_error_type: Optional[UsfmVersificationErrorType] = None, - ): - self._book_num = book_num - self._expected_chapter = expected_chapter - self._expected_verse = expected_verse - self._actual_chapter = actual_chapter - self._actual_verse = actual_verse - self._verse_ref = verse_ref - self._type: UsfmVersificationErrorType - if usfm_versification_error_type is not None: - self._type = usfm_versification_error_type - self._actual_value = actual_value or "" - self._project_name = project_name - - @property - def type(self) -> UsfmVersificationErrorType: - return self._type - - @property - def project_name(self) -> str: - return self._project_name - - def check_error(self) -> bool: - """Returns true if there is an error""" - if self._expected_chapter > self._actual_chapter and self._expected_verse != 0: - self._type = UsfmVersificationErrorType.MISSING_CHAPTER - return True - if self._expected_verse > self._actual_verse and self._expected_chapter == self._actual_chapter: - self._type = UsfmVersificationErrorType.MISSING_VERSE - return True - if self._verse_ref is not None: - if not self._verse_ref.segment() and self._verse_ref.has_segments_defined: - self._type = UsfmVersificationErrorType.MISSING_VERSE_SEGMENT - return True - if self._verse_ref.segment() and not self._verse_ref.has_segments_defined: - self._type = UsfmVersificationErrorType.EXTRA_VERSE_SEGMENT - return True - if not self._verse_ref.is_valid: - self._type = UsfmVersificationError.map(self._verse_ref.valid_status) - return True - return False - - @staticmethod - def map(valid_status: ValidStatus) -> UsfmVersificationErrorType: - if valid_status == ValidStatus.OUT_OF_RANGE: - return UsfmVersificationErrorType.EXTRA_VERSE - if valid_status == ValidStatus.VERSE_REPEATED or valid_status == ValidStatus.VERSE_OUT_OF_ORDER: - return UsfmVersificationErrorType.INVALID_VERSE_RANGE - raise ValueError( - f"{ValidStatus.__name__} {valid_status} does not map to any {UsfmVersificationErrorType.__name__}" - ) - - @property - def expected_verse_ref(self) -> str: - if self._type in [ - UsfmVersificationErrorType.EXTRA_VERSE, - UsfmVersificationErrorType.INVALID_CHAPTER_NUMBER, - UsfmVersificationErrorType.INVALID_VERSE_NUMBER, - ]: - return "" - if ( - default_verse_ref := VerseRef.try_from_string( - f"{canon.book_number_to_id(self._book_num)} {self._expected_chapter}:{self._expected_verse}" - ) - ) is None: - return self.default_verse(self._expected_chapter, self._expected_verse) - if self._type == UsfmVersificationErrorType.MISSING_VERSE_SEGMENT: - if ( - verse_ref_with_segment := VerseRef.try_from_string( - f"{canon.book_number_to_id(self._book_num)} {self._expected_chapter}:{self._expected_verse}a" - ) - ) is not None: - return str(verse_ref_with_segment) - if self._type == UsfmVersificationErrorType.INVALID_VERSE_RANGE and self._verse_ref is not None: - sorted_all_unique_verses = sorted(set(self._verse_ref.all_verses())) - first_verse = sorted_all_unique_verses[0] - last_verse = sorted_all_unique_verses[-1] - if first_verse == last_verse: - return str(first_verse) - elif ( - corrected_verse_range_ref := VerseRef.try_from_string( - f"{canon.book_number_to_id(self._book_num)} " - f"{self._expected_chapter}:{first_verse.verse_num}-{last_verse.verse_num}" - ) - ) is not None: - return str(corrected_verse_range_ref) - return str(default_verse_ref) - - @property - def actual_verse_ref(self) -> str: - if self.type == UsfmVersificationErrorType.INVALID_CHAPTER_NUMBER: - return f"{canon.book_number_to_id(self._book_num)} {self._actual_value}" - if self.type == UsfmVersificationErrorType.INVALID_VERSE_NUMBER: - return f"{canon.book_number_to_id(self._book_num)} {self._expected_chapter}:{self._actual_value}" - if self._verse_ref is not None: - return str(self._verse_ref) - if actual_verse_ref := VerseRef.try_from_string( - f"{canon.book_number_to_id(self._book_num)} {self._actual_chapter}:{self._actual_verse}" - ): - return str(actual_verse_ref) - return self.default_verse(self._actual_chapter, self._actual_verse) - - def default_verse(self, chapter: int, verse: int): - verse_string = "" if self._actual_verse == -1 else str(verse) - return f"{canon.book_number_to_id(self._book_num)} {chapter}:{verse_string}" - - -class UsfmVersificationErrorDetector(UsfmParserHandler): - def __init__(self, settings: ParatextProjectSettings): - self._project_name = settings.name - self._versification = settings.versification - self._current_book = 0 - self._current_chapter = 0 - self._current_verse = VerseRef() - self._errors: List[UsfmVersificationError] = [] - - @property - def errors(self) -> List[UsfmVersificationError]: - return self._errors.copy() - - def end_usfm(self, state: UsfmParserState) -> None: - if self._current_book > 0 and canon.is_canonical(self._current_book): - versification_error = UsfmVersificationError( - self._current_book, - self._versification.get_last_chapter(self._current_book), - self._versification.get_last_verse( - self._current_book, self._versification.get_last_chapter(self._current_book) - ), - self._current_chapter, - list(self._current_verse.all_verses())[-1].verse_num, - self._project_name, - ) - if versification_error.check_error(): - self._errors.append(versification_error) - - def start_book(self, state: UsfmParserState, marker: str, code: str) -> None: - self._current_book = state.verse_ref.book_num - self._current_chapter = 0 - self._current_verse = VerseRef() - - def chapter( - self, state: UsfmParserState, number: str, marker: str, alt_number: Optional[str], pub_number: Optional[str] - ) -> None: - if self._current_book > 0 and canon.is_canonical(self._current_book) and self._current_chapter > 0: - versification_error = UsfmVersificationError( - self._current_book, - self._current_chapter, - self._versification.get_last_verse(self._current_book, self._current_chapter), - self._current_chapter, - list(self._current_verse.all_verses())[-1].verse_num, - self._project_name, - ) - if versification_error.check_error(): - self._errors.append(versification_error) - - self._current_chapter = state.verse_ref.chapter_num - self._current_verse = VerseRef() - - # See whether the chapter number is invalid - verse_ref = state.verse_ref.copy() - verse_ref.chapter = number - if verse_ref.chapter_num == -1: - self._errors.append( - UsfmVersificationError( - book_num=self._current_book, - expected_chapter=self._current_chapter, - actual_value=number, - project_name=self._project_name, - usfm_versification_error_type=UsfmVersificationErrorType.INVALID_CHAPTER_NUMBER, - ) - ) - - def verse( - self, state: UsfmParserState, number: str, marker: str, alt_number: Optional[str], pub_number: Optional[str] - ) -> None: - verse_in_error = False - self._current_verse = state.verse_ref.copy() - if self._current_book > 0 and canon.is_canonical(self._current_book) and self._current_chapter > 0: - versification_error = UsfmVersificationError( - self._current_book, - self._current_chapter, - list(self._current_verse.all_verses())[-1].verse_num, - self._current_chapter, - list(self._current_verse.all_verses())[-1].verse_num, - self._project_name, - self._current_verse, - ) - if versification_error.check_error(): - self._errors.append(versification_error) - verse_in_error = True - - if not verse_in_error: - # See whether the verse number is invalid - verse_ref = self._current_verse.copy() - verse_ref.verse = number - if verse_ref.verse_num == -1: - self._errors.append( - UsfmVersificationError( - book_num=self._current_book, - expected_chapter=self._current_chapter, - actual_value=number, - project_name=self._project_name, - usfm_versification_error_type=UsfmVersificationErrorType.INVALID_VERSE_NUMBER, - ) - ) diff --git a/machine/corpora/zip_paratext_project_versification_detector.py b/machine/corpora/zip_usfm_versification_analyzer.py similarity index 72% rename from machine/corpora/zip_paratext_project_versification_detector.py rename to machine/corpora/zip_usfm_versification_analyzer.py index bc1964a3..8bd1fdac 100644 --- a/machine/corpora/zip_paratext_project_versification_detector.py +++ b/machine/corpora/zip_usfm_versification_analyzer.py @@ -2,12 +2,12 @@ from zipfile import ZipFile from .paratext_project_settings import ParatextProjectSettings -from .paratext_project_versification_error_detector_base import ParatextProjectVersificationErrorDetectorBase +from .usfm_versification_analyzer_base import UsfmVersificationAnalyzerBase from .zip_paratext_project_file_handler import ZipParatextProjectFileHandler from .zip_paratext_project_settings_parser import ZipParatextProjectSettingsParser -class ZipParatextProjectVersificationErrorDetector(ParatextProjectVersificationErrorDetectorBase): +class ZipUsfmVersificationAnalyzer(UsfmVersificationAnalyzerBase): def __init__(self, archive: ZipFile, parent_settings: Optional[ParatextProjectSettings] = None): super().__init__( ZipParatextProjectFileHandler(archive), ZipParatextProjectSettingsParser(archive, parent_settings).parse() diff --git a/machine/scripture/verse_ref.py b/machine/scripture/verse_ref.py index 11a3931e..2a362ce1 100644 --- a/machine/scripture/verse_ref.py +++ b/machine/scripture/verse_ref.py @@ -876,7 +876,9 @@ def change_versification(self, vref: VerseRef, ignore_segments: bool = False) -> vref.versification = self return True - def all_included_verses(self) -> Generator[VerseRef, None, None]: + def all_included_verses( + self, only_chapters: Optional[Dict[int, Optional[Set[int]]]] = None + ) -> Generator[VerseRef, None, None]: for book, chapters in enumerate(self.book_list): book = book + 1 if not is_canonical(book) or (book > 86 and book < 93): @@ -885,6 +887,12 @@ def all_included_verses(self) -> Generator[VerseRef, None, None]: chapter = chapter + 1 first_verse = self.first_included_verse(book, chapter) yielded_first_verse = False + if only_chapters is not None: + if book not in only_chapters: + continue + chapters_filter = only_chapters[book] + if chapters_filter is not None and chapter not in chapters_filter: + continue for verse_number in range(2, last_verse + 1): verse = VerseRef(book=book, chapter=chapter, verse=verse_number, versification=self) if self.is_excluded(verse.bbbcccvvv): diff --git a/tests/corpora/test_usfm_manual.py b/tests/corpora/test_usfm_manual.py index cb214f85..c093a828 100644 --- a/tests/corpora/test_usfm_manual.py +++ b/tests/corpora/test_usfm_manual.py @@ -18,9 +18,10 @@ StandardParallelTextCorpus, UpdateUsfmRow, UpdateUsfmTextBehavior, - ZipParatextProjectVersificationErrorDetector, + ZipUsfmVersificationAnalyzer, ) from machine.punctuation_analysis import ZipParatextProjectQuoteConventionDetector +from machine.scripture import ALL_BOOK_IDS @pytest.mark.skip(reason="This is for manual testing only. Remove this decorator to run the test.") @@ -68,6 +69,6 @@ def test_analyze_corpora_quote_conventions(): @pytest.mark.skip(reason="This is for manual testing only. Remove this decorator to run the test.") def test_validate_usfm_versification(): archive = zipfile.ZipFile(USFM_SOURCE_PROJECT_ZIP_PATH, "r") - versification_error_detector = ZipParatextProjectVersificationErrorDetector(archive) - errors = versification_error_detector.get_usfm_versification_errors() - assert len(errors) == 0 + versification_analyzer = ZipUsfmVersificationAnalyzer(archive) + analysis = versification_analyzer.analyze_usfm_versification({book_id: None for book_id in ALL_BOOK_IDS}) + assert len(analysis.diagnostics) == 0 diff --git a/tests/corpora/test_usfm_versification_analyzer.py b/tests/corpora/test_usfm_versification_analyzer.py new file mode 100644 index 00000000..da6807f3 --- /dev/null +++ b/tests/corpora/test_usfm_versification_analyzer.py @@ -0,0 +1,489 @@ +from io import StringIO +from typing import Dict, List, Optional, Set + +from testutils.memory_paratext_project_file_handler import DefaultParatextProjectSettings +from testutils.memory_usfm_versification_analyzer import MemoryUsfmVersificationAnalyzer + +from machine.corpora import ParatextProjectSettings, UsfmVersificationAnalysis, UsfmVersificationDiagnosticType +from machine.scripture import ENGLISH_VERSIFICATION, ORIGINAL_VERSIFICATION, Versification + + +def test_no_errors(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \v 13 + \v 14 + \v 15 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert analysis.total_num_encountered_verses == 15 + assert len(analysis.diagnostics) == 0 + + +def test_missing_verse(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \v 13 + \v 14 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 14 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].filename == "653JNTest.SFM" + assert analysis.diagnostics[0].line_numbers == [16] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:15" + + analysis = env.analyze_usfm_versification(only_chapters={"3JN": set()}) + assert len(analysis.diagnostics) == 0 + assert analysis.total_num_encountered_verses == 0 + + +def test_missing_chapter(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 0 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[0].num_affected_verses == 15 + assert analysis.diagnostics[0].line_numbers == [1] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:1-15" + + +def test_extra_verse(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \v 13 + \v 14 + \v 15 + \v 16 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 16 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.EXTRA + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [18] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:16" + + +def test_invalid_verse(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 13-12 + \v 14 + \v 15 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 15 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.INVALID + assert analysis.diagnostics[0].num_affected_verses == 2 + assert analysis.diagnostics[0].line_numbers == [14] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:13-12" + + +def test_extra_verse_segment(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \v 13 + \v 14a + \v 14b + \v 15 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 2 + assert analysis.total_num_encountered_verses == 15 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.INCORRECT_VERSE_SEGMENT + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [16] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:14a" + + +def test_missing_verse_segment(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \v 13 + \v 14 + \v 15 + """ + }, + settings=DefaultParatextProjectSettings(versification=get_custom_versification(r"*3JN 1:13,a,b")), + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 15 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.INCORRECT_VERSE_SEGMENT + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [15] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:13" + + +def test_ignore_noncanonicals(): + env = _TestEnvironment( + files={ + "98XXETest.SFM": r"""\id XXE + \c 1 + \v 3-2 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["XXE"]) + assert len(analysis.diagnostics) == 0 + + +def test_extra_verse_excluded_in_custom_vrs(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \v 13 + \v 14 + \v 15 + """, + }, + settings=DefaultParatextProjectSettings(versification=get_custom_versification(r"-3JN 1:13")), + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 15 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.EXTRA + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [15] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:13" + + +def test_multiple_books(): + env = _TestEnvironment( + files={ + "642JNTest.SFM": r"""\id 2JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + """, + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \v 13 + \v 14 + \v 15 + """, + } + ) + analysis = env.analyze_usfm_versification(only_books=["2JN", "3JN"]) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 27 + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [14] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "2JN 1:13" + + +def test_multiple_chapters(): + env = _TestEnvironment( + files={ + "642JNTest.SFM": r"""\id 2JN + \c 1 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6 + \v 7 + \v 8 + \v 9 + \v 10 + \v 11 + \v 12 + \c 2 + \v 1 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["2JN"]) + assert len(analysis.diagnostics) == 2 + assert analysis.total_num_encountered_verses == 13 + + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [14] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "2JN 1:13" + + assert analysis.diagnostics[1].type == UsfmVersificationDiagnosticType.EXTRA + assert analysis.diagnostics[1].num_affected_verses == 1 + assert analysis.diagnostics[1].line_numbers == [16] + assert len(analysis.diagnostics[1].references) == 1 + assert str(analysis.diagnostics[1].references[0]) == "2JN 2:1" + + analysis = env.analyze_usfm_versification(only_chapters={"2JN": {1}}) + assert len(analysis.diagnostics) == 1 + assert analysis.total_num_encountered_verses == 12 + + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [14] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "2JN 1:13" + + +def test_invalid_chapter_number(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1. + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 2 + assert analysis.total_num_encountered_verses == 0 + + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.INVALID + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [2] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN :0" + + assert analysis.diagnostics[1].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[1].num_affected_verses == 15 + assert analysis.diagnostics[1].line_numbers == [1] + assert len(analysis.diagnostics[1].references) == 1 + assert str(analysis.diagnostics[1].references[0]) == "3JN 1:1-15" + + +def test_invalid_verse_number(): + env = _TestEnvironment( + files={ + "653JNTest.SFM": r"""\id 3JN + \c 1 + \v v1 + """ + } + ) + analysis = env.analyze_usfm_versification(only_books=["3JN"]) + assert len(analysis.diagnostics) == 2 + assert analysis.total_num_encountered_verses == 1 + + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.INVALID + assert analysis.diagnostics[0].num_affected_verses == 1 + assert analysis.diagnostics[0].line_numbers == [3] + assert len(analysis.diagnostics[0].references) == 1 + assert str(analysis.diagnostics[0].references[0]) == "3JN 1:v1" + + assert analysis.diagnostics[1].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[1].num_affected_verses == 15 + assert analysis.diagnostics[1].line_numbers == [3] + assert len(analysis.diagnostics[1].references) == 1 + assert str(analysis.diagnostics[1].references[0]) == "3JN 1:1-15" + + +def test_unsupported_cross_chapter_verse_reference(): + env = _TestEnvironment( + files={ + "03LEVTest.SFM": r"""\id LEV + \c 6 + \v 1 + \v 2 + \v 3 + \v 4 + \v 5 + \v 6-9 + \v 10-30 + """ + }, + # The project uses the English versification, in which LEV 6:6-9 maps to + # 5:25-6:2 in the Original versification (crossing a chapter boundary). + settings=DefaultParatextProjectSettings(versification=ENGLISH_VERSIFICATION), + ) + analysis = env.analyze_usfm_versification(only_books=["LEV"]) + assert len(analysis.diagnostics) == 3 + assert analysis.total_num_encountered_verses == 30 + assert analysis.total_num_affected_verses == 833 + + assert analysis.diagnostics[0].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[0].num_affected_verses == 104 + assert analysis.diagnostics[0].line_numbers == [1] + assert len(analysis.diagnostics[0].references) == 5 + assert str(analysis.diagnostics[0].references[0]) == "LEV 1:1-17" + + assert analysis.diagnostics[1].type == UsfmVersificationDiagnosticType.UNSUPPORTED_VERSE_RANGE + assert analysis.diagnostics[1].num_affected_verses == 4 + assert analysis.diagnostics[1].line_numbers == [8] + assert len(analysis.diagnostics[1].references) == 1 + assert str(analysis.diagnostics[1].references[0]) == "LEV 6:6-9" + + assert analysis.diagnostics[2].type == UsfmVersificationDiagnosticType.MISSING + assert analysis.diagnostics[2].num_affected_verses == 725 + assert analysis.diagnostics[2].line_numbers == [9] + assert len(analysis.diagnostics[2].references) == 21 + assert str(analysis.diagnostics[2].references[0]) == "LEV 7:1-38" + + +class _TestEnvironment: + def __init__(self, settings: Optional[ParatextProjectSettings] = None, files: Optional[Dict[str, str]] = None): + self._settings = settings + self._files = files or {} + self.analyzer = MemoryUsfmVersificationAnalyzer(settings, self._files) + + def analyze_usfm_versification( + self, + only_books: Optional[List[str]] = None, + only_chapters: Optional[Dict[str, Optional[Set[int]]]] = None, + ) -> UsfmVersificationAnalysis: + if only_chapters is not None: + return self.analyzer.analyze_usfm_versification(only_chapters) + book_ids_and_chapters: Optional[Dict[str, Optional[Set[int]]]] = ( + {book: None for book in only_books} if only_books is not None else None + ) + return self.analyzer.analyze_usfm_versification(book_ids_and_chapters) + + +def get_custom_versification( + custom_vrs_contents: str, base_versification: Optional[Versification] = None +) -> Versification: + stream = StringIO(custom_vrs_contents) + versification = base_versification or Versification("custom", "vers.txt", ORIGINAL_VERSIFICATION) + versification = Versification.parse(stream, "vers.txt", versification, "custom") + return versification diff --git a/tests/corpora/test_usfm_versification_error_detector.py b/tests/corpora/test_usfm_versification_error_detector.py deleted file mode 100644 index 15cca145..00000000 --- a/tests/corpora/test_usfm_versification_error_detector.py +++ /dev/null @@ -1,377 +0,0 @@ -from io import StringIO -from typing import Dict, List, Optional - -from testutils.memory_paratext_project_file_handler import DefaultParatextProjectSettings -from testutils.memory_paratext_project_versification_error_detector import ( - MemoryParatextProjectVersificationErrorDetector, -) - -from machine.corpora import ParatextProjectSettings, UsfmVersificationError, UsfmVersificationErrorType -from machine.scripture import ORIGINAL_VERSIFICATION, Versification - - -def test_get_usfm_versification_errors_no_errors(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \v 13 - \v 14 - \v 15 - """ - } - ) - assert len(env.get_usfm_versification_errors()) == 0 - - -def test_get_usfm_versification_errors_missing_verse(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \v 13 - \v 14 - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 1 - assert errors[0].type == UsfmVersificationErrorType.MISSING_VERSE - assert errors[0].expected_verse_ref == "3JN 1:15" - assert errors[0].actual_verse_ref == "3JN 1:14" - - -def test_get_usfm_versification_missing_chapter(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 1 - assert errors[0].type == UsfmVersificationErrorType.MISSING_CHAPTER - assert errors[0].expected_verse_ref == "3JN 1:15" - assert errors[0].actual_verse_ref == "3JN 0:0" - - -def test_get_usfm_versification_errors_extra_verse(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \v 13 - \v 14 - \v 15 - \v 16 - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 1 - assert errors[0].type == UsfmVersificationErrorType.EXTRA_VERSE - assert errors[0].expected_verse_ref == "" - assert errors[0].actual_verse_ref == "3JN 1:16" - - -def test_get_usfm_versification_errors_invalid_verse(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 13-12 - \v 14 - \v 15 - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 1 - assert errors[0].type == UsfmVersificationErrorType.INVALID_VERSE_RANGE - assert errors[0].expected_verse_ref == "3JN 1:12-13" - assert errors[0].actual_verse_ref == "3JN 1:13-12" - - -def test_get_usfm_versification_errors_extra_verse_segment(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \v 13 - \v 14a - \v 14b - \v 15 - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 2 - assert errors[0].type == UsfmVersificationErrorType.EXTRA_VERSE_SEGMENT - assert errors[0].expected_verse_ref == "3JN 1:14" - assert errors[0].actual_verse_ref == "3JN 1:14a" - - -def test_get_usfm_versification_errors_missing_verse_segments(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \v 13 - \v 14 - \v 15 - """ - }, - settings=DefaultParatextProjectSettings(versification=get_custom_versification(r"*3JN 1:13,a,b")), - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 1 - assert errors[0].type == UsfmVersificationErrorType.MISSING_VERSE_SEGMENT - assert errors[0].expected_verse_ref == "3JN 1:13a" - assert errors[0].actual_verse_ref == "3JN 1:13" - - -def test_get_usfm_versification_errors_ignore_noncanonicals(): - env = _TestEnvironment( - files={ - "98XXETest.SFM": r"""\id XXE - \c 1 - \v 3-2 - """ - } - ) - assert len(env.get_usfm_versification_errors()) == 0 - - -def test_get_usfm_versification_errors_excluded_in_custom_vrs(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \v 13 - \v 14 - \v 15 - """, - }, - settings=DefaultParatextProjectSettings(versification=get_custom_versification(r"-3JN 1:13")), - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 1 - assert errors[0].type == UsfmVersificationErrorType.EXTRA_VERSE - assert errors[0].expected_verse_ref == "" - assert errors[0].actual_verse_ref == "3JN 1:13" - - -def test_get_usfm_versification_errors_multiple_books(): - env = _TestEnvironment( - files={ - "642JNTest.SFM": r"""\id 2JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - """, - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \v 13 - \v 14 - \v 15 - """, - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 1 - assert errors[0].type == UsfmVersificationErrorType.MISSING_VERSE - assert errors[0].expected_verse_ref == "2JN 1:13" - assert errors[0].actual_verse_ref == "2JN 1:12" - - -def test_get_usfm_versification_errors_multiple_chapters(): - env = _TestEnvironment( - files={ - "642JNTest.SFM": r"""\id 2JN - \c 1 - \v 1 - \v 2 - \v 3 - \v 4 - \v 5 - \v 6 - \v 7 - \v 8 - \v 9 - \v 10 - \v 11 - \v 12 - \c 2 - \v 1 - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 2 - assert errors[0].type == UsfmVersificationErrorType.MISSING_VERSE - assert errors[1].type == UsfmVersificationErrorType.EXTRA_VERSE - assert errors[0].expected_verse_ref == "2JN 1:13" - assert errors[0].actual_verse_ref == "2JN 1:12" - assert errors[1].expected_verse_ref == "" - assert errors[1].actual_verse_ref == "2JN 2:1" - - -def test_get_usfm_versification_errors_invalid_chapter_number(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1. - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 2 - assert errors[0].type == UsfmVersificationErrorType.INVALID_CHAPTER_NUMBER - assert errors[1].type == UsfmVersificationErrorType.MISSING_CHAPTER - assert errors[0].expected_verse_ref == "" - assert errors[0].actual_verse_ref == "3JN 1." - assert errors[1].expected_verse_ref == "3JN 1:15" - assert errors[1].actual_verse_ref == "3JN -1:0" - - -def test_get_usfm_versification_errors_invalid_verse_number(): - env = _TestEnvironment( - files={ - "653JNTest.SFM": r"""\id 3JN - \c 1 - \v v1 - """ - } - ) - errors = env.get_usfm_versification_errors() - assert len(errors) == 2 - assert errors[0].type == UsfmVersificationErrorType.INVALID_VERSE_NUMBER - assert errors[1].type == UsfmVersificationErrorType.MISSING_VERSE - assert errors[0].expected_verse_ref == "" - assert errors[0].actual_verse_ref == "3JN 1:v1" - assert errors[1].expected_verse_ref == "3JN 1:15" - assert errors[1].actual_verse_ref == "3JN 1:0" - - -class _TestEnvironment: - def __init__(self, settings: Optional[ParatextProjectSettings] = None, files: Optional[Dict[str, str]] = None): - self._settings = settings - self._files = files or {} - self.detector = MemoryParatextProjectVersificationErrorDetector(settings, self._files) - - def get_usfm_versification_errors(self) -> List[UsfmVersificationError]: - return self.detector.get_usfm_versification_errors() - - -def get_custom_versification( - custom_vrs_contents: str, base_versification: Optional[Versification] = None -) -> Versification: - stream = StringIO(custom_vrs_contents) - versification = base_versification or Versification("custom", "vers.txt", ORIGINAL_VERSIFICATION) - versification = Versification.parse(stream, "vers.txt", versification, "custom") - return versification diff --git a/tests/scripture/test_versification.py b/tests/scripture/test_versification.py index 0a42e80b..6ab530ae 100644 --- a/tests/scripture/test_versification.py +++ b/tests/scripture/test_versification.py @@ -109,6 +109,12 @@ def test_all_included_verses() -> None: assert len(russian_orthodox_verses) == 37280 assert russian_orthodox_verses[-1].bbbcccvvv == 83001015 + original_verses_genesis = list(ORIGINAL_VERSIFICATION.all_included_verses({1: None})) + assert len(original_verses_genesis) == 1533 + + original_verses_genesis_chapter_one = list(ORIGINAL_VERSIFICATION.all_included_verses({1: {1}})) + assert len(original_verses_genesis_chapter_one) == 31 + def test_has_cross_book_mappings() -> None: assert not ORIGINAL_VERSIFICATION.has_cross_book_mappings() diff --git a/tests/testutils/memory_paratext_project_versification_error_detector.py b/tests/testutils/memory_usfm_versification_analyzer.py similarity index 65% rename from tests/testutils/memory_paratext_project_versification_error_detector.py rename to tests/testutils/memory_usfm_versification_analyzer.py index 3a710169..386fa1d2 100644 --- a/tests/testutils/memory_paratext_project_versification_error_detector.py +++ b/tests/testutils/memory_usfm_versification_analyzer.py @@ -1,10 +1,10 @@ from typing import Dict, Optional -from machine.corpora import ParatextProjectSettings, ParatextProjectVersificationErrorDetectorBase +from machine.corpora import ParatextProjectSettings, UsfmVersificationAnalyzerBase from .memory_paratext_project_file_handler import DefaultParatextProjectSettings, MemoryParatextProjectFileHandler -class MemoryParatextProjectVersificationErrorDetector(ParatextProjectVersificationErrorDetectorBase): +class MemoryUsfmVersificationAnalyzer(UsfmVersificationAnalyzerBase): def __init__(self, settings: Optional[ParatextProjectSettings], files: Optional[Dict[str, str]]) -> None: super().__init__(MemoryParatextProjectFileHandler(files), settings or DefaultParatextProjectSettings())