Skip to content

[GSoC] Train DeBERTa NER model for required phrase prediction #5137

Description

@Kaushik-Kumar-CEG

continuation of #5077. uses the BIOES dataset produced there

Goal

Finetune DeBERTa-v3-large to predict required phrase spans in license rule text using BIOES labels

Tasks

  • Subword tokenization with label alignment (-100 for continuation tokens)
  • Training with class-weighted loss (handle O vs B/I/E/S imbalance)
  • Include negative samples (rules without markers, all-O labels) for balanced training
  • Evaluation: token F1, exact span match
  • ONNX export for CPU inference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions