Skip to content

fixes issues with data extraction#245

Merged
tomjemmett merged 3 commits intomainfrom
fix_data_extraction
May 1, 2026
Merged

fixes issues with data extraction#245
tomjemmett merged 3 commits intomainfrom
fix_data_extraction

Conversation

@tomjemmett
Copy link
Copy Markdown
Member

  • files were being created as inputs-data/inputs-data/dev, stopped the duplication of the container name
  • we were not removing previous files, which could cause issues if the files were not overwritten

- files were being created as inputs-data/inputs-data/dev, stopped the duplication of the container name
- we were not removing previous files, which could cause issues if the files were not overwritten
Copilot AI review requested due to automatic review settings April 27, 2026 13:03
@tomjemmett tomjemmett requested a review from a team as a code owner April 27, 2026 13:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes issues in the data extraction upload process to prevent incorrect container path duplication and to ensure old extracted files are removed before uploading new ones.

Changes:

  • Switch blob uploads to use a container URL directly (via ContainerClient.from_container_url) to avoid inputs-data/inputs-data/... style duplication.
  • Add a pre-upload delete of the target extract_version directory using the ADLS Gen2 (Data Lake) API.
  • Add azure-storage-file-datalake as a dependency (and lock it in uv.lock).

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/nhp/data/extract_data.py Refactors upload client creation and adds deletion of prior extracted directories before upload.
pyproject.toml Adds azure-storage-file-datalake dependency needed for directory deletion.
uv.lock Locks the new dependency and associated wheel metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/nhp/data/extract_data.py Outdated
Comment thread src/nhp/data/extract_data.py Outdated
Comment thread src/nhp/data/extract_data.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@tomjemmett tomjemmett force-pushed the fix_data_extraction branch from b3dae3f to 11e8806 Compare April 28, 2026 08:18
@tomjemmett tomjemmett linked an issue Apr 28, 2026 that may be closed by this pull request
@tomjemmett tomjemmett force-pushed the fix_data_extraction branch from 11e8806 to f35b8d3 Compare April 29, 2026 15:37
Copy link
Copy Markdown
Member

@yiwen-h yiwen-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! The update-sas-tokens.ps1 script also very useful - have updated our guidance for running the data pipeline accordingly

@tomjemmett tomjemmett merged commit c959c5f into main May 1, 2026
3 checks passed
@tomjemmett tomjemmett deleted the fix_data_extraction branch May 1, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug with where files are being written to in new pipeline

3 participants