SOLR-17949: Add Azure Blob Storage backup repository module#3750
SOLR-17949: Add Azure Blob Storage backup repository module#3750prateeksinghalgit wants to merge 8 commits into
Conversation
|
Quick update: following up on the dev@ thread. To clarify scope — although the diff is large, the actual implementation is centered in 8 main files and 8 test files; the rest are license header updates. Happy to split the PR into smaller logical chunks (core module / tests / docs) if that helps with review. Thanks everyone! |
|
Don’t have time to review but asked copilot for an opinion 😉 |
janhoy
left a comment
There was a problem hiding this comment.
Only immediate comment is naming. «Blob-repository» is too generic. Should contain word «azure»?
There was a problem hiding this comment.
Pull Request Overview
This PR implements Azure Blob Storage backup repository support for Apache Solr, enabling collections to be backed up to and restored from Microsoft Azure. The implementation follows established patterns from existing S3 and GCS modules, providing 4 authentication methods (Connection String, Account Key, SAS Token, Azure Identity), incremental backup support, and comprehensive documentation with 76 passing unit tests.
Reviewed Changes
Copilot reviewed 70 out of 70 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| backup-restore.adoc | Adds comprehensive documentation for BlobBackupRepository with authentication methods and configuration |
| BlobBackupRepository.java | Main repository implementation following Solr's BackupRepository interface |
| BlobStorageClient.java | Azure SDK wrapper providing blob storage operations |
| BlobOutputStream.java | Custom output stream for block-based blob uploads |
| BlobIndexInput.java | Lucene IndexInput implementation with page caching |
| Test files | 8 test classes with 76 passing tests covering all functionality |
| build.gradle | Module build configuration with Azure SDK dependencies |
| License files | License and notice files for Azure and Netty dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This commit adds support for backing up and restoring Solr collections to Azure Blob Storage with multiple authentication options. Features: - Full backup/restore functionality to Azure Blob Storage - Support for 4 authentication methods: * Connection String (for development) * Account Name + Key (for simple production) * SAS Token (recommended for production) * Azure Identity (Managed Identity, Service Principal, Azure CLI) - Incremental backup support with versioning - Data integrity verification (checksum validation) - Compatible with Azurite emulator for local testing - Comprehensive documentation and 76 passing unit tests Implementation: - 8 implementation files (1,606 LOC) - 8 test files (2,180 LOC) - All dependencies Apache 2.0 licensed - Follows Solr's backup repository patterns
32efbbd to
5462186
Compare
- Renamed module from blob-repository to azure-blob-repository - Renamed all classes from Blob* to AzureBlob* for clarity - Updated package from org.apache.solr.blob to org.apache.solr.azureblob - Added Azure SDK dependencies (azure-storage-blob, azure-identity) - Updated Solr Reference Guide with Azure Blob Storage documentation - Added .gitignore entries for Azurite test infrastructure All authentication methods tested successfully with real Azure Blob Storage: - Connection String authentication - Account Name + Key authentication - SAS Token authentication - Service Principal (Azure Identity) authentication Testing completed with 100% success rate on backup/restore operations.
5462186 to
47f456a
Compare
made the change to azure-blob-repository |
24daaaa to
e6d9a64
Compare
- Switch from Netty to OkHttp for better Security Manager compatibility - Use static shared HttpClient for better resource management - Fix licenses: msal4j and Azure SDK are MIT licensed - Add changelog entry - Add JFR permissions for Reactor
e6d9a64 to
f7adb05
Compare
|
I'll leave the rest of the review to others more proficient in Azure Blob than me (have never used it). I'd love to test it with a read AzBlob but I'll leave that to someone who already have an active account. I have a concern that the PR feels largely AI generated(?), lacking the care to details that we require for contributions. Have me excused @prateeksinghalgit if this is not correct, it was just a hunch I got while reviewing. I'd rather have a short well thought out README with useful advice for users than three pages of detailed step by step instructions for testing and developing the feature. For other reviewers to pick up where I left, consider in partifular the HTTP client choice, correct licensing and avoiding hard coded ports in tests. I have not done a complete review, not read the ref-guide part at all, just commented on things me and Copilot saw immediately. |
8e812b1 to
3545833
Compare
3545833 to
e18dd13
Compare
|
Hi all, I’ve pushed a set of updates based on the feedback so far:
Please let me know if anything else should be adjusted that would help move the review forward. Happy to make further changes. Thanks again for the thoughtful review and guidance so far! — Prateek |
e18dd13 to
b28e265
Compare
|
Hi All, I’ve fixed the recent CI failures:
Please let me know if anything else should be adjusted to help move the review forward. I’m happy to iterate further as needed. |
- Use Testcontainers (Azurite) for integration tests to avoid hardcoded ports and external dependencies - Disable Security Manager for Azure Blob tests to support Testcontainers (similar to extraction module) - Fix OkHttp compilation error by adding explicit testImplementation - Update documentation: BlobBackupRepository -> AzureBlobBackupRepository
b28e265 to
428ca18
Compare
3cef18e to
d015bbe
Compare
|
This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution! |
|
Thanks @janhoy for adding more reviewers. @HoustonPutman and @psalagnac ,let me know if you have any questions related to the pr. |
|
Thanks for adding me as a reviewer. |
…eam merge Made-with: Cursor
psalagnac
left a comment
There was a problem hiding this comment.
Only reviewed the production code (skiped tests and other stuff).
Some general comments on the review:
- can you replace all unnecessary qualified class names by imports? (replacing
java.lang.StringbyString) This makes the code painful to read - there is a lot of shared logic between Azure backups and S3 backups. Lot of code is duplicated. I wonder if abstract classes should be introduced to factor out some logic
- we always upload files with a block list. At least for small files, can we keep things simples with a simple pull? It think we may skip the commit call.
- overall, I think it's pretty inefficient that
AzureBlobIndexInputalways reads files by small pages. Mostly for backup/restore, we will only fetch full files. You should not do a request by page, or use much bigger pages (without keeping them in memory).
Please note I did a code review, but I have no way to test it with Azure.
|
|
||
| String destPath = getBlobPath(dest); | ||
|
|
||
| String blobPath = destPath.endsWith("/") ? destPath + destFileName : destPath; |
There was a problem hiding this comment.
I think something is wrong here. We fully drop destFileName when destPath does not end with a /
| String accountKey, | ||
| String sasToken, | ||
| String tenantId, | ||
| String clientId, |
There was a problem hiding this comment.
clientId and clientSecret are unused. These parameters could be removed from this class.
There was a problem hiding this comment.
using them now to build, ClientSecretCredentialBuilder. Helps with users using service principal authentication method to do solr backup and restore to azure blob storage.
|
|
||
| try { | ||
| BlobClient blobClient = containerClient.getBlobClient(blobPath); | ||
| final long contentLength = blobClient.getProperties().getBlobSize(); |
There was a problem hiding this comment.
This is under optimized. We do two queries to open the stream. My understanding is invoking blobClient.getProperties() issues a HEAD query.
The result of blobClient.openInputStream() also contains the length. This may be retrieved with blobInputStream.getProperties().getBlobSize()
|
|
||
| try { | ||
| BlobClient blobClient = containerClient.getBlobClient(sanitizedDirPath); | ||
| blobClient.upload(new ByteArrayInputStream(new byte[0]), 0, true); |
There was a problem hiding this comment.
This is under optimized. We do two requests to create an empty object. Metadata should be set in the initial request.
Something like this should work:
BlobParallelUploadOptions options =
new BlobParallelUploadOptions(new ByteArrayInputStream(new byte[0]))
.setMetadata(metadata);
blobClient.uploadWithResponse(options, null, Context.NONE);
|
|
||
| private Collection<String> deleteBlobs(Collection<String> paths) throws AzureBlobException { | ||
| try { | ||
| return deleteBlobs(paths, DELETE_BATCH_SIZE); |
There was a problem hiding this comment.
DELETE_BATCH_SIZE has no impact (unused parameter). Does it come from S3 code?
| } | ||
|
|
||
| @Override | ||
| public void clear() { |
| } | ||
|
|
||
| AzureBlobIndexInput slice = | ||
| new AzureBlobIndexInput( |
There was a problem hiding this comment.
This seems broken to me. We pass the description as path. If we create a slice from a slice, there is a high change we end if fetching a bad object.
| new AzureBlobIndexInput( | ||
| getFullSliceDescription(sliceDescription), client, length, pageSize, MAX_CACHED_PAGES); | ||
|
|
||
| slice.position = 0L; |
There was a problem hiding this comment.
Position is already set to 0 at construction
| } | ||
|
|
||
| // Internal state for slices: base offset to add to all range requests | ||
| private long clientViewBaseOffset = 0L; |
There was a problem hiding this comment.
This should be immutable and passed to the constructor.
Also, I'm rather it is moved with other class members.
| public void seek(long pos) throws IOException { | ||
| ensureOpen(); | ||
| if (pos < 0 || pos > length) { | ||
| throw new IOException("Seek position out of bounds: " + pos); |
Adopt Azure SDK BOM 1.3.6 and Azurite 3.35.0 to enable the single-call pullStream pattern. Rewrite AzureBlobIndexInput on BufferedIndexInput with closed flag, overflow-safe slice, and proactive stream close. Fix DELETEBACKUP double-slash path. Cache BlobProperties in isDirectory. Document SecurityManager + Azure Identity limitation in the README. Add tests for clone independence, backward seek, slice, random access. Co-authored-by: Cursor <cursoragent@cursor.com>
Adopt Azure SDK BOM 1.3.6 and Azurite 3.35.0 to enable the single-call pullStream pattern. Rewrite AzureBlobIndexInput on BufferedIndexInput with closed flag, overflow-safe slice, and proactive stream close. Fix DELETEBACKUP double-slash path. Cache BlobProperties in isDirectory. Document SecurityManager + Azure Identity limitation in the README. Add tests for clone independence, backward seek, slice, random access. Co-authored-by: Cursor <cursoragent@cursor.com>
1923c1e to
f944f72
Compare
@psalagnac , thank you for taking the time to review.
|
Implements Azure Blob Storage backup repository as discussed in SOLR-17949.
Description
This PR adds a new backup repository implementation for Azure Blob Storage, enabling Solr collections to be backed up to and restored from Microsoft Azure.
Key Features:
Solution
The implementation follows Solr's
BackupRepositoryinterface pattern, similar to existing S3 and GCS repository modules:BackupRepositoryinterfaceIndexInputfor reading from Azure blobssolr.xmlAll streaming operations are compatible with Solr's
ResumableInputStreamfor fault-tolerant transfers.Implementation stats:
Tests
Unit Tests: 76/76 passing (100%)
./gradlew :solr:modules:blob-repository:test # Result: BUILD SUCCESSFUL - 76 test(s)Test Coverage:
Testing Instructions:
Can be tested locally with Azurite emulator (no Azure account needed) or with real Azure Blob Storage. See
solr/modules/blob-repository/README.mdfor detailed setup instructions.Checklist
Please review the following and check all that apply:
mainbranch../gradlew :solr:modules:blob-repository:check(module-specific check passed).