Compaction deletes non-shared files#6060
Conversation
keith-turner
left a comment
There was a problem hiding this comment.
Will also need to modify the split code.
| // File is not shared - try to delete directly | ||
| try { | ||
| Path filePath = new Path(file.getMetadataPath()); | ||
| boolean deleted = ctx.getVolumeManager().deleteRecursively(filePath); |
There was a problem hiding this comment.
Deleting files here is not correct because the conditional mutation to the metadata table has not yet been made. Only after successfully making the conditional mutation will we know what was shared or not. Need to do something like the following.
- Modify conditional mutation to check that the file values (this checks the new shared field) are the same for the tablet. This atomically ensures that what is thought to be shared is actually shared when the metadata table update happens. Currently the conditional mutation only checks the set of files, not their values.
- After successfully submitting the condtional mutation look at the table metadata and the list of files compacting (from commitData) . Compute a set of files that is in commitData.inputPaths and is marked as not shared in the tablet metadata. If a file is not in tablet metadata, add it to the set because its shared status is unknown (this can happen because of failure and retry). This is the set of compacting files that are known for sure to not be shared, in the case of failure and retry this set should be empty.
- If the conditional mutation was not successful, this set above should be empty.
- Pass the set computed above to PutGcCandidates
- In PutGcCandidates delete non shared files and add gc delete markers for shared files. Doing the actual file deletes in PutGcCandidates makes reasoning about retries (caused by process dying in middle of fate step) much easier.
There is existing code for comparing the files values, but its not exactly what we need. Would probably need to create a new requireFiles(Map<StoredTabletFile, DataFileValue> files) method that uses code similar to this in its impl.
| } | ||
| } | ||
|
|
||
| markSourceFilesAsShared(srcTableId, tableId, context, bw); |
There was a problem hiding this comment.
Marking the source tablets as shared here has race conditions, these may not be the files in the clone. Also marking them as shared must be done using a conditional writer that only changes the marking if the file exists in the tablet w/ the same value (this check must be done by the conditional mutation).
The following conditions must be true for clone to avoid race conditions.
- Files are marked as shared before cloning.
- After copying the tablets file they must still exist in the source tablet.
Need to rework the code to do the following algorithm for each tablet.
- Before cloning mark all files in the source tablet as shared using a conditional mutation.
- Copy the source tablets files to the destination tablet.
- Read the source and desitation tablets and ensure the destitnations tablet has a subset of the sources files. If so this tablet was successfully cloned and we know the GC did not delete any files because the source still has them. Done with tablet and do not need to next step.
- There was a concurrent change w/ the source tablets files, possible that GC even deleted some of them. Delete the destination tablet and go back to step 1.
The current code does this w/o step 1. Also the current code batches tablet read/writes for efficiency. The code would probably be much easier to understand w/o the batching, but doing one tablet at a time w/o batching would be really slow. So need to modify the code work in step 1 and still do the batching.
|
@ArbaazKhan1 - looks like this has a conflict now. |
| time = Long.parseLong(ba[2]); | ||
| } else { | ||
| time = -1; | ||
| // Could be old format with time, or new format with shared |
There was a problem hiding this comment.
I wonder if we should always encode the time. If there are 3 parts, then the DataFileValue was written before this code and shared is false. If there are 4 parts, then it was written with the new code.
| @Override | ||
| public int hashCode() { | ||
| return Long.valueOf(size + numEntries).hashCode(); | ||
| return Long.valueOf(size + numEntries + (shared ? 1 : 0)).hashCode(); |
There was a problem hiding this comment.
We might want to change this to Objects.hash(size, numEntries, shared)
| "Conditional mutation to mark files as shared was rejected for tablet {}, " | ||
| + "tablet changed concurrently, clone will retry for this tablet", | ||
| srcTablet.getExtent()); |
There was a problem hiding this comment.
It might be better to return a boolean here for success/fail (see comment below). Could probably move the debug log statement down to the calling location too.
| try { | ||
| try (TabletsMetadata tabletsMetadata = createCloneScanner(null, srcTableId, context)) { | ||
| for (TabletMetadata tablet : tabletsMetadata) { | ||
| markTabletFilesAsShared(context, tablet); |
There was a problem hiding this comment.
If there is a failure here, then you probably want a continue to restart the loop. I don't think you want to continue on a failure.
| var tabletFilesMap = tablet.getFilesMap(); | ||
| for (StoredTabletFile file : ecm.getJobFiles()) { | ||
| DataFileValue dfv = tabletFilesMap.get(file); | ||
| if (dfv == null || !dfv.isShared()) { |
There was a problem hiding this comment.
I don't think dfv should not be null. If it is, that seems like a larger issue. I don't think we should delete these files.
| var tabletMutator = tabletsMutator.mutateTablet(getExtent()).requireAbsentOperation() | ||
| .requireCompaction(ecid).requireSame(tablet, LOCATION) | ||
| .requireFiles(commitData.getJobFiles()); | ||
| .requireFiles(tablet.getFilesMap()); |
There was a problem hiding this comment.
The code before this change was validating that the files that are part of the compaction are part of the tablet. I think that should always be true. This change is checking that the tablets files are the same as when this method started. This might not be true as a compaction could have occurred adding a new file.
|
|
||
| public PutGcCandidates(CompactionCommitData commitData, String refreshLocation) { | ||
| public PutGcCandidates(CompactionCommitData commitData, String refreshLocation, | ||
| ArrayList<StoredTabletFile> filesToDeleteViaGc) { |
There was a problem hiding this comment.
If I'm understanding correctly, the filesToDeleteViaGc list is the list of non-shared files that we want to delete directly, not via the Garbage Collector. If that's the case, then I would suggest renaming this variable in all places to avoid confusion.
| LOG.debug("Directly deleted non-shared compaction input file: {}", | ||
| jobFile.getFileName()); | ||
| } else { | ||
| LOG.debug("Non-shared file {} was already absent (likely retry), no GC marker needed", |
There was a problem hiding this comment.
I think there may be other cases where this returns false. I think you should add the file to the fallback set here to create the delete marker.
| // Compute which files appear in more than one tablet range | ||
| Set<String> sharedFiles; | ||
| try (LoadMappingIterator lmiPass1 = | ||
| BulkSerialize.getUpdatedLoadMapping(bulkDir.toString(), bulkInfo.tableId, fs::open)) { | ||
| sharedFiles = computeSharedFiles(fateId, lmiPass1); | ||
| } |
There was a problem hiding this comment.
I know why this is here, it's too bad that we have to read the underlying files twice.
|
I didn't look at the tests yet. You have covered compactions, split, and bulk import. For merge, I guess the resulting tablet just contains all of the file references from the merged tablets? I'm just trying to determine if there is any chance of a collision on files with different shared values and how that's handled. It might be useful to see if you can add a test for this case. |
Closes Issue #5387
Added shared marker to
DataFileValueto track files used by single or multiple tablets. Compactions now delete non-shared files directly, instead of creating GC markers. This Pr adds the following changes:DataFileValue