[KITE-843] Add in-place compaction to the CLI - Cloudera Open Source

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.17.1
Fix Version/s: 1.1.0
Component/s: Command-line Interface
Labels:
None

Description

There are a few alternatives for compaction (herringbone, filecrusher) that work approximately like the CLI does, by copying content in place in a MR job and deleting the old data. Kite can almost be used to do this, but the process requires copying to a different dataset with the copy command, removing data by hand, and copying the new files back.

First, I think we should update the delete command to work with view URIs and call deleteAll() so users don't have to remove files by hand to do this with Kite

Second, I think we should implement in-place compaction that creates a temporary dataset, runs a copy job, then deletes the source data and merges the temporary dataset (maybe add a replaceMerge() to do the work of delete and merge by partition). This would still corrupt data for a short period of time, but queries can be resubmitted.

Last, I think we should integrate with Hive's locking mechanism so that we can do this safely. We can copy the data, lock the directory, replace the content, then unlock.

Attachments

Issue Links

depends on

KITE-1004 Update compaction to work with unpartitioned datasets

Resolved

relates to

KITE-769 Command-line utility for file concatenation on the cluster

Resolved

Options

Progress

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:

Ryan Blue

Reporter:

Ryan Blue

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

11/Dec/14 11:02 PM

Updated:

28/May/15 10:04 PM

Resolved:

28/May/15 10:04 PM