I've been working on updating the Hive/HCatalog integration lately, which has a section of code that uses a private field on the HCatalogMetadataProvider to share the location of a Dataset. There are actually two cases:
- If the table data is managed by Hive, then the FileSystemDatasetRepository uses getDataDirectory() to find where Hive put the data (this is valid after the Descriptor is loaded, but before another one is).
- If the data is managed by a FileSystemDatasetRepository, then when data sets are created, the repo calls setDataDirectory to tell the MetadataProvider where to tell Hive the data lives.
This has a few problems/bugs and I'm refactoring to avoid the in/out parameter. It appears to me that this is done because the DatasetDescriptor doesn't have a location for the data set. I believe the reasoning was probably that not all data sets necessarily have a location in DFS or the local FS, but I think that this concern is outweighed by the problems it is causing.
I propose adding a String location property to the DatasetDescriptor and an accessor, String getLocation(). This string will optionally contain a URI-formatted location for the data set, like "hdfs:/path/to/data".
As part of this change, we can fix a problem in the FileSystem implementation as well, where both the MetadataProvider and the DatasetRepository have a method to determine where the data lives (currently, MP calls a static method in the repo impl). Because the descriptor will pass this data, either the MP or the repo should determine the location. And because the location of the dataset must be managed by the MetadataProvider for Hive, it makes sense to do this in the MetadataProvider for the FS implementation also. The FS repo will just open Datasets from locations specified in the descriptors.