It would be really nice to support loading the correct loading of DatasetRepository instances from a URI.
I currently have code that uses the following format for URIs:
<storage component> can be one of the following:
- file:<path> where <path> is relative or absolute, and indicates the root directory. It is not legal to have an authority in a file: storage component. It's also legal to specify this storage component using the null-authority version that is also common in the wild: e.g. file://<path> in which case, the path must always be absolute.
- hdfs://<host>:<port>/<path> where <host> and <port> are required, and <path> designates the root directory. Path may not be relative.
- hive://<host>:<port>/<database> where <host>, <port>, <database> are required. The authority (host+port) indicate the metastore server to connect to.
The hive: storage component implementation is currently incomplete. I'll open another JIRA for that as an enhancement. The intention is to let DatasetRepository implementations continue to pick their datasets' paths. In the case of hive://, the thinking is that a dataset created with a specific location in its DatasetDescriptor will function as an external table. All others will be "internal" or "normal" Hive tables. This is entirely independent of
CDK-139 and does not conflict with it.
All of this is done outside of the existing code as a thin layer atop that simply instantiates things correctly.
URIs are used rather than URLs because these identifiers are opaque locations and not necessarily singular resources, per the RFC.
RFC 2396 (URIs) - http://www.ietf.org/rfc/rfc2396.txt