Details
-
Type:
Sub-task
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 1.1.0
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
Description
This sub-task is broader than CDK-1008 in that the idea is to cache View/Dataset instances longer than the life of the Context returned by getContext() as those are short-lived.
In a busy environment there could be many Oozie coordinators using Kite datasets. For each coordinator action materialization, Oozie polls for the datasets referenced by the input events declared by the coordinator. Each poll results in a call to KiteURIHandler.exists(), which calls Datasets.load() to load the dataset before it can call isReady() check for the ready signal.
The frequent calls to Datasets.load() could take a toll on the Oozie server, NameNode, Data Nodes hosting the metadata, and the Hive Metastore. The idea here is to add another layer of caching to reduce invocations of Datasets.load().
Caching Dataset instances in the context of KiteURIHandler should be relatively safe I think since isReady() is the only method being invoked and it doesn't depend on having the latest-greatest instance of the Dataset metadata. For example, if a schema change is committed to the dataset descriptor but Oozie is still using an old instance of the Dataset with the old schema it is not really a problem as isReady() doesn't care about the schema.