Based on Impala team recommendation:
Implement INVALIDATE on manual refresh, with following requirements:
1. On refresh request, programmatically check HMS for each db which tables exist in the HMS (e.g. by making a "show tables <db>" through hive) but not in Impala and issue invalidate metadata calls for only those tables.
2. Have a warning message enabled by default
Implement REFRESH, with following requirements:
1. for "simple" external modification such as adding new files/partitions and the table is known to Impala, we recommend calling REFRESH.
2. If the table is unknown to Impala or some radical change occurred in table's metadata (e.g. hdfs rebalancing), it's best to run INVALIDATE METADATA
Be aware of the tradeoffs with respect to the response time of these two operations:
If you invalidate a table that has a large number of partitions, the response time of the invalidate metadata statement may be quite small but the next time the user tries to access that table (e.g. using a select stmt), it will take a really long time because she will have to wait for the actual metadata to be fully loaded. So, in general it is a good practice to avoid running invalidate metadata unless you really have to.
REFRESH is a synchronous call, i.e. the client making the call will block waiting for metadata to be reloaded for a table. Also, REFRESH requires the table to be known to Impala but at the same time it reuses cached metadata to improve the response time of metadata reloading.
INVALIDATE METADATA is an asynchronous call. It returns almost immediately for a table because the only thing it does is to mark the metadata as "invalid". However, the next time the table is accessed (e.g. from a select stmt), a full metadata load is triggered for that table and the latency of that operation is added to the stmt's response time.
Can revisit once https://issues.cloudera.org/browse/IMPALA-2047 is done