[KITE-977] Don't Require MIME Type if a Parser is Explicitly Specified - Cloudera Open Source

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.0.0
Fix Version/s: None
Component/s: Morphlines Module
Labels:
None

Description

When using the solrCell command in Morphlines, you can either use the autodetect parser or provide the class name of a specific parser to use. My understanding is that the latter is mainly used in special cases where the autodetect parser fails to correctly determine the file type (e.g., because the files have nonstandard extensions).

While it makes perfect sense that the autodetect parser would require an upstream detectMimeType command, I found that it was necessary to also use the detectMimeType command (or hardcode the MIME type header using setValues) when a single, specific parser (e.g., org.apache.tika.parser.pdf.PDFParser) was defined in the solrCell command.

Since the Morphline defines a specific parser to use in this case, I would think that there'd be no reason to inspect the MIME type header (i.e. the code inspects that header to determine which parser to use). In other words, I think that the doProcess method in org.kitesdk.morphline.solrcell.SolrCellBuilder should check whether a single specific parser was defined in the command, and if so, use it rather than calling out to the detectParser method to select one based on the MIME type.

I realize that this is probably a rare corner case, but the behavior I found (i.e. having to detect the MIME type even though I specified which parser to use) was not intuitive nor clearly documented. I recommend either changing the code as described or changing the documentation to explain that setting the MIME type is required even if you specified which parser to use.

Attachments

Activity

People

Assignee:

Unassigned

Reporter:

Tom Wheeler

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

31/Mar/15 3:32 PM

Updated:

31/Mar/15 8:08 PM