Details
-
Type:
Improvement
-
Status: Open
-
Priority:
Minor
-
Resolution: Unresolved
-
Affects Version/s: 1.0.0
-
Fix Version/s: None
-
Component/s: Morphlines Module
-
Labels:None
Description
When using the solrCell command in Morphlines, you can either use the autodetect parser or provide the class name of a specific parser to use. My understanding is that the latter is mainly used in special cases where the autodetect parser fails to correctly determine the file type (e.g., because the files have nonstandard extensions).
While it makes perfect sense that the autodetect parser would require an upstream detectMimeType command, I found that it was necessary to also use the detectMimeType command (or hardcode the MIME type header using setValues) when a single, specific parser (e.g., org.apache.tika.parser.pdf.PDFParser) was defined in the solrCell command.
Since the Morphline defines a specific parser to use in this case, I would think that there'd be no reason to inspect the MIME type header (i.e. the code inspects that header to determine which parser to use). In other words, I think that the doProcess method in org.kitesdk.morphline.solrcell.SolrCellBuilder should check whether a single specific parser was defined in the command, and if so, use it rather than calling out to the detectParser method to select one based on the MIME type.
I realize that this is probably a rare corner case, but the behavior I found (i.e. having to detect the MIME type even though I specified which parser to use) was not intuitive nor clearly documented. I recommend either changing the code as described or changing the documentation to explain that setting the MIME type is required even if you specified which parser to use.