[DISTRO-659] Does Cloudera Search support other file systems other than HDFS? - Cloudera Open Source

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Bug
Affects Version/s: search-1.0.0
Fix Version/s: None
Component/s: Search
Labels:
None
Environment:
CDH5.1.2 + search-1.0.0 + solr-4.4 +Intel Enterprise Edition for Lustre*, version 2.2

Description

Hi,
I am testing Cloudera components with Lustre file system, following the instructions of Cloudera certification.
For now, Lustre can work with CDH, HBase, Hive, Pig, Mahout and Spark.

Recently, I encounter the following issues when I perform Cloudera Search testing. That makes one question come to me again "does Cloudera Search only support HDFS?" just like Impala(please see https://issues.cloudera.org/browse/IMPALA-1404).

My reference for Cloudera Search is "Cloudera Search User Guide" in http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/PDF/Cloudera-Search-User-Guide.pdf

Issue 1:
It happened when I created my first solr collection by running the following command

$ solrctl collection --create collection1 -s 1
Error: A call to SolrCloud WEB APIs failed: HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/xml;charset=UTF-8
Transfer-Encoding: chunked
Date: Wed, 29 Oct 2014 06:27:51 GMT

<?xml version="1.0" encoding="UTF-8"?>

<response>

<lst name="responseHeader">
<int name="status">
0</int>
<int name="QTime">
4797</int>
</lst>
<lst name="failure">
<str>
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'collection1_shard1_replica1': Unable to create core: collection1_shard1_replica1 Caused by: /lustre:/solr/collection1/core_node1/data/tlog</str>
</lst>

The directory is right and can be accessed by solr, and I didn't find anything wrong with Lustre log.
Then, after checking solrconfig.xml, I disabled updateLog feature. It really did work. But later, I saw solr doc say that Realtime-get currently relies on the update log feature. So, is disabling updateLog feature a right fix for this problem? Why tlog dir can't be created?

Issue2:
I moved on to make batch indexing using mapreduce without updateLog feature. This time I hit the following issue

hadoop --config /etc/hadoop/conf jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir lustre:///user/solr/outdir --verbose --go-live --zk-host centos6-hadoop:2181/solr --collection collection3 lustre:/user/solr/indir
...
1038 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 2 files using 2 real mappers into 2 reducers
Error: org.kitesdk.morphline.api.MorphlineRuntimeException: java.lang.IllegalArgumentException: Host must not be null: lustre:/user/solr/indir/sample-statuses-20120906-141433.avro
	at org.kitesdk.morphline.base.FaultTolerance.handleException(FaultTolerance.java:73)
	at org.apache.solr.hadoop.morphline.MorphlineMapRunner.map(MorphlineMapRunner.java:213)
	at org.apache.solr.hadoop.morphline.MorphlineMapper.map(MorphlineMapper.java:86)
	at org.apache.solr.hadoop.morphline.MorphlineMapper.map(MorphlineMapper.java:54)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.IllegalArgumentException: Host must not be null: lustre:/user/solr/indir/sample-statuses-20120906-141433.avro
	at org.apache.solr.hadoop.PathParts.<init>(PathParts.java:61)
	at org.apache.solr.hadoop.morphline.MorphlineMapRunner.map(MorphlineMapRunner.java:185)
	... 10 more

This looks like some path resolution problem in Solr. Since Lustre is a type of parallel distributed file system, when we use it instead of HDFS, we don't need hostname (lustre:///path, not like hdfs://hostname:port/path).

So, if Cloudera Search run on Lustre, how to fix the above issues?

Thanks ahead!

Does Cloudera Search support other file systems other than HDFS?

Details

Description

Attachments

Activity

People

Dates