Details
-
Type:
Improvement
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 0.2.0
-
Fix Version/s: 0.3.0
-
Component/s: None
-
Labels:None
Description
Got the following error when running 8 concurrent mr jobs on a parquet table:
2016-03-17 10:15:02,130 INFO [main] com.cloudera.recordservice.core.ThriftUtils: Connecting to RecordServiceWorker at vd0220.halxg.cloudera.com:13050, with timeout: 10000ms
2016-03-17 10:15:02,130 INFO [main] com.cloudera.recordservice.core.ThriftUtils: Connected to RecordServiceWorker at vd0220.halxg.cloudera.com:13050
2016-03-17 10:15:12,141 WARN [main] com.cloudera.recordservice.core.RecordServiceWorkerClient: Could not get service protocol version from RecordServiceWorker at vd0220.halxg.cloudera.com:13050. com.cloudera.recordservice.shade.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
2016-03-17 10:15:12,213 INFO [main] com.cloudera.recordservice.core.RecordServiceWorkerClient: Closing RecordServiceWorker task: TUniqueId(hi:8666625902207997972, lo:8192009521421407619)
2016-03-17 10:15:16,168 INFO [main] com.cloudera.recordservice.core.RecordServiceWorkerClient: Closing RecordServiceWorker connection.
2016-03-17 10:15:16,168 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2016-03-17 10:15:16,182 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
2016-03-17 10:15:16,239 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:impala (auth:SIMPLE) cause:java.io.IOException: Could not get service protocol version from RecordServiceWorker at vd0220.halxg.cloudera.com:13050.
2016-03-17 10:15:16,240 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: Could not get service protocol version from RecordServiceWorker at vd0220.halxg.cloudera.com:13050.
at com.cloudera.recordservice.core.RecordServiceWorkerClient.connect(RecordServiceWorkerClient.java:482)
at com.cloudera.recordservice.core.RecordServiceWorkerClient.access$1200(RecordServiceWorkerClient.java:45)
at com.cloudera.recordservice.core.RecordServiceWorkerClient$Builder.connect(RecordServiceWorkerClient.java:234)
at com.cloudera.recordservice.mr.RecordReaderCore.<init>(RecordReaderCore.java:68)
at com.cloudera.recordservice.mapreduce.RecordServiceInputFormatBase$RecordReaderBase.initialize(RecordServiceInputFormatBase.java:94)
at com.cloudera.recordservice.mapreduce.RecordServiceInputFormat$RecordServiceRecordReader.initialize(RecordServiceInputFormat.java:107)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: com.cloudera.recordservice.shade.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at com.cloudera.recordservice.shade.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at com.cloudera.recordservice.shade.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at com.cloudera.recordservice.shade.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at com.cloudera.recordservice.shade.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at com.cloudera.recordservice.shade.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at com.cloudera.recordservice.shade.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at com.cloudera.recordservice.thrift.RecordServiceWorker$Client.recv_GetProtocolVersion(RecordServiceWorker.java:103)
at com.cloudera.recordservice.thrift.RecordServiceWorker$Client.GetProtocolVersion(RecordServiceWorker.java:91)
at com.cloudera.recordservice.core.RecordServiceWorkerClient.connect(RecordServiceWorkerClient.java:441)
... 13 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at com.cloudera.recordservice.shade.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
... 21 more
Fix this issue via either increasing the recordservice.worker.connection.timeoutMs to 60 sec or setting rpc timeout before sending get protocol version request.
The reason why these changes work is that before setting rpc timeout, it uses the connection timeout as the rpc timeout. While gettting protocol version is also a rpc, so we should set the rpc timeout before this request as well.
Besides, when getting Read timeout error, we should add more useful suggestions in the log, eg. ask users to increase rpc timeout.