Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Critical
-
Resolution: Incomplete
-
Affects Version/s: CDH3u2
-
Fix Version/s: None
-
Component/s: HDFS
-
Labels:None
-
Environment:CentOS 5.5, JDK 1.6.0_21, CDH3u2, CDH2u2
Description
We have an older cluster of 40 nodes which runs CDH2 (Hadoop 0.20.1) and we need to DistCp data/files from this cluster, over to a newer cluster which runs CDH3u2 (Hadoop 0.20.2). Both environments use CentOS 5.5 and JDK 1.6.0_21.
When we tried DistCP between these cluster (hadoop distcp hftp://<40-Node-Old-Cluster>:50070/<Folder>/ hdfs://<29-Node-New-Cluster>:8020/<Folder>/) the job fails after certain time with JT exception:
java.io.IOException: Copied: 24 Skipped: 0 Failed: 1 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:582) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264)
At the more lower level in the task attempts, I see this error:
2011-11-23 18:45:17,546 INFO org.apache.hadoop.tools.
DistCp: FAIL cloud401.ws0-all-servers.2011-11-22-11.seq : java.io.IOException: File size not matched: copied 55836672 bytes (53.2m) to tmpfile (=hdfs://<29-Node-New-Cluster>:8020/<folder>/<file.name>) but expected 309896332 bytes (295.5m) from hftp://<40-Node-Old-Cluster>:50070/<folder>/<file.name>
We tried the same exercise between 2 clusters that have similar Hadoop versions (both with CDH2 or CDH3u2) and the DistCP works perfectly fine on both occasions. The DistCP fails only in case the target version was CDH3u2 and source was CDH2 (we don't have other types of environments).
The original DistCP we tried was for about 2TB data (and we need to do about 50TB total) and it failed. We tried a smaller 50GB test transfer and it succeeded, although there were several tasks that failed with the same kind of errors, the JT submitted retry attempts (after 2 or in some cases 3rd attempt), the job finished successfully. When the data volume increases, the job fails totally.
Please let me know if you would need any other details. Any help you can provide is greatly appreciated.
- Karthik