DistCp to copy files from local file system to hadoop hdfs

DistCp (distributed copy) is a tool generally used for large inter/intra-cluster copying in hadoop.

But it can also be used to copy the files from local file system to hadoop hdfs.

To test this i have created around 3000+ files in my files system.


My local filesytem :  /home/rajesh/testfiles

rajesh@namenode1:~/testfiles$ ls -lrt |wc -l
3133


HDFS Directory: (I haven't created the folder in hdfs)

rajesh@namenode1:~/testfiles$ hadoop fs -ls /user/rajesh
Found 5 items
drwx------   - rajesh hdfs          0 2016-07-18 06:59 /user/rajesh/.Trash
drwx------   - rajesh hdfs          0 2016-07-18 06:06 /user/rajesh/.staging
-rw-r--r--   3 rajesh hdfs     428959 2016-07-05 07:54 /user/rajesh/Hadoop_Tuning_Guide-Version5.pdf
drwxr-xr-x   - rajesh hdfs          0 2016-07-05 07:27 /user/rajesh/hive



Command to Copy:

hadoop distcp file:///home/rajesh/testfiles /user/rajesh

Logs:


rajesh@namenode1:~/testfiles$ hadoop distcp file:///home/rajesh/testfiles /user/rajesh
16/07/18 07:00:50 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[file:/home/rajesh/testfiles], targetPath=/user/rajesh, targetPathExists=true, preserveRawXattrs=false}
16/07/18 07:00:52 INFO impl.TimelineClientImpl: Timeline service address: http://namenode1.rajesh.com:8188/ws/v1/timeline/
16/07/18 07:00:52 INFO client.RMProxy: Connecting to ResourceManager at namenode1.rajesh.com/192.168.0.100:8050
16/07/18 07:01:28 INFO impl.TimelineClientImpl: Timeline service address: http://namenode1.rajesh.com:8188/ws/v1/timeline/
16/07/18 07:01:28 INFO client.RMProxy: Connecting to ResourceManager at namenode1.rajesh.com/192.168.0.100:8050
16/07/18 07:01:31 INFO mapreduce.JobSubmitter: number of splits:21
16/07/18 07:01:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468835763123_0002
16/07/18 07:01:33 INFO impl.YarnClientImpl: Submitted application application_1468835763123_0002
16/07/18 07:01:33 INFO mapreduce.Job: The url to track the job: http://namenode1.rajesh.com:8088/proxy/application_1468835763123_0002/
16/07/18 07:01:33 INFO tools.DistCp: DistCp job-id: job_1468835763123_0002
16/07/18 07:01:33 INFO mapreduce.Job: Running job: job_1468835763123_0002
16/07/18 07:01:57 INFO mapreduce.Job: Job job_1468835763123_0002 running in uber mode : false
16/07/18 07:01:57 INFO mapreduce.Job:  map 0% reduce 0%
16/07/18 07:02:29 INFO mapreduce.Job:  map 1% reduce 0%
16/07/18 07:02:32 INFO mapreduce.Job:  map 2% reduce 0%
16/07/18 07:02:37 INFO mapreduce.Job:  map 3% reduce 0%
16/07/18 07:02:38 INFO mapreduce.Job:  map 4% reduce 0%
16/07/18 07:02:42 INFO mapreduce.Job:  map 5% reduce 0%
16/07/18 07:02:45 INFO mapreduce.Job:  map 6% reduce 0%
16/07/18 07:02:48 INFO mapreduce.Job:  map 7% reduce 0%
16/07/18 07:02:51 INFO mapreduce.Job:  map 8% reduce 0%
16/07/18 07:02:53 INFO mapreduce.Job:  map 9% reduce 0%
16/07/18 07:02:54 INFO mapreduce.Job:  map 10% reduce 0%
16/07/18 07:02:57 INFO mapreduce.Job:  map 11% reduce 0%
16/07/18 07:03:00 INFO mapreduce.Job:  map 12% reduce 0%
16/07/18 07:03:01 INFO mapreduce.Job:  map 13% reduce 0%
16/07/18 07:03:03 INFO mapreduce.Job:  map 14% reduce 0%
16/07/18 07:03:26 INFO mapreduce.Job:  map 15% reduce 0%
16/07/18 07:03:31 INFO mapreduce.Job:  map 16% reduce 0%
16/07/18 07:03:37 INFO mapreduce.Job:  map 17% reduce 0%
16/07/18 07:03:40 INFO mapreduce.Job:  map 18% reduce 0%
16/07/18 07:03:46 INFO mapreduce.Job:  map 19% reduce 0%
16/07/18 07:03:49 INFO mapreduce.Job:  map 20% reduce 0%
16/07/18 07:03:52 INFO mapreduce.Job:  map 21% reduce 0%
16/07/18 07:03:55 INFO mapreduce.Job:  map 22% reduce 0%
16/07/18 07:03:58 INFO mapreduce.Job:  map 23% reduce 0%
16/07/18 07:04:01 INFO mapreduce.Job:  map 24% reduce 0%
16/07/18 07:04:02 INFO mapreduce.Job:  map 25% reduce 0%
16/07/18 07:04:05 INFO mapreduce.Job:  map 26% reduce 0%
16/07/18 07:04:08 INFO mapreduce.Job:  map 27% reduce 0%
16/07/18 07:04:10 INFO mapreduce.Job:  map 28% reduce 0%
16/07/18 07:04:12 INFO mapreduce.Job:  map 29% reduce 0%
16/07/18 07:04:44 INFO mapreduce.Job:  map 30% reduce 0%
16/07/18 07:04:47 INFO mapreduce.Job:  map 31% reduce 0%
16/07/18 07:04:50 INFO mapreduce.Job:  map 32% reduce 0%
16/07/18 07:04:53 INFO mapreduce.Job:  map 33% reduce 0%
16/07/18 07:04:56 INFO mapreduce.Job:  map 34% reduce 0%
16/07/18 07:05:01 INFO mapreduce.Job:  map 35% reduce 0%
16/07/18 07:05:02 INFO mapreduce.Job:  map 36% reduce 0%
16/07/18 07:05:05 INFO mapreduce.Job:  map 37% reduce 0%
16/07/18 07:05:08 INFO mapreduce.Job:  map 38% reduce 0%
16/07/18 07:05:10 INFO mapreduce.Job:  map 39% reduce 0%
16/07/18 07:05:13 INFO mapreduce.Job:  map 40% reduce 0%
16/07/18 07:05:15 INFO mapreduce.Job:  map 41% reduce 0%
16/07/18 07:05:18 INFO mapreduce.Job:  map 42% reduce 0%
16/07/18 07:05:20 INFO mapreduce.Job:  map 43% reduce 0%
16/07/18 07:05:51 INFO mapreduce.Job:  map 44% reduce 0%
16/07/18 07:05:55 INFO mapreduce.Job:  map 45% reduce 0%
16/07/18 07:05:58 INFO mapreduce.Job:  map 46% reduce 0%
16/07/18 07:06:01 INFO mapreduce.Job:  map 47% reduce 0%
16/07/18 07:06:04 INFO mapreduce.Job:  map 48% reduce 0%
16/07/18 07:06:05 INFO mapreduce.Job:  map 49% reduce 0%
16/07/18 07:06:08 INFO mapreduce.Job:  map 50% reduce 0%
16/07/18 07:06:10 INFO mapreduce.Job:  map 51% reduce 0%
16/07/18 07:06:13 INFO mapreduce.Job:  map 52% reduce 0%
16/07/18 07:06:14 INFO mapreduce.Job:  map 53% reduce 0%
16/07/18 07:06:17 INFO mapreduce.Job:  map 54% reduce 0%
16/07/18 07:06:19 INFO mapreduce.Job:  map 55% reduce 0%
16/07/18 07:06:22 INFO mapreduce.Job:  map 56% reduce 0%
16/07/18 07:06:23 INFO mapreduce.Job:  map 57% reduce 0%
16/07/18 07:06:52 INFO mapreduce.Job:  map 58% reduce 0%
16/07/18 07:06:55 INFO mapreduce.Job:  map 59% reduce 0%
16/07/18 07:06:59 INFO mapreduce.Job:  map 60% reduce 0%
16/07/18 07:07:02 INFO mapreduce.Job:  map 61% reduce 0%
16/07/18 07:07:05 INFO mapreduce.Job:  map 62% reduce 0%
16/07/18 07:07:08 INFO mapreduce.Job:  map 63% reduce 0%
16/07/18 07:07:11 INFO mapreduce.Job:  map 64% reduce 0%
16/07/18 07:07:14 INFO mapreduce.Job:  map 65% reduce 0%
16/07/18 07:07:16 INFO mapreduce.Job:  map 66% reduce 0%
16/07/18 07:07:19 INFO mapreduce.Job:  map 67% reduce 0%
16/07/18 07:07:20 INFO mapreduce.Job:  map 68% reduce 0%
16/07/18 07:07:23 INFO mapreduce.Job:  map 69% reduce 0%
16/07/18 07:07:27 INFO mapreduce.Job:  map 70% reduce 0%
16/07/18 07:07:30 INFO mapreduce.Job:  map 71% reduce 0%
16/07/18 07:07:50 INFO mapreduce.Job:  map 72% reduce 0%
16/07/18 07:07:57 INFO mapreduce.Job:  map 73% reduce 0%
16/07/18 07:08:03 INFO mapreduce.Job:  map 74% reduce 0%
16/07/18 07:08:10 INFO mapreduce.Job:  map 75% reduce 0%
16/07/18 07:08:13 INFO mapreduce.Job:  map 76% reduce 0%
16/07/18 07:08:17 INFO mapreduce.Job:  map 77% reduce 0%
16/07/18 07:08:20 INFO mapreduce.Job:  map 78% reduce 0%
16/07/18 07:08:22 INFO mapreduce.Job:  map 79% reduce 0%
16/07/18 07:08:25 INFO mapreduce.Job:  map 80% reduce 0%
16/07/18 07:08:28 INFO mapreduce.Job:  map 81% reduce 0%
16/07/18 07:08:31 INFO mapreduce.Job:  map 82% reduce 0%
16/07/18 07:08:34 INFO mapreduce.Job:  map 83% reduce 0%
16/07/18 07:08:39 INFO mapreduce.Job:  map 84% reduce 0%
16/07/18 07:08:43 INFO mapreduce.Job:  map 85% reduce 0%
16/07/18 07:08:55 INFO mapreduce.Job:  map 86% reduce 0%
16/07/18 07:09:05 INFO mapreduce.Job:  map 87% reduce 0%
16/07/18 07:09:12 INFO mapreduce.Job:  map 88% reduce 0%
16/07/18 07:09:18 INFO mapreduce.Job:  map 89% reduce 0%
16/07/18 07:09:21 INFO mapreduce.Job:  map 90% reduce 0%
16/07/18 07:09:22 INFO mapreduce.Job:  map 91% reduce 0%
16/07/18 07:09:24 INFO mapreduce.Job:  map 92% reduce 0%
16/07/18 07:09:27 INFO mapreduce.Job:  map 94% reduce 0%
16/07/18 07:09:30 INFO mapreduce.Job:  map 95% reduce 0%
16/07/18 07:09:33 INFO mapreduce.Job:  map 96% reduce 0%
16/07/18 07:09:34 INFO mapreduce.Job:  map 97% reduce 0%
16/07/18 07:09:35 INFO mapreduce.Job:  map 98% reduce 0%
16/07/18 07:09:38 INFO mapreduce.Job:  map 99% reduce 0%
16/07/18 07:09:41 INFO mapreduce.Job:  map 100% reduce 0%
16/07/18 07:09:43 INFO mapreduce.Job: Job job_1468835763123_0002 completed successfully
16/07/18 07:09:43 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=7228656
                FILE: Number of bytes written=2873945
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=559335
                HDFS: Number of bytes written=7228656
                HDFS: Number of read operations=22093
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=6308
        Job Counters
                Launched map tasks=21
                Other local map tasks=21
                Total time spent by all maps in occupied slots (ms)=1341918
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=1341918
                Total vcore-seconds taken by all map tasks=1341918
                Total megabyte-seconds taken by all map tasks=1374124032
        Map-Reduce Framework
                Map input records=3133
                Map output records=0
                Input split bytes=2457
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=9853
                CPU time spent (ms)=240430
                Physical memory (bytes) snapshot=4918603776
                Virtual memory (bytes) snapshot=70847594496
                Total committed heap usage (bytes)=3467640832
        File Input Format Counters
                Bytes Read=556878
        File Output Format Counters
                Bytes Written=0
        org.apache.hadoop.tools.mapred.CopyMapper$Counter
                BYTESCOPIED=7228656
                BYTESEXPECTED=7228656
                COPY=3133

Validating the files copied:

rajesh@namenode1:~/testfiles$ hadoop fs -ls /user/rajesh/testfiles/
Found 3132 items
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:06 /user/rajesh/testfiles/1.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:07 /user/rajesh/testfiles/10001x.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:08 /user/rajesh/testfiles/10001x_0001x.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:04 /user/rajesh/testfiles/10001x_0001x_0001x.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:07 /user/rajesh/testfiles/10001x_0001xy2z.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:08 /user/rajesh/testfiles/10001x_0001xy2z_0001x.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:08 /user/rajesh/testfiles/10001x_0001xy3z.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:08 /user/rajesh/testfiles/10001xy1000z_0001x.txt
-rw-r--r--   3 rajesh hdfs       2308 2016-07-18 07:05 /user/rajesh/testfiles/10001xy1000z_0001x_0001x.txt





Side note:  Test File creation for this test

I have created this 3000+ files in my windows machine.  Then scp to linux box.

Since it is a copy of the same files in windows the file names has blank space and "copy" word in every file name.

I used the rename command to rename the files which i found very useful for bulk rename operations like this.

rename -n 's/ copy/xyz[001]/' *.txt


1 comment:

  1. Hi Rajesh,
    Are you sure we can do distcp from local Linux machine to HDFS? I am trying the same you mentioned but i am getting error that

    Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.FileNotFoundException: File file:/home/koti.karri/Sample/file1.txt does not exist

    Can you please help where i am doing wrong?

    ReplyDelete

Boost Your Download Speed with lftp Segmentation

Looking for a faster way to download files via sftp to a Linux machine? Try using "lftp" instead. This tool offers segmented downl...

Other relevant topics