Using PIG to load into multiple column family of a Hbase table

Using PIG to load into Hbase:-

In this article we will see how to join 2 datasets in PIG and load the joined, filtered data into HBASE table with 2 different column families

Sample Datasets:-

Dataset1:  It has customer no, firstname, lastname, age, profession

[hbase@hbase1 ~]$ cat sample_data.csv
4000001,Kristina,Chung,55,Pilot
4000002,Paige,Chen,74,Teacher
4000003,Sherri,Melton,34,Firefighter
4000004,Gretchen,Hill,66,Computer hardware engineer
4000005,Karen,Puckett,74,Lawyer
4000006,Patrick,Song,42,Veterinarian
4000007,Elsie,Hamilton,43,Pilot
4000008,Hazel,Bender,63,Carpenter
4000009,Malcolm,Wagner,39,Artist
4000010,Dolores,McLaughlin,60,Writer
4000011,Francis,McNamara,47,Therapist
4000012,Sandy,Raynor,26,Writer
4000013,Marion,Moon,41,Carpenter
4000014,Beth,Woodard,65,
4000015,Julia,Desai,49,Musician
4000016,Jerome,Wallace,52,Pharmacist
4000017,Neal,Lawrence,72,Computer support specialist
4000018,Jean,Griffin,45,Childcare worker
4000019,Kristine,Dougherty,63,Financial analyst

Dataset2:  It has customer phone number details mapped with customer no.

[hbase@hbase1 ~]$ cat sample_customer_contact.csv
4000001,2025550106
4000002,2025550110
4000003,2025550179
4000004,2025550197
4000005,2025550148
4000006,2025550119
4000007,4045550173
4000008,4045550175
4000009,4045550185
4000010,4045550129
4000011,4045550155
4000012,4045550151
4000013,2255550174
4000014,2255550166
4000015,2255550181
4000016,2255550145
4000017,2255550191
4000018,2255550123
4000019,4045550173

Put these 2 files in the HDFS path /user/hbase/ using hadoop fs -put commands.

hadoop fs -put sample_data.csv /user/hbase/
hadoop fs -put sample_customer_contact.csv /user/hbase/

++++++++++++++++++++++++++++++++
HBase Create table with 2 column family ( customers_data , customers_contact)
++++++++++++++++++++++++++++++++

In hbase-shell :-

create 'customers', {NAME => 'customers_data'} ,{NAME => 'customers_contact'}

--------------
Pig Script :-
--------------

customers_data_relation = LOAD 'hdfs:/user/hbase/sample_data.csv' USING PigStorage(',') AS (
           custno:chararray,
           firstname:chararray,
           lastname:chararray,
           age:int,
           profession:chararray
);

/*  If need to store customer info into customers_data column family individually  */

STORE customers_data_relation INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'customers_data:firstname
 customers_data:lastname
 customers_data:age
 customers_data:profession'
);

customers_contact_relation = LOAD 'hdfs:/user/hbase/sample_customer_contact.csv' USING PigStorage(',') AS (
        custno:chararray,
phone:chararray
);

/* if need to store phone numbers into customers_contact column family separately */

STORE customers_contact_relation INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'customers_contact:phone'
);

/* joining customers_data_relation &  customers_contact_relation on custno. field */

customer_info_joined = join customers_data_relation by custno, customers_contact_relation by custno;

customer_info_final = FOREACH customer_info_joined {
 GENERATE
  (chararray) customers_data_relation::custno,
  (chararray) customers_data_relation::firstname,
  (chararray) customers_data_relation::lastname,
  (chararray) customers_data_relation::age,
  (chararray) customers_data_relation::profession,
  (chararray) customers_contact_relation::phone
  ;
}

/* To store the filtered , joined data into hbase */

STORE Final_Join_filtered  INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'customers_data:firstname
 customers_data:lastname
 customers_data:age
 customers_data:profession
 customers_contact:phone'
);

The above pig script will run the Map reduce job and store the data into Hbase customers table

To Verify the data:-

Now if you scan the customers table in hbase shell , you should see the 19 records with phone number.

hbase(main):064:0> scan 'customers'
ROW                                           COLUMN+CELL
 4000001                                      column=customers_contact:phone, timestamp=1530651843319, value=2025550106
 4000001                                      column=customers_data:age, timestamp=1530651843319, value=55
 4000001                                      column=customers_data:firstname, timestamp=1530651843319, value=Kristina
 4000001                                      column=customers_data:lastname, timestamp=1530651843319, value=Chung
 4000001                                      column=customers_data:profession, timestamp=1530651843319, value=Pilot
 4000002                                      column=customers_contact:phone, timestamp=1530651843325, value=2025550110
 4000002                                      column=customers_data:age, timestamp=1530651843325, value=74
 4000002                                      column=customers_data:firstname, timestamp=1530651843325, value=Paige
 4000002                                      column=customers_data:lastname, timestamp=1530651843325, value=Chen
 4000002                                      column=customers_data:profession, timestamp=1530651843325, value=Teacher
 4000003                                      column=customers_contact:phone, timestamp=1530651843326, value=2025550179
 4000003                                      column=customers_data:age, timestamp=1530651843326, value=34
 4000003                                      column=customers_data:firstname, timestamp=1530651843326, value=Sherri
 4000003                                      column=customers_data:lastname, timestamp=1530651843326, value=Melton
 4000003                                      column=customers_data:profession, timestamp=1530651843326, value=Firefighter
 4000004                                      column=customers_contact:phone, timestamp=1530651843326, value=2025550197
 4000004                                      column=customers_data:age, timestamp=1530651843326, value=66
 4000004                                      column=customers_data:firstname, timestamp=1530651843326, value=Gretchen
 4000004                                      column=customers_data:lastname, timestamp=1530651843326, value=Hill
 4000004                                      column=customers_data:profession, timestamp=1530651843326, value=Computer hardware engineer
 4000005                                      column=customers_contact:phone, timestamp=1530651843327, value=2025550148
 4000005                                      column=customers_data:age, timestamp=1530651843327, value=74
 4000005                                      column=customers_data:firstname, timestamp=1530651843327, value=Karen
 4000005                                      column=customers_data:lastname, timestamp=1530651843327, value=Puckett
 4000005                                      column=customers_data:profession, timestamp=1530651843327, value=Lawyer
 4000006                                      column=customers_contact:phone, timestamp=1530651843328, value=2025550119
 4000006                                      column=customers_data:age, timestamp=1530651843328, value=42
 4000006                                      column=customers_data:firstname, timestamp=1530651843328, value=Patrick
 4000006                                      column=customers_data:lastname, timestamp=1530651843328, value=Song
 4000006                                      column=customers_data:profession, timestamp=1530651843328, value=Veterinarian
 4000007                                      column=customers_contact:phone, timestamp=1530651843328, value=4045550173
 4000007                                      column=customers_data:age, timestamp=1530651843328, value=43
 4000007                                      column=customers_data:firstname, timestamp=1530651843328, value=Elsie
 4000007                                      column=customers_data:lastname, timestamp=1530651843328, value=Hamilton
 4000007                                      column=customers_data:profession, timestamp=1530651843328, value=Pilot
 4000008                                      column=customers_contact:phone, timestamp=1530651843329, value=4045550175
 4000008                                      column=customers_data:age, timestamp=1530651843329, value=63
 4000008                                      column=customers_data:firstname, timestamp=1530651843329, value=Hazel
 4000008                                      column=customers_data:lastname, timestamp=1530651843329, value=Bender
 4000008                                      column=customers_data:profession, timestamp=1530651843329, value=Carpenter
 4000009                                      column=customers_contact:phone, timestamp=1530651843329, value=4045550185
 4000009                                      column=customers_data:age, timestamp=1530651843329, value=39
 4000009                                      column=customers_data:firstname, timestamp=1530651843329, value=Malcolm
 4000009                                      column=customers_data:lastname, timestamp=1530651843329, value=Wagner
 4000009                                      column=customers_data:profession, timestamp=1530651843329, value=Artist
 4000010                                      column=customers_contact:phone, timestamp=1530651843330, value=4045550129
 4000010                                      column=customers_data:age, timestamp=1530651843330, value=60
 4000010                                      column=customers_data:firstname, timestamp=1530651843330, value=Dolores
 4000010                                      column=customers_data:lastname, timestamp=1530651843330, value=McLaughlin
 4000010                                      column=customers_data:profession, timestamp=1530651843330, value=Writer
 4000011                                      column=customers_contact:phone, timestamp=1530651843330, value=4045550155
 4000011                                      column=customers_data:age, timestamp=1530651843330, value=47
 4000011                                      column=customers_data:firstname, timestamp=1530651843330, value=Francis
 4000011                                      column=customers_data:lastname, timestamp=1530651843330, value=McNamara
 4000011                                      column=customers_data:profession, timestamp=1530651843330, value=Therapist
 4000012                                      column=customers_contact:phone, timestamp=1530651843331, value=4045550151
 4000012                                      column=customers_data:age, timestamp=1530651843331, value=26
 4000012                                      column=customers_data:firstname, timestamp=1530651843331, value=Sandy
 4000012                                      column=customers_data:lastname, timestamp=1530651843331, value=Raynor
 4000012                                      column=customers_data:profession, timestamp=1530651843331, value=Writer
 4000013                                      column=customers_contact:phone, timestamp=1530651843331, value=2255550174
 4000013                                      column=customers_data:age, timestamp=1530651843331, value=41
 4000013                                      column=customers_data:firstname, timestamp=1530651843331, value=Marion
 4000013                                      column=customers_data:lastname, timestamp=1530651843331, value=Moon
 4000013                                      column=customers_data:profession, timestamp=1530651843331, value=Carpenter
 4000014                                      column=customers_contact:phone, timestamp=1530651843332, value=2255550166
 4000014                                      column=customers_data:age, timestamp=1530651843332, value=65
 4000014                                      column=customers_data:firstname, timestamp=1530651843332, value=Beth
 4000014                                      column=customers_data:lastname, timestamp=1530651843332, value=Woodard
 4000014                                      column=customers_data:profession, timestamp=1530651843332, value=
 4000015                                      column=customers_contact:phone, timestamp=1530651843332, value=2255550181
 4000015                                      column=customers_data:age, timestamp=1530651843332, value=49
 4000015                                      column=customers_data:firstname, timestamp=1530651843332, value=Julia
 4000015                                      column=customers_data:lastname, timestamp=1530651843332, value=Desai
 4000015                                      column=customers_data:profession, timestamp=1530651843332, value=Musician
 4000016                                      column=customers_contact:phone, timestamp=1530651843333, value=2255550145
 4000016                                      column=customers_data:age, timestamp=1530651843333, value=52
 4000016                                      column=customers_data:firstname, timestamp=1530651843333, value=Jerome
 4000016                                      column=customers_data:lastname, timestamp=1530651843333, value=Wallace
 4000016                                      column=customers_data:profession, timestamp=1530651843333, value=Pharmacist
 4000017                                      column=customers_contact:phone, timestamp=1530651843333, value=2255550191
 4000017                                      column=customers_data:age, timestamp=1530651843333, value=72
 4000017                                      column=customers_data:firstname, timestamp=1530651843333, value=Neal
 4000017                                      column=customers_data:lastname, timestamp=1530651843333, value=Lawrence
 4000017                                      column=customers_data:profession, timestamp=1530651843333, value=Computer support specialist
 4000018                                      column=customers_contact:phone, timestamp=1530651843339, value=2255550123
 4000018                                      column=customers_data:age, timestamp=1530651843339, value=45
 4000018                                      column=customers_data:firstname, timestamp=1530651843339, value=Jean
 4000018                                      column=customers_data:lastname, timestamp=1530651843339, value=Griffin
 4000018                                      column=customers_data:profession, timestamp=1530651843339, value=Childcare worker
 4000019                                      column=customers_contact:phone, timestamp=1530651843341, value=4045550173
 4000019                                      column=customers_data:age, timestamp=1530651843341, value=63
 4000019                                      column=customers_data:firstname, timestamp=1530651843341, value=Kristine
 4000019                                      column=customers_data:lastname, timestamp=1530651843341, value=Dougherty
 4000019                                      column=customers_data:profession, timestamp=1530651843341, value=Financial analyst
19 row(s) in 0.0740 seconds


No comments:

Post a Comment

Boost Your Download Speed with lftp Segmentation

Looking for a faster way to download files via sftp to a Linux machine? Try using "lftp" instead. This tool offers segmented downl...

Other relevant topics