In this article we will see how to join 2 datasets in PIG and load the joined, filtered data into HBASE table with 2 different column families
Sample Datasets:-
Dataset1: It has customer no, firstname, lastname, age, profession
[hbase@hbase1 ~]$ cat sample_data.csv
4000001,Kristina,Chung,55,Pilot
4000002,Paige,Chen,74,Teacher
4000003,Sherri,Melton,34,Firefighter
4000004,Gretchen,Hill,66,Computer hardware engineer
4000005,Karen,Puckett,74,Lawyer
4000006,Patrick,Song,42,Veterinarian
4000007,Elsie,Hamilton,43,Pilot
4000008,Hazel,Bender,63,Carpenter
4000009,Malcolm,Wagner,39,Artist
4000010,Dolores,McLaughlin,60,Writer
4000011,Francis,McNamara,47,Therapist
4000012,Sandy,Raynor,26,Writer
4000013,Marion,Moon,41,Carpenter
4000014,Beth,Woodard,65,
4000015,Julia,Desai,49,Musician
4000016,Jerome,Wallace,52,Pharmacist
4000017,Neal,Lawrence,72,Computer support specialist
4000018,Jean,Griffin,45,Childcare worker
4000019,Kristine,Dougherty,63,Financial analyst
Dataset2: It has customer phone number details mapped with customer no.
[hbase@hbase1 ~]$ cat sample_customer_contact.csv
4000001,2025550106
4000002,2025550110
4000003,2025550179
4000004,2025550197
4000005,2025550148
4000006,2025550119
4000007,4045550173
4000008,4045550175
4000009,4045550185
4000010,4045550129
4000011,4045550155
4000012,4045550151
4000013,2255550174
4000014,2255550166
4000015,2255550181
4000016,2255550145
4000017,2255550191
4000018,2255550123
4000019,4045550173
Put these 2 files in the HDFS path /user/hbase/ using hadoop fs -put commands.
hadoop fs -put sample_data.csv /user/hbase/
hadoop fs -put sample_customer_contact.csv /user/hbase/
++++++++++++++++++++++++++++++++
HBase Create table with 2 column family ( customers_data , customers_contact)
++++++++++++++++++++++++++++++++
In hbase-shell :-
create 'customers', {NAME => 'customers_data'} ,{NAME => 'customers_contact'}
--------------
Pig Script :-
--------------
customers_data_relation = LOAD 'hdfs:/user/hbase/sample_data.csv' USING PigStorage(',') AS (
custno:chararray,
firstname:chararray,
lastname:chararray,
age:int,
profession:chararray
);
/* If need to store customer info into customers_data column family individually */
STORE customers_data_relation INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'customers_data:firstname
customers_data:lastname
customers_data:age
customers_data:profession'
);
customers_contact_relation = LOAD 'hdfs:/user/hbase/sample_customer_contact.csv' USING PigStorage(',') AS (
custno:chararray,
phone:chararray
);
/* if need to store phone numbers into customers_contact column family separately */
'customers_contact:phone'
);
/* joining customers_data_relation & customers_contact_relation on custno. field */
customer_info_joined = join customers_data_relation by custno, customers_contact_relation by custno;
customer_info_final = FOREACH customer_info_joined {
GENERATE
(chararray) customers_data_relation::custno,
(chararray) customers_data_relation::firstname,
(chararray) customers_data_relation::lastname,
(chararray) customers_data_relation::age,
(chararray) customers_data_relation::profession,
(chararray) customers_contact_relation::phone
;
}
/* To store the filtered , joined data into hbase */
STORE Final_Join_filtered INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'customers_data:firstname
customers_data:lastname
customers_data:age
customers_data:profession
customers_contact:phone'
);
The above pig script will run the Map reduce job and store the data into Hbase customers table
To Verify the data:-
Now if you scan the customers table in hbase shell , you should see the 19 records with phone number.
hbase(main):064:0> scan 'customers'
ROW COLUMN+CELL
4000001 column=customers_contact:phone, timestamp=1530651843319, value=2025550106
4000001 column=customers_data:age, timestamp=1530651843319, value=55
4000001 column=customers_data:firstname, timestamp=1530651843319, value=Kristina
4000001 column=customers_data:lastname, timestamp=1530651843319, value=Chung
4000001 column=customers_data:profession, timestamp=1530651843319, value=Pilot
4000002 column=customers_contact:phone, timestamp=1530651843325, value=2025550110
4000002 column=customers_data:age, timestamp=1530651843325, value=74
4000002 column=customers_data:firstname, timestamp=1530651843325, value=Paige
4000002 column=customers_data:lastname, timestamp=1530651843325, value=Chen
4000002 column=customers_data:profession, timestamp=1530651843325, value=Teacher
4000003 column=customers_contact:phone, timestamp=1530651843326, value=2025550179
4000003 column=customers_data:age, timestamp=1530651843326, value=34
4000003 column=customers_data:firstname, timestamp=1530651843326, value=Sherri
4000003 column=customers_data:lastname, timestamp=1530651843326, value=Melton
4000003 column=customers_data:profession, timestamp=1530651843326, value=Firefighter
4000004 column=customers_contact:phone, timestamp=1530651843326, value=2025550197
4000004 column=customers_data:age, timestamp=1530651843326, value=66
4000004 column=customers_data:firstname, timestamp=1530651843326, value=Gretchen
4000004 column=customers_data:lastname, timestamp=1530651843326, value=Hill
4000004 column=customers_data:profession, timestamp=1530651843326, value=Computer hardware engineer
4000005 column=customers_contact:phone, timestamp=1530651843327, value=2025550148
4000005 column=customers_data:age, timestamp=1530651843327, value=74
4000005 column=customers_data:firstname, timestamp=1530651843327, value=Karen
4000005 column=customers_data:lastname, timestamp=1530651843327, value=Puckett
4000005 column=customers_data:profession, timestamp=1530651843327, value=Lawyer
4000006 column=customers_contact:phone, timestamp=1530651843328, value=2025550119
4000006 column=customers_data:age, timestamp=1530651843328, value=42
4000006 column=customers_data:firstname, timestamp=1530651843328, value=Patrick
4000006 column=customers_data:lastname, timestamp=1530651843328, value=Song
4000006 column=customers_data:profession, timestamp=1530651843328, value=Veterinarian
4000007 column=customers_contact:phone, timestamp=1530651843328, value=4045550173
4000007 column=customers_data:age, timestamp=1530651843328, value=43
4000007 column=customers_data:firstname, timestamp=1530651843328, value=Elsie
4000007 column=customers_data:lastname, timestamp=1530651843328, value=Hamilton
4000007 column=customers_data:profession, timestamp=1530651843328, value=Pilot
4000008 column=customers_contact:phone, timestamp=1530651843329, value=4045550175
4000008 column=customers_data:age, timestamp=1530651843329, value=63
4000008 column=customers_data:firstname, timestamp=1530651843329, value=Hazel
4000008 column=customers_data:lastname, timestamp=1530651843329, value=Bender
4000008 column=customers_data:profession, timestamp=1530651843329, value=Carpenter
4000009 column=customers_contact:phone, timestamp=1530651843329, value=4045550185
4000009 column=customers_data:age, timestamp=1530651843329, value=39
4000009 column=customers_data:firstname, timestamp=1530651843329, value=Malcolm
4000009 column=customers_data:lastname, timestamp=1530651843329, value=Wagner
4000009 column=customers_data:profession, timestamp=1530651843329, value=Artist
4000010 column=customers_contact:phone, timestamp=1530651843330, value=4045550129
4000010 column=customers_data:age, timestamp=1530651843330, value=60
4000010 column=customers_data:firstname, timestamp=1530651843330, value=Dolores
4000010 column=customers_data:lastname, timestamp=1530651843330, value=McLaughlin
4000010 column=customers_data:profession, timestamp=1530651843330, value=Writer
4000011 column=customers_contact:phone, timestamp=1530651843330, value=4045550155
4000011 column=customers_data:age, timestamp=1530651843330, value=47
4000011 column=customers_data:firstname, timestamp=1530651843330, value=Francis
4000011 column=customers_data:lastname, timestamp=1530651843330, value=McNamara
4000011 column=customers_data:profession, timestamp=1530651843330, value=Therapist
4000012 column=customers_contact:phone, timestamp=1530651843331, value=4045550151
4000012 column=customers_data:age, timestamp=1530651843331, value=26
4000012 column=customers_data:firstname, timestamp=1530651843331, value=Sandy
4000012 column=customers_data:lastname, timestamp=1530651843331, value=Raynor
4000012 column=customers_data:profession, timestamp=1530651843331, value=Writer
4000013 column=customers_contact:phone, timestamp=1530651843331, value=2255550174
4000013 column=customers_data:age, timestamp=1530651843331, value=41
4000013 column=customers_data:firstname, timestamp=1530651843331, value=Marion
4000013 column=customers_data:lastname, timestamp=1530651843331, value=Moon
4000013 column=customers_data:profession, timestamp=1530651843331, value=Carpenter
4000014 column=customers_contact:phone, timestamp=1530651843332, value=2255550166
4000014 column=customers_data:age, timestamp=1530651843332, value=65
4000014 column=customers_data:firstname, timestamp=1530651843332, value=Beth
4000014 column=customers_data:lastname, timestamp=1530651843332, value=Woodard
4000014 column=customers_data:profession, timestamp=1530651843332, value=
4000015 column=customers_contact:phone, timestamp=1530651843332, value=2255550181
4000015 column=customers_data:age, timestamp=1530651843332, value=49
4000015 column=customers_data:firstname, timestamp=1530651843332, value=Julia
4000015 column=customers_data:lastname, timestamp=1530651843332, value=Desai
4000015 column=customers_data:profession, timestamp=1530651843332, value=Musician
4000016 column=customers_contact:phone, timestamp=1530651843333, value=2255550145
4000016 column=customers_data:age, timestamp=1530651843333, value=52
4000016 column=customers_data:firstname, timestamp=1530651843333, value=Jerome
4000016 column=customers_data:lastname, timestamp=1530651843333, value=Wallace
4000016 column=customers_data:profession, timestamp=1530651843333, value=Pharmacist
4000017 column=customers_contact:phone, timestamp=1530651843333, value=2255550191
4000017 column=customers_data:age, timestamp=1530651843333, value=72
4000017 column=customers_data:firstname, timestamp=1530651843333, value=Neal
4000017 column=customers_data:lastname, timestamp=1530651843333, value=Lawrence
4000017 column=customers_data:profession, timestamp=1530651843333, value=Computer support specialist
4000018 column=customers_contact:phone, timestamp=1530651843339, value=2255550123
4000018 column=customers_data:age, timestamp=1530651843339, value=45
4000018 column=customers_data:firstname, timestamp=1530651843339, value=Jean
4000018 column=customers_data:lastname, timestamp=1530651843339, value=Griffin
4000018 column=customers_data:profession, timestamp=1530651843339, value=Childcare worker
4000019 column=customers_contact:phone, timestamp=1530651843341, value=4045550173
4000019 column=customers_data:age, timestamp=1530651843341, value=63
4000019 column=customers_data:firstname, timestamp=1530651843341, value=Kristine
4000019 column=customers_data:lastname, timestamp=1530651843341, value=Dougherty
4000019 column=customers_data:profession, timestamp=1530651843341, value=Financial analyst
19 row(s) in 0.0740 seconds
No comments:
Post a Comment