Apache Pig - BinStorage()
The BinStorage() function is used to load and store the data into Pig using machine readable format. BinStorge() in Pig is generally used to store temporary data generated between the MapReduce jobs. It supports multiple locations as input.
Syntax
Given below is the syntax of the BinStorage() function.
grunt> BinStorage();
Example
Assume that we have a file named stu_data.txt in the HDFS directory /pig_data/ as shown below.
Stu_data.txt
001,Rajiv_Reddy,21,Hyderabad 002,siddarth_Battacharya,22,Kolkata 003,Rajesh_Khanna,22,Delhi 004,Preethi_Agarwal,21,Pune 005,Trupthi_Mohanthy,23,Bhuwaneshwar 006,Archana_Mishra,23,Chennai 007,Komal_Nayak,24,trivendram 008,Bharathi_Nambiayar,24,Chennai
Let us load this data into Pig into a relation as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/stu_data.txt' USING PigStorage(',') as (id:int, firstname:chararray, age:int, city:chararray);
Now, we can store this relation into the HDFS directory named /pig_data/ using the BinStorage() function.
grunt> STORE student_details INTO 'hdfs://localhost:9000/pig_Output/mydata' USING BinStorage();
After executing the above statement, the relation is stored in the given HDFS directory. You can see it using the HDFS ls command as shown below.
$ hdfs dfs -ls hdfs://localhost:9000/pig_Output/mydata/ Found 2 items -rw-r--r-- 1 Hadoop supergroup 0 2015-10-26 16:58 hdfs://localhost:9000/pig_Output/mydata/_SUCCESS -rw-r--r-- 1 Hadoop supergroup 372 2015-10-26 16:58 hdfs://localhost:9000/pig_Output/mydata/part-m-00000
Now, load the data from the file part-m-00000.
grunt> result = LOAD 'hdfs://localhost:9000/pig_Output/b/part-m-00000' USING BinStorage();
Verify the contents of the relation as shown below
grunt> Dump result; (1,Rajiv_Reddy,21,Hyderabad) (2,siddarth_Battacharya,22,Kolkata) (3,Rajesh_Khanna,22,Delhi) (4,Preethi_Agarwal,21,Pune) (5,Trupthi_Mohanthy,23,Bhuwaneshwar) (6,Archana_Mishra,23,Chennai) (7,Komal_Nayak,24,trivendram) (8,Bharathi_Nambiayar,24,Chennai)