Apache Pig - SIZE()

Syntax

Given below is the syntax of the SIZE() function.

grunt> SIZE(expression)

The return values vary according to the data types in Apache Pig.

Data type	Value
int, long, float, double	For all these types, the size function returns 1.
Char array	For a char array, the size() function returns the number of characters in the array.
Byte array	For a bytearray, the size() function returns the number of bytes in the array.
Tuple	For a tuple, the size() function returns number of fields in the tuple.
Bag	For a bag, the size() function returns number of tuples in the bag.
Map	For a map, the size() function returns the number of key/value pairs in the map.

Example

Assume that we have a file named employee.txt in the HDFS directory /pig_data/ as shown below.

employee.txt

1,John,2007-01-24,250
2,Ram,2007-05-27,220  
3,Jack,2007-05-06,170  
3,Jack,2007-04-06,100  
4,Jill,2007-04-06,220  
5,Zara,2007-06-06,300  
5,Zara,2007-02-06,350

And we have loaded this file into Pig with the relation name employee_data as shown below.

grunt> employee_data = LOAD 'hdfs://localhost:9000/pig_data/ employee.txt' USING PigStorage(',')
   as (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

Calculating the Size of the Type

To calculate the size of the type of a particular column, we can use the SIZE() function. Let’s calculate the size of the name type as shown below.

grunt> size = FOREACH employee_data GENERATE SIZE(name);

Verification

Verify the relation size using the DUMP operator as shown below.

grunt> Dump size;

Output

It will produce the following output, displaying the contents of the relation size as follows. In the example, we have calculated the size of the name column. Since it is of varchar type, the SIZE() function gives you the number of characters in the name of each employee.

(4) 
(3) 
(4) 
(4) 
(4) 
(4) 
(4)

apache_pig_eval_functions.htm