Thursday, June 21, 2012

Hadoop Hive SPLIT example

For example you have a table that tracks your visitors and you want to see who was the top referers but you don't want to see anything after the "?", i.e. to remove the http parameters:

select split(referer, '\\?')[0], count(*) as cnt from event_log where referer is not null group by split(referer, '\\?')[0] order by cnt desc limit 100;

Note that we need to escape the ? because it's a special symbol in regular expressions. So if you have the following input

http://www.a.com/something?param1=yes&param2=no
http://www.b.com/hehe?submit=true

then the output will be

http://www.a.com/something             1
http://www.b.com/hehe                     1

No comments: