Hive分析nginx日志

昵称23016082 2015-07-09

展开全文

日志格式：

（3条）

223.66.226.171 - - 02/Jul/2015:00:00:09 +0800 "GET /api/BFAdvertising?mid=194094807113E0&cpid=2&ver=5.49.0528.2111 HTTP/1.1" 200 5 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E)" "-"

221.3.220.29 - - 02/Jul/2015:00:00:09 +0800 "GET /api/BFAdvertising?mid=193728294B5054&cpid=2&ver=5.49.0528.2222 HTTP/1.1" 200 5 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)" "-"

14.204.246.135 - - 02/Jul/2015:00:00:09 +0800 "GET /api/pvCallback?adid=f0998bac69d381b777d14671e5d4b124&out_id=1948892A2DC9C9&platform=2&material_id=6361&template_id=7933711&template_ext_id=7933711l&work_order_id=2728&member_id=00000263&product_id=395&drama_id=207 HTTP/1.1" 200 5 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)" "-"

这几条日志里面有十一列（为了展示的美观，我在里面加入了换行符），每列之间是用空格分割的，每列的含义分别是客户端访问IP、用户标识、用户、访问时间（两段）、请求页面、请求状态、返回文件的大小、跳转来源、浏览器UA，最后一个不知道。如果想用一般的方法来解析这条日志的话，有点困难。但是我们可以如果我们用正则表达式去匹配这九列的话还是很简单的：

CREATE TABLE nginxlogs2(

host STRING,

identity STRING,

user STRING,

time STRING,

timeZone STRING,

request STRING,

status STRING,

size STRING,

referer STRING,

agent STRING,

anyother STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*?\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") ([^ ]*)",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s"

)

STORED AS TEXTFILE;

将日志放置到这个表的目录下，gz格式的和未压缩的格式都可以直接被Hive解析。所以我可以用下面已经查询每小时的访问量超过20的IP：

hive> select substring(time, 2, 14) date ,host, count(*) as count 
from logs 
group by substring(time, 2, 14), host 
having count > 20 
sort by date, count;