分享

Hive分析nginx日志

 昵称23016082 2015-07-09
日志格式:
(3条)
223.66.226.171 - - 02/Jul/2015:00:00:09 +0800 "GET /api/BFAdvertising?mid=194094807113E0&cpid=2&ver=5.49.0528.2111 HTTP/1.1" 200 5 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E)" "-"
221.3.220.29 - - 02/Jul/2015:00:00:09 +0800 "GET /api/BFAdvertising?mid=193728294B5054&cpid=2&ver=5.49.0528.2222 HTTP/1.1" 200 5 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)" "-"
14.204.246.135 - - 02/Jul/2015:00:00:09 +0800 "GET /api/pvCallback?adid=f0998bac69d381b777d14671e5d4b124&out_id=1948892A2DC9C9&platform=2&material_id=6361&template_id=7933711&template_ext_id=7933711l&work_order_id=2728&member_id=00000263&product_id=395&drama_id=207 HTTP/1.1" 200 5 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0)" "-"
这几条日志里面有十一列(为了展示的美观,我在里面加入了换行符),每列之间是用空格分割的,每列的含义分别是客户端访问IP、用户标识、用户、访问时间(两段)、请求页面、请求状态、返回文件的大小、跳转来源、浏览器UA,最后一个不知道。如果想用一般的方法来解析这条日志的话,有点困难。但是我们可以如果我们用正则表达式去匹配这九列的话还是很简单的:


CREATE TABLE nginxlogs2(
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  timeZone STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING,
  anyother STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*?\") (-|[0-9]*) (-|[0-9]*) (\".*?\") (\".*?\") ([^ ]*)",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s"
)
STORED AS TEXTFILE;


将日志放置到这个表的目录下,gz格式的和未压缩的格式都可以直接被Hive解析。所以我可以用下面已经查询每小时的访问量超过20的IP:

hive> select substring(time, 2, 14) date ,host, count(*) as count 
from logs 
group by substring(time, 2, 14), host 
having count > 20 
sort by date, count;

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多