分布式前端日志归档利器-Sort

这个需求看样子很简单,不过其实也很棘手。网站做大了自然会用上CDN做N台前端,取回来的日志合并就是个苦差事。我们不可能保证服务器的时间绝对精确到毫秒级别,也不可能用某个集群文件系统来让文件同时写入到一台统计服务器,这都不科学,复杂程度太大。那么只能瞄准合并日志。

Linux系统自带了Sort这款利器能让我们很方便的解决这个问题,首先老习惯sort --help查资料把参数列出来:

Usage: sort [OPTION]... [FILE]...
Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too.
Ordering options:

-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < `JAN' < ... < `DEC'
-n, --numeric-sort compare according to string numerical value
-r, --reverse reverse the result of comparisons

Other options:

-c, --check check whether input is sorted; do not sort
-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit

POS is F[.C][OPTS], where F is the field number and C the character position
in the field. OPTS is one or more single-letter ordering options, which
override global ordering options for that key. If no key is given, use the
entire line as the key.

SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

With no FILE, or when FILE is -, read standard input.

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

Report bugs to <bug-coreutils@gnu.org>.

它让我把语言环境变量设置为C?能提升性能么?既然这样推荐我就照做了。

把需要用到的几个参数翻译一下:

  • -m 使用merge算法,文件都排序好了,只需合并,这样执行性能比较高
  • -t 指定分隔符
  • -k 看第几列排序,如果指定多列会依次排序(1列相同比对2列以此类推)
  • -o 输出的文件名
  • -S 内存缓冲区控制,这个我试了不起作用,求高人指点

格式倒是很简单:sort [各种参数] [输入的文件]
嘛嘛,看起来很简单的样子,那么实践一下。squid日志的格式是unix时间戳开头属于第一列,然后空格分割开的。这里合并1.log和2.log为all.log的命令如下:

sort -m -t " " -k 1 -o all.log 1.log 2.log

完成了?嗯没错,就这么简单。不过还有其它难题,要精确到毫秒级的按天分割才能移交给统计系统处理,不过最纠结的归档排序问题解决了接下来就好办多了。

分布式前端日志归档利器-Sort 没有评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据