ganlia 和 nagios 等工具,是业界优秀的监控告警工具;这种工具主要是面向运维的,也可以用来进行性能稳定性的测试。
面对分布式系统测试,耗时都比较长,往往一台机器安装多套系统,影响监控指标的准确性。
下面是一种进行进程级别监控的方n法,可以通过扩展,集群的监控力度;同时将监控脚本加入告警,防止脚本异常退出(Nagios扩展另文描述)
GEngin.py:总体的引擎,根据conf下配置文件的配置项,轮询监控指标,调用gmetric广播出去
conf:目录中保存metrix配置文件,配置参数指标
flag:目录中仅保存一个flag文件,文件名就是任务名,监控指标将根据任务名分离,便于汇总统计对比
log: 目录中记录GEngin的log及每个指标收取脚本的log
pid: GEngin的pid 为告警脚本使用
script: 指标收集的具体的脚本
cat conf/metrix.cfg:
YARN|ResourceManager|cpu|ResourceManager_cpu.py|ResourceManager_cpu.txt|int16|Percent| YARN|ResourceManager|mem|ResourceManager_mem.py|ResourceManager_mem.txt|int16|Percent| YARN|ResourceManager|lsof|ResourceManager_lsof.py|ResourceManager_lsof.txt|int16|Number| |
ls flag/:
yarntestD001.flag
ll log/:
-rw-r--r-- 1 yarn users 168 Mar 19 20:02 yarntestD001_YARNResourceManagercputdw-10-16-19-91.txt -rw-r--r-- 1 yarn users 168 Mar 19 20:02 yarntestD001_YARNResourceManagerlsoftdw-10-16-19-91.txt -rw-r--r-- 1 yarn users 168 Mar 19 20:02 yarntestD001_YARNResourceManagermemtdw-10-16-19-91.txt |
ll script/:
-rw-r--r-- 1 yarn users 882 Feb 28 17:20 ResourceManager_cpu.py -rw-r--r-- 1 yarn users 1093 Feb 28 17:45 ResourceManager_lsof.py -rw-r--r-- 1 yarn users 882 Feb 28 17:18 ResourceManager_mem.py |
cat script/SAMPLE.py:
#!/usr/bin/env python # coding=gbk import sys import os import datetime import time def CheckInput(): "Check Input parameters , they should be a pysql file." if len(sys.argv) < 2 : print "Usage: " + sys.argv[0] + " FileNamePrefix " sys.exit() if __name__== '__main__': CheckInput() # check parameter and asign PyFileName ## result file log to directory of LOG LogFile = open("log/"+sys.argv[1],'a') res = "29" ## Interface to Gmetrix ,must be value:Value print "value:"+res ntime = str(time.strftime("%Y-%m-%d %X",time.localtime())) LogFile.write(ntime+" "+res+"\n") LogFile.close() |