TPCx-BB是大数据基准测试工具,它通过模拟零售商的30个应用场景,执行30个查询来衡量基于Hadoop的大数据系统的包括硬件和软件的性能。其中一些场景还用到了机器学习算法(聚类、线性回归等)。为了更好地了解被测试的系统的性能,需要对TPCx-BB整个测试流程深入了解。本文详细分析了整个TPCx-BB测试工具的源码,希望能够对大家理解TPCx-BB有所帮助。
代码结构
主目录($BENCH_MARK_HOME)下有:
· bin
· conf
· data-generator
· engines
· tools
几个子目录。
bin下有几个 module ,是执行时需要用到的脚本:bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等
conf下有两个配置文件:bigBench.properties 和 userSettings.conf
bigBench.properties 主要设置 workload(执行的benchmarkPhases)和 power_test_0(POWER_TEST 阶段需要执行的SQL查询)
默认 workload : workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1
默认 power_test_0 :1-30
userSetting.conf 是一些基本设置,包括JAVA environment 、default settings for benchmark(database、engine、map_tasks、scale_factor ...)、HADOOP environment、
HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)
data-generator下是跟数据生成相关的脚本及配置文件。详细内容在下面介绍。
engines下是TPCx-BB支持的4种引擎:biginsights、hive、impala、spark_sql。默认引擎为hive。实际上,只有hive目录下不为空,其他三个目录下均为空,估计是现在还未完善。
tools下有两个jar包:HadoopClusterExec.jar 和 RunBigBench.jar 。其中 RunBigBench.jar 是执行TPCx-BB测试的一个非常重要的文件,大部分程序都在该jar包内。
数据生成
数据生成相关程序和配置都在 data-generator 目录下。该目录下有一个 pdgf.jar 包和 config、dicts、extlib 三个子目录。
pdgf.jar是数据生成的Java程序。config下有两个配置文件:bigbench-generation.xml 和 bigbench-schema.xml 。
bigbench-generation.xml 主要设置生成的原始数据(不是数据库表)包含哪几张表、每张表的表名、表的大小以及表输出的目录、表文件的后缀、分隔符、字符编码等。
<schema name="default"> <tables> <!-- not refreshed tables --> <!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons --> <table name="income_band"></table> <table name="reason"></table> <table name="ship_mode"></table> <table name="web_site"></table> <!-- /tables not used in benchmark --> <!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) --> <table name="date_dim" static="true"></table> <table name="time_dim" static="true"></table> <table name="customer_demographics" static="true"></table> <table name="household_demographics" static="true"></table> <!-- /static tables --> <!-- "normal" tables. split over all nodes. not generated during refresh --> <table name="store"></table> <table name="warehouse"></table> <table name="promotion"></table> <table name="web_page"></table> <!-- /"normal" tables.--> <!-- /not refreshed tables --> <!-- refreshed tables. Generated on all nodes. Refresh tables generate extra data during refresh (e.g. add new data to the existing tables) In "normal"-Phase generate table rows: [0,REFRESH_PERCENTAGE*Table.Size]; In "refresh"-Phase generate table rows: [REFRESH_PERCENTAGE*Table.Size+1, Table.Size] .Has effect only if ${REFRESH_SYSTEM_ENABLED}==1. --> <table name="customer"> <scheduler name="DefaultScheduler"> <partitioner name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner"> <prePartition><![CDATA[ if(${REFRESH_SYSTEM_ENABLED}>0){ int tableID = table.getTableID(); int timeID = 0; long lastTableRow=table.getSize()-1; long rowStart; long rowStop; boolean exclude=false; long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE}); if(${REFRESH_PHASE}>0){ //Refresh part rowStart = lastTableRow - refreshRows +1; rowStop = lastTableRow; if(refreshRows<=0){ exclude=true; } }else{ //"normal" part rowStart = 0; rowStop = lastTableRow - refreshRows; } return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude); }else{ //DEFAULT return getParentPartitioner().getDefaultPrePartition(project, table); } ]]></prePartition> </partitioner> </scheduler> </table> <output name="SplitFileOutputWrapper"> <!-- DEFAULT output for all Tables, if no table specific output is specified--> <output name="CSVRowOutput"> <fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate> <outputDir>output/</outputDir> <fileEnding>.dat</fileEnding> <delimiter>|</delimiter> <charset>UTF-8</charset> <sortByRowID>true</sortByRowID> </output> <output name="StatisticsOutput" active="1"> <size>${item_size}</size><!-- a counter per item .. initialize later--> <fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate> <outputDir>output/</outputDir> <fileEnding>.csv</fileEnding> <delimiter>,</delimiter> <header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR--> </header> <footer></footer> |
bigbench-schema.xml 设置了很多参数,有跟表的规模有关的,比如每张表的大小(记录的条数);绝大多数是跟表的字段有关的,比如时间的起始、结束、性别比例、结婚比例、指标的上下界等。还具体定义了每个字段是怎么生成的,以及限制条件。示例如下:
生成的数据大小由 SCALE_FACTOR(-f) 决定。如 -f 1,则生成的数据总大小约为1G;-f 100,则生成的数据总大小约为100G。那么SCALE_FACTOR(-f) 是怎么精确控制生成的数据的大小呢?
原因是 SCALE_FACTOR(-f) 决定了每张表的记录数。如下,customer 表的记录数为 100000.0d * ${SF_sqrt},即如果 -f 1 则 customer 表的记录数为 100000*sqrt(1)= 10万条 ;如果 -f 100 则 customer 表的记录数为 100000*sqrt(100)= 100万条
<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property> <property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property> <property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property> <property name="${gender_likelihood}" type="double">0.5</property> <property name="${married_likelihood}" type="double">0.3</property> <property name="${WP_LINK_MIN}" type="double">2</property> <property name="${WP_LINK_MAX}" type="double">25</property> <field name="d_date" size="13" type="CHAR" primary="false"> <gen_DateTime> <disableRng>true</disableRng> <useFixedStepSize>true</useFixedStepSize> <startDate>${date_dim_begin_date}</startDate> <endDate>${date_dim_end_date}</endDate> <outputFormat>yyyy-MM-dd</outputFormat> </gen_DateTime> </field> <field name="t_time_id" size="16" type="CHAR" primary="false"> <gen_ConvertNumberToString> <gen_Id/> <size>16.0</size> <characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters> </gen_ConvertNumberToString> </field> <field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false"> <gen_Null probability="${NULL_CHANCE}"> <gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" /> </gen_Null> </field> |
dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件,表里每一条记录的各个字段应该是从这些字典里生成的。
extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jar、commons-net-3.3.jar、xml-apis.jar和log4j-1.2.15.jar等
总结:
pdgf.jar根据bigbench-generation.xml 和 bigbench-schema.xml两个文件里的配置(表名、字段名、表的记录条数、每个字段生成的规则),从 dicts 目录下对应的 .dict
文件获取表中每一条记录、每个字段的值,生成原始数据。
customer 表里的某条记录如下:
0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com gdMmGdU9
如果执行 TPCx-BB 测试时指定 -f 1(SCALE_FACTOR = 1) 则最终生成的原始数据总大小约为 1G(977M+8.6M)
[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data 12.7 M 38.0 M /user/root/benchmarks/bigbench/data/customer 5.1 M 15.4 M /user/root/benchmarks/bigbench/data/customer_address 74.2 M 222.5 M /user/root/benchmarks/bigbench/data/customer_demographics 14.7 M 44.0 M /user/root/benchmarks/bigbench/data/date_dim 151.5 K 454.4 K /user/root/benchmarks/bigbench/data/household_demographics 327 981 /user/root/benchmarks/bigbench/data/income_band 405.3 M 1.2 G /user/root/benchmarks/bigbench/data/inventory 6.5 M 19.5 M /user/root/benchmarks/bigbench/data/item 4.0 M 12.0 M /user/root/benchmarks/bigbench/data/item_marketprices 53.7 M 161.2 M /user/root/benchmarks/bigbench/data/product_reviews 45.3 K 135.9 K /user/root/benchmarks/bigbench/data/promotion 3.0 K 9.1 K /user/root/benchmarks/bigbench/data/reason 1.2 K 3.6 K /user/root/benchmarks/bigbench/data/ship_mode 3.3 K 9.9 K /user/root/benchmarks/bigbench/data/store 4.1 M 12.4 M /user/root/benchmarks/bigbench/data/store_returns 88.5 M 265.4 M /user/root/benchmarks/bigbench/data/store_sales 4.9 M 14.6 M /user/root/benchmarks/bigbench/data/time_dim 584 1.7 K /user/root/benchmarks/bigbench/data/warehouse 170.4 M 511.3 M /user/root/benchmarks/bigbench/data/web_clickstreams 7.9 K 23.6 K /user/root/benchmarks/bigbench/data/web_page 5.1 M 15.4 M /user/root/benchmarks/bigbench/data/web_returns 127.6 M 382.8 M /user/root/benchmarks/bigbench/data/web_sales 8.6 K 25.9 K /user/root/benchmarks/bigbench/data/web_site |
执行流程
要执行TPCx-BB测试,首先需要切换到TPCx-BB源程序的目录下,然后进入bin目录,执行以下语句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
其中,-f、-m、-s、-j都是参数,用户可根据集群的性能以及自己的需求来设置。如果不指定,则使用默认值,默认值在 conf 目录下的 userSetting.conf 文件指定,如下:
export BIG_BENCH_DEFAULT_DATABASE="bigbench"
export BIG_BENCH_DEFAULT_ENGINE="hive"
export BIG_BENCH_DEFAULT_MAP_TASKS="80"
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"
export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"
export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"
默认 MAP_TASKS 为 80(-m 80)、SCALE_FACTOR 为 1000(-f 1000)、NUMBER_OF_PARALLEL_STREAMS 为 2(-s 2)。
所有可选参数及其意义如下:
General options:
-d 使用的数据库 (默认: $BIG_BENCH_DEFAULT_DATABASE -> bigbench)
-e 使用的引擎 (默认: $BIG_BENCH_DEFAULT_ENGINE -> hive)
-f 数据集的规模因子(scale factor) (默认: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)
-h 显示帮助
-m 数据生成的`map tasks`数 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"
-s 并行的`stream`数 (默认: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2)
Driver specific options:
-a 伪装模式执行
-b 执行期间将调用的bash脚本在标准输出中打印出来
-i 指定需要执行的阶段 (详情见$BIG_BENCH_CONF_DIR/bigBench.properties)
-j 指定需要执行的查询 (默认:1-30共30个查询均执行)"
-U 解锁专家模式
若指定了-U,即解锁了专家模式,则:
echo "EXPERT MODE ACTIVE"
echo "WARNING - INTERNAL USE ONLY:"
echo "Only set manually if you know what you are doing!"
echo "Ignoring them is probably the best solution"
echo "Running individual modules:"
echo "Usage: `basename $0` module [options]"
-D 指定需要debug的查询部分. 大部分查询都只有一个单独的部分
-p 需要执行的benchmark phase (默认: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"
-q 指定需要执行哪个查询(只能指定一个)
-t 指定执行该查询时用第哪个stream
-v metastore population的sql脚本 (默认: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"
-w metastore refresh的sql脚本 (默认: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"
-y 含额外的用户自定义查询参数的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"
-z 含额外的用户自定义引擎设置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)"
List of available modules:
$BIG_BENCH_ENGINE_BIN_DIR
回到刚刚执行TPCx-BB测试的语句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
bigBench
bigBench是主脚本,runBenchmark是module。
bigBench 里设置了很多环境变量(包括路径、引擎、STREAM数等等),因为后面调用 runBigBench.jar 的时候需要在Java程序里读取这些环境变量。
bigBench 前面都是在做一些基本工作,如设置环境变量、解析用户输入参数、赋予文件权限、设置路径等等。到最后一步调用 runBenchmark 的 runModule() 方法:
1、设置基本路径
export BIG_BENCH_VERSION="1.0"
export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin"
export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf"
export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator"
export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools"
export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"
2、指定 core-site.xml 和 hdfs-site.xml 的路径
数据生成时要用到Hadoop集群,生成在hdfs上
export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"
export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"
3、赋予整个包下所有可执行文件权限(.sh/.jar/.py)
find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '*.jar' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '*.py' -exec chmod 755 {} +
4、设置 userSetting.conf 的路径并 source
USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf"
if [ ! -f "$USER_SETTINGS" ]
then
echo "User settings file $USER_SETTINGS not found"
exit 1
else
source "$USER_SETTINGS"
fi
5、解析输入参数和选项并根据选项的内容作设置
第一个参数必须是module_name
如果没有输入参数或者第一个参数以"-"开头,说明用户没有输入需要运行的module。
if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]]
then
export MODULE_NAME=""
SHOW_HELP="1"
else
export MODULE_NAME="$1"
shift
fi
export LIST_OF_USER_OPTIONS="$@"
解析用户输入的参数
根据用户输入的参数来设置环境变量
while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do case "$OPT" in # script options d) #echo "-d was triggered, Parameter: $OPTARG" >&2 USER_DATABASE="$OPTARG" ;; D) #echo "-D was triggered, Parameter: $OPTARG" >&2 DEBUG_QUERY_PART="$OPTARG" ;; e) #echo "-e was triggered, Parameter: $OPTARG" >&2 USER_ENGINE="$OPTARG" ;; f) #echo "-f was triggered, Parameter: $OPTARG" >&2 USER_SCALE_FACTOR="$OPTARG" ;; h) #echo "-h was triggered, Parameter: $OPTARG" >&2 SHOW_HELP="1" ;; m) #echo "-m was triggered, Parameter: $OPTARG" >&2 USER_MAP_TASKS="$OPTARG" ;; p) #echo "-p was triggered, Parameter: $OPTARG" >&2 USER_BENCHMARK_PHASE="$OPTARG" ;; q) #echo "-q was triggered, Parameter: $OPTARG" >&2 QUERY_NUMBER="$OPTARG" ;; s) #echo "-t was triggered, Parameter: $OPTARG" >&2 USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG" ;; t) #echo "-s was triggered, Parameter: $OPTARG" >&2 USER_STREAM_NUMBER="$OPTARG" ;; U) #echo "-U was triggered, Parameter: $OPTARG" >&2 USER_EXPERT_MODE="1" ;; v) #echo "-v was triggered, Parameter: $OPTARG" >&2 USER_POPULATE_FILE="$OPTARG" ;; w) #echo "-w was triggered, Parameter: $OPTARG" >&2 USER_REFRESH_FILE="$OPTARG" ;; y) #echo "-y was triggered, Parameter: $OPTARG" >&2 USER_QUERY_PARAMS_FILE="$OPTARG" ;; z) #echo "-z was triggered, Parameter: $OPTARG" >&2 USER_ENGINE_SETTINGS_FILE="$OPTARG" ;; # driver options a) #echo "-a was triggered, Parameter: $OPTARG" >&2 export USER_PRETEND_MODE="1" ;; b) #echo "-b was triggered, Parameter: $OPTARG" >&2 export USER_PRINT_STD_OUT="1" ;; i) #echo "-i was triggered, Parameter: $OPTARG" >&2 export USER_DRIVER_WORKLOAD="$OPTARG" ;; j) #echo "-j was triggered, Parameter: $OPTARG" >&2 export USER_DRIVER_QUERIES_TO_RUN="$OPTARG" ;; \?) echo "Invalid option: -$OPTARG" >&2 exit 1 ;; :) echo "Option -$OPTARG requires an argument." >&2 exit 1 ;; esac done |
设置全局变量。如果用户指定了某个参数的值,则采用该值,否则使用默认值。
export BIG_BENCH_EXPERT_MODE="${USER_EXPERT_MODE:-"0"}" export SHOW_HELP="${SHOW_HELP:-"0"}" export BIG_BENCH_DATABASE="${USER_DATABASE:-"$BIG_BENCH_DEFAULT_DATABASE"}" export BIG_BENCH_ENGINE="${USER_ENGINE:-"$BIG_BENCH_DEFAULT_ENGINE"}" export BIG_BENCH_MAP_TASKS="${USER_MAP_TASKS:-"$BIG_BENCH_DEFAULT_MAP_TASKS"}" export BIG_BENCH_SCALE_FACTOR="${USER_SCALE_FACTOR:-"$BIG_BENCH_DEFAULT_SCALE_FACTOR"}" export BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS="${USER_NUMBER_OF_PARALLEL_STREAMS:-"$BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS"}" export BIG_BENCH_BENCHMARK_PHASE="${USER_BENCHMARK_PHASE:-"$BIG_BENCH_DEFAULT_BENCHMARK_PHASE"}" export BIG_BENCH_STREAM_NUMBER="${USER_STREAM_NUMBER:-"0"}" export BIG_BENCH_ENGINE_DIR="$BIG_BENCH_HOME/engines/$BIG_BENCH_ENGINE" export BIG_BENCH_ENGINE_CONF_DIR="$BIG_BENCH_ENGINE_DIR/conf" |
6、检测 -s -m -f -j的选项是否为数字
if [ -n "`echo "$BIG_BENCH_MAP_TASKS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_MAP_TASKS is not a number" fi if [ -n "`echo "$BIG_BENCH_SCALE_FACTOR" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_SCALE_FACTOR is not a number" fi if [ -n "`echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS is not a number" fi if [ -n "`echo "$BIG_BENCH_STREAM_NUMBER" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_STREAM_NUMBER is not a number" fi |
7、检查引擎是否存在
if [ ! -d "$BIG_BENCH_ENGINE_DIR" ] then echo "Engine directory $BIG_BENCH_ENGINE_DIR not found. Aborting script..." exit 1 fi if [ ! -d "$BIG_BENCH_ENGINE_CONF_DIR" ] then echo "Engine configuration directory $BIG_BENCH_ENGINE_CONF_DIR not found. Aborting script..." exit 1 fi |
8、设置 engineSetting.conf 路径并 source
ENGINE_SETTINGS="$BIG_BENCH_ENGINE_CONF_DIR/engineSettings.conf" if [ ! -f "$ENGINE_SETTINGS" ] then echo "Engine settings file $ENGINE_SETTINGS not found" exit 1 else source "$ENGINE_SETTINGS" fi |
9、检查module是否存在
当输入某个module时,系统会先到$BIG_BENCH_ENGINE_BIN_DIR/目录下去找该module是否存在,如果存在,就source "$MODULE";如果该目录下不存在指定的module,再到export MODULE="$BIG_BENCH_BIN_DIR/"目录下找该module,如果存在,就source "$MODULE";否则,输出Module $MODULE not found, aborting script.
export MODULE="$BIG_BENCH_ENGINE_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else export MODULE="$BIG_BENCH_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else echo "Module $MODULE not found, aborting script." exit 1 fi fi |
10、检查module里的runModule()、helpModule ( )、runEngineCmd()方法是否有定义
MODULE_RUN_METHOD="runModule" if ! declare -F "$MODULE_RUN_METHOD" > /dev/null 2>&1 then echo "$MODULE_RUN_METHOD was not implemented, aborting script" exit 1 fi |
11、运行module
如果module是runBenchmark,执行
runCmdWithErrorCheck "$MODULE_RUN_METHOD"
也就是
runCmdWithErrorCheck runModule()
由上可以看出,bigBench脚本主要执行一些如设置环境变量、赋予权限、检查并解析输入参数等基础工作,最终调用runBenchmark的runModule()方法继续往下执行。
runBenchmark
接下来看看runBenchmark脚本。
runBenchmark里有两个函数:helpModule ()和runModule ()。
helpModule ()就是显示帮助。
runModule ()是运行runBenchmark模块时真正调用的函数。该函数主要做四件事:
1、清除之前生成的日志
2、调用RunBigBench.jar来执行
3、logEnvInformation
4、将日志文件夹打包成zip
源码如下:
runModule () { #check input parameters if [ "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" -le 0 ] then echo "The number of parallel streams -s must be greater than 0" return 1 fi "${BIG_BENCH_BIN_DIR}/bigBench" cleanLogs -U $LIST_OF_USER_OPTIONS "$BIG_BENCH_JAVA" -jar "${BIG_BENCH_TOOLS_DIR}/RunBigBench.jar" "${BIG_BENCH_BIN_DIR}/bigBench" logEnvInformation -U $LIST_OF_USER_OPTIONS "${BIG_BENCH_BIN_DIR}/bigBench" zipLogs -U $LIST_OF_USER_OPTIONS return $? } |
相当于运行runBenchmark模块时又调用了cleanLogs、logEnvInformation、zipLogs三个模块以及RunBigBench.jar。其中RunBigBench.jar是TCPx-BB测试执行的核心代码,用Java语言编写。接下来分析RunBigBench.jar源码。
runModule()
runModule()函数用来执行某个module。我们已知,执行某个module需要切换到主目录下的bin目录,然后执行:
./bigBench module_name arguments
在runModule()函数里,cmdLine用来生成如上命令。
ArrayList cmdLine = new ArrayList(); cmdLine.add("bash"); cmdLine.add(this.runScript); cmdLine.add(benchmarkPhase.getRunModule()); cmdLine.addAll(arguments); |
其中,this.runScript为:
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
benchmarkPhase.getRunModule()用来获得需要执行的module。
arguments为用户输入的参数。
至此,cmdLine为:
bash $BIG_BENCH_BIN_DIR/bigBench module_name arguments
那么,怎么让系统执行该bash命令呢?答案是调用runCmd()方法。
boolean successful = this.runCmd(this.homeDir, benchmarkPhase.isPrintStdOut(), (String[])cmdLine.toArray(new String[0]));
接下来介绍rumCmd()方法
runCmd()
runCmd()方法通过ProcessBuilder来创建一个操作系统进程,并用该进程执行以上的bash命令。
ProcessBuilder还可以设置工作目录和环境。
ProcessBuilder pb = new ProcessBuilder(command); pb.directory(new File(workingDirectory)); Process p = null; --- p = pb.start(); getQueryList() |
getQueryList()用来获得需要执行的查询列表。从$BIG_BENCH_LOGS_DIR/bigBench.properties文件中读取。与$BIG_BENCH_HOME/conf/bigBench.properties内容一致。
bigBench.properties里power_test_0=1-30规定了powter_test_0阶段需要执行的查询及其顺序。
可以用区间如 5-12 或者单个数字如 21 表示,中间用 , 隔开。
power_test_0=28-25,2-5,10,22,30
表示powter_test_0阶段需要执行的查询及其顺序为:28,27,26,25,2,3,4,5,10,22,30
如果想让30个查询按顺序执行,则:
power_test_0=1-30
获得查询列表的源码如下:
private List<Integer> getQueryList(BigBench.BenchmarkPhase benchmarkPhase, int streamNumber) { String SHUFFLED_NAME_PATTERN = "shuffledQueryList"; BigBench.BenchmarkPhase queryOrderBasicPhase = BigBench.BenchmarkPhase.POWER_TEST; String propertyKey = benchmarkPhase.getQueryListProperty(streamNumber); boolean queryOrderCached = benchmarkPhase.isQueryOrderCached(); if(queryOrderCached && this.queryListCache.containsKey(propertyKey)) { return new ArrayList((Collection)this.queryListCache.get(propertyKey)); } else { Object queryList; String basicPhaseNamePattern; if(!this.properties.containsKey(propertyKey)) { if(benchmarkPhase.isQueryOrderRandom()) { if(!this.queryListCache.containsKey("shuffledQueryList")) { basicPhaseNamePattern = queryOrderBasicPhase.getQueryListProperty(0); if(!this.properties.containsKey(basicPhaseNamePattern)) { throw new IllegalArgumentException("Property " + basicPhaseNamePattern + " is not defined, but is the basis for shuffling the query list."); } this.queryListCache.put("shuffledQueryList", this.getQueryList(queryOrderBasicPhase, 0)); } queryList = (List)this.queryListCache.get("shuffledQueryList"); this.shuffleList((List)queryList, this.rnd); } else { queryList = this.getQueryList(queryOrderBasicPhase, 0); } } else { queryList = new ArrayList(); String[] var11; int var10 = (var11 = this.properties.getProperty(propertyKey).split(",")).length; label65: for(int var9 = 0; var9 < var10; ++var9) { basicPhaseNamePattern = var11[var9]; String[] queryRange = basicPhaseNamePattern.trim().split("-"); switch(queryRange.length) { case 1: ((List)queryList).add(Integer.valueOf(Integer.parseInt(queryRange[0].trim()))); break; case 2: int startQuery = Integer.parseInt(queryRange[0]); int endQuery = Integer.parseInt(queryRange[1]); int i; if(startQuery > endQuery) { i = startQuery; while(true) { if(i < endQuery) { continue label65; } ((List)queryList).add(Integer.valueOf(i)); --i; } } else { i = startQuery; while(true) { if(i > endQuery) { continue label65; } ((List)queryList).add(Integer.valueOf(i)); ++i; } } default: throw new IllegalArgumentException("Query numbers must be in the form X or X-Y, comma separated."); } } } if(queryOrderCached) { this.queryListCache.put(propertyKey, new ArrayList((Collection)queryList)); } return new ArrayList((Collection)queryList); } } |
parseEnvironment()
parseEnvironment()读取系统的环境变量并解析。
Map env = System.getenv(); this.version = (String)env.get("BIG_BENCH_VERSION"); this.homeDir = (String)env.get("BIG_BENCH_HOME"); this.confDir = (String)env.get("BIG_BENCH_CONF_DIR"); this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench"; this.datagenDir = (String)env.get("BIG_BENCH_DATA_GENERATOR_DIR"); this.logDir = (String)env.get("BIG_BENCH_LOGS_DIR"); this.dataGenLogFile = (String)env.get("BIG_BENCH_DATAGEN_STAGE_LOG"); this.loadLogFile = (String)env.get("BIG_BENCH_LOADING_STAGE_LOG"); this.engine = (String)env.get("BIG_BENCH_ENGINE"); this.database = (String)env.get("BIG_BENCH_DATABASE"); this.mapTasks = (String)env.get("BIG_BENCH_MAP_TASKS"); this.numberOfParallelStreams = Integer.parseInt((String)env.get("BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS")); this.scaleFactor = Long.parseLong((String)env.get("BIG_BENCH_SCALE_FACTOR")); this.stopAfterFailure = ((String)env.get("BIG_BENCH_STOP_AFTER_FAILURE")).equals("1"); |
并自动在用户指定的参数后面加上 -U (解锁专家模式)
this.userArguments.add("-U");
如果用户指定了 PRETEND_MODE、PRINT_STD_OUT、WORKLOAD、QUERIES_TO_RUN,则以用户指定的参数为准,否则使用默认值。
if(env.containsKey("USER_PRETEND_MODE")) { this.properties.setProperty("pretend_mode", (String)env.get("USER_PRETEND_MODE")); } if(env.containsKey("USER_PRINT_STD_OUT")) { this.properties.setProperty("show_command_stdout", (String)env.get("USER_PRINT_STD_OUT")); } if(env.containsKey("USER_DRIVER_WORKLOAD")) { this.properties.setProperty("workload", (String)env.get("USER_DRIVER_WORKLOAD")); } if(env.containsKey("USER_DRIVER_QUERIES_TO_RUN")) { this.properties.setProperty(BigBench.BenchmarkPhase.POWER_TEST.getQueryListProperty(0), (String)env.get("USER_DRIVER_QUERIES_TO_RUN")); } |
读取 workload 并赋值 benchmarkPhases。如果 workload 里不包含 BENCHMARK_START 和 BENCHMARK_STOP,自动在 benchmarkPhases 的首位和末位分别加上 BENCHMARK_START 和 BENCHMARK_STOP。
this.benchmarkPhases = new ArrayList(); Iterator var7 = Arrays.asList(this.properties.getProperty("workload").split(",")).iterator(); while(var7.hasNext()) { String benchmarkPhase = (String)var7.next(); this.benchmarkPhases.add(BigBench.BenchmarkPhase.valueOf(benchmarkPhase.trim())); } if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_START)) { this.benchmarkPhases.add(0, BigBench.BenchmarkPhase.BENCHMARK_START); } if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_STOP)) { this.benchmarkPhases.add(BigBench.BenchmarkPhase.BENCHMARK_STOP); } |