教程
The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
https://www.cs.virginia.edu/stream/ref.html

1. 介绍

Stream测试是内存测试中业界公认的内存带宽性能测试基准工具，由Virginia University提供，通过生成四种不同模式下的内存读写操作，用于测试高性能计算机的内存带宽。现代计算机中都是用缓存技术，为了保证测试正确反映计算机内存的读写性能，测试中使用的数据量应远大于缓存大小。更多介绍请参考：http://www.cs.virginia.edu/stream/ref.html

Stream 是合成的benchmark 程序，用c和Fortran 77 编写，测试4个方面的操作。操作如下：

``Copy'' measures transfer rates in the absence of arithmetic.
``Scale'' adds a simple arithmetic operation.
``Add' adds a third operand to allow multiple load/store ports on vector machines to be tested.
``Triad'' allows chained/overlapped/fused multiply/add operations.

------------------------------------------

per iteration:

name kernel bytes FLOPS

------------------------------------------

COPY: a(i) = b(i) 16 0

SCALE: a(i) = q*b(i) 16 1

ADD: a(i) = b(i) + c(i) 24 1

TRIAD: a(i) = b(i) + q*c(i) 24 2

对应到测试分别代表：

COPY：先访问一个内存单元读出其中的值，再将值写入到另一个内存单元

SCALE：先从内存单元读出其中的值，作一个乘法运算，再将结果写入到另一个内存单元

ADD：先从内存单元读出两个值，做加法运算，再将结果写入到另一个内存单元

TRIAD：先从内存单元中中读两个值a、b，对其进行乘加混合运算（a + 因子 * b ），将运算结果写入到另一个内存单元

主要参数：

-DNTIMES=10：执行的次数，并从这些结果中选最优值。
stream.c：待编译的源码文件
stream：输出的可执行文件名

其他参数：

-mtune=native -march=native：针对CPU指令的优化，此处由于编译机即运行机器。故采用native的优化方法。更多编译器对CPU的优化参考
-mcmodel=medium ；当单个Memory Array Size 大于2GB时需要设置此参数
-DOFFSET=4096 ；数组的偏移，一般可以不定义

2.下载，编译安装

source code: http://www.cs.virginia.edu/stream/FTP/Code/

多线程编译：

gcc -O -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIME=20 stream.c -o stream.o

参数说明
-STREAM_ARRAY_SIZE 测试数组大小，默认是10000000，一般来说array size的大小是缓存大小的4倍
-NTIMES 测试时间，默认是10
-OFFSET 调节数组的内存对齐，默认为0，一般不用修改
-openMP多线程支持添加-fopenmp选项，icc为-openmp，pgcc为-mp，Open64的opencc为-openmp

3.测试

要点

执行结果如下，主要关注内容为数组的复制(Copy)、数组的尺度变换(Scale)、数组的矢量求和(Add)、数组的复合矢量求和(Triad)：

执行测试时，stream会自动读取系统最大cpu数来启动线程测试，因此进行性能对比测试时需要指定测试的线程数

export OMP_NUM_THREADS=8，表示编译后的程序会启动8个线程进行测试。

stream要分版本，5.9 还是5.10：在stream.c里面第一行
5.10使用-DSTREAM_ARRAY_SIZE=<4cache大小>；5.9使用 -DN=<4cache大小>

gcc -O -mcmodel=small -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -mcmodel=large -DNTIME=20 ./stream.c -o ./stream.o

步骤

准备好测试用机器

１.如果测试是物理机查询并记录CPU数量，路数以及频率，内存大小和ｃａｃｈｅ大小
２.若测试为虚拟机，记录宿主机的CPU数量以及频率，内存大小，并且为虚拟机配置２核２线程，内存８GB

测试数据说明

若测试为物理机，则array size参数的大小至少为物理机cache的4倍，array size的计算公式：L3 cache * 4*（CPU路数）10^6/8，比如L3 cache为50M，cpu路数（即socket数）为2，则array size至少为：504210^6/8 为50000000
虚拟机使用默认array size 10000000即可, 多线程：5042*10^6/8 为50000000 左右

测试过程

物理机步骤：
1.   进入STREAM所在目录，编译：
gcc -O -mcmodel=small -fopenmp -DSTREAM_ARRAY_SIZE=<4*cache大小> -mcmodel=large -DNTIME=100 ./stream.c -o ./stream.o
2．运行：#./stream.o
虚拟机步骤：
1.   进入STREAM所在目录，编译：
gcc -O -fopenmp -DNTIME=100 ./stream.c -o ./stream.o
2.   执行./stream.o
记录所有配置信息和测试结果。

测试结果

输出COPY，Scale，Add 以及Triad的测试结果

4.探究

在stream测试中，参数DSTREAM_ARRAY_SIZE的取值，对测试结果的影响非常大，所以，在测试时对此参数的取值，需要有一个相对合理的范围。

公式

按照网上的说明和源码的解释：

STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.
You should adjust the value of 'STREAM_ARRAY_SIZE' (below) to meet both of the following criteria:
(a) Each array must be at least 4 times the size of the available cache memory. I don't worry about the difference between 10^6 and 2^20, so in practice the minimum array size is about 3.8 times the cache size.
Example 1: One Xeon E3 with 8 MB L3 cache
STREAM_ARRAY_SIZE should be >= 4 million, giving
an array size of 30.5 MB and a total memory requirement
of 91.5 MB.
Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
STREAM_ARRAY_SIZE should be >= 20 million, giving
an array size of 153 MB and a total memory requirement
of 458 MB.
(b) The size should be large enough so that the 'timing calibration'
output by the program is at least 20 clock-ticks.
Example: most versions of Windows have a 10 millisecond timer
granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds.
If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
This means the each array must be at least 1 GB, or 128M elements.

公式为：DSTREAM_ARRAY_SIZE = L3 cache (MB) x 4 (times) x 1000000 x sockets / 8

按照上面的公式，套用例子里的数据，

(a)，当L3 cache为8MB， socket为1时， DSTREAM_ARRAY_SZIE=8 x 4 x 1000000 x 1 / 8 = 400 0000

(b), 当 L3 cache为20MB，socket为2时，DSTREAM_ARRAY_SIZE=20 x 4 x 1000000 x 2 / 8 = 2000 0000

结果与例子里的数据一致。

到这里，似乎可以直接套用公式开始stream的测试了。

问题：

本着钻牛角尖的精神，有个小问题：

DSTREAM_ARRAY_SIZE参数的取值范围，在大于4倍L3 cache的基础上，如果一直增大，结果影响大吗，必须按照4倍L3 cache来取值么？
每次测试，都会消耗定量的内存，如果大到内存不够了会发生什么

验证环境

测试平台	L3 Cache	socket	4.1倍L3 Cache 对应ARRAY_SIZE	可用物理内存	0.6倍可用内存
物理机	114 MB	2	1 2000 0000	1024 G	614.4 G
虚拟机 4C8G 1v1 绑核	64 MB	4	1 3119 9999	8 G	4.8 G
虚拟机 4C8G 共享核

测试命令

gcc -mtune=native -march=native -O3 -fno-pic -ffp-contract=fast -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=${array_size} -DNTIMES=80 stream.c -o stream.o

加上**-mcmodel=large，** 在编译时，使用超大arraysize 才不会报错。

结论

在DSTREAM_ARRAY_SIZE取大于下，其stream的结果差别不大，大部分在3%范围内波动，大点的偏差在5%以内，所以只要大于4倍L3 cache， stream的测试结果是可信的，如果
4.1倍L3 cache的情况一直增大DSTREAM_ARRAY_SIZE，只会增加测试运行时间。
在stream测试执行时所需的内存大于0.6倍可用内存时，对结果的影响也不大，但是如果大到一定的程度，在运行时会直接报 segment fault，导致执行不成功。

目录CONTENT

内存测试：stream