Intel SIMD and AVX-512

Single Instruction, Multiple Data - computers with multiple processing elements perform the same operation on multiple data points simultaneously, which exploit data level parallelism, but not concurrency - simultaneous computations, only a single process

SIMD

Intel and AMD comparation

Instruction	Num	Date	ICPU	IDate	ACPU	ADate
MMX	57	1996-10-12	Pentium MMX(P55C)	1996-10-12	K6	1997-4-1
SSE	70	1999-5-1	Pentium III(Katmai)	1999-5-1	Athlon XP	2001-10-9
SSE2	144	2000-11-1	Pentium 4(Willamette)	2000-11-1	Opteron	2003-4-22
SSE3	13	2004-2-1	Pentium 4(Prescott)	2004-2-1	Athlon 64	2005-4-1
SSSE3	16	2006-1-1	Core	2006-1-1	Fusion(Bobcat)	2011-1-5
SSE4.1	47	2006-9-27	Penryn	2007-11-1	Bulldozer	2011-9-7
SSE4.2	7	2008-11-17	Nehalem	2008-11-17	Bulldozer	2011-9-7
SSE4a	4	2007-11-11			K10	2007-11-11
SSE5		2007-8-30
AVX		2008-3-1	Sandy Bridge	2011-1-9	Bulldozer	2011-9-7
AVX2		2011-6-13	Haswell	2013-4-1
AES	7	2008-3-1	Westmere	2010-1-7	Bulldozer	2011-9-7
3DNowPrefetch	2	2010-8-1			K6-2	1998-5-28
3DNow!	21	1998-1-1			K6-2	1998-5-28
3DNow!+		1999-6-23			Athlon	1999-6-23
MmxExt					Athlon	1999-6-23
3DNow! Pro					Athlon XP	2001-10-9
POPCNT	1	2007-11-11			K10	2007-11-11
ABM	1	2007-11-11			K10	2007-11-11
CLMUL	5	2008-5-1	Westmere	2010-1-7	Bulldozer	2011-9-7
F16C		2009-5-1	Ivy Bridge	2012-4-1	Bulldozer	2011-9-7
FAM4		2009-5-1			Bulldozer	2011-9-7
XOP		2009-5-1			Bulldozer	2011-9-7

AVX-512

Content

Intel® AVX-512 Foundation (F)

512-bit vector width.
32 512-bit long vector registers.
Data expand and data compress instructions.
Ternary logic instruction.
8 new 64-bit long mask registers.
Two source cross-lane permute instructions.
Scatter instructions. — Embedded broadcast/rounding.
Transcendental support.

Intel® AVX-512 Conflict Detection Instructions (CD)

Intel® AVX-512 Exponential and Reciprocal Instructions (ER)

Intel® AVX-512 Prefetch Instructions (PF)

Intel® AVX-512 Byte and Word Instructions (BW)

Intel® AVX-512 Double Word and Quad Word Instructions (DQ) — New QWORD and Compute and Convert Instructions.

Intel® AVX-512 Vector Length Extensions (VL)

When any 512-bit instruction is used, there is a moderate reduction in speed, and if a core uses the heaviest of these instructions in a sustained way, the core may run much slower. Furthermore, the slowdown is usually worse when more cores use these new instructions. In the worst case, you might be running at half the advertised frequency and thus your whole application could run slower.

Downclocking

Only happens when using 512-bits register - means we can benefit from AVX-512 instructions and features without downclocking.
Never get downclocking when working on 128-bit registers
Downclocking, when it happens, is per core and for a short time after you have used particular instructions
Heavy instructions
- floating point operations, integer multiplication, leading-zero bits AVX-512 instructions …
- used in deep learning, numerical analysis, high performance computing, cryptograph
Light instructions
- integer operations, logical operations, data shuffling
- text processing, fast compression routines, vectorized implementation of library routines

Intel cores run in three model from license 0 to 2 - fastest to slowest

L2 will happen when sustained use of heavy 512-bit instructions - one such instruction every cycle
L1 will happen when sustained use of heavy 256-bit instructions
using taskset and numactl to control which cores run the heavy instructions

Intel Xeon Gold 5120 processor…

mode	1 active core	9 active cores
Normal	3.2 GHz	2.7 GHz
AVX2	3.1 GHz	2.3 GHz
AVX-512	2.9 GHz	1.6 GHz

Notes
- Downclocking will happen in any case even if you are not using AVX-512 instructions
- Downclocking is modest when using light AVX-512 even if it across all cores
- Downclocking becomes significant when using heavy numerical work

Thoughts

We engineers should monitor the frequency of cores and identify massive downclocking using perf.
Massive downclocking is maybe not an important risk on machines with few cores.
Consider isolating heavy numerical instructions work to specific threads/processes/cores, which limit the downclocking to cores and take full advantages of AVX-512. We should ensure the benefits of AVX-512 are substantial compared with non-AVX-512.
Check that light AVX-512 gives you a greater gain compared with frequency downclocking. Even the cost is lower.
Library, especially performance sensitive libraries, should determine whether AVX-512 is worth it and document the approach with the likely speedups from the wider instructions.
From the compiler’s perspective, it may decide to use the AVX-512 instructions while copying structure, inline memcpy and vectorizing loops, the decision is difficult.
We may need a tool that monitor the workload and identify when and why downclocking occurs. Meanwhile, OS or application frameworks could assign threads to specific cores.

Le's Zone

A basic learning for Intel AVX-512