Welcome to JiKe DevOps Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
810 views
in Technique[技术] by (71.8m points)

performance - Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?

Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.

PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

Please log in or register to answer this question.

1 Answer

0 votes
by (71.8m points)

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache. 

128 bit = 16 bytes / clock read AND 128 bit = 16 bytes / clock write (can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/    

And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html

More about lat_mem_rd program is in its man page or here on SO.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to JiKe DevOps Community for programmer and developer-Open, Learning and Share
...