c - L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes -


i want achieve maximum bandwidth of following operations intel processors.

for(int i=0; i<n; i++) z[i] = x[i] + y[i]; //n=2048 

where x, y, , z float arrays. doing on haswell, ivy bridge , , westmere systems.

i allocated memory this

char *a = (char*)_mm_malloc(sizeof(float)*n, 64); char *b = (char*)_mm_malloc(sizeof(float)*n, 64); char *c = (char*)_mm_malloc(sizeof(float)*n, 64); float *x = (float*)a; float *y = (float*)b; float *z = (float*)c; 

when did got 50% of peak bandwidth expected each system.

the peak values calculated frequency * average bytes/clock_cycle. average bytes/clock cycle each system is:

core2: 2 16 byte reads 1 16 byte write per 2 clock cycles     -> 24 bytes/clock cycle sb/ib: 2 32 byte reads , 1 32 byte write per 2 clock cycles -> 48 bytes/clock cycle haswell: 2 32 byte reads , 1 32 byte write per clock cycle  -> 96 bytes/clock cycle 

this means e.g. on haswell i observe 48 bytes/clock cycle (could 2 reads in 1 clock cycle , 1 write next clock cycle).

i printed out difference in address of b-a , c-b , each 8256 bytes. value 8256 8192+64. each larger array size (8192 bytes) 1 cache-line.

on whim tried allocating memory this.

const int k = 0; char *mem = (char*)_mm_malloc(1<<18,4096); char *a = mem; char *b = a+n*sizeof(float)+k*64; char *c = b+n*sizeof(float)+k*64; float *x = (float*)a; float *y = (float*)b; float *z = (float*)c; 

this doubled peak bandwidth around 90% of peak bandwidth. however, when tried k=1 dropped 50%. have tried other values of k , found e.g. k=2, k=33, k=65 gets 50% of peak e.g. k=10, k=32, k=63 gave full speed. i don't understand this.

in agner fog's micrarchitecture manual says there false dependency memory address same set , offset

it not possible read , write simultaneously addresses spaced multiple of 4 kbytes.

but that's see biggest benefit! when k=0 memory address differ 2*4096 bytes. agner talks cache bank conflicts. haswell , westmere not suppose have these bank conflicts should not explain observing. what's going on!?

i understand ooo execution decides address read , write if arrays' memory addresses differ 4096 bytes not mean processor reads e.g. &x[0] , writes &z[0] @ same time why being off single cache line cause choke?

edit: based on evgeny kluev's answer believe agner fog calls "bogus store forwarding stall". in manual under pentium pro, ii , ii writes:

interestingly, can get bogus store forwarding stall when writing , reading different addresses if happen have same set-value in different cache banks:

; example 5.28. bogus store-to-load forwarding stall mov byte ptr [esi], al mov ebx, dword ptr [esi+4092] ; no stall mov ecx, dword ptr [esi+4096] ; bogus stall 

edit: here table of efficiencies on each system k=0 , k=1.

               k=0      k=1         westmere:      99%      66% ivy bridge:    98%      44% haswell:       90%      49% 

i think can explain these numbers if assume k=1 writes , reads cannot happen in same clock cycle.

       cycle     westmere          ivy bridge           haswell            1     read  16          read  16 read  16    read  32 read 32            2     write 16          read  16 read  16    write 32            3                       write 16            4                       write 16    k=1/k=0 peak    16/24=66%          24/48=50%            48/96=50% 

this theory works out pretty well. ivy bridge bit lower expect ivy bridge suffers bank cache conflicts others don't may effect consider.

below working code test yourself. on system without avx compile g++ -o3 sum.cpp otherwise compile g++ -o3 -mavx sum.cpp. try varying value k.

//sum.cpp #include <x86intrin.h> #include <stdio.h> #include <string.h> #include <time.h>  #define timer_type clock_realtime  double time_diff(timespec start, timespec end) {     timespec temp;     if ((end.tv_nsec-start.tv_nsec)<0) {         temp.tv_sec = end.tv_sec-start.tv_sec-1;         temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;     } else {         temp.tv_sec = end.tv_sec-start.tv_sec;         temp.tv_nsec = end.tv_nsec-start.tv_nsec;     }     return (double)temp.tv_sec +  (double)temp.tv_nsec*1e-9; }  void sum(float * __restrict x, float * __restrict y, float * __restrict z, const int n) {     #if defined(__gnuc__)     x = (float*)__builtin_assume_aligned (x, 64);     y = (float*)__builtin_assume_aligned (y, 64);     z = (float*)__builtin_assume_aligned (z, 64);     #endif     for(int i=0; i<n; i++) {         z[i] = x[i] + y[i];     } }  #if (defined(__avx__)) void sum_avx(float *x, float *y, float *z, const int n) {     float *x1 = x;     float *y1 = y;     float *z1 = z;     for(int i=0; i<n/64; i++) { //unroll 8 times         _mm256_store_ps(z1+64*i+  0,_mm256_add_ps(_mm256_load_ps(x1+64*i+ 0), _mm256_load_ps(y1+64*i+  0)));         _mm256_store_ps(z1+64*i+  8,_mm256_add_ps(_mm256_load_ps(x1+64*i+ 8), _mm256_load_ps(y1+64*i+  8)));         _mm256_store_ps(z1+64*i+ 16,_mm256_add_ps(_mm256_load_ps(x1+64*i+16), _mm256_load_ps(y1+64*i+ 16)));         _mm256_store_ps(z1+64*i+ 24,_mm256_add_ps(_mm256_load_ps(x1+64*i+24), _mm256_load_ps(y1+64*i+ 24)));         _mm256_store_ps(z1+64*i+ 32,_mm256_add_ps(_mm256_load_ps(x1+64*i+32), _mm256_load_ps(y1+64*i+ 32)));         _mm256_store_ps(z1+64*i+ 40,_mm256_add_ps(_mm256_load_ps(x1+64*i+40), _mm256_load_ps(y1+64*i+ 40)));         _mm256_store_ps(z1+64*i+ 48,_mm256_add_ps(_mm256_load_ps(x1+64*i+48), _mm256_load_ps(y1+64*i+ 48)));         _mm256_store_ps(z1+64*i+ 56,_mm256_add_ps(_mm256_load_ps(x1+64*i+56), _mm256_load_ps(y1+64*i+ 56)));     } } #else void sum_sse(float *x, float *y, float *z, const int n) {     float *x1 = x;     float *y1 = y;     float *z1 = z;     for(int i=0; i<n/32; i++) { //unroll 8 times         _mm_store_ps(z1+32*i+  0,_mm_add_ps(_mm_load_ps(x1+32*i+ 0), _mm_load_ps(y1+32*i+  0)));         _mm_store_ps(z1+32*i+  4,_mm_add_ps(_mm_load_ps(x1+32*i+ 4), _mm_load_ps(y1+32*i+  4)));         _mm_store_ps(z1+32*i+  8,_mm_add_ps(_mm_load_ps(x1+32*i+ 8), _mm_load_ps(y1+32*i+  8)));         _mm_store_ps(z1+32*i+ 12,_mm_add_ps(_mm_load_ps(x1+32*i+12), _mm_load_ps(y1+32*i+ 12)));         _mm_store_ps(z1+32*i+ 16,_mm_add_ps(_mm_load_ps(x1+32*i+16), _mm_load_ps(y1+32*i+ 16)));         _mm_store_ps(z1+32*i+ 20,_mm_add_ps(_mm_load_ps(x1+32*i+20), _mm_load_ps(y1+32*i+ 20)));         _mm_store_ps(z1+32*i+ 24,_mm_add_ps(_mm_load_ps(x1+32*i+24), _mm_load_ps(y1+32*i+ 24)));         _mm_store_ps(z1+32*i+ 28,_mm_add_ps(_mm_load_ps(x1+32*i+28), _mm_load_ps(y1+32*i+ 28)));     } } #endif  int main () {     const int n = 2048;     const int k = 0;     float *z2 = (float*)_mm_malloc(sizeof(float)*n, 64);      char *mem = (char*)_mm_malloc(1<<18,4096);     char *a = mem;     char *b = a+n*sizeof(float)+k*64;     char *c = b+n*sizeof(float)+k*64;      float *x = (float*)a;     float *y = (float*)b;     float *z = (float*)c;     printf("x %p, y %p, z %p, y-x %d, z-y %d\n", a, b, c, b-a, c-b);      for(int i=0; i<n; i++) {         x[i] = (1.0f*i+1.0f);         y[i] = (1.0f*i+1.0f);         z[i] = 0;     }     int repeat = 1000000;     timespec time1, time2;      sum(x,y,z,n);     #if (defined(__avx__))     sum_avx(x,y,z2,n);     #else     sum_sse(x,y,z2,n);     #endif     printf("error: %d\n", memcmp(z,z2,sizeof(float)*n));      while(1) {         clock_gettime(timer_type, &time1);         #if (defined(__avx__))         for(int r=0; r<repeat; r++) sum_avx(x,y,z,n);         #else         for(int r=0; r<repeat; r++) sum_sse(x,y,z,n);         #endif         clock_gettime(timer_type, &time2);          double dtime = time_diff(time1,time2);         double peak = 1.3*96; //haswell @1.3ghz         //double peak = 3.6*48; //ivy bridge @ 3.6ghz         //double peak = 2.4*24; // westmere @ 2.4ghz         double rate = 3.0*1e-9*sizeof(float)*n*repeat/dtime;         printf("dtime %f, %f gb/s, peak, %f, efficiency %f%%\n", dtime, rate, peak, 100*rate/peak);     } } 

i think gap between a , b not matter. after leaving 1 gap between b , c i've got following results on haswell:

k   % ----- 1  48 2  48 3  48 4  48 5  46 6  53 7  59 8  67 9  73 10 81 11 85 12 87 13 87 ... 0  86 

since haswell known free of bank conflicts, remaining explanation false dependence between memory addresses (and you've found proper place in agner fog's microarchitecture manual explaining problem). difference between bank conflict , false sharing bank conflict prevents accessing same bank twice during same clock cycle while false sharing prevents reading offset in 4k piece of memory after you've written same offset (and not during same clock cycle several clock cycles after write).

since code (for k=0) writes offset after doing 2 reads same offset , not read long time, case should considered "best", placed k=0 @ end of table. k=1 read offset overwritten, means false sharing , therefore performance degradation. larger k time between write , read increases , cpu core has more chances pass written data through memory hierarchy (which means 2 address translations read , write, updating cache data , tags , getting data cache, data synchronization between cores, , many more stuff). k=12 or 24 clocks (on cpu) enough every written piece of data ready subsequent read operations, starting value performance gets usual. looks not different 20+ clocks on amd (as said @mysticial).


Comments

Popular posts from this blog

javascript - how to protect a flash video from refresh? -

visual studio 2010 - Connect to informix database windows form application -

android - Associate same looper with different threads -