E3-1240 v5 3.50GHz single core perf worse than E5-2650 v2 2.60GHz PHP 5.X
E3-1240 v5 @ 3.50GHz performance is worse than E5-2650 v2 @ 2.60GHz for PHP 5.X. For PHP 7 (and everything else) the E3 is better.
The test setup is Xenserver 6.5 w/Centos 6.8 kernel 2.6.32-642.6.2.el6.x86_64 HVM guests. Each VM has 2 cores assigned. The test is using siege. PHP 5.4, 5.5, 5.6; are all nearly 50% slower for E3. PHP 7 is 200% faster for E3. Varnish is nearly 300% faster for E3. Sysbench tests are 150% - 300% faster for E3. Only PHP 5.X is faster for E5.
I've torn down and rebuilt the VM's several times and confirmed they are the same. I've even live migrated them across to the other host/proc and confirmed the same results.
I've tried strace, but it isn't going to work because it adds overhead to every call and the E3 executes that overhead faster. In a browser the E5 TTFB is 167ms; the E3 is 318ms. Stracing the call on the E5 is 548ms; E3 557ms. The E3 executes the overhead of strace faster and the execution times equalize.
What is different about PHP 5.X that it would run so much better on the older generation, slower clocked, E5? Is it the larger l1/l2 cache making the difference? Or something else, instruction set related maybe? What another tool could I use, that adds a little overhead, to see the php execution performance? You are comparing low end Xeon processors with high end Xeon processors ($250-$280 vs $1166-$1180 per processor). You would need to use the same series E3-1240 v2 vs E3-1240 v5 to have a more accurate test.
http://ark.intel.com/products/88176/Intel-Xeon-Processor-E3-...
http://ark.intel.com/products/65730/Intel-Xeon-Processor-E3-... Maybe Unfortunately <not supported> Performance counter stats for 'php56 index.php': PHP 5 is allocating memory differently from PHP 7. That's why you see difference there and this is the biggest difference between E5 and E3 here (memory bandwidth,cache size). PHP 7 is making optimizations making less memory allocations because it allocates in chunks, PHP5 is allocating/reallocating all the time. Memory seems likely. On the E5 php 5.6 in top we see sys at 15%. On the E3 php 5.6 in top we see sys at 7%. On the E5 php 7 in top we see sys at 13%. We are exploring memory perf more now. I'm just speculating here, but if I remember correctly, PHP5 uses significantly more memory than PHP7, and the E5 has a 2.5x larger L3 cache and almost twice the memory bandwidth of the E3. Perhaps that has something to do with it? We thought that as well. The E5 has 4 memory channels max bandwidth of 51.2 GB/s. The E3 has 2 memory channels max bandwidth of 34.1 GB/s. But we see a dramatic difference in single core tests. Our virtual machines have 2 cores assigned and there's also a dramatic difference. I wouldn't think that 1-2 cores would saturate 2 memory channels nor 34.1 GB/s bandwidth. If we were testing all 8 cores on the E3 vs E5 8 core virtual machine, yeah maybe, but 1-2 cores? The L3 cache is much larger on the E5 at 20MB Smartcache vs the E3 at 8MB Smartcache. That seems to be the more likely suspect but I don't know enough about how the cpu cache is used in relation to php to say for sure. Hopefully, someone else does :) Ref:
http://ark.intel.com/products/88176/Intel-Xeon-Processor-E3-...
http://ark.intel.com/products/64590/Intel-Xeon-Processor-E5-... You have talked about everything but what the parent was mentioning- actual cache. As in, L1 and L2. Those vary sizably among the different price tiers, somewhat understably, for reasons related to this. On recent IBM Power chips, there's a so called PowerCore option that turns off half the cores, and lets the remaining cores double their L2. On some workloads that's a net win. I also tend to think it's there for those people paying a pricey per-core or per-socket fee, where a modest 15% performance gain/core could be very rewarding in a way that scale-out/more-cores can't replicate, but that's in a different realm than anyone I know. See the other comment above re: perf stat. Working on the event descriptors to see and confirm the l1/l2 cache hits/misses. Look at the cache miss counters, I suspect that's the explanation. Your other workloads are more cache friendly. > Varnish is nearly 300% faster for E3 That is the most worrying thing IMO. If the cache is hot (i.e. all loads are from RAM), then the E5 should be vastly more powerful, not vice versa... Could you try the benchmarks with Gentoo, with optimised builds for each CPU? Just to make sure: You built all binaries and linked libraries yourself, from scratch, with the same optimization settings, right? The binaries are from remi repo. We have a template we provision from. I used the same template on each virtual machine. PHP 5.4.45 (cli) (built: Sep 19 2016 15:31:07)
PHP 5.5.38 (cli) (built: Nov 9 2016 17:32:11)
PHP 5.6.28 (cli) (built: Nov 9 2016 07:04:38) The binaries are the same on each virtual machine. Are there build optimizations for E3/V5 vs E5/V2 that could make such a difference? >Are there build optimizations for E3/V5 vs E5/V2 that could make such a difference? Potentially, yes. Either way you're not comparing apples to apples using packages from a 3rd party repo built for a different system. Newer processors have different instruction sets (AVX is a big one). You'd want to make sure you're not only compiling it on each different platform, but also using a new enough compiler to support the instruction sets. Correct. We are going to build using the correct -march flags: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html >>gcc -march=native -Q --help=target >>march=silvermont gcc native thinks this e3 is silvermont, a low power SoC. We built 5.6 with -O2 -march=broadwell -mno=avx (had to remove, probably pecl ext issue). There was about a 15% performance gain. Nothing that would explain the large difference between E3 and E5. Found this and will try it: https://github.com/centminmod/centminmod/commit/755dd9e87eac... Have tried comparing on bare metal? We have not compared on bare metal.
would show some difference? It measures some kernel and CPU events like context switches, page faults, L1 and L3 cache misses. perf stat -d php ./benchmark.php
Working on finding the event descriptors... 394.588620 task-clock (msec) # 0.983 CPUs utilized
226 context-switches # 0.573 K/sec
2 cpu-migrations # 0.005 K/sec
17,447 page-faults # 0.044 M/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
<not supported> L1-dcache-loads
<not supported> L1-dcache-load-misses
<not supported> LLC-loads
<not supported> LLC-load-misses
0.401580145 seconds time elapsed