The Weird and Wacky World of VIA（关于兆芯CPU架构的分析的外文资料）

金刚光 2022-12-01 发布于辽宁

展开全文

The Weird and Wacky World of VIA, the 3rd player in the “Modern” x86 market

September 1, 2021 Cheese Leave a comment

Header Image credit goes to Martijn Boer.

In the world of x86 CPUs there are two major players, Intel and AMD. However, there is one (well two but that will be expanded on later) other company that designs and produces CPUs that are fully compatible with modern x86 extensions, yes even AVX, and that company is VIA.

How VIA got the x86 license is a bit of a messy story that involves the purchase of both Cyrix from National Semiconductor and Centaur Technology from Integrated Device Technology, but in the end VIA ended up with an x86 license and is currently designing and developing x86 compatible CPUs.

VIA’s Cores through Time

When VIA acquired Cyrix and Centaur, there were several cores in design. From Cyrix there was a brand new core by the name of Jalapeno. Jalapeno was a spicy 2-wide out of order design with an on-die RIMM memory controller. Cyrix also had a more modest Cayenne (later renamed Joshua) core which was less a revolution and more an evolution of the MII. Meanwhile, VIA got the Samuel core from Centaur.
VIA scraped both cores from Cyrix, Joshua and Jalapeno, and purchased the Samuel core from Centaur as VIA were focusing on low power designs which neither the Joshua or Jalapeno core were targeting.

The Samuel core and its many derivatives were used in the Cyrix III (later renamed to the C3 as it was not based on Cyrix tech) all the way up to the VIA C7.

The Samuel core which dates back to 2000 was replaced in 2008 by the Isaiah core. The Samuel core which dates back to 2000 was replaced in 2008 by the Isaiah core. Isaiah was a very ambitious design at the time, a low-power fully out-of-order core in 2008 was not trivial, considering the nodes from the time, and as a reminder Intel’s Bonnell core was a dual issue in-order core also from 2008. Yes ARM did have a out-of-order core in 2008 in the form of the Cortex A9 but as pointed out in our article about how ISA doesn’t matter, A9 lost to Bonnell in terms of both power and performance so for VIA to also trying a fully out-of-order CPU in 2008 was an ambitious project. Isaiah was first fabbed on Fujitsu 65nm and then later shrunk to TSMC 40nm. Furthermore, it was shrunk even more down the line from the aforementioned TSMC 40nm to TSMC 28nm and renamed Isaiah II.

VIA’s latest core built on TSMC N16, CNS, has yet be be released in any shipping products, is a very large jump compared to Isaiah. In the grand scheme of CPU design, in the year 2021, it looks to be fairly lackluster, however, and is nowhere near as ambitious as Isaiah sadly.

VIA’s Isaiah: You Call This Low-Power?

Isaiah is very complex for a low-power core. That’s clear just from a mile high block diagram.

Most of the structure sizes here were determined using microbenchmarks

For comparison, here’s Intel’s Pentium III, which was discontinued just a year before Isaiah launched as VIA Nano.

Data compiled from a variety of sources, some of which are unofficial, and some of which don’t agree. Take this with a grain of salt, but it should be mostly accurate

Compared to Intel’s prior “big core”, VIA Nano is giant. It’s deeper and wider with bigger L1 caches, but VIA’s engineers didn’t stop there.

Let’s start with the branch predictor, where VIA took things to the next level.

The Complex Maze that is Isaiah’s Branch Predictor

The branch prediction unit (BPU) tells the core where to fetch the next instruction from. If there’s no branch, obviously you just keep fetching instructions in a straight line. But if there is, the BPU must decide whether the branch is taken. If it is taken, the BPU needs to determine where the branch is going (and preferably do so quickly to avoid keeping the instruction fetch unit waiting).

Let’s look at speed first. As described in our previous article, this test shows how fast the BPU can provide the branch target.

This graph shows how many branch targets can be tracked in a loop
The longer the line stays flat the better.

VIA says they used a 4096 entry 4-way BTB to cache branch targets, and our results confirm that. For 2008, that’s insane. AMD’s Bobcat only has a 512 entry BTB. Even big cores like Intel’s Conroe and AMD’s K8/K10 only had 2048 entries. Intel didn’t bring out a 4096 entry BTB until Sandy Bridge in 2011.

But this huge BTB doesn’t come for free and ends up being a bit slow. Nano can only do one taken branch every three cycles. Or alternatively, the instruction fetch unit sits around doing nothing for two cycles (bubbles) after a taken branch while the branch predictor grabs the branch target out of the BTB.

Our experiments only show 1024 BTB entries for Core 2, but other research with more detailed testing concluded it had 2048 entries

Core 2 uses a smaller and faster BTB capable of handling a taken branch every two cycles. If there are four or fewer taken branches in play, Core 2 can do “zero bubble” branch prediction.

Now let’s look at Bobcat’s approach.

It looks like Bobcat’s BTB is tied to the instruction cache, as changing branch spacing affects how many branches it can cache. Branches spaced by 4 bytes were disastrously bad for Bobcat.

Bobcat’s BTB isn’t as big as VIA’s, but it’s slightly faster. In the best case, Bobcat can do a taken branch every two cycles. Unlike VIA, Bobcat is also sensitive to branch spacing and can only utilize its maximum BTB capacity if branches are moderately spaced. Dense branches cause catastrophically bad behavior for Bobcat. It looks even worse than missing the BTB, so maybe Bobcat’s BTB isn’t even tagged at 4 byte granularity and ends up feeding wrong branch targets to the frontend, causing mispredicts.

Now that we’ve covered how fast the branch predictor is, let’s see how well it can track patterns. That should correlate well with how accurate it is. Unfortunately, we can’t compare real world accuracy because we couldn’t get performance counters to work on the Nano.

The X axis shows how long the repeating pattern is for each branch, while the Y axis shows how many branches are in play. The Z axis shows how much longer each branch took on average when the pattern was random, versus all 0s (branch always taken). We’re looking at the 'wall’ here, and how far it is away from the origin. The farther, the better.

This wacky looking graph is the pattern recognition graph for the VIA Nano U4025

Nano uses a very complex direction predictor with four Branch History Tables (BHTs). Three of them predict whether a branch is taken, while the fourth predicts which prediction to use. This 'tournament’ style predictor with competing prediction methods isn’t new, but saw more use on very high performance, high power designs. For example, DEC’s Alpha EV5 had three BHTs – one with local history, one with global history, and one for the meta predictor (that selects between the previous two). VIA didn’t disclose the exact prediction methods in play. But from the graph above, its pattern recognition abilities are quite impressive.

For perspective, compare this to Core 2, a contemporary “big core” design. Intel’s BPU uses a 8-bit Global History Buffer1.

It’s quite clear here that Core 2 Duo’s BPU can’t track very long patterns

And as you can see, in this test, Isaiah’s BPU wins hands down.

Bobcat lands somewhere in between. It’s not quite as good as Isaiah, but is better than Core 2.

Taking a step back, VIA preferred a more complex frontend design. The instruction cache is large with high associativity. Decode is 3-wide (wider than any low power x86 core until Intel’s Goldmont came out in 2016). And the branch predictor is incredibly sophisticated for 2008. For Bobcat, AMD opted for a narrower frontend with smaller caches and a less capable (but lower latency) branch predictor.

VIA kept the same philosophy for Isaiah’s backend.

A Very Capable Out of Order Engine

A 65 entry reorder buffer (ROB) is massive for the time, for context Bobcat has a 56 entry ROB and is 3 years newer. This beefiness also follows in the integer and floating point register files where Isaiah has 46 integer and 48 SIMD/FP registers for renaming. Comparatively, Bobcat has 34 integer registers and 21 FP registers available for renaming.

The reason this graph shows 46 entries for the Floating Point Register File (FP Reg. File) and 48 entries for the Integer Register File (Int. Reg. File) but our diagram above says 62 and 64 entries for the FP Reg. File and Int. Reg. File is because x86 has 16 architectural registers that have to be tracked for exception recovery and can’t be used for holding speculative register values.

We got some mixed results with our microbenchmarks in regards to Isaiah’s ROB size. On Windows we observed a 48 entry ROB, however on Linux saw a 65 entry ROB. We have not observed that kind of discrepancy with any other CPU in our testing; we went with the higher value because it’s clear that Isaiah’s ROB can hold more than 48 micro-ops.

AMD’s Bobcat didn’t emphasize FP/SIMD reordering capacity as much as VIA did. The difference here is especially exaggerated because we used 128-bit packed integer operations to test the FP/SIMD register file. Bobcat breaks those into 2×64-bit operations, and thus consumes two 64-bit registers for each 128-bit operation.

VIA says in the Isaiah whitepaper that both the load and store queues are 16 entries each and our microbenchmarks show this for the most part.

Compared to Bobcat, Isaiah as 6 more load queue entries but 6 fewer store queue entries. Core 2 has 4 more load queue entries and double the store queue entries that Isaiah has.

The load/store unit is one of the only places where Isaiah looks normal for a low power architecture. It’s laid out differently than Bobcat, and has a slightly bigger store queue than Intel’s older Pentium III.

A Big Distributed Scheduler

We used instruction-to-port mappings from Agner’s instruction tabes to determine the size of each scheduling queue

We couldn’t directly measure the store scheduler’s size. Unlike newer CPUs, Nano does not allow a load to execute ahead of a store with an unknown address. This restriction is understandable, as this memory dependency speculation just debuted in Intel’s Core 2 architecture. AMD didn’t do this until Bulldozer launched in 2011.

In the Isaiah whitepaper, VIA says there are a total of 76 entires across all of the reservation stations. To estimate the store scheduler’s size, we subtracted our estimates for the other schedulers from 76, to get 13 entries.

Thus, the break down of the individual scheduling queues is as follows:
Integer 1 – 12 entires
Integer 2 – 11 entries
Media A – 8 entries
Media B – 24 entries
Load – 8 entries
Store – Most likely 13 entries

Relative to Bobcat, Isaiah has 14 more SIMD scheduler entries total, 7 more Integer scheduler entries, and 13 more Load/Store entries, or to put it another way “Bloody Huge”. Now, Core 2 has a fully unified 32 entry scheduler, which while smaller then both Isaiah and Bobcat in terms of number of entries, is more flexible and can do well with fewer entries.

Nano’s Wide and Fast Execution Engine

Constructed using Agner’s Instruction Tables

For a low power architecture, Nano’s designers built a very powerful execution engine. The Media A (MA) and Media B (MB) ports feature 128-bit wide execution units and datapaths, matching Intel’s Core 2 in vector throughput. Low power cores didn’t have that kind of vector throughput until AMD’s Jaguar architecture debuted in 2013. The MA/MB schedulers together have 32 entries – more than the SIMD scheduler on AMD’s Bobcat and Jaguar.

VIA’s engineers didn’t stop at hitting big core FP/SIMD capability on a “low power” core. According to Agner’s instruction tables, floating point addition latency is just two cycles, faster than any contemporary (or modern) CPU. FP multiply latency is three cycles. For comparison, Skylake does floating point adds and multiplies with four cycle latency, and Zen 3 does both with three cycle latency.

Similarly, Nano’s L1D can return data in two cycles. If that’s not impressive enough already, remember that it’s with a 64 KB 16-way cache. Assuming address generation and TLB lookup takes a cycle, Nano is checking 16 tags and getting data back in one cycle. That’s insane. At the time, 3 cycle L1D latencies were typical, with 64 KB 2-way (Athlon) or 32 KB 8-way (Core 2) caches. Today, L1D latency is around 4 cycles (or 5 cycles on Intel’s Ice Lake and Tiger Lake cores).

Wrapping Up: Hitting the Wrong Target?

Isaiah is unique and very ambitious. Perhaps too ambitious for its own good.

In places Isaiah, remember this is a “Low-Power” design, is bigger than Core 2 which only came out a scant 18 months before Isaiah on a similar class of node. In almost all areas, Isaiah is also wider than Bobcat which came out 3 years after Isaiah, on a smaller node than Isaiah’s launch node.

Now to add some anecdotal evidence, when I was testing all 3 platforms for this article, the Nano output either similar amounts or seemingly more heat than the Core 2 Duo. For some context, the Nano was able to keep my coffee above lukewarm which is quite impressive; the Bobcat system was easily the coolest of the three.

Clearly Nano wasn’t very low power. But it also wasn’t high performance, so Nano’s priorities need some explaining. I especially wonder why Nano’s design team used full 128-bit vector execution units, and made them very low latency. That can’t have been cheap in terms of power. One possibility is that VIA’s engineers targeted multimedia applications like video playback. That would explain why Nano’s FP/SIMD ports are named “Media A” and “Media B”. But hardware video decoders became very prevalent not long after Nano debuted, leaving Nano’s powerful vector units in an awkward spot.

Isaiah’s large, fast L1D and powerful branch predictor are also culprits. Low latency and high associativity scream high power and low clocks. The branch predictor is a bit more difficult to analyze. Nano went all out for accuracy, but paid the price in frontend latency. With 20-20 hindsight, we can see AMD reduce branch predictor related frontend bubbles generation after generation, and conclude that branch predictor speed is more important than VIA thought it was.

Ultimately, Isaiah represents an ambitious attempt by VIA to compete with AMD and Intel. There’s no doubt VIA had a capable and determined engineering team. But creating a successful CPU is all about anticipating important workloads and correctly tuning the architecture for them. Intel and AMD correctly guessed that video playback would be offloaded. Also, Intel/AMD are larger companies with extensive simulation resources that can better guide engineering decisions. Both developed low power architectures that were narrower and less sophisticated architectures than Isaiah, but more than made up for it with higher clock speed and careful resource allocation.

Now in the very beginning I said how there is a fourth x86 design house, and that is true, however Zhaoxin is a joint venture between VIA and the Shanghai Municipal Government. If you paid attention to our ISA doesn’t matter article you would know that we have a Zhaoxin CPU, in-house, ready for testing. Because of time constraints, the full in-depth dive into that CPU will have to wait for a part 2 to this article. However, from what testing we have done with Zhaoxin’s newest microarchiture, Lujiazui, it’s more evolutionary than revolutionary.

The Weird and Wacky World of VIA Part 2: Zhaoxin’s not quite Electric Boogaloo

September 22, 2021 Cheese, clamchowder Leave a comment

In Part 1 of this piece we talked about the third x86 design house, VIA and more specifically VIA’s most recent commercially available architecture Isaiah. Today we are talking about the joint venture that VIA has with the Shanghai Municipal Government, Zhaoxin.

A Little Background History

Zhaoxin was started in 2013 as a joint venture between the Shanghai Government and VIA.

Chart of Zhaoxin’s CPU families and architectures

Based on the information on Wikichip, Zhangiang is a minor modification of the Isaiah II core where the biggest changes were the addition of the Chinese hashing algorithms SM3 and SM4 to Padlock.

The big architectural changes came with the Wudaokou core where Zhaoxin claimed a 25% performance per clock increase, which is no small feat. The core we are looking at today is Zhaoxin’s latest, Lujiazui, which Zhaoxin claims has an up to 50% performance increase over Wudaokou however, the clock speed increase is also up to 50% which make me suspect that Lujiazui is the Wudaokou core ported from HLMC’s 28nm to TSMC’s 16nm and has had a nice clock bump as a consequence.

Lujiazui versus Isaiah
What has changed?

Cache Money…..
Or in this case an exchange

One of the largest changes that Zhaoxin did with the Lujiazui architecture is to the cache hierarchy.

The absolute values for the L1 latency is the same for both the Zhaoxin and VIA at 5 cycles. But, VIA documents a 2 cycle latency for the L1 where as we are showing a 5 cycle latency; this could be because of our test using complex address generation which looks to add 3 cycles to the L1 latency. We suspect that Zhaoxin has a 4 cycle L1 latency with a 1 cycle complex address generation penalty based on what Intel and AMD document however it is impossible for us to tell based on this test.

Zhaoxin has cut the L1 in half from Isaiah going from a 64KB L1 to a 32KB L1. This is an interesting move for a few reasons: a smaller L1 is lower power, costs less die area, and will reduce the hitrate of the L1 so more requests go out to the L2.

Speaking of that, Lujiazui’s 4MB L2 is now shared across a quad core cluster, ditching Isaiah’s private 1 MB of L2 per core. Thus Lujiazui has a last level cache configuration similar to that of AMD’s Zen 1 and Zen 2 APUs which is a massive change because Isaiah did not have shared caches.

However, this new shared L2 came at a large cost. L2 latency went from 20 cycles on Isaiah to an astonishing 48 cycles on Lujiazui. For comparison, Zen 1’s L3 (LLC) is twice as big at 8MB but has 35 cycle latency even though Zen 1 runs at a higher clock.
This would suggest to us that this is not designed to be a high performance design, however both Zhaoxin and VIA claim that the KX-6000 series is designed to go up against Skylake in performance which is a high performance core. Now, Tom’s Hardware did a very in-depth performance review of the KX-U6780A and found that it in no way matches up to Skylake for performance.

Comparing Zhaoxin’s cache bandwidth to Piledriver in a 1 thread per module config. PD similarly supports AVX by splitting 256-bit instructions into two 128-bit micro-ops.

Lujiazui’s cache bandwidth is unimpressive even when compared to other CPUs that don’t use a full-width AVX implementation (more on that later). We weren’t even able to sustain 8 bytes per cycle to each core from the L2 cache. Single threaded bandwidth, shown below, is also nothing to write home about. Clearly, the cache hierarchy will need to be improved if Zhaoxin really does want to go up against Zen 2.

Memory bandwidth with one thread. Again, Piledriver is provided for comparison.

Beyond the L2, Lujiazui brings the memory controller on-die instead of accessing it through a Front-Side Bus (FSB), reducing memory latency and Lujiazui’s intra-cluster interconnect gives reasonably low latencies when bouncing cache lines between cores:

“Core to core latencies” for LuJiaZui, where cores use atomic instructions to bounce cache line ownership

The All-Important Branch Predictor Unit

In the previous part of this article we talked a lot about Isaiah’s very unique BPU and how it seemed to trade off speed for accuracy.

Zhaoxin tries to get the best of both worlds by adding a 16 entry L0 BTB that can handle taken branches with no bubbles. Zen 2 has a similar mechanism, and can typically handle about half of the branches it sees without going to larger, higher latency BTBs. Lujiazui’s main BTB, like Isaiah’s, has 4096 entries and creates two bubbles if it’s used to provide a branch target. If there’s a BTB miss, Lujiazui is able to provide a branch target slightly faster than Isaiah.

Now let’s look at how well the branch predictor can handle long, repeating patterns.

Pattern recognition capabilites of the VIA (left) and the Zhaoxin (right). Getting farther along the X axis without suffering a rise in execution times indicates the branch predictor can track longer patterns. And getting farther along the Y axis indicates the branch predictor can track patterns for more branches.

Pattern recognition capabilities haven’t changed much. As you can see, the VIA and the Zhaoxin BPUs are very similar in this test. Isaiah’s direction predictor was very sophisticated for a low power core in 2008. But a decade later, it’s rather mediocre. High performance cores available when Lujiazui launched, like AMD’s Zen and Intel’s Skylake, can handle much longer patterns without mispredicts:

Zen 1 and Skylake have much more powerful pattern recognition capabilities, as shown by these graphs

ARM’s Cortex A75 is a contemporary architecture with a more comparable direction predictor. A75 a low power core designed for use in smartphones, so this is another sign that Lujiazui isn’t really designed to go head to head with Intel and AMD’s best.

Lujiazui’s pattern recognition ability is roughly in the same ballpark as the Cortex A75’s

Lujiazui’s call/return prediction is peculiar:

Lujiazui’s behavior with deeply nested calls, with Haswell and Zen 2 results also here for comparison

It looks like Lujiazui implements a fast but tiny L1 return stack with only two entries. Lujiazui probably has a L2 return stack with 34 entries, as performance counters only show mispredicts after call depth exceeds that.

5-6 nanoseconds is 13 to 16 cycles at 2.6 GHz, so this L2 return stack is horrendously slow. Hitting Lijiazui’s L2 return stack is about as bad as overflowing the return stack on Haswell. Deeply nested calls should be avoided on Lujiazui, even if they’re correctly predicted.

A Smaller, Narrower Core?

This came as a bit of a surprise to us but it looks like Lujiazui has narrowed the decode width from 3-wide in Isaiah to 2-wide. Rename and/or retirement appear to be 2-wide as well – a test where half the instructions decode into two micro-ops only runs at 1.5 IPC, or 2 micro-ops per cycle.

The three reasons as to why you would choose to narrow the core are if you want higher clock speed, lower power, and/or a smaller area that your core takes up. However, the reason that Zhaoxin was going purely for higher clock speed doesn’t hold up because Isaiah hit the same clock speed as Wudaokou with both architectures maxing out at 2GHz on a 28nm process.

Zhaoxin probably cut width down in order to build an 8-core chip with acceptable power and die area. This is a good reminder that a wider core running at lower clocks isn’t always better. ARM’s A73 also decreased core width to 2-wide (from 3-wide on the A72) to improve power efficiency, so Zhaoxin’s move is not without precedent.

The ROB and Register Files – Going Backwards?

Zhaoxin also cut the ROB (reorder buffer) size from 65 to 48 entries. We’re limited by Lujiazui’s 48 entry ROB even if we use instructions that write to integer or floating point registers, indicating that there are enough register file entries for every instruction in the ROB. Apparently, Lujiazui ditched Isaiah’s separate register files and went back to a P6-style ROB+RRF. Pretty much all modern high performance CPUs use register files decoupled from the ROB, so this is a puzzling choice.

One guess is that a ROB+RRF scheme requires less storage for registers, because integer and FP/SIMD instructions share the same register file. It’s probably simpler to implement too, because you only have to deal with the ROB regardless of whether an instruction writes to a register.

To their credit, Zhaoxin has implemented an interesting optimization for not-taken branches to soften the blow of a smaller ROB. Specifically, we were able to get two long latency loads to execute in parallel with up to 64 not-taken branches between them. Each of these branches is a cmp + je pair, so that’s 128 instructions in flight. Not-taken branches account for around 5 or 7% of the instruction stream (in Cinebench R20 and Call of Duty Vanguard respectively). So, Lujiazui’s effective reordering capacity may not be as far behind Nano’s as the ROB size suggests. We couldn’t find any other instruction or combination of instructions that could get past a limit of 48 in flight.

Scheduling All These Trains

Ok the CPU isn’t actually scheduling trains, it’s just my really bad attempt to say that we are now looking at Lujiazui’s schedulers.

Zhaoxin seems to have largely kept Nano’s port layout while rebalancing scheduler sizes to improve performance with common integer workloads. The integer scheduling queues grew by a little bit, and we see one more scheduler entry available for the ALUs compared to Nano.

On the memory side, Lujiazui increases the load scheduler size to 12 entries, and similarly beefs up the store scheduler to 20 entries. Compared to Isaiah the store scheduler has increased by 4 entries and the load scheduler has increased by approximately 7 entries.

Lujiazui has rebalanced and slightly cut down the FP/SIMD side schedulers. The FP multiply and add ports have 12 entry scheduling queues, unlike Nano’s 8+24 split.

Execution Ports

We’re assuming Lujiazui keeps most of Nano’s port layout and execution units. Because the core is 2-wide, it’s very difficult to determine which execution units are on separate ports. When possible, we tried to see which instructions could start executing together. If that technique is impossible (for example if one instruction can go to two ports, making core width a limitation rather than execution ports), we tried to see whether they shared a scheduling queue.

Like Isaiah/Nano, Lujiazui has two ALU ports and handles integer multiplication differently based on register width. For example, IMUL with 16-bit inputs has a latency of 2 cycles. With 64-bit multiplication, latency goes up to 9 cycles. They also go to different scheduling queues. Floating point throughput has not changed.

Zhaoxin has beefed up the execution engine in some areas too. There are now two 128-bit vector integer ALUs, instead of one in Isaiah. Lujaizui also gets an extra load pipe, letting it sustain two 128-bit loads per cycle from the L1 data cache.

On the other hand, Lujiazui loses Isaiah’s impressive 2-cycle floating point adder. FP addition latency is now 3 cycles.

AVX – Supported on Paper, Not Useful in Practice

Tomshardware described Lujiazui’s AVX performance as “subpar“, and we can easily see why. 256-bit AVX instructions are split into two 128-bit micro-ops and thus consume two scheduler entries. Since the core seems to be two micro-ops wide, using AVX instructions doesn’t provide a throughput advantage over SSE.

256-bit AVX instructions consume twice as many scheduler entries, indicating they are split into 2×128-bit micro-ops. X axis = instructions between long latency loads, Y axis = iteration latency in nanoseconds

But things get worse. Reordering capacity drops to less than half for 256-bit instructions:

It seems like 256-bit AVX instructions consume more than two ROB slots. That’s a worrying penalty in a core with an already small ROB. To put things in perspective, AMD’s K10 architecture featured a 72 entry ROB, and suffered heavily from the ROB filling up:

K10’s 72 entry ROB already fills up quite often. Lujiazui’s 48 entry one is smaller, so losing even more entries from storing 256-bit state is going to hurt.

256-bit ops are further penalized by increased latency on Lujiazui:

Instruction	Latency	Throughput
128-bit FP Add	3 clocks	1 per clock
128-bit FP Multiply	3 clocks	1 per clock
256-bit FP Add	5.8 clocks	0.5 per clock
256-bit FP Multiply	5 clocks	0.5 per clock

Measured throughput and latencies for vector floating point operations on the Zhaoxin KX-6640MA

Zhaoxin’s AVX implementation on Lujiazui looks like a minimum effort job done only to ensure compatibility with newer software. Piledriver and Zen 1 also split 256-bit AVX instructions into two 128-bit micro-ops, but keep latency the same. And, they don’t suffer penalties beyond what you’d expect from an instruction that decodes into two micro-ops. It’s totally fine to use 256-bit AVX instructions on those two AMD architectures. You won’t get an increase in reordering capacity or math throughput, but you still get better code density.

On the other hand, 256-bit AVX should be avoided on Lujiazui. Any instruction density advantage is outweighed by a wombo combo of increased execution latency and reduced reordering capacity.

Loading Loading Loading
Storing Storing Storing

Isaiah’s had fairly small load and store queues with both being 16 entires a piece. Lujiazui has increased both the load and store queues to 24 and 22 entries respectively, delivering a generational improvement.

Like Isaiah, Lujiazui seems to lack memory dependence prediction. Loads can’t be reordered ahead of stores with an unknown address. This was a shiny new feature when it debuted in Core 2, but just about every high performance core can do it today.

And now to put it ALL Together

With all the data we have gathered, we have been able to create an approximate block diagram of Lujiazui.

In Conclusion

Wudaokou and Lujiazui are very much not just a reskinned Isaiah. It’s a significantly different design that trims and rebalances VIA’s original Nano design, letting it reach higher clock speeds with a moderate IPC increase.

But Lujiazui is definitely a low power core without high performance aspirations. A lot of changes seem to target die area and power efficiency rather than performance. A 25% IPC gain over a decade is nowhere near enough to catch AMD or Intel. In terms of per-core performance, contemporary high performance Intel and AMD CPUs completely sweep Lujiazui away.

In some ways, Lujiazui is Nano in reverse. Nano was a low power core on paper, but its architecture was beefier than low power cores that launched years later, and wasn’t far off contemporary high performance designs. Meanwhile, Lujiazui aims to compete with Intel and AMD’s big cores, but is nowhere near that. AMD’s Jaguar and Intel’s Goldmont would probably be better points for comparison.

Now this article has gotten very long so the benchmarks of the Zhaoxin versus the Nano will have to wait for a Part 3 but I will say that Zhaoxin’s claim of 25 percent more performance per clock isn’t just hot air.

If you like what we do and you would like to help us with acquiring more things to test, then head on over to our Patreon if you want to chip in a few bucks.

Appendix

P6-Style ROB and Register Files

When mixing integer and 256-bit AVX instructions, we see one renamed register file used for both. With separate integer and FP register files, mixing integer and FP register file usage will increase your register-limited reordering capacity to the sum of both (which generally means ROB capacity becomes the limit). But with Lujiazui, mixing in 256-bit AVX instructions reduces reordering capacity, so integer and vector instructions are competitively sharing one one register file.

And as we mentioned earlier, 256-bit AVX instructions seem to consume more than two ROB slots/renamed registers. If we have one 256-bit AVX instruction blocked from retiring by a long latency load, we can only get 45 integer adds in flight before a second load is blocked from entering the backend.

Even if there are no 256-bit AVX instructions pending retirement, simply having 256-bit values present in the YMM registers influences Lujiazui’s ROB/RF entry reclamation logic. As you can see from the black line, having YMM state introduces a spike shortly before we get to 48. That result was repeatable, and isn’t noise.

We were surprised to see a newer CPU with less reordering capacity, so we tried a batch of other things to see if we could get long latency loads to execute in parallel with more than 48 instructions between them.

X axis = instructions between loads, Y axis = iteration latency in nanoseconds

Of those attempts, only one using not-taken branches was successful (as noted earlier). Also, Zhaoxin can have about 20 taken branches pending retirement before the backend has to stall the frontend.

It’s interesting that Lujiazui’s branch tracking is decoupled from the ROB, even though we found no evidence of fast (checkpointed) mispredict recovery

Integer Multiplication

Lujiazui also has incredibly weird behavior with 32-bit and 16-bit multiplies. Reordering capacity varies depending on the values being multiplied, so the CPU might be able to eliminate multiplies (like when an input is 1) when the scheduler is full.

Partial AVX2 Support

Officially, Lujiazui supports AVX, but not AVX2. AVX2 includes FMA (fused multiply add) and 256-bit integer instructions. As expected, FMA instructions generate a fault. However, 256-bit integer instructions appear to work correctly. Just like with AVX/FP, 256-bit integer instructions are decoded into two 128-bit micro-ops. Funny enough, 256-bit packed integer multiplication doesn’t suffer additional latency compared to 128-bit ops.

	Latency	Throughput
128-bit Integer Add	1 clock	2 per cycle
128-bit Integer Multiply	3 clocks	1 per cycle
256-bit Integer Add	1.66 clocks	1 per cycle
256-bit Integer Multiply	3 clocks	0.5 per cycle

Performance with unsupported (but working) AVX2 instructions

Also, 256-bit integer addition doesn’t reduce reordering capacity as severely 256-bit floating point ops do. This is pretty strange – Zen 1 gives identical results regardless of whether the 256-bit registers are being accessed by integer or floating point instructions.

Because Lujiazui has two 128-bit integer ports, each with its own scheduling queue, 256-bit integer operations end up getting more scheduling capacity too.

It looks like Zhaoxin has actually done a decent job with 256-bit integer operations, but couldn’t expose them because they don’t support other parts of AVX2 (namely, FMA).

Interpreting FP Scheduler Results

If you’re particularly eagle-eyed, you might have noticed that our FP scheduler graphs have multiple latency jumps. That’s because creating a dependency for a floating point instruction isn’t 100% straightfoward. Integer dependencies are easy – just have test instructions consume the result from the long latency loads used to block retirement.

One way ot create the dependency for FP instructions is to convert the integer load result to floating point (cvtsi2ss). But the cvtsi2ss instruction itself could consume a floating point scheduler slot. So for the most part, we used the result from a long latency load to index into a separate array of floating point values. For those tests, we’re looking at the final jump in latency, because that shows when only one load is executing in parallel.

Here’s a visual explanation, with the test loop unrolled (as it would be in the CPU’s backend)

Yeah, the graphs are for multiplies, but adds are shown. Same concept. Also, Zhaoxin is not fully overlapping loads like Zen 2. We get 96-100 “scheduler” entries for Zen 2 because we can’t differentiate between the scheduling queue and non-scheduling queue with this test

Assume the first two pointer chasing loads have completed. That is, the loads that put results into registers edi and esi. Loads marked with green lines are ones that are ready to execute.

The first, lowest latency stretch is when the CPU is only limited by memory latency and available instruction level parallelism. Then, iteration latency increases as the CPU’s out of order engine is able to see fewer loads (because the FP scheduler is filling up, preventing it from accepting more instructions).

Zhaoxin Part 3: A Sort of Anti-Climax

November 16, 2021 Cheese Leave a comment

I’ll be blunt here, this part will seem like an anti-climax compared to Part 2 of this series but I hope to nicely wrap up this series with this as the conclusion piece of what we know about how the changes talked about in Part 2 have affected the performance of the Lujiazui architecture and the potential future of Zhaoxin products and microarchitectures.

A Note on the Benchmarks used for this Piece

There is no perfect benchmark that can spit out one single number that tells you how much better CPU X compared to CPU Y because every workload is different and every use case is different; the benchmarks that we chose for this article were benchmarks that either are freely available or ones in which we were able to get a license for.

With that said, for this piece we used Cinebench R11.5, Cinebench R20, and Geekbench 5. Now we wanted to run both our standard suite of tests found in our N1 deep dive piece, Cinebench R15, and UL’s PCMark, however the VIA system would not run either Cinebench R15 or PCMark. PCMark just refused to even open on the VIA and Cinebench R15 would just crash on both the VIA and Zhaoxin systems which we suspect is down to the poor driver support for the integrated graphics on these systems.

Also, please keep in mind that we had to clock normalize by either dividing or multiplying the clock speed of the CPU in question to an arbitrary clock speed. This is unfortunately is not a good method to try and get a performance per clock comparison however, this was the best that we could do because we did not have control of the CPU clock ratios. We also had to divide the Zhaoxin’s R20 scores by two because we only had the multi thread scores due to the VIA not being able to finish the R20 single thread test.

Cinebench R11.5

Starting off with Cinebench R11.5, which would have been a common benchmark around the time of the launch of the VIA Nano and the Intel Core 2 T7100, looking at just the raw single thread results Isaiah is taking up the rear with Lujiiazui at the front of the pack. However when we look at the single thread clock normalized results things change, Isaiah now pulls ahead of Bobcat and Merom pulls in front of Lujiazui. While our clock normalized results shows that Lujiazui has about 13 to 14 percent perfomance per clock increase over Isaiah, this is not a good start for Lujiazui to be losing to an architecture that is a decade and a half old at the time of writing this piece considering that Lujiazui is the newest architecture that Zhaoxin offers.

The story arguably gets even worse with the multithreaded results. Now, the KX-6640MA that we tested is a 4 core CPU and unfortunately Zhaoxin does not sell a 2 core version, so we do have an extrapolated result for a 2 core version which was dividing the 4 core result by the single core result to get the ratio between the two results, dividing the ratio by 2, then multiplying the divided ratio by the single tread result which is not ideal but it does show something interesting. Now a note on the memory configuration of the Zhaoxin, we ran the Zhaoxin with both single channel DDR4 2666 and dual channel DDR4 2400 and there was only minimal changes to any of our results. That lack of change along with the fairly poor multithread ratio suggests either that cores are being bandwidth starved due to the poor L2 or that the cores simply can not take advantage of the near doubling of the bandwidth that the dual channel memory gives it in Cinebench R11.5.

Cinebench R20

Moving on to a more modern benchmark in the form of Cinebench R20 and this is a much more promising result for Lujiazui. Looking at the clock (and core) normalized results first, Lujiazui is getting double the performance per clock of Isaiah which for a generational increase is very impressive. However, note has to be made that Cinebench R20 can take advantage of AVX acceleration which no CPU tested has other then the Zhaoxin which does diminishes its win over Isaiah and Lujiazui still loses to Merom once again. Also something else to notice here is that in this more modern application, Bobcat takes the lead over Isaiah even in the clock normalized results in R20 which is the opposite of R15 where Isaiah beat Bobcat.

Geekbench 5

Isaiah versus Lujiazui

The Geekbench portion of this review will be split in to several sections with the first section being the comparison of Isaiah to Lujiazui.

· Raw GB5 scores for Isaiah versus Lujiazui

· Clock Normalized GB5 scores for Isaiah versus Lujiazui

· Percentage values for clock normalized GB5

Looking at the raw Geekbench 5 scores, Lujiazui just runs away with the win but the Kx-6640MA is also at over double the clock speed of the U4025 so you would expect Lujiazui to win but one place where Lujiazui has a huge lead over Isaiah is in the AES subtest and the reason for this is that Lujiazui has cryptograph acceleration and Isaiah, nor any of the other CPUs in this test, simply does not.

Now looking at the clock normalized results, Lujiazui pulls out wins for the most part but there are a few subtests where there was minimal improvement over Isaiah and even one subtest, face detection, where there was a regression in the clock normalized results. However on average, Lujiazui is roughly 20 to 30 percent faster then Isaiah per clock which is inline with Zhaoxin’s claim of a 25% increase in performance per clock.

Lujiazui versus the Competition: Integer Subtests

Geekbench 5 Integer Subtests Charts

Comparing Lujiazui to the other architectures in Geekbench 5’s Integer subtests shows that while Lujiazui is beating the pants off of Bobcat, it is still behind Merom with the only place where Lujiazui takes a win versus Merom is in the Navigation subtest.

Lujiazui versus the Competition: Floating Point Subtests

Geekbench 5 Floating Point Subtests Charts

Now comparing Lujiazui in Geekbench 5’s Floating Point subtests to the competition is a little more positive compared to the Integer subtests and there are now 2 subtests where Lujiazui beats Merom and in the FP subtests, Lujiazui just kicks Bobcat to the curb and it is no contest.

Lujiazui versus the Competition: Rounding out the Totals

In the end, even when factoring in the AES acceleration that Lujiazui, Lujiazui still loses to a 15 year old architecture which needless to say is not a good look for Lujiazui or Zhaoxin if they are truly are trying to catch Intel and AMD.

The Future of Zhaoxin and their Architectures

Well….. this section has had to have a rewrite due to Centaur seemingly being acquired by Intel in the past week or so. The relationship between Centaur and Zhaoxin is a complex one that we outlined in Part 1 of this series but the short of it is that Zhaoxin uses modified versions of Centaur’s architectures rather then making a new design completely from scratch so with Intel grabbing the people from Centaur’s design office in Austin, the future for Zhaoxin is not looking too bright at the moment however the details of this deal are far from clear at the moment.

However, lets assume that at the very least Centaur sent over the design for their CNS microarchitecture to Zhaoxin before the buyout by Intel. The CNS architecture will be a very large uplift over Lujiazui with Geekbench 5 giving CNS roughly a 2x increase in performance per clock over Lujiazui. However in Geekbench 5, CNS has roughly the same performance per clock as Sandy Bridge which is a decade old architecture which while an improvement over Lujiazui, still is not close to where Intel and AMD is in terms of performance per clock.

Now, if Centaur did not transfer CNS’s IP to Zhaoxin then Zhaoxin is going to have to iterate on Lujiazui which is even further behind Intel and AMD then CNS is however I think that Zhaoxin can take some queues from Jaguar’s improvement’s to Bobcat which yielded a roughly 50 percent performance per clock increase. What AMD did was increase the size of certain structures in the core like the AGU scheduler and improved the L1 and L2 BTBs along with reducing latency of certain math operations. AMD also widened the FPU pipes from 64b to 128b, which is already the case for Lujiazui but Zhaoxin could make the FPU pipes 256b wide to improve AVX operations which is a large weakspot for Lujiazui.

Now something interesting we were told by AtopNUC, the makers of the Zhaoxin system we used for our testing, is that Zhaoxin apparently will be launching a new CPU series later this year. Now what that series could be, I am unsure however my best guess is that it is that it could be Lujiazui with proper AVX2 support considering that there is a GCC optimization target that has roughly the ISA level of Haswell but has a few missing instructions which exclude Intel and AMD CPUs from the possible CPUs it could be.

Regardless, if Zhaoxin does what to try and match AMD and Intel then they have a long road ahead of them however, they did achieve the 25 percent performance per clock improvement over Isaiah that they claimed so kudos to them for achieving that feat.

If you like our articles and journalism and you want to support us in our endeavors then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way.

Author

VIA Part 4 – A Deep Dive into Centaur’s Last CPU Core: CNS

March 23, 2022 Cheese, clamchowder Leave a comment

The x86-64 instruction set powers the vast majority of PCs, consoles, and servers. However, the number of x86 licensees has always been small, so it’s important to keep track of the few that are left. While Intel and AMD have been at each other’s throats chasing the performance crown, VIA, the former owner of Centaur, has quietly targeted the lower power laptop and embedded market. We’ve covered a couple of their previous chips, like the VIA Nanoand Zhaoxin LuJiaZui. However, VIA haven’t made any attempts to go head to head with Intel or AMD’s top end designs since the cancellation of then newly acquired Cyrix’ Jalapeno core in 1999.

In 2019 things changed and VIA again set their sights on higher performance targets. Following in the footsteps of their now defunct subsidiary Cyrix, the Centaur team announced an x86 core called “CNS”. Unlike Nano, CNS targets server applications and prioritizes high performance with high IPC and AVX-512 support. That puts CNS in a very interesting position. It not only represents a shift away from VIA’s strategy of targeting the low power market, but also stands out as the first non-Intel microarchitecture to implement AVX-512.

From Centaur’s announcement of their CNS architecture and CHA (SoC with Ncore AI accelerator + CNS)

But late last year, news broke that Intel was buying the Centaur design house from VIA. This sent shock waves throughout the tech world, because Centaur was the only remaining high performance x86 design house besides Intel and AMD.

A CHA chip obtained by Brutus from Centaur’s closeout auction

Unfortunately, that most likely killed the CNS effort. However, thanks to Brutus along with our wonderful patreons and supporters, we managed to get one of the few CHA chip samples. In this article, we’re going to take a deep look at the CNS microarchitecture, and see what could have been if it actually released.

Overview and Block Diagram

Centaur’s CHA chip comes with eight CNS cores along with a machine learning accelerator, called the NCore. To feed those components, CHA has a 16 MB last level cache, and a quad channel DDR4 memory controller. All of that is implemented on a 194 mm2 die, with TSMC’s 16nm process:

CHA die layout, from Centaur’s presentation at the Linley Spring Processor Conference

We’re going to focus on the CNS cores, because there were no NCore drivers released publicly or that we could find. Here’s what the architecture looks like from our testing:

Our CNS chip runs at 2.2 GHz, which is impressive for an engineering sample chip when the production chips targeted 2.5 GHz. For testing, it’s set up with quad channel DDR4-3200 memory. For comparison purposes, we ran tests on an Azure NC12 instance. That has a Xeon E5-2690 v3 with SMT disabled, and appears to be running at 3 GHz. However, we’re not sure what kind of memory it has.

Frontend: Branch Prediction Accuracy

Centaur refers to the CNS as “Haswell-Class”

Centaur Adds AI to Server Processor, by Linley Gwennap

Since Centaur says the CNS core is Haswell class, that’s what we’ll compare it with. CNS’s predictor can recognize pretty long patterns, but generally falls short of Haswell’s capabilities. However CNS does appear to have ample storage for branch history. With 512 branches in play, it can keep up excellent prediction accuracy with repeating history lengths up to 24-long. Haswell in contrast falls apart once the pattern length exceeds 16.

CNS’s direction predictor is much more powerful than the one in Nano. Centaur made significant progress since their last attempt at taking on the x86 market. But Intel is a juggernaut with lots of engineering muscle, and they’re not so easy for a small company to catch.

Indirect Branch Prediction

Indirect branches can jump to different places, and add another layer of difficulty to branch prediction. The predictor has to track all of those targets, and pick between them. Here, we’re only looking at how many targets the indirect predictor can track.

Additional time taken per indirect branch when it cycles though multiple targets, instead of hitting the same target every time

For a core of its size, CNS has impressive indirect branch prediction capabilities. We saw up to 1024 indirect targets handled without much of a penalty, with 256 branches and four targets per branch. Haswell does better when tracking a few branches with a lot of targets, but CNS is better with a lot of branches and a few targets per branch.

Call/Return Prediction

Call and return pairs are a special case of indirect branches, because returns usually go back to where the call came from. To accelerate return prediction, Centaur uses a 7 entry return stack. For comparison, Haswell has a 16 entry return stack, and Zen has 31 entries. CNS’s return stack is small, and it’ll suffer in code with deeply nested calls.

Frontend: Branch Prediction Speed

To speed up branch handling in the frontend, CNS uses a complex, multi-level cache of branch targets. Haswell and CNS can both track 128 branches and handle them with no fetch bubbles. They also look like they hit BTB miss penalties after more than 4096 branches are in play, but the similarities end there.

With under 16 branches, CNS can sustain two taken branches per cycle. It therefore joins Rocket Lake and Golden Cove in an exclusive club of CPUs that can do more than one taken branch per cycle. Intel likely does this by unrolling loops within its micro-op queue, called the LSD (loop stream detector) in Intel’s documentation. Centaur probably has a similar mechanism. Otherwise, they’d need a dual ported instruction cache to achieve such impressive branching performance.

Once we move past 128 branches into the main BTB, Intel and Centaur’s approaches diverge again. CNS seems to use a BTB tied to the L1 instruction cache. Once the loop size exceeds 32 KB, we see a sharp jump in cycles taken per branch. Within the L1i, CNS can do one branch every three cycles. In other words, two fetch cycles are wasted after a taken branch. Perhaps this indicates that CNS’s L1i has 3 cycles of latency. Haswell uses a more modern decoupled BTB, which can track 4096 branches regardless of spacing or L1i hit/miss. Intel’s implementation is also faster, and only wastes one fetch cycle after a taken branch.

Some comparisons using branches spaced by 16 bytes

For perspective though, CNS does a lot better than AMD’s Zen 2, at least when the two are running at similar clock speeds. Zen 2 can do zero-bubble taken branches with a 16 entry L0 BTB. But after that, it takes steep penalties when it takes branch targets from its larger, slower L1 and L2 BTBs.

Frontend: Instruction Fetch and Decode

Testing L1 instruction cache bandwidth with 8 byte NOPs, which are representative of vector instructions that tend to be longer (because of prefixes)

MPR’s article says CNS can fetch 32 bytes per cycle from the L1 instruction cache. However, we were only able to achieve this for the 2 KB test size. CNS’s instruction fetch performance ends up looking a bit like Haswell’s, where 32B/cycle fetch can only be achieved for small code sizes, from the uop cache.

Testing instruction fetch/decode throughput with 4 byte NOPs, which are more representative of non-vector workloads

From L2, CNS can sustain around 16 bytes of code per cycle, which is still enough to sustain 4 IPC unless code is dominated by vector instructions. Throughput makes a sharp drop once instructions spill out of L2, but that’s typical behavior for many designs, including Haswell.

CNS can sustain 5 NOPs per cycle, as long as they’re in a tight loop with no more than 24 instructions.

MPR’s article says CNS has a predecode stage that can handle four instructions per cycle. Predecoded instructions are placed into an instruction queue, which feeds the main 4-wide decoder.

One possible explanation for the behavior above is that the instruction queue has 24 entries, and can act as a loop buffer. If code fits within this loop buffer, it bypasses predecode restrictions and can run at 5 IPC, as long as the main decoder can fuse a pair of instructions. This is quite different from Haswell, where the predecode stage is 6-wide, and the loop buffer is behind the main 4-wide decoder.

Like Haswell, CNS can fuse conditional jumps with a previous instruction that sets flags, including arithmetic operations. Unlike other CPUs we’ve looked at, it can also fuse NOPs with adjacent instructions. A fused pair is tracked as a single micro-op in the backend.

Rename/Allocate Stage Optimizations

CNS’s renamer can recognize when an operation will always generate a zero, like when a register is xor-ed with itself or subtracted from itself. For these cases, it can tell the scheduler that it doesn’t need to wait for inputs, allowing more IPC to be extracted. It’s on par with Haswell in that respect.

Unlike Haswell, CNS doesn’t seem to have move elimination capabilities. A chain of dependent register to register move instructions will execute at one per cycle.

Backend: Out of Order Resources

Like any modern high performance CPU, CNS has large buffers to enable out of order execution. Key structures like the register files, ROB, and memory ordering queues are roughly comparable to Haswell’s. Even the scheduler and branch order buffer have similar sizes. Centaur was really targeting Haswell-level performance with this core.

Structure	Instruction needs an entry if it…	Centaur CNS	Intel Haswell	Intel Skylake-X
ROB	Exists	192	192	224
Integer Register File	Writes to an integer register (GPR)	146 (130+16)	168 (136+32)	180 (150+32 measured)
FP/Vector Register File	Writes to a 256-bit (or smaller) AVX register	144 (80+64)	168 (136+32)	168 (148+32 measured)
FP/Vector Register File, 512-bit	Writes to an AVX-512 (ZMM) register	40	N/A	168 (148+32 measured)
AVX-512 Mask Register File	Modifies an AVX-512 mask (K) register	138 (130+8) – aliased to integer RF	N/A	128
Load Queue	Reads from memory	72	72	72
Store Queue	Writes to memory	46	42	56
Branch Order Buffer	Affects control flow	46	48	64
Scheduler	Is waiting on an execution unit	64	60	97

But CNS’s headline feature is AVX-512 support, so let’s look deeper into that. It’s nothing like Skylake-X’s fully featured AVX-512 implementation. Instead, Centaur still uses 256-bit vector registers, and splits 512-bit instructions into two micro-ops. That means CNS doesn’t gain throughput or reordering capacity from using AVX-512.

It could still benefit from AVX-512’s masking features, but that’s problematic as well. On CNS, those mask registers and general purpose integer registers both competitively share the same renamed register file. That’s not great, because CNS doesn’t have a particularly large pool of integer registers to begin with (it’s slightly smaller than Haswell’s). Combined with how 512-bit results consume two vector registers, CNS’s reordering capacity could see much lower limits when AVX-512 is in use.

Backend: Execution Units

Integer Execution

Centaur has not skimped on CNS’s integer units. The core has four ALU pipes like Haswell, but specialized execution units are duplicated across more of CNS’s pipes.

All four of CNS’s ALU pipes can do rotate and shift operations, compared to two on Haswell. Complex bit manipulation operations like PDEP and PEXT can execute at two per cycle on CNS, while Haswell only has one pipe for that. Haswell and CNS can both do integer multiplication with 3 cycle latency, but CNS has two integer multipliers to Haswell’s one. While both CPUs have four ALU pipes on the surface, CNS’s pipes are more flexible, and could enable better throughput especially for applications that take advantage of specialized instructions.

Vector and Floating Point Execution

On the vector execution side, CNS’s setup looks a lot like Haswell’s, except CNS uses separate vector execution pipes instead having the integer execution ports do double duty. Vector integer execution units are spread across three pipes, while FP operations have two pipes to pick from. Again, Centaur doesn’t skimp and execution units are duplicated across more pipes than with Haswell.

CNS’s floating point unit can execute two 256-bit FP additions or multiplies per cycle, at 3 cycle latency. Haswell’s FP multiplication latency is worse, at 5 cycles. And, Haswell can only do a single FP add operation per cycle at 3 cycle latency. Funnily enough, you could match CNS’s FP addition throughput on Haswell by using FMA ops with a multiplier of 1, albeit at higher latency. Fused multiply add execution is nearly the same across both architectures, with 2×256-bit throughput and 5 cycle latency.

Centaur’s vector integer execution is also more capable. All three pipes can do vector integer addition, while only two of Haswell’s can. Like on the scalar integer side, CNS’s vector side has two integer multipliers, versus one on Haswell. Depending on the exact multiply operation in question, Haswell’s performance can degrade further. For example, pmulld (vector multiplication with 64-bit elements) executes at half rate with a latency of 10 cycles. CNS executes the same operation at ~1.68 per cycle, with 3 cycle latency.

Address Generation

Address generation is one of CNS’s only weaknesses on the execution unit front. CNS has two AGU pipes, each capable of handling either loads or stores. Haswell has three AGUs, allowing it to do two loads and a store in the same cycle.

Centaur still has a few tricks up its sleeve though. It can write 64 bytes per cycle by executing either a single AVX-512 store, or two 256-bit AVX ones. That gives it twice as much store bandwidth as Haswell. In terms of maximum L1D bandwidth, CNS can theoretically hit 128 bytes per cycle with a 1:1 mix of reads and writes. We weren’t able to hit that in our tests, but we did get over 90 bytes per cycle. That’s close to Haswell’s 96 B/cycle theoretical max.

In practice, Haswell’s triple AGU setup probably has a slight edge. Most applications have far more loads than stores, and Haswell’s extra store AGU will reduce pressure on the two general purpose AGUs. This advantage would be minimal though.

Memory Ordering and Store Forwarding

CNS features a rather sophisticated load/store unit. Unlike the VIA Nano and Zhaoxin’s LuJiaZui, it can speculatively execute loads ahead of stores with unknown addresses.

Centaur also has a robust store forwarding mechanism. All cases where a load is completely contained within a prior store are handled with a latency of 7 cycles. It’s also able to complete two loads and two stores per cycle if they’re independent and neither cross a 64 byte cache line boundary. We don’t see that from Intel or AMD until Sunny Cove and Zen 3.

Results from our implementation of Henry Wong’s store forwarding test. 64-bit store offset goes down vertically, and 32-bit load offset goes across

However, latencies are a bit high, especially if forwarding fails. If the load only partially overlaps the store, latency jumps to 21 cycles. Store forwarding latency increases by a cycle if the load crosses a 64 byte cache line. If a load crosses a cache line boundary and store forwarding fails, there’s a 6 cycle penalty.

Results from a run on Haswell

Haswell seems to do a fast check at four byte granularity. If the load and store both access the same 4 byte aligned region, it does a more thorough check that causes a half cycle penalty. That slight penalty applies even if there’s no overlap. Successful store forwarding on Haswell has a latency of 5.5 cycles, while failed store forwarding costs 15. Both latencies are lower than CNS’s, which is quite impressive considering Haswell’s higher clock speeds.

If the load crosses a cache line boundary, Haswell’s store forwarding takes an extra two cycles, and the penalty for a failed store forward increases by one cycle.

Backend: Cache and Memory Access

	Centaur CNS	Intel Haswell	Intel Skylake-X	Intel Golden Cove	AMD Zen 2
L1 Instruction Cache	32 KB 8-Way, 3 Cycle?	32 KB 8-Way	32 KB 8-Way	32 KB 8-Way	32 KB 8-Way
L1 Data Cache	32 KB 8-Way, 5 Cycle	32 KB 8-Way, 4 Cycle	32 KB 8-Way, 4 Cycle	48 KB 12-Way, 5 Cycle	32 KB 8-Way, 4 Cycle
L2 Cache	256 KB 16-Way, 13 Cycle	256 KB 8-Way, 12 Cycle	1024 KB 16-Way, 12 Cycle	1280 KB, 10-Way, 15 Cycle	512 KB 8-Way, 12 Cycle
L3 Cache	16 MB 16-Way, 56 Cycle	30 MB, 49 Cycle	35.75 MB, 55 Cycle	30 MB, 67 Cycle	16 MB 16-Way, 39 Cycle

Haswell and Skylake L3 parameters vary. The E5-2690 v3 was used for Haswell here, and the Xeon 8171M (on Azure) was used for Skylake

For the most part, Centaur’s CNS suffers from higher cache latencies than Intel’s Haswell, even when the latter is in a server platform. That’s partially because Haswell is running at 3 GHz, while CNS only runs at 2.2 GHz.

Actual time on the left, core clocks on the right

But Centaur falls behind even if we normalize for clock speed. With 5 cycles of latency, Centaur’s L1D is slow. Its L2 is also a cycle slower than Haswell’s. In the L3 region, Centaur again loses both in terms of cycles and absolute time. It only pulls a slight win when we get out to memory.

Things look a bit better for both CPUs if we use 2 MB pages to avoid address translation penalties, but the picture is largely the same. 24 ns of latency is slightly sub-par for a ring servicing just eight cores. Looking at differences in L3 latencies, we can also see that a L2 TLB access takes an extra eight cycles.

Actual time on the left, core clocks on the right

In terms of clock cycles, it’s not too much higher than Haswell-E. But the E5-2690 v3 has more cache, more cores, and runs at a higher clock. Intel has done a very impressive job of scaling up the ring interconnect while keeping latency under control.

Bandwidth

CNS’s L1D can do a 64-byte load and a 64-byte store every cycle. We weren’t able to get close to its theoretical bandwidth even with read-modify-write or copy patterns (which give a 1:1 load-to-store ratio), but at least we got more than 64 bytes per cycle.

Actual bandwidth to the left, bytes per cycle to the right

Read bandwidth barely drops when going from L1 to L2, staying at just under 64 bytes per cycle.

Actual bandwidth to the left, bytes per cycle to the right

Even Intel’s latest cores can’t sustain that amount of L2 read bandwidth. That’s an impressive performance from Centaur’s architecture. Once we get to L3 though, bandwidth takes a sharp drop.

I don’t trust my copy bandwidth measurement just yet. We’re still working on expanding our benchmark capabilities.

With all eight cores loaded, we see over 1.6 TB/s from CHA’s L1 data caches with a memory copy pattern. L2 bandwidth is highest using a read pattern, at just under 1.1 TB/s. L3 bandwidth hits about 325 GB/s. Finally, we get just above 55 GB/s from memory, with a copy pattern.

On-Die Interconnect and System Architecture

lstopo output for the CHA server we tested on

Centaur uses a ring interconnect to connect cores with the L3 cache, and off-die IO. Each ring stop can move 64 bytes per cycle in each direction – twice that of Haswell’s.

Bandwidth Scaling

With eight cores loaded, CNS averages 97.4 bytes per cycle across all eight cores, while Haswell-E gets 81.62 bytes per cycle. CNS’s wider ring does help with bandwidth, but it’s not enough to offset Intel’s clock speed advantage:

As a consolation prize, Centaur’s ring-based L3 is able to provide more bandwidth under heavy load than Ice Lake’s mesh-based cache. Mesh based interconnects tend to have issues clocking up without consuming tremendous amounts of power, resulting in high latency and low bandwidth. Ice Lake’s is no exception.

CHA has a quad channel memory controller that supports DDR4-3200, but memory bandwidth isn’t too impressive. 53 GB/s is well short of the theoretical 102.4 GB/s. Getting close to theoretical bandwidth is hard, because DRAM bandwidth gets lost to refresh cycles and read/write turnarounds. But 52% of theoretical is way too low.

Haswell-E doesn’t do much better. Its first generation DDR4 controller has trouble running at high speeds. Still, Intel’s relatively old architecture shows better memory bandwidth scaling as more cores are loaded. One thought is that CHA wasn’t optimized with CPU memory performance in mind. Centaur has a large NPU taking up a significant chunk of die area, capable of nearly 7 bfloat16 TFLOPs. That could demand a lot of memory bandwidth, especially with the CPU cores active at the same time. Also, the quad channel memory controller could allow more DIMM slots, making it easier to install lots of memory without needing very expensive high capacity modules.

Locks and Cache Coherency

Cache coherency is handled pretty competently with a mechanism that’s likely tied to the L3 slices. The time taken for a core to see a write from another depends on how close both cores are to the L3 slice that the cache line is homed to:

A 4K aligned cacheline seems to be closest to core 7

Lock latencies aren’t bad, especially considering CNS’s low clock speeds. Haswell-E ends up being in the same ballpark. On one hand, Intel benefits from higher clocks. But on the other, it’s using a dual ring setup to connect more cores and more cache. More hops across the interconnect generally translates to higher latency.

Intel’s dual ring setup is pretty darn good

Clam’s Take

In a vacuum, CNS’s architecture shows how far the Centaur design team has come. CNS is wider and has more reordering capacity than any previous VIA or Centaur architecture. It also implements a grab bag of new microarchitecture features, showcasing the small Centaur team’s design prowess. Loads can be hoisted ahead of stores with an unknown address. Store forwarding is very robust, with behaviour that resembles Sunny Cove’s. There’s a large, unified, multi-ported scheduler. Vector execution units are very powerful for a core of its size. Multiple sets of architectural registers (GPR and AVX-512 mask) are aliased to the same physical register file for better area efficiency. All of this is done in a very compact core:

Approximate sizes for Centaur CNS and various Intel cores. CHA die photo from Centaur’s presentation, Haswell photo from Cole L, Skylake-X photo from Fritzchens Fritz, Golden Cove from Intel’s press materials

There’s progress at the system level too. Centaur designed a modern ring interconnect to link the cores with cache and IO. That interconnect enables a large, shared L3, and lets its bandwidth scale to meet the demands of eight cores. Finally, the CHA chip’s quad channel DDR4 controller and 44 PCIe lanes give it more off-chip bandwidth than any previous Centaur design.

But CNS doesn’t exist in a vacuum. Haswell level IPC is great. What’s not great (for Centaur) is that Haswell clocks much higher. Intel also refined their ring bus over several generations, letting it support higher core counts and scale to higher bandwidth, even with narrower links between ring stops. Centaur’s architecture, while ambitious, would have a tough time against Haswell.

Worse, Centaur’s CHA chip still wasn’t on the market in 2021. That put it up against Intel’s Icelake-SP based Xeons, and AMD’s Zen 3 based EPYC chips. Centaur is a small company with limited resources, and is a process node behind. In a pure CPU versus CPU battle, they stood no chance.

CNS really starts falling behind as other designs get a process node advantage. By the time Zen 2 comes around, AMD has a smaller core with higher IPC and higher clocks. AMD core photos are from Fritzchens Fritz

But Centaur knew that. That’s where NCore comes in. It’s a beefy machine learning accelerator capable of 6.8 trillion bfloat16 operations per second. Centaur wanted to pair their CNS architecture with NCore to create a uniquely competitive product. Core for core, CNS can’t take on Skylake or Zen 2, but it’s more than adequate for driving an ML accelerator. For context, Intel targeted 5G base stations (which counts as “edge”) with Snow Ridge. Snow Ridge has Tremont cores running at 2.2 GHz. ARM targets edge applications with Neoverse E1. CNS would stand up well against them, especially with vectorized workloads.

Placing the NCore on-die also reduces latency, and leaves PCIe lanes free for other IO. On the edge, that’d undoubtedly be a ton of networking bandwidth. CNS and NCore together make sense if you want a server with powerful inference capabilities, but had more moderate demands for CPU performance and didn’t have the space or power budget for external ML accelerators.

I suppose that market never materialized. Or if it did, it wasn’t enough to save Centaur. Then, Intel snapped up Centaur’s design team, because those engineers are very good at doing a lot with limited resources. CNS is a testament to that.

Final Words

In conclusion, while it is sad to see a CPU design house close shop, Centaur would not have competed with either AMD or Intel. Even if Centaur had launched a system with the CNS core in 2017, both Intel and AMD had more expandable systems in the form of Skylake-X and Zen 1 EPYC because CNS can only scale to 2 sockets with a total of 16 cores versus Skylake-X’s max of 224 cores across 8 sockets and EPYC’s max of 64 cores across 2 sockets.

However in Centaur’s planned release year of 2020, like Clam said, things were not looking good for either Centaur or the CNS core. And in late 2021, the scene was looking ever grimmer for Centaur and they decided to close up shop with a buyout by Intel for 125 million US dollars.

Now, the buyout does not necessarily mean that the CNS core is dead. The wildcard in all of this is Zhaoxin. Zhaoxin has been very quiet during this acquisition of Centaur by Intel but they have proven that they can improve on Centaur’s already existing designs with Lujiazui.

In 2018, Zhaoxin claimed to be able to catch up to AMD’s then fairly new Zen 1 architecture with their KX-7000 series of CPUs and I suspect that they were planning on using the CNS core to do that. This hypothesis does have creditable evidence in support of it due to the prototype boards using the ZX-200 as their southbridge and that VIA transferred IP to Zhaoxin in October of 2020 which included CPU IP. Now, whether or not Zhaoxin can clock the CNS core to the 3.5+ GHz that a modern desktop CPU needs in order to be competitive is a different story.

Regardless of if Zhaoxin could get CNS up to 3.5GHz, the point is moot. This is not early 2019, it’s early 2022 so this hypothetical 3.5GHz CNS would not be able to get near to either AMD’s or Intel’s current architectures let alone what will be out in roughly 6 to 9 months.

But back to the buyout of Centaur by Intel, the 125 million dollar question is why? Why did Intel buy the Centaur design house? To be frank, we have no idea. Centaur was no threat to Intel or even AMD. There are a few possible reasons that I can think of:

· Intel wanted some IP owned by Centaur such as NCore

· Intel wanted the Centaur engineers because as Clam said, they seemingly can do a lot with very little resources

· Intel wanted a good enough x86 CPU core on a TSMC node for projects where the fastest CPU core is not required and Intel wants to use x86 compatible cores

· And the most cynical reason is that Intel just wanted one fewer company with the x86 license, although VIA still has the Cyrix license as far as I know

Unfortunately, the actual reason as to why Intel acquired Centaur may never be known due to Intel being very tight lipped as to why they gobbled up Centaur. This is quite possibly the end of an era that most didn’t know still existed. Maybe Zhaoxin will take up the mantle of the third high performance x86 design house that has been held by Cyrix and Centaur or maybe we will be reduced to two high performance x86 design houses in AMD and Intel. I cannot predict the future but what I do know is that this article is at its end.

If you like our articles and journalism and you want to support us in our endeavors then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way or if you would like to talk with the Chips and Cheese staff and the people behind the scenes then consider joining our Discord.