The Weird and Wacky World of VIA, the 3rd player in the “Modern” x86 marketSeptember 1, 2021 Cheese Leave a comment Header Image credit goes to Martijn Boer. In the world of x86 CPUs there are two major players, Intel and AMD. However, there is one (well two but that will be expanded on later) other company that designs and produces CPUs that are fully compatible with modern x86 extensions, yes even AVX, and that company is VIA. How VIA got the x86 license is a bit of a messy story that involves the purchase of both Cyrix from National Semiconductor and Centaur Technology from Integrated Device Technology, but in the end VIA ended up with an x86 license and is currently designing and developing x86 compatible CPUs. VIA’s Cores through TimeWhen
VIA acquired Cyrix and Centaur, there were several cores in design. From Cyrix
there was a brand new core by the name of Jalapeno. Jalapeno was a spicy 2-wide
out of order design with an on-die RIMM memory controller. Cyrix also had a
more modest Cayenne (later renamed Joshua) core which was less a revolution and
more an evolution of the MII. Meanwhile, VIA got the Samuel core from Centaur. The Samuel core and its many derivatives were used in the Cyrix III (later renamed to the C3 as it was not based on Cyrix tech) all the way up to the VIA C7. The Samuel core which dates back to 2000 was replaced in 2008 by the Isaiah core. The Samuel core which dates back to 2000 was replaced in 2008 by the Isaiah core. Isaiah was a very ambitious design at the time, a low-power fully out-of-order core in 2008 was not trivial, considering the nodes from the time, and as a reminder Intel’s Bonnell core was a dual issue in-order core also from 2008. Yes ARM did have a out-of-order core in 2008 in the form of the Cortex A9 but as pointed out in our article about how ISA doesn’t matter, A9 lost to Bonnell in terms of both power and performance so for VIA to also trying a fully out-of-order CPU in 2008 was an ambitious project. Isaiah was first fabbed on Fujitsu 65nm and then later shrunk to TSMC 40nm. Furthermore, it was shrunk even more down the line from the aforementioned TSMC 40nm to TSMC 28nm and renamed Isaiah II. VIA’s latest core built on TSMC N16, CNS, has yet be be released in any shipping products, is a very large jump compared to Isaiah. In the grand scheme of CPU design, in the year 2021, it looks to be fairly lackluster, however, and is nowhere near as ambitious as Isaiah sadly. VIA’s Isaiah: You Call This Low-Power?Isaiah is very complex for a low-power core. That’s clear just from a mile high block diagram. Most of the structure sizes here were determined using microbenchmarks For comparison, here’s Intel’s Pentium III, which was discontinued just a year before Isaiah launched as VIA Nano. Data compiled from a variety of sources, some of which are unofficial, and some of which don’t agree. Take this with a grain of salt, but it should be mostly accurate Compared to Intel’s prior “big core”, VIA Nano is giant. It’s deeper and wider with bigger L1 caches, but VIA’s engineers didn’t stop there. Let’s start with the branch predictor, where VIA took things to the next level. The Complex Maze that is Isaiah’s Branch PredictorThe branch prediction unit (BPU) tells the core where to fetch the next instruction from. If there’s no branch, obviously you just keep fetching instructions in a straight line. But if there is, the BPU must decide whether the branch is taken. If it is taken, the BPU needs to determine where the branch is going (and preferably do so quickly to avoid keeping the instruction fetch unit waiting). Let’s look at speed first. As described in our previous article, this test shows how fast the BPU can provide the branch target. This
graph shows how many branch targets can be tracked in a loop VIA says they used a 4096 entry 4-way BTB to cache branch targets, and our results confirm that. For 2008, that’s insane. AMD’s Bobcat only has a 512 entry BTB. Even big cores like Intel’s Conroe and AMD’s K8/K10 only had 2048 entries. Intel didn’t bring out a 4096 entry BTB until Sandy Bridge in 2011. But this huge BTB doesn’t come for free and ends up being a bit slow. Nano can only do one taken branch every three cycles. Or alternatively, the instruction fetch unit sits around doing nothing for two cycles (bubbles) after a taken branch while the branch predictor grabs the branch target out of the BTB. Our experiments only show 1024 BTB entries for Core 2, but other research with more detailed testing concluded it had 2048 entries Core 2 uses a smaller and faster BTB capable of handling a taken branch every two cycles. If there are four or fewer taken branches in play, Core 2 can do “zero bubble” branch prediction. Now let’s look at Bobcat’s approach. It looks like Bobcat’s BTB is tied to the instruction cache, as changing branch spacing affects how many branches it can cache. Branches spaced by 4 bytes were disastrously bad for Bobcat. Bobcat’s BTB isn’t as big as VIA’s, but it’s slightly faster. In the best case, Bobcat can do a taken branch every two cycles. Unlike VIA, Bobcat is also sensitive to branch spacing and can only utilize its maximum BTB capacity if branches are moderately spaced. Dense branches cause catastrophically bad behavior for Bobcat. It looks even worse than missing the BTB, so maybe Bobcat’s BTB isn’t even tagged at 4 byte granularity and ends up feeding wrong branch targets to the frontend, causing mispredicts. Now that we’ve covered how fast the branch predictor is, let’s see how well it can track patterns. That should correlate well with how accurate it is. Unfortunately, we can’t compare real world accuracy because we couldn’t get performance counters to work on the Nano. The X axis shows how long the repeating pattern is for each branch, while the Y axis shows how many branches are in play. The Z axis shows how much longer each branch took on average when the pattern was random, versus all 0s (branch always taken). We’re looking at the 'wall’ here, and how far it is away from the origin. The farther, the better. This wacky looking graph is the pattern recognition graph for the VIA Nano U4025 Nano uses a very complex direction predictor with four Branch History Tables (BHTs). Three of them predict whether a branch is taken, while the fourth predicts which prediction to use. This 'tournament’ style predictor with competing prediction methods isn’t new, but saw more use on very high performance, high power designs. For example, DEC’s Alpha EV5 had three BHTs – one with local history, one with global history, and one for the meta predictor (that selects between the previous two). VIA didn’t disclose the exact prediction methods in play. But from the graph above, its pattern recognition abilities are quite impressive. For perspective, compare this to Core 2, a contemporary “big core” design. Intel’s BPU uses a 8-bit Global History Buffer1. It’s quite clear here that Core 2 Duo’s BPU can’t track very long patterns And as you can see, in this test, Isaiah’s BPU wins hands down. Bobcat lands somewhere in between. It’s not quite as good as Isaiah, but is better than Core 2. Taking a step back, VIA preferred a more complex frontend design. The instruction cache is large with high associativity. Decode is 3-wide (wider than any low power x86 core until Intel’s Goldmont came out in 2016). And the branch predictor is incredibly sophisticated for 2008. For Bobcat, AMD opted for a narrower frontend with smaller caches and a less capable (but lower latency) branch predictor. VIA kept the same philosophy for Isaiah’s backend. A Very Capable Out of Order EngineA 65 entry reorder buffer (ROB) is massive for the time, for context Bobcat has a 56 entry ROB and is 3 years newer. This beefiness also follows in the integer and floating point register files where Isaiah has 46 integer and 48 SIMD/FP registers for renaming. Comparatively, Bobcat has 34 integer registers and 21 FP registers available for renaming. The reason this graph shows 46 entries for the Floating Point Register File (FP Reg. File) and 48 entries for the Integer Register File (Int. Reg. File) but our diagram above says 62 and 64 entries for the FP Reg. File and Int. Reg. File is because x86 has 16 architectural registers that have to be tracked for exception recovery and can’t be used for holding speculative register values. We got some mixed results with our microbenchmarks in regards to Isaiah’s ROB size. On Windows we observed a 48 entry ROB, however on Linux saw a 65 entry ROB. We have not observed that kind of discrepancy with any other CPU in our testing; we went with the higher value because it’s clear that Isaiah’s ROB can hold more than 48 micro-ops. AMD’s Bobcat didn’t emphasize FP/SIMD reordering capacity as much as VIA did. The difference here is especially exaggerated because we used 128-bit packed integer operations to test the FP/SIMD register file. Bobcat breaks those into 2×64-bit operations, and thus consumes two 64-bit registers for each 128-bit operation. VIA says in the Isaiah whitepaper that both the load and store queues are 16 entries each and our microbenchmarks show this for the most part. Compared to Bobcat, Isaiah as 6 more load queue entries but 6 fewer store queue entries. Core 2 has 4 more load queue entries and double the store queue entries that Isaiah has. The load/store unit is one of the only places where Isaiah looks normal for a low power architecture. It’s laid out differently than Bobcat, and has a slightly bigger store queue than Intel’s older Pentium III. A Big Distributed SchedulerWe used instruction-to-port mappings from Agner’s instruction tabes to determine the size of each scheduling queue We couldn’t directly measure the store scheduler’s size. Unlike newer CPUs, Nano does not allow a load to execute ahead of a store with an unknown address. This restriction is understandable, as this memory dependency speculation just debuted in Intel’s Core 2 architecture. AMD didn’t do this until Bulldozer launched in 2011. In the Isaiah whitepaper, VIA says there are a total of 76 entires across all of the reservation stations. To estimate the store scheduler’s size, we subtracted our estimates for the other schedulers from 76, to get 13 entries. Thus,
the break down of the individual scheduling queues is as follows: Relative to Bobcat, Isaiah has 14 more SIMD scheduler entries total, 7 more Integer scheduler entries, and 13 more Load/Store entries, or to put it another way “Bloody Huge”. Now, Core 2 has a fully unified 32 entry scheduler, which while smaller then both Isaiah and Bobcat in terms of number of entries, is more flexible and can do well with fewer entries. Nano’s Wide and Fast Execution EngineConstructed using Agner’s Instruction Tables For a low power architecture, Nano’s designers built a very powerful execution engine. The Media A (MA) and Media B (MB) ports feature 128-bit wide execution units and datapaths, matching Intel’s Core 2 in vector throughput. Low power cores didn’t have that kind of vector throughput until AMD’s Jaguar architecture debuted in 2013. The MA/MB schedulers together have 32 entries – more than the SIMD scheduler on AMD’s Bobcat and Jaguar. VIA’s engineers didn’t stop at hitting big core FP/SIMD capability on a “low power” core. According to Agner’s instruction tables, floating point addition latency is just two cycles, faster than any contemporary (or modern) CPU. FP multiply latency is three cycles. For comparison, Skylake does floating point adds and multiplies with four cycle latency, and Zen 3 does both with three cycle latency. Similarly, Nano’s L1D can return data in two cycles. If that’s not impressive enough already, remember that it’s with a 64 KB 16-way cache. Assuming address generation and TLB lookup takes a cycle, Nano is checking 16 tags and getting data back in one cycle. That’s insane. At the time, 3 cycle L1D latencies were typical, with 64 KB 2-way (Athlon) or 32 KB 8-way (Core 2) caches. Today, L1D latency is around 4 cycles (or 5 cycles on Intel’s Ice Lake and Tiger Lake cores). Wrapping Up: Hitting the Wrong Target?Isaiah is unique and very ambitious. Perhaps too ambitious for its own good. In places Isaiah, remember this is a “Low-Power” design, is bigger than Core 2 which only came out a scant 18 months before Isaiah on a similar class of node. In almost all areas, Isaiah is also wider than Bobcat which came out 3 years after Isaiah, on a smaller node than Isaiah’s launch node. Now to add some anecdotal evidence, when I was testing all 3 platforms for this article, the Nano output either similar amounts or seemingly more heat than the Core 2 Duo. For some context, the Nano was able to keep my coffee above lukewarm which is quite impressive; the Bobcat system was easily the coolest of the three. Clearly Nano wasn’t very low power. But it also wasn’t high performance, so Nano’s priorities need some explaining. I especially wonder why Nano’s design team used full 128-bit vector execution units, and made them very low latency. That can’t have been cheap in terms of power. One possibility is that VIA’s engineers targeted multimedia applications like video playback. That would explain why Nano’s FP/SIMD ports are named “Media A” and “Media B”. But hardware video decoders became very prevalent not long after Nano debuted, leaving Nano’s powerful vector units in an awkward spot. Isaiah’s large, fast L1D and powerful branch predictor are also culprits. Low latency and high associativity scream high power and low clocks. The branch predictor is a bit more difficult to analyze. Nano went all out for accuracy, but paid the price in frontend latency. With 20-20 hindsight, we can see AMD reduce branch predictor related frontend bubbles generation after generation, and conclude that branch predictor speed is more important than VIA thought it was. Ultimately, Isaiah represents an ambitious attempt by VIA to compete with AMD and Intel. There’s no doubt VIA had a capable and determined engineering team. But creating a successful CPU is all about anticipating important workloads and correctly tuning the architecture for them. Intel and AMD correctly guessed that video playback would be offloaded. Also, Intel/AMD are larger companies with extensive simulation resources that can better guide engineering decisions. Both developed low power architectures that were narrower and less sophisticated architectures than Isaiah, but more than made up for it with higher clock speed and careful resource allocation. Now in the very beginning I said how there is a fourth x86 design house, and that is true, however Zhaoxin is a joint venture between VIA and the Shanghai Municipal Government. If you paid attention to our ISA doesn’t matter article you would know that we have a Zhaoxin CPU, in-house, ready for testing. Because of time constraints, the full in-depth dive into that CPU will have to wait for a part 2 to this article. However, from what testing we have done with Zhaoxin’s newest microarchiture, Lujiazui, it’s more evolutionary than revolutionary. The Weird and Wacky World of VIA Part 2: Zhaoxin’s not quite Electric BoogalooSeptember 22, 2021 Cheese, clamchowder Leave a comment In Part 1 of this piece we talked about the third x86 design house, VIA and more specifically VIA’s most recent commercially available architecture Isaiah. Today we are talking about the joint venture that VIA has with the Shanghai Municipal Government, Zhaoxin. A Little Background HistoryZhaoxin was started in 2013 as a joint venture between the Shanghai Government and VIA. Chart of Zhaoxin’s CPU families and architectures Based on the information on Wikichip, Zhangiang is a minor modification of the Isaiah II core where the biggest changes were the addition of the Chinese hashing algorithms SM3 and SM4 to Padlock. The big architectural changes came with the Wudaokou core where Zhaoxin claimed a 25% performance per clock increase, which is no small feat. The core we are looking at today is Zhaoxin’s latest, Lujiazui, which Zhaoxin claims has an up to 50% performance increase over Wudaokou however, the clock speed increase is also up to 50% which make me suspect that Lujiazui is the Wudaokou core ported from HLMC’s 28nm to TSMC’s 16nm and has had a nice clock bump as a consequence. Lujiazui versus Isaiah |
Instruction | Latency | Throughput |
128-bit FP Add | 3 clocks | 1 per clock |
128-bit FP Multiply | 3 clocks | 1 per clock |
256-bit FP Add | 5.8 clocks | 0.5 per clock |
256-bit FP Multiply | 5 clocks | 0.5 per clock |
Measured throughput and latencies for vector floating point operations on the Zhaoxin KX-6640MA
Zhaoxin’s AVX implementation on Lujiazui looks like a minimum effort job done only to ensure compatibility with newer software. Piledriver and Zen 1 also split 256-bit AVX instructions into two 128-bit micro-ops, but keep latency the same. And, they don’t suffer penalties beyond what you’d expect from an instruction that decodes into two micro-ops. It’s totally fine to use 256-bit AVX instructions on those two AMD architectures. You won’t get an increase in reordering capacity or math throughput, but you still get better code density.
On the other hand, 256-bit AVX should be avoided on Lujiazui. Any instruction density advantage is outweighed by a wombo combo of increased execution latency and reduced reordering capacity.
Isaiah’s had fairly small load and store queues with both being 16 entires a piece. Lujiazui has increased both the load and store queues to 24 and 22 entries respectively, delivering a generational improvement.
Like Isaiah, Lujiazui seems to lack memory dependence prediction. Loads can’t be reordered ahead of stores with an unknown address. This was a shiny new feature when it debuted in Core 2, but just about every high performance core can do it today.
With all the data we have gathered, we have been able to create an approximate block diagram of Lujiazui.
Wudaokou and Lujiazui are very much not just a reskinned Isaiah. It’s a significantly different design that trims and rebalances VIA’s original Nano design, letting it reach higher clock speeds with a moderate IPC increase.
But Lujiazui is definitely a low power core without high performance aspirations. A lot of changes seem to target die area and power efficiency rather than performance. A 25% IPC gain over a decade is nowhere near enough to catch AMD or Intel. In terms of per-core performance, contemporary high performance Intel and AMD CPUs completely sweep Lujiazui away.
In some ways, Lujiazui is Nano in reverse. Nano was a low power core on paper, but its architecture was beefier than low power cores that launched years later, and wasn’t far off contemporary high performance designs. Meanwhile, Lujiazui aims to compete with Intel and AMD’s big cores, but is nowhere near that. AMD’s Jaguar and Intel’s Goldmont would probably be better points for comparison.
Now this article has gotten very long so the benchmarks of the Zhaoxin versus the Nano will have to wait for a Part 3 but I will say that Zhaoxin’s claim of 25 percent more performance per clock isn’t just hot air.
If you like what we do and you would like to help us with acquiring more things to test, then head on over to our Patreon if you want to chip in a few bucks.
When mixing integer and 256-bit AVX instructions, we see one renamed register file used for both. With separate integer and FP register files, mixing integer and FP register file usage will increase your register-limited reordering capacity to the sum of both (which generally means ROB capacity becomes the limit). But with Lujiazui, mixing in 256-bit AVX instructions reduces reordering capacity, so integer and vector instructions are competitively sharing one one register file.
And as we mentioned earlier, 256-bit AVX instructions seem to consume more than two ROB slots/renamed registers. If we have one 256-bit AVX instruction blocked from retiring by a long latency load, we can only get 45 integer adds in flight before a second load is blocked from entering the backend.
Even if there are no 256-bit AVX instructions pending retirement, simply having 256-bit values present in the YMM registers influences Lujiazui’s ROB/RF entry reclamation logic. As you can see from the black line, having YMM state introduces a spike shortly before we get to 48. That result was repeatable, and isn’t noise.
We were surprised to see a newer CPU with less reordering capacity, so we tried a batch of other things to see if we could get long latency loads to execute in parallel with more than 48 instructions between them.
X axis = instructions between loads, Y axis = iteration latency in nanoseconds
Of those attempts, only one using not-taken branches was successful (as noted earlier). Also, Zhaoxin can have about 20 taken branches pending retirement before the backend has to stall the frontend.
It’s interesting that Lujiazui’s branch tracking is decoupled from the ROB, even though we found no evidence of fast (checkpointed) mispredict recovery
Lujiazui also has incredibly weird behavior with 32-bit and 16-bit multiplies. Reordering capacity varies depending on the values being multiplied, so the CPU might be able to eliminate multiplies (like when an input is 1) when the scheduler is full.
Officially, Lujiazui supports AVX, but not AVX2. AVX2 includes FMA (fused multiply add) and 256-bit integer instructions. As expected, FMA instructions generate a fault. However, 256-bit integer instructions appear to work correctly. Just like with AVX/FP, 256-bit integer instructions are decoded into two 128-bit micro-ops. Funny enough, 256-bit packed integer multiplication doesn’t suffer additional latency compared to 128-bit ops.
Latency | Throughput | |
128-bit Integer Add | 1 clock | 2 per cycle |
128-bit Integer Multiply | 3 clocks | 1 per cycle |
256-bit Integer Add | 1.66 clocks | 1 per cycle |
256-bit Integer Multiply | 3 clocks | 0.5 per cycle |
Performance with unsupported (but working) AVX2 instructions
Also, 256-bit integer addition doesn’t reduce reordering capacity as severely 256-bit floating point ops do. This is pretty strange – Zen 1 gives identical results regardless of whether the 256-bit registers are being accessed by integer or floating point instructions.
Because Lujiazui has two 128-bit integer ports, each with its own scheduling queue, 256-bit integer operations end up getting more scheduling capacity too.
It looks like Zhaoxin has actually done a decent job with 256-bit integer operations, but couldn’t expose them because they don’t support other parts of AVX2 (namely, FMA).
If you’re particularly eagle-eyed, you might have noticed that our FP scheduler graphs have multiple latency jumps. That’s because creating a dependency for a floating point instruction isn’t 100% straightfoward. Integer dependencies are easy – just have test instructions consume the result from the long latency loads used to block retirement.
One way ot create the dependency for FP instructions is to convert the integer load result to floating point (cvtsi2ss). But the cvtsi2ss instruction itself could consume a floating point scheduler slot. So for the most part, we used the result from a long latency load to index into a separate array of floating point values. For those tests, we’re looking at the final jump in latency, because that shows when only one load is executing in parallel.
Here’s a visual explanation, with the test loop unrolled (as it would be in the CPU’s backend)
Yeah, the graphs are for multiplies, but adds are shown. Same concept. Also, Zhaoxin is not fully overlapping loads like Zen 2. We get 96-100 “scheduler” entries for Zen 2 because we can’t differentiate between the scheduling queue and non-scheduling queue with this test
Assume the first two pointer chasing loads have completed. That is, the loads that put results into registers edi and esi. Loads marked with green lines are ones that are ready to execute.
The first, lowest latency stretch is when the CPU is only limited by memory latency and available instruction level parallelism. Then, iteration latency increases as the CPU’s out of order engine is able to see fewer loads (because the FP scheduler is filling up, preventing it from accepting more instructions).
November 16, 2021 Cheese Leave a comment
I’ll be blunt here, this part will seem like an anti-climax compared to Part 2 of this series but I hope to nicely wrap up this series with this as the conclusion piece of what we know about how the changes talked about in Part 2 have affected the performance of the Lujiazui architecture and the potential future of Zhaoxin products and microarchitectures.
There is no perfect benchmark that can spit out one single number that tells you how much better CPU X compared to CPU Y because every workload is different and every use case is different; the benchmarks that we chose for this article were benchmarks that either are freely available or ones in which we were able to get a license for.
With that said, for this piece we used Cinebench R11.5, Cinebench R20, and Geekbench 5. Now we wanted to run both our standard suite of tests found in our N1 deep dive piece, Cinebench R15, and UL’s PCMark, however the VIA system would not run either Cinebench R15 or PCMark. PCMark just refused to even open on the VIA and Cinebench R15 would just crash on both the VIA and Zhaoxin systems which we suspect is down to the poor driver support for the integrated graphics on these systems.
Also, please keep in mind that we had to clock normalize by either dividing or multiplying the clock speed of the CPU in question to an arbitrary clock speed. This is unfortunately is not a good method to try and get a performance per clock comparison however, this was the best that we could do because we did not have control of the CPU clock ratios. We also had to divide the Zhaoxin’s R20 scores by two because we only had the multi thread scores due to the VIA not being able to finish the R20 single thread test.
Starting off with Cinebench R11.5, which would have been a common benchmark around the time of the launch of the VIA Nano and the Intel Core 2 T7100, looking at just the raw single thread results Isaiah is taking up the rear with Lujiiazui at the front of the pack. However when we look at the single thread clock normalized results things change, Isaiah now pulls ahead of Bobcat and Merom pulls in front of Lujiazui. While our clock normalized results shows that Lujiazui has about 13 to 14 percent perfomance per clock increase over Isaiah, this is not a good start for Lujiazui to be losing to an architecture that is a decade and a half old at the time of writing this piece considering that Lujiazui is the newest architecture that Zhaoxin offers.
The story arguably gets even worse with the multithreaded results. Now, the KX-6640MA that we tested is a 4 core CPU and unfortunately Zhaoxin does not sell a 2 core version, so we do have an extrapolated result for a 2 core version which was dividing the 4 core result by the single core result to get the ratio between the two results, dividing the ratio by 2, then multiplying the divided ratio by the single tread result which is not ideal but it does show something interesting. Now a note on the memory configuration of the Zhaoxin, we ran the Zhaoxin with both single channel DDR4 2666 and dual channel DDR4 2400 and there was only minimal changes to any of our results. That lack of change along with the fairly poor multithread ratio suggests either that cores are being bandwidth starved due to the poor L2 or that the cores simply can not take advantage of the near doubling of the bandwidth that the dual channel memory gives it in Cinebench R11.5.
Moving on to a more modern benchmark in the form of Cinebench R20 and this is a much more promising result for Lujiazui. Looking at the clock (and core) normalized results first, Lujiazui is getting double the performance per clock of Isaiah which for a generational increase is very impressive. However, note has to be made that Cinebench R20 can take advantage of AVX acceleration which no CPU tested has other then the Zhaoxin which does diminishes its win over Isaiah and Lujiazui still loses to Merom once again. Also something else to notice here is that in this more modern application, Bobcat takes the lead over Isaiah even in the clock normalized results in R20 which is the opposite of R15 where Isaiah beat Bobcat.
The Geekbench portion of this review will be split in to several sections with the first section being the comparison of Isaiah to Lujiazui.
· Raw GB5 scores for Isaiah versus Lujiazui
· Clock Normalized GB5 scores for Isaiah versus Lujiazui
· Percentage values for clock normalized GB5
Looking at the raw Geekbench 5 scores, Lujiazui just runs away with the win but the Kx-6640MA is also at over double the clock speed of the U4025 so you would expect Lujiazui to win but one place where Lujiazui has a huge lead over Isaiah is in the AES subtest and the reason for this is that Lujiazui has cryptograph acceleration and Isaiah, nor any of the other CPUs in this test, simply does not.
Now looking at the clock normalized results, Lujiazui pulls out wins for the most part but there are a few subtests where there was minimal improvement over Isaiah and even one subtest, face detection, where there was a regression in the clock normalized results. However on average, Lujiazui is roughly 20 to 30 percent faster then Isaiah per clock which is inline with Zhaoxin’s claim of a 25% increase in performance per clock.
Geekbench 5 Integer Subtests Charts
Comparing Lujiazui to the other architectures in Geekbench 5’s Integer subtests shows that while Lujiazui is beating the pants off of Bobcat, it is still behind Merom with the only place where Lujiazui takes a win versus Merom is in the Navigation subtest.
Geekbench 5 Floating Point Subtests Charts
Now comparing Lujiazui in Geekbench 5’s Floating Point subtests to the competition is a little more positive compared to the Integer subtests and there are now 2 subtests where Lujiazui beats Merom and in the FP subtests, Lujiazui just kicks Bobcat to the curb and it is no contest.
In the end, even when factoring in the AES acceleration that Lujiazui, Lujiazui still loses to a 15 year old architecture which needless to say is not a good look for Lujiazui or Zhaoxin if they are truly are trying to catch Intel and AMD.
Well….. this section has had to have a rewrite due to Centaur seemingly being acquired by Intel in the past week or so. The relationship between Centaur and Zhaoxin is a complex one that we outlined in Part 1 of this series but the short of it is that Zhaoxin uses modified versions of Centaur’s architectures rather then making a new design completely from scratch so with Intel grabbing the people from Centaur’s design office in Austin, the future for Zhaoxin is not looking too bright at the moment however the details of this deal are far from clear at the moment.
However, lets assume that at the very least Centaur sent over the design for their CNS microarchitecture to Zhaoxin before the buyout by Intel. The CNS architecture will be a very large uplift over Lujiazui with Geekbench 5 giving CNS roughly a 2x increase in performance per clock over Lujiazui. However in Geekbench 5, CNS has roughly the same performance per clock as Sandy Bridge which is a decade old architecture which while an improvement over Lujiazui, still is not close to where Intel and AMD is in terms of performance per clock.
Now, if Centaur did not transfer CNS’s IP to Zhaoxin then Zhaoxin is going to have to iterate on Lujiazui which is even further behind Intel and AMD then CNS is however I think that Zhaoxin can take some queues from Jaguar’s improvement’s to Bobcat which yielded a roughly 50 percent performance per clock increase. What AMD did was increase the size of certain structures in the core like the AGU scheduler and improved the L1 and L2 BTBs along with reducing latency of certain math operations. AMD also widened the FPU pipes from 64b to 128b, which is already the case for Lujiazui but Zhaoxin could make the FPU pipes 256b wide to improve AVX operations which is a large weakspot for Lujiazui.
Now something interesting we were told by AtopNUC, the makers of the Zhaoxin system we used for our testing, is that Zhaoxin apparently will be launching a new CPU series later this year. Now what that series could be, I am unsure however my best guess is that it is that it could be Lujiazui with proper AVX2 support considering that there is a GCC optimization target that has roughly the ISA level of Haswell but has a few missing instructions which exclude Intel and AMD CPUs from the possible CPUs it could be.
Regardless, if Zhaoxin does what to try and match AMD and Intel then they have a long road ahead of them however, they did achieve the 25 percent performance per clock improvement over Isaiah that they claimed so kudos to them for achieving that feat.
If you like our articles and journalism and you want to support us in our endeavors then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way.
·
·
March 23, 2022 Cheese, clamchowder Leave a comment
The x86-64 instruction set powers the vast majority of PCs, consoles, and servers. However, the number of x86 licensees has always been small, so it’s important to keep track of the few that are left. While Intel and AMD have been at each other’s throats chasing the performance crown, VIA, the former owner of Centaur, has quietly targeted the lower power laptop and embedded market. We’ve covered a couple of their previous chips, like the VIA Nanoand Zhaoxin LuJiaZui. However, VIA haven’t made any attempts to go head to head with Intel or AMD’s top end designs since the cancellation of then newly acquired Cyrix’ Jalapeno core in 1999.
In 2019 things changed and VIA again set their sights on higher performance targets. Following in the footsteps of their now defunct subsidiary Cyrix, the Centaur team announced an x86 core called “CNS”. Unlike Nano, CNS targets server applications and prioritizes high performance with high IPC and AVX-512 support. That puts CNS in a very interesting position. It not only represents a shift away from VIA’s strategy of targeting the low power market, but also stands out as the first non-Intel microarchitecture to implement AVX-512.
From Centaur’s announcement of their CNS architecture and CHA (SoC with Ncore AI accelerator + CNS)
But late last year, news broke that Intel was buying the Centaur design house from VIA. This sent shock waves throughout the tech world, because Centaur was the only remaining high performance x86 design house besides Intel and AMD.
A CHA chip obtained by Brutus from Centaur’s closeout auction
Unfortunately, that most likely killed the CNS effort. However, thanks to Brutus along with our wonderful patreons and supporters, we managed to get one of the few CHA chip samples. In this article, we’re going to take a deep look at the CNS microarchitecture, and see what could have been if it actually released.
Centaur’s CHA chip comes with eight CNS cores along with a machine learning accelerator, called the NCore. To feed those components, CHA has a 16 MB last level cache, and a quad channel DDR4 memory controller. All of that is implemented on a 194 mm2 die, with TSMC’s 16nm process:
CHA die layout, from Centaur’s presentation at the Linley Spring Processor Conference
We’re going to focus on the CNS cores, because there were no NCore drivers released publicly or that we could find. Here’s what the architecture looks like from our testing:
Our CNS chip runs at 2.2 GHz, which is impressive for an engineering sample chip when the production chips targeted 2.5 GHz. For testing, it’s set up with quad channel DDR4-3200 memory. For comparison purposes, we ran tests on an Azure NC12 instance. That has a Xeon E5-2690 v3 with SMT disabled, and appears to be running at 3 GHz. However, we’re not sure what kind of memory it has.
Centaur refers to the CNS as “Haswell-Class”
Centaur Adds AI to Server Processor, by Linley Gwennap
Since Centaur says the CNS core is Haswell class, that’s what we’ll compare it with. CNS’s predictor can recognize pretty long patterns, but generally falls short of Haswell’s capabilities. However CNS does appear to have ample storage for branch history. With 512 branches in play, it can keep up excellent prediction accuracy with repeating history lengths up to 24-long. Haswell in contrast falls apart once the pattern length exceeds 16.
CNS’s direction predictor is much more powerful than the one in Nano. Centaur made significant progress since their last attempt at taking on the x86 market. But Intel is a juggernaut with lots of engineering muscle, and they’re not so easy for a small company to catch.
Indirect branches can jump to different places, and add another layer of difficulty to branch prediction. The predictor has to track all of those targets, and pick between them. Here, we’re only looking at how many targets the indirect predictor can track.
Additional time taken per indirect branch when it cycles though multiple targets, instead of hitting the same target every time
For a core of its size, CNS has impressive indirect branch prediction capabilities. We saw up to 1024 indirect targets handled without much of a penalty, with 256 branches and four targets per branch. Haswell does better when tracking a few branches with a lot of targets, but CNS is better with a lot of branches and a few targets per branch.
Call and return pairs are a special case of indirect branches, because returns usually go back to where the call came from. To accelerate return prediction, Centaur uses a 7 entry return stack. For comparison, Haswell has a 16 entry return stack, and Zen has 31 entries. CNS’s return stack is small, and it’ll suffer in code with deeply nested calls.
To speed up branch handling in the frontend, CNS uses a complex, multi-level cache of branch targets. Haswell and CNS can both track 128 branches and handle them with no fetch bubbles. They also look like they hit BTB miss penalties after more than 4096 branches are in play, but the similarities end there.
With under 16 branches, CNS can sustain two taken branches per cycle. It therefore joins Rocket Lake and Golden Cove in an exclusive club of CPUs that can do more than one taken branch per cycle. Intel likely does this by unrolling loops within its micro-op queue, called the LSD (loop stream detector) in Intel’s documentation. Centaur probably has a similar mechanism. Otherwise, they’d need a dual ported instruction cache to achieve such impressive branching performance.
Once we move past 128 branches into the main BTB, Intel and Centaur’s approaches diverge again. CNS seems to use a BTB tied to the L1 instruction cache. Once the loop size exceeds 32 KB, we see a sharp jump in cycles taken per branch. Within the L1i, CNS can do one branch every three cycles. In other words, two fetch cycles are wasted after a taken branch. Perhaps this indicates that CNS’s L1i has 3 cycles of latency. Haswell uses a more modern decoupled BTB, which can track 4096 branches regardless of spacing or L1i hit/miss. Intel’s implementation is also faster, and only wastes one fetch cycle after a taken branch.
Some comparisons using branches spaced by 16 bytes
For perspective though, CNS does a lot better than AMD’s Zen 2, at least when the two are running at similar clock speeds. Zen 2 can do zero-bubble taken branches with a 16 entry L0 BTB. But after that, it takes steep penalties when it takes branch targets from its larger, slower L1 and L2 BTBs.
Testing L1 instruction cache bandwidth with 8 byte NOPs, which are representative of vector instructions that tend to be longer (because of prefixes)
MPR’s article says CNS can fetch 32 bytes per cycle from the L1 instruction cache. However, we were only able to achieve this for the 2 KB test size. CNS’s instruction fetch performance ends up looking a bit like Haswell’s, where 32B/cycle fetch can only be achieved for small code sizes, from the uop cache.
Testing instruction fetch/decode throughput with 4 byte NOPs, which are more representative of non-vector workloads
From L2, CNS can sustain around 16 bytes of code per cycle, which is still enough to sustain 4 IPC unless code is dominated by vector instructions. Throughput makes a sharp drop once instructions spill out of L2, but that’s typical behavior for many designs, including Haswell.
CNS can sustain 5 NOPs per cycle, as long as they’re in a tight loop with no more than 24 instructions.
MPR’s article says CNS has a predecode stage that can handle four instructions per cycle. Predecoded instructions are placed into an instruction queue, which feeds the main 4-wide decoder.
One possible explanation for the behavior above is that the instruction queue has 24 entries, and can act as a loop buffer. If code fits within this loop buffer, it bypasses predecode restrictions and can run at 5 IPC, as long as the main decoder can fuse a pair of instructions. This is quite different from Haswell, where the predecode stage is 6-wide, and the loop buffer is behind the main 4-wide decoder.
Like Haswell, CNS can fuse conditional jumps with a previous instruction that sets flags, including arithmetic operations. Unlike other CPUs we’ve looked at, it can also fuse NOPs with adjacent instructions. A fused pair is tracked as a single micro-op in the backend.
CNS’s renamer can recognize when an operation will always generate a zero, like when a register is xor-ed with itself or subtracted from itself. For these cases, it can tell the scheduler that it doesn’t need to wait for inputs, allowing more IPC to be extracted. It’s on par with Haswell in that respect.
Unlike Haswell, CNS doesn’t seem to have move elimination capabilities. A chain of dependent register to register move instructions will execute at one per cycle.
Like any modern high performance CPU, CNS has large buffers to enable out of order execution. Key structures like the register files, ROB, and memory ordering queues are roughly comparable to Haswell’s. Even the scheduler and branch order buffer have similar sizes. Centaur was really targeting Haswell-level performance with this core.
Structure | Instruction needs an entry if it… | Centaur CNS | Intel Haswell | Intel Skylake-X | |
ROB | Exists | 192 | 192 | 224 | |
Integer Register File | Writes to an integer register (GPR) | 146 (130+16) | 168 (136+32) | 180 (150+32 measured) | |
FP/Vector Register File | Writes to a 256-bit (or smaller) AVX register | 144 (80+64) | 168 (136+32) | 168 (148+32 measured) | |
FP/Vector Register File, 512-bit | Writes to an AVX-512 (ZMM) register | 40 | N/A | 168 | |
AVX-512 Mask Register File | Modifies an AVX-512 mask (K) register | 138 (130+8) – aliased to integer RF | N/A | 128 | |
Load Queue | Reads from memory | 72 | 72 | 72 | |
Store Queue | Writes to memory | 46 | 42 | 56 | |
Branch Order Buffer | Affects control flow | 46 | 48 | 64 | |
Scheduler | Is waiting on an execution unit | 64 | 60 | 97 |
But CNS’s headline feature is AVX-512 support, so let’s look deeper into that. It’s nothing like Skylake-X’s fully featured AVX-512 implementation. Instead, Centaur still uses 256-bit vector registers, and splits 512-bit instructions into two micro-ops. That means CNS doesn’t gain throughput or reordering capacity from using AVX-512.
It could still benefit from AVX-512’s masking features, but that’s problematic as well. On CNS, those mask registers and general purpose integer registers both competitively share the same renamed register file. That’s not great, because CNS doesn’t have a particularly large pool of integer registers to begin with (it’s slightly smaller than Haswell’s). Combined with how 512-bit results consume two vector registers, CNS’s reordering capacity could see much lower limits when AVX-512 is in use.
Centaur has not skimped on CNS’s integer units. The core has four ALU pipes like Haswell, but specialized execution units are duplicated across more of CNS’s pipes.
All four of CNS’s ALU pipes can do rotate and shift operations, compared to two on Haswell. Complex bit manipulation operations like PDEP and PEXT can execute at two per cycle on CNS, while Haswell only has one pipe for that. Haswell and CNS can both do integer multiplication with 3 cycle latency, but CNS has two integer multipliers to Haswell’s one. While both CPUs have four ALU pipes on the surface, CNS’s pipes are more flexible, and could enable better throughput especially for applications that take advantage of specialized instructions.
On the vector execution side, CNS’s setup looks a lot like Haswell’s, except CNS uses separate vector execution pipes instead having the integer execution ports do double duty. Vector integer execution units are spread across three pipes, while FP operations have two pipes to pick from. Again, Centaur doesn’t skimp and execution units are duplicated across more pipes than with Haswell.
CNS’s floating point unit can execute two 256-bit FP additions or multiplies per cycle, at 3 cycle latency. Haswell’s FP multiplication latency is worse, at 5 cycles. And, Haswell can only do a single FP add operation per cycle at 3 cycle latency. Funnily enough, you could match CNS’s FP addition throughput on Haswell by using FMA ops with a multiplier of 1, albeit at higher latency. Fused multiply add execution is nearly the same across both architectures, with 2×256-bit throughput and 5 cycle latency.
Centaur’s vector integer execution is also more capable. All three pipes can do vector integer addition, while only two of Haswell’s can. Like on the scalar integer side, CNS’s vector side has two integer multipliers, versus one on Haswell. Depending on the exact multiply operation in question, Haswell’s performance can degrade further. For example, pmulld (vector multiplication with 64-bit elements) executes at half rate with a latency of 10 cycles. CNS executes the same operation at ~1.68 per cycle, with 3 cycle latency.
Address generation is one of CNS’s only weaknesses on the execution unit front. CNS has two AGU pipes, each capable of handling either loads or stores. Haswell has three AGUs, allowing it to do two loads and a store in the same cycle.
Centaur still has a few tricks up its sleeve though. It can write 64 bytes per cycle by executing either a single AVX-512 store, or two 256-bit AVX ones. That gives it twice as much store bandwidth as Haswell. In terms of maximum L1D bandwidth, CNS can theoretically hit 128 bytes per cycle with a 1:1 mix of reads and writes. We weren’t able to hit that in our tests, but we did get over 90 bytes per cycle. That’s close to Haswell’s 96 B/cycle theoretical max.
In practice, Haswell’s triple AGU setup probably has a slight edge. Most applications have far more loads than stores, and Haswell’s extra store AGU will reduce pressure on the two general purpose AGUs. This advantage would be minimal though.
CNS features a rather sophisticated load/store unit. Unlike the VIA Nano and Zhaoxin’s LuJiaZui, it can speculatively execute loads ahead of stores with unknown addresses.
Centaur also has a robust store forwarding mechanism. All cases where a load is completely contained within a prior store are handled with a latency of 7 cycles. It’s also able to complete two loads and two stores per cycle if they’re independent and neither cross a 64 byte cache line boundary. We don’t see that from Intel or AMD until Sunny Cove and Zen 3.
Results from our implementation of Henry Wong’s store forwarding test. 64-bit store offset goes down vertically, and 32-bit load offset goes across
However, latencies are a bit high, especially if forwarding fails. If the load only partially overlaps the store, latency jumps to 21 cycles. Store forwarding latency increases by a cycle if the load crosses a 64 byte cache line. If a load crosses a cache line boundary and store forwarding fails, there’s a 6 cycle penalty.
Haswell seems to do a fast check at four byte granularity. If the load and store both access the same 4 byte aligned region, it does a more thorough check that causes a half cycle penalty. That slight penalty applies even if there’s no overlap. Successful store forwarding on Haswell has a latency of 5.5 cycles, while failed store forwarding costs 15. Both latencies are lower than CNS’s, which is quite impressive considering Haswell’s higher clock speeds.
If the load crosses a cache line boundary, Haswell’s store forwarding takes an extra two cycles, and the penalty for a failed store forward increases by one cycle.
Centaur CNS | Intel Haswell | Intel Skylake-X | Intel Golden Cove | AMD Zen 2 | ||
L1 Instruction Cache | 32 KB 8-Way, 3 Cycle? | 32 KB 8-Way | 32 KB 8-Way | 32 KB 8-Way | 32 KB 8-Way | |
L1 Data Cache | 32 KB 8-Way, 5 Cycle | 32 KB 8-Way, 4 Cycle | 32 KB 8-Way, 4 Cycle | 48 KB 12-Way, 5 Cycle | 32 KB 8-Way, 4 Cycle | |
L2 Cache | 256 KB 16-Way, 13 Cycle | 256 KB 8-Way, 12 Cycle | 1024 KB 16-Way, 12 Cycle | 1280 KB, 10-Way, 15 Cycle | 512 KB 8-Way, 12 Cycle | |
L3 Cache | 16 MB 16-Way, 56 Cycle | 30 MB, 49 Cycle | 35.75 MB, 55 Cycle | 30 MB, 67 Cycle | 16 MB 16-Way, 39 Cycle |
Haswell and Skylake L3 parameters vary. The E5-2690 v3 was used for Haswell here, and the Xeon 8171M (on Azure) was used for Skylake
For the most part, Centaur’s CNS suffers from higher cache latencies than Intel’s Haswell, even when the latter is in a server platform. That’s partially because Haswell is running at 3 GHz, while CNS only runs at 2.2 GHz.
Actual time on the left, core clocks on the right
But Centaur falls behind even if we normalize for clock speed. With 5 cycles of latency, Centaur’s L1D is slow. Its L2 is also a cycle slower than Haswell’s. In the L3 region, Centaur again loses both in terms of cycles and absolute time. It only pulls a slight win when we get out to memory.
Things look a bit better for both CPUs if we use 2 MB pages to avoid address translation penalties, but the picture is largely the same. 24 ns of latency is slightly sub-par for a ring servicing just eight cores. Looking at differences in L3 latencies, we can also see that a L2 TLB access takes an extra eight cycles.
Actual time on the left, core clocks on the right
In terms of clock cycles, it’s not too much higher than Haswell-E. But the E5-2690 v3 has more cache, more cores, and runs at a higher clock. Intel has done a very impressive job of scaling up the ring interconnect while keeping latency under control.
CNS’s L1D can do a 64-byte load and a 64-byte store every cycle. We weren’t able to get close to its theoretical bandwidth even with read-modify-write or copy patterns (which give a 1:1 load-to-store ratio), but at least we got more than 64 bytes per cycle.
Actual bandwidth to the left, bytes per cycle to the right
Read bandwidth barely drops when going from L1 to L2, staying at just under 64 bytes per cycle.
Actual bandwidth to the left, bytes per cycle to the right
Even Intel’s latest cores can’t sustain that amount of L2 read bandwidth. That’s an impressive performance from Centaur’s architecture. Once we get to L3 though, bandwidth takes a sharp drop.
I don’t trust my copy bandwidth measurement just yet. We’re still working on expanding our benchmark capabilities.
With all eight cores loaded, we see over 1.6 TB/s from CHA’s L1 data caches with a memory copy pattern. L2 bandwidth is highest using a read pattern, at just under 1.1 TB/s. L3 bandwidth hits about 325 GB/s. Finally, we get just above 55 GB/s from memory, with a copy pattern.
lstopo output for the CHA server we tested on
Centaur uses a ring interconnect to connect cores with the L3 cache, and off-die IO. Each ring stop can move 64 bytes per cycle in each direction – twice that of Haswell’s.
With eight cores loaded, CNS averages 97.4 bytes per cycle across all eight cores, while Haswell-E gets 81.62 bytes per cycle. CNS’s wider ring does help with bandwidth, but it’s not enough to offset Intel’s clock speed advantage:
As a consolation prize, Centaur’s ring-based L3 is able to provide more bandwidth under heavy load than Ice Lake’s mesh-based cache. Mesh based interconnects tend to have issues clocking up without consuming tremendous amounts of power, resulting in high latency and low bandwidth. Ice Lake’s is no exception.
CHA has a quad channel memory controller that supports DDR4-3200, but memory bandwidth isn’t too impressive. 53 GB/s is well short of the theoretical 102.4 GB/s. Getting close to theoretical bandwidth is hard, because DRAM bandwidth gets lost to refresh cycles and read/write turnarounds. But 52% of theoretical is way too low.
Haswell-E doesn’t do much better. Its first generation DDR4 controller has trouble running at high speeds. Still, Intel’s relatively old architecture shows better memory bandwidth scaling as more cores are loaded. One thought is that CHA wasn’t optimized with CPU memory performance in mind. Centaur has a large NPU taking up a significant chunk of die area, capable of nearly 7 bfloat16 TFLOPs. That could demand a lot of memory bandwidth, especially with the CPU cores active at the same time. Also, the quad channel memory controller could allow more DIMM slots, making it easier to install lots of memory without needing very expensive high capacity modules.
Cache coherency is handled pretty competently with a mechanism that’s likely tied to the L3 slices. The time taken for a core to see a write from another depends on how close both cores are to the L3 slice that the cache line is homed to:
A 4K aligned cacheline seems to be closest to core 7
Lock latencies aren’t bad, especially considering CNS’s low clock speeds. Haswell-E ends up being in the same ballpark. On one hand, Intel benefits from higher clocks. But on the other, it’s using a dual ring setup to connect more cores and more cache. More hops across the interconnect generally translates to higher latency.
Intel’s dual ring setup is pretty darn good
In a vacuum, CNS’s architecture shows how far the Centaur design team has come. CNS is wider and has more reordering capacity than any previous VIA or Centaur architecture. It also implements a grab bag of new microarchitecture features, showcasing the small Centaur team’s design prowess. Loads can be hoisted ahead of stores with an unknown address. Store forwarding is very robust, with behaviour that resembles Sunny Cove’s. There’s a large, unified, multi-ported scheduler. Vector execution units are very powerful for a core of its size. Multiple sets of architectural registers (GPR and AVX-512 mask) are aliased to the same physical register file for better area efficiency. All of this is done in a very compact core:
Approximate sizes for Centaur CNS and various Intel cores. CHA die photo from Centaur’s presentation, Haswell photo from Cole L, Skylake-X photo from Fritzchens Fritz, Golden Cove from Intel’s press materials
There’s progress at the system level too. Centaur designed a modern ring interconnect to link the cores with cache and IO. That interconnect enables a large, shared L3, and lets its bandwidth scale to meet the demands of eight cores. Finally, the CHA chip’s quad channel DDR4 controller and 44 PCIe lanes give it more off-chip bandwidth than any previous Centaur design.
But CNS doesn’t exist in a vacuum. Haswell level IPC is great. What’s not great (for Centaur) is that Haswell clocks much higher. Intel also refined their ring bus over several generations, letting it support higher core counts and scale to higher bandwidth, even with narrower links between ring stops. Centaur’s architecture, while ambitious, would have a tough time against Haswell.
Worse, Centaur’s CHA chip still wasn’t on the market in 2021. That put it up against Intel’s Icelake-SP based Xeons, and AMD’s Zen 3 based EPYC chips. Centaur is a small company with limited resources, and is a process node behind. In a pure CPU versus CPU battle, they stood no chance.
CNS really starts falling behind as other designs get a process node advantage. By the time Zen 2 comes around, AMD has a smaller core with higher IPC and higher clocks. AMD core photos are from Fritzchens Fritz
But Centaur knew that. That’s where NCore comes in. It’s a beefy machine learning accelerator capable of 6.8 trillion bfloat16 operations per second. Centaur wanted to pair their CNS architecture with NCore to create a uniquely competitive product. Core for core, CNS can’t take on Skylake or Zen 2, but it’s more than adequate for driving an ML accelerator. For context, Intel targeted 5G base stations (which counts as “edge”) with Snow Ridge. Snow Ridge has Tremont cores running at 2.2 GHz. ARM targets edge applications with Neoverse E1. CNS would stand up well against them, especially with vectorized workloads.
Placing the NCore on-die also reduces latency, and leaves PCIe lanes free for other IO. On the edge, that’d undoubtedly be a ton of networking bandwidth. CNS and NCore together make sense if you want a server with powerful inference capabilities, but had more moderate demands for CPU performance and didn’t have the space or power budget for external ML accelerators.
I suppose that market never materialized. Or if it did, it wasn’t enough to save Centaur. Then, Intel snapped up Centaur’s design team, because those engineers are very good at doing a lot with limited resources. CNS is a testament to that.
In conclusion, while it is sad to see a CPU design house close shop, Centaur would not have competed with either AMD or Intel. Even if Centaur had launched a system with the CNS core in 2017, both Intel and AMD had more expandable systems in the form of Skylake-X and Zen 1 EPYC because CNS can only scale to 2 sockets with a total of 16 cores versus Skylake-X’s max of 224 cores across 8 sockets and EPYC’s max of 64 cores across 2 sockets.
However in Centaur’s planned release year of 2020, like Clam said, things were not looking good for either Centaur or the CNS core. And in late 2021, the scene was looking ever grimmer for Centaur and they decided to close up shop with a buyout by Intel for 125 million US dollars.
Now, the buyout does not necessarily mean that the CNS core is dead. The wildcard in all of this is Zhaoxin. Zhaoxin has been very quiet during this acquisition of Centaur by Intel but they have proven that they can improve on Centaur’s already existing designs with Lujiazui.
In 2018, Zhaoxin claimed to be able to catch up to AMD’s then fairly new Zen 1 architecture with their KX-7000 series of CPUs and I suspect that they were planning on using the CNS core to do that. This hypothesis does have creditable evidence in support of it due to the prototype boards using the ZX-200 as their southbridge and that VIA transferred IP to Zhaoxin in October of 2020 which included CPU IP. Now, whether or not Zhaoxin can clock the CNS core to the 3.5+ GHz that a modern desktop CPU needs in order to be competitive is a different story.
Regardless of if Zhaoxin could get CNS up to 3.5GHz, the point is moot. This is not early 2019, it’s early 2022 so this hypothetical 3.5GHz CNS would not be able to get near to either AMD’s or Intel’s current architectures let alone what will be out in roughly 6 to 9 months.
But back to the buyout of Centaur by Intel, the 125 million dollar question is why? Why did Intel buy the Centaur design house? To be frank, we have no idea. Centaur was no threat to Intel or even AMD. There are a few possible reasons that I can think of:
· Intel wanted some IP owned by Centaur such as NCore
· Intel wanted the Centaur engineers because as Clam said, they seemingly can do a lot with very little resources
· Intel wanted a good enough x86 CPU core on a TSMC node for projects where the fastest CPU core is not required and Intel wants to use x86 compatible cores
· And the most cynical reason is that Intel just wanted one fewer company with the x86 license, although VIA still has the Cyrix license as far as I know
Unfortunately, the actual reason as to why Intel acquired Centaur may never be known due to Intel being very tight lipped as to why they gobbled up Centaur. This is quite possibly the end of an era that most didn’t know still existed. Maybe Zhaoxin will take up the mantle of the third high performance x86 design house that has been held by Cyrix and Centaur or maybe we will be reduced to two high performance x86 design houses in AMD and Intel. I cannot predict the future but what I do know is that this article is at its end.
If you like our articles and journalism and you want to support us in our endeavors then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way or if you would like to talk with the Chips and Cheese staff and the people behind the scenes then consider joining our Discord.
|