SX Benchmarks :.
MP Scalability: Have We Hit The Wall ? :.
With the incredible increase in scaling that has been achieved in the last 12-18 months, it was becoming more and more evident that in certain instances Computer Hardware was overtaking the capabilities of Native DAW applications. Something that was unthinkable not too long ago.
This was driven home quite dramatically with the release
of the Clovertown Series of QuadCore Xeons, allowing for the first time,
8 Physical cores to be logistically within reach for a Dual Socket-
Professional DAW Workstation.
|Nuendo/ SX's Multi Processor support :|
The question of Nuendo's Multi Processor support, which is claimed to be unlimited, has been less than clear cut. In the past we had encountered an issue when moving from 2 to 4 physical cores , when the first series of Dual Dualcore Opterons systems hit the wild. The result then was less than inspiring , with the systems performing worse than the Dual Single Core systems. The exact details were kept close to the chest of a select few, except to say that performance was not as expected. The solution then was a patch to repair Nuendo's Multi Threading capability with the new AMD chips. Bare in mind it was claimed then that Quad Single Core CPU's worked fine , it did not however translate to the Dual Dualcore arena. This is important to remember as we fast forward to the present day where we needed to move from 4 to 8 Cores.
As we moved into the Dual Quad arena, the Nuendo forum was again the hot bed of preliminary information. One early report showed a distinct loss of scalability compared to a Single Quad system, using the same audio RME Fireface interface.
It was quickly surmised that the Dual
Quad Xeon systems were suffering buss arbitration issues - a possibility.
The flaw in that original thesis however, was that the conclusion being postured was based on the Thonex benchmark, which did not correlate in any way to scalability per core.
Further testing by other early Conroe/Woodcrest adopters using some of the other available benchmarks like Blofelds DSP40, that did correlate to scalability per core, proved without a doubt that there was no buss arbitration or scalability issues. At least at the Dual DualCore level.
The Dual Quadcore specter reared its head again shortly after, another member at N.com , Steven from Yellowcab Studios in France jumped the gun and opted for a Dual Quadcore Clovertown 5320 Xeon - Dell 690 Workstation. This system was identical to that of system being successfully used by a member psvennevig– Pal , except Pal’s system had Dual Dualcore Woodcrest 5160 CPU's.
The New DualQuad system was using multiple MADI cards
for a total of 112 I/O. This is the configuration that had been successfully
working on an earlier model Dual Xeon system at the facility, and also
worked flawlessly on the Dual Dualcore Woodcrest system. What transpired
on the Dual Quad Xeon system turned into a 2 month long Soap Opera that
saw not only me being banished for simply suggesting that the issues
could be Nuendo related, but also the extent to what opinions could
be manipulated by those who were nowhere near qualified to even express
|The Investigation Begins : 2 Tribes Go to War..|
The issue being experienced
by the Dual Quad Xeon system was 2 fold, one being a disproportional
delta between the VST meter, and the Task Manager , and the second being
a complete VST meter and system overload when Multi Processing was enabled
- which is the complete reverse of what should be happening when MP
is working correctly. This behavior was accentuated at the lower latencies.
In short, the system could not be successfully operated below 1024 Samples
with the high I/O configuration and MP enabled.
The questions that needed to be answered from my perspective were
1: Was Nuendo’s MP capability correctly scaling across the 8 Physical Cores when configured as Dual Quad.
2: Was XP Pro’s MP HAL correctly
identifying and Multithreading across the 8 available Cores, also in
that respect, what was the correlation between the HAL and Nuendo’s
With Steinberg asleep at the wheel, we needed to try
and isolate whether Nuendo’s MP capability was a possible cause.
It was suggested that Steven test another Multi Processor capable DAW
application. Sonar 6 was chosen. Sonar did not display any of the issues
being experienced in all areas being tested, all worked as expected.
The system was then configured running 112 virtual I/O across the inbuilt soundcard using ASIO4All, the behavior was close enough to identical , with a slightly lower overall CPU loading- due to the non physical nature of the I/O -again highlighting the issues where associated more to Nuendo , than an architectural issue . However, the discussion at N3 was still being manipulated by those that continued to insist that it was not a Nuendo issue, despite the collected evidence indicating otherwise.
It was time to let those boys chat amongst themselves for a while, I had already been placed on “vacation” weeks earlier, and it was obvious that the other parties involved were growing increasingly frustrated by the preceding.
We initiated an alternate investigation at the DAWbench
Forum, away from the influence of those behaving like corporate apologist,
and continued thru a process of elimination to isolate as clearly as
we could, what was behind the issues being experienced.
|Windows XP Pro, Server 2003 Multi Processor HALS: :|
|Firstly, we decided to isolate the Operating System, its MP capability and the possible correlation to Nuendo’s MP capability. My thinking was that XP Pro's Multi Processor HAL was never designed for 8 physical cores, we may have been lucky with 4, as the HAL could have been updated to accommodate the earlier Hyperthreading Xeon CPU's. Windows 2003 however has the capability of anywhere from 1-4 physical CPU's for Standard, 1-8 for Enterprise, 8-32 for Data Center, so there is definitely more than one version of the Multi Processor HAL.||I organized Steven to have access to the latest Server 2003 R2 - Enterprise Build, this version of the W2K3 had the capability of 8 Physical CPU’s , so we could quickly establish whether the HAL was a contributing factor. The behavior remained, therefore eliminating the O.S out of the equation. We still had not heard anything from Steinberg at this point, so we were still chasing a ghost, but the further we progressed, the more and more it seemed to be a Nuendo issue.||
*Note : During this stage of testing, Vista 32 and Cubase 4 were also tested: The behavior remained consistent to that experienced previously, which suggested that the issue had unfortunately also navigated its way into the new seq4 audio engine implemented in Cubase 4.*
|Nuendo 3 : RME MADI 112 I/O : Accumulative Resource Allocation : :|
We decided to separate the 2 issues being experienced by testing what affect the I/O was having independently to the MP issue.
It appeared that the higher the number of I/O , the larger the delta between the VST and TM meters, which then seemed to be accentuated by the number of cores, so the more cores, the higher the accumulated impact.
On 2 and 4 Core systems , the behavior of the higher VST/TM loadings were evident, but not to the extent that was being experienced with the 8 Cores, also the other systems could be easily operated at extreme low latencies, while the 8 Core system was hobbled at anything below 1024.
A test was conducted using a simple project, the audio in the test project was a sine tone playing on one track only, but staggered so that it covered all audio tracks in succession.
The number of I/O channels was progressively increased, and the behavior noted. The results listed below.
With Multiprocessor ON:
With Multiprocessor OFF:
The results indicated that there was in fact more than one issue being experienced, something I had suspected very early on.
Just how much the I/O issues were inter-related with the MP issue was clarified when we moved the testing away from the Dual Quad area.
Dual Xeon Quadcore-Clovertown : Lynx AES16 -16 I/0-128 Sample Buffer : Multi Processing On - Multi Processing Off
|Goodbye Clovertown, Hello Woodcrest :|
At this point, Steven had had enough of Steinbergs lack of communication , and/or clarification of the issue, and took up an offer from a fellow DAWbench forum member – Ten- to swap the QuadCore Xeons, for a a pair of Dual DualCore 5160 chips. These chips were identical to those being used by Pal in his Dell 690 system, so we already had a clear indication on the performance to be expected.
On installation of the Dual DualCore Xeons, the system behavior instantly improved to the point where the full 112 Channels of I/0 could be used effectively right down to 128 Samples, whereas on the previous configuration it was stalled at 1024.
The correlation between the VST and TM was still disproportional and higher to what Steven had experienced on the earlier Dual Xeon system, but behavior with MP On, was as expected, with huge amounts of overhead available over the MP Off configuration.
This highlighted another factor that when enabling high numbers of I/O, Nuendo would assign resources , perhaps as a buffering mechanism. These resources seemed to be accumulative with the number of Cores available, so the higher the number of cores, the higher the VST / CPU loading. That combined with the 8 Core MP issue, resulted in the extreme issues being experienced.
When comparing the screenshots of the Dual DualCore / MADI rig , configured with both Minimal and Maximum I/O, the added resource allocation is clearly evident . This has been reported in the past, but was never really clarified. What needs to be clarified is why it seems to be accumulative per number of Cores when running the higher number of I/O.
The bottom set of screenshots are the Clovertown CPU's in their new home in Tens Rig running a Lynx AES16 - Aurora 16. The difference at 128 samples between the earlier configuration running 112 I/0 and the current rig running 16 1/0 is quite dramatic.
|Scalability Per Core : Dualcore – Quadcore, Dual Dualcore – Dual Quadcore :|
With that variable out of the way, we needed to clarify the scalability per core moving from the Dual DualCore to Dual QuadCore. At the same time, I took the opportunity to have a closer look at the Single DualCore to Single Quadcore platform to guage a comparable scalability within the respective platforms. Some early reports on Quadcore to Dual Quadcore were less than conclusive. A Mishmash of platforms, results using different audio cards, Speed grades , etc. We initiated tests based around the Blofelds DSP 40 benchmark, it is the most established, with a huge collated data base, and the best example to indicate scalability per core of the available benchmarks. We also did not shy away from reporting the Performance Droop presented by the Save /Reopen issue.
Firstly let’s look at the scalability achieved on the Single Socket Platform.
The scaling from Single DualCore to Single Quadcore was extremely smooth, with the added bonus of the systems being capable of running at the 32 Sample latency settings.
Performance gains across the available comparable latencies of 256-064 ranged from 52 – 34 % . There was no indication of FSB/ Memory arbitration issues, and quite simply, the performance is quite astounding for a Single CPU DAW.
Scalability for the Dual Socket Platform shows a dramatically different behavior
While on the surface , the results at the higher latencies look reasonably impressive, comparing the % scalability compared to the Single Socket system shows only an increase from 27 - 05% on the latencies from 256-064 , and a 13% decrease at 032 samples.
This is a dramatic departure from the results achieved on the i975 chipset / Single Socket system.
This could indicate an arbitration issue navigating the Dual QuadCores/ Cache Coherency/Multi FSB architecture of the i5000x.
On the other hand, it could simply be an inability for
the application to scale accordingly.
The difficulty in concluding exactly what degree the variable was Software or Hardware related rested purely on Nuendo’s MP Optimization for the Dual Quadcore architecture.
Side Note : The only other report of Nuendo’s MP ability above 4 Cores was from a thinly detailed offering of Nuendo successfully scaling on a Quad Dual Core AMD system. No details were offered apart from the scalability from 4 -8 Cores was expected.
Something that needs to be noted is that the system scalability being reported was based on a 1024 buffer setting, which would negate a vast majority of the issues being experienced and reported on the Dual Quad Xeon systems.
Unfortunately we do not have a clearer idea of the exact scaling of the AMD platform from 4 – 8 Cores , as any comparable benchmarks being utilized are dismissed as irrelevant by the individual involved.
|Nuendo MP Bug Officially Confirmed : Conclusion:|
Just as we were finalizing the testing, Steinberg acknowledged that the issues being experienced were in fact the result of a bug in the Nuendo MP engine. 6 Full weeks after the issue was brought to their attention. No further clarification was offered unfortunately, so the speculation still raged at N.com from those that now felt more than a little vulnerable after their earlier posturing.
The debate then shifted to trying to attribute the issues to Intels SMP architecture /approach being different enough to AMD’s, to be somehow to blame. How this conclusion is drawn is anyone’s guess, unless they are privy to some inside information that has not been shared publicly.
If there is some fire under the smoke, it can only be attributed to the Intel Dual Quad arena. The simple fact that the current Intel Core2 Duo/Quad / Dual Woodcrest systems are more than easily accounting for any of the AMD offerings, clearly shows that the issue is not across the board, as is being suggested.
It’s just a ridiculous continuation of the disinformation and FUD being postured by a select few.
It is now at the time of writing, a further 6 full weeks since the initial announcement, and Steinberg have not offered any more detail on the actual MP bug, or when we will see a resolution to the issue. From the Diagonalese announcements that they have officially offered, it will not be offered for SX/ N3, and has not been offered in the recent C4 4.02 update.
So until further notice those of us using the current crop of Core2 based systems, it would be wiser to limit to 4 Cores on any and all Steinberg product .
Those wanting to move to 8 Cores will more than likely need to wait for C 4.1 ?/ N4 in Q3 2007.., unless of course you can justify 10-15K for a Quad Dual Opti , and/or accept running at higher latency settings. Considering the advancements made by the current Intel architecture to run comfortably at a true 32 sample latency on numerous audio cards , the later option seems a little more than a cop out by those posturing that as a solution..!
Edit: 09 March 2007: Steinberg today announced some more detail of the encountered bug, and it confirms what I have written in this report, that there are 2 distinct areas being effected, one being the accumulative loadings per Core when using High I/O, and secondly the inability to scale correctly across multiple cores at lower latencies: They also confirmed that the fix is not scheduled until Q3 / N4 :
What I find unnerving is the influx of new participants who have come out of the woodwork to “thank” Steinberg for their acknowledgment and communication on this issue, and how appreciative we all are that they have come to the rescue after we were chasing our tails ?
Quite simply, the only reason people were running around in circles was a two pronged effort, the first being the insistence of the resident corporate apologists to continually try and shift the focus away from it being a Nuendo issue.
This inevitably slowed the process, as every step forward
was continually dragged 2 steps back. I cannot understand the insistence
of continually defending a position that is consistently being eroded
by the accumulating evidence, and even after the official announcement
and acknowledgment, it still didn’t lesson the resistance?
The second and most significant was that
Steinberg refused to acknowledge the issue for 6-7 weeks. I specifically
asked for an official response in early December, as I mentioned earlier,
the response was that it worked in theory.
Now let’s put that into perspective shall we.
What Steinberg admitted was that they did not have the appropriate systems or resources to actually test the Multithreading capability past “theory”. They did not get an appropriate test system until 6-7 weeks after the issue was brought to their attention. How any company developing Professional Audio applications can possibly be in a situation where they do not have appropriate hardware to test their coding, is beyond me, as it undoubtedly puts their capability of QA into question. The situation also highlights that the width and breadth of the BETA list is also inadequate.
It took a further 6 -7 weeks to “officially” confirm the bug after the first announcement, and the estimated ETA for the “Fix”. Now just how useful and appreciated this new found interaction is remains to be seen. Talk is cheap, patience is thin, and having dealt with this company at a corporate level for over a decade, I am not holding my breath.