make-IT-work: wpr

Posts mit dem Label wpr werden angezeigt. Alle Posts anzeigen

Donnerstag, 6. Februar 2014

Diagnose frame skips and stutter in DirectX applications

(Although I'll be writing specifically about applications using DirectX for rendering, the general concepts can also be applied to applications using OpenGL.)

User experience of games and multi media applications is highly bound to a feeling of smoothness. This smoothness feeling is quickly lost when

the application does not respond fast enough to input,
animations stutter or
audio output is distorted.

Developers have to try hard so none of those three appears. But when they appear, how to find out what's responsible for it? In this article I'll show a way to diagnose frame skips.

Tools

Your Swiss Army knife for diagnosing performance problems on Windows 7 and newer is the Performance Toolkit which is part of the Windows SDK. In this article I'll use Windows Performance Recorder (WPR, wprui), Windows Performance Analyzer (WPA) and GPUView.

The second step will be to record a event trace using WPR. But for a clean and nice event trace with as little back ground noise as possible, the first step is to get a clean Windows instance. This means to reboot the computer and end all processes which are not vital for the operation of the OS. Start WPR (wprui, not the command line tool), select "First level triage", "CPU usage", "GPU activity", "Desktop composition activity", "Audio glitches", "Video glitches" from the profiles. Select detail level "Verbose" and log to memory. Start your application. Start the event recording. Wait a little bit and let WPR record the "normal" state of your application. Then make a frame skip or wait until it appears on it's own. Depends of the nature of your frame skip problem. If you realize that the frame skip does not appear any more... the processes you killed may be a hint. Repeat the steps adding one process after another until the frame skipping appears again. Save the recorded trace.

Different Types of Problems

There are many reasons why your application may experience a frame skip:

Application CPU load: your application is heavily CPU dependent. The CPU is not able to process the work required between two vsyncs (16ms for 60Hz).
Application GPU load: your application creates a lot of GPU load (usually shaders processing) so the GPU is not able to process the work between two vsyncs.
System CPU load: background processes occupy a big chunk of CPU time leaving your application insufficient CPU time.
System GPU load: similar as previous but this time GPU releated.
Failing to process CPU and GPU work concurrently: GPU and CPU are able to work concurrently. If your application waits for the vsync, after finishing the last frame, CPU time is wasted, which is available during GPU processing.
Bad vsync synchronization: This is similar to the previous one. If you're synchronizing to vsync in a bad way (and there are a lot ways to get it wrong) you're application wakes up too late after the last vsync, leaving not enough time until the next vsync to process the work required to construct the next frame.
Locks: either locks in your application, or system wide locks, may block your application and make it skip a frame or two.
System / driver bugs: this is the nastiest way to loose frames. It's possible for the OS or graphics driver to have bugs which are triggered by your application and which lead to frame losses

I'll omit the last point, because there's usually nothing you can do about it. Most of the time the reason for your frame loss will fall into one of the other categories.
All these types of problems can be made visible using GPUView or the more versatile WPA (Windows Performance Analyzer). Load your recorded trace in GPUView. Matt's Webcorner has a very nice introduction to GPUView. Read it before you continue here.

Application CPU Load

If the CPU sits all the time in your application's thread lanes and the 'present' GPU package comes too late for the vsync, you're affected by this kind of problem. See section "Large CPU Delay - Dalaran" in Matt's introduction to GPUView for screenshots how this looks like in GPUView.

Once you confirmed you're suffering from this problem, you can switch to WPA and have a look into the "CPU Usage (Sampled)" view to get a hint where your application spends the most time. Every other profiler will aid you on finding bottlenecks in your code too.

Application GPU Load

See section "Large Render Time - Mothership Cloaking" in Matt's GPUView introduction for screen shots how this looks like. Diagnose this deeper is harder than CPU load. The GPU profiler industry sector is not so mature as the one for CPU profilers. Still, there are couple of tools which are available. And of course there's always "ninja debugging".

NVIDIA PerfKit: If you're the owner of a graphics card with a NVIDIA GPU, those tools might help you to get down to the core of the problem.
AMD GPU PerfStudio 2
AMD GPU ShaderAnalyzer
Intel Graphics Performance Analyzers
gDEBugger: for OpenGL applications
Ninja debugging: cut away everything. Add the parts back again, one after another, and measure the performance between the steps. When you identified a part responsible for a big chunk of GPU work, cut it further down and repeat.

System CPU Load

If you see the CPU sitting in the lanes of other processes then your's, you know you need to get rid of them.

System GPU Load

No so common. But possible. If the graphic card queue is full but not with packets of your process, check which process issued them.

Failing to process CPU and GPU work concurrently

If your application has a lot of CPU work to do, you'd better use every bit of CPU time available. If you rely on the 'present' command to block and to wake up your process once the vsync occured, you're not using important milliseconds of CPU. I'll cover the topic on how to synchronize to vsync without relying on blocking 'present' command in a later article.

Bad vsync synchronization

If you try to not rely on blocking 'present' command and do it wrong, it won't be any better than relying on 'present'. I'll cover this in the same future article as the previous topic.

Locks

It's possible you're your own enemy, if you try to do network or file system operations in the same thread you do the graphic stuff. But it's also possible that you avoided those obviously blocking APIs but your application still was blocked by the system. If you see that your graphics thread wakes up too late after vsync to do it's job, you need to analyse what it was waiting on. The second part of this article will explain such a situation in detail.

Blocked by the System

Let's have a look at a simple application which does some basic DirectX operations. There is not much CPU load. There is little GPU load. Usually the applications runs at constant 60ftps. But from time to time it skips a frame. I recorded a event trace and loaded it in GPUView. First I have to find the position in the trace where the frame skip occurs. There's a nice short cut for this. Locate the row with your application's process and enable the 'Present Chart' (Charts > Toggle Present Chart). Have a look in the Latency lane and watch out for spikes:

I select the spike and press CTRL-Z to zoom into selection. Now I enable vsync events (F8) and disable the Present Chart again. The result looks like this:

You see three horizontal lanes. The first is the graphic queue of the process. The next two are the two threads running in the application. There are five frames in the screen shot, the blue lines being the vsync events. Three frames look healthy. Little work in the threads, a couple of graphic command packages quickly processed. But in one of the frames the present package (cross-hatched) is kept unusually long in the process queue. Usually the command packets are moved from the process queue to the graphic card queue by the system video scheduler when there's room in the hardware queue. Here's a screen shot of the same time period (I increased the zoom a little bit) including the graphics queue.

In the first frame on this screen shot the present packet is moved nearly instantly into to the hardware queue and is processed quickly. But in the next frame it remains in the process queue for some reason and is moved and processed not until the third frame.

The video scheduler thread is hosted by the System process. Here's a "healthy" frame zoomed in to show about 3ms after the vsync event.

You see the hard ware lane, the test application "Source.exe", the System process and the Desktop Window Manager (dwm.exe) process. The name of the visible thread in the System process is already a hint to what it does: "VidSchiWorkerThread". If you look accurately you see how this thread moves the graphic command packets. Shortly after dwm.exe added a packet to it's queue the video scheduler wakes up and moves the packet to the hardware lane (the packet remains in the process lane until its completely processed by the hardware). The video scheduler goes to sleep again and wakes up when Source.exe added a command packet to it's queue. It's moved to the hardware queue nearly immideately. Then again, it wakes up when the 'present' packet was added to the process queue. But the hardware queue currently contains already two packets so the scheduler goes to sleep again and wakes up as the GPU finished the dwm.exe packet. The 'present' packet is moved to the GPU queue.

GPUView is able to display the stack of the thread at the time when it was woken up. If the zoom rate is high enough small handles appear at the bottom of the squares denoting the CPU time slices used by the threads. Clicking the handle of the first CPU slice of the System thread, a window will appear displaying a lot of information and a stack trace at the bottom of the window. (Assuming the symbol path was set up correctly (Options > Symbol Path...). You can set the symbol path to something similar to this: srv*C:\Symbols*http://msdl.microsoft.com/download/symbols; and enable "Use Symbol Server".)
The stack trace of the video scheduler waking up for the first time is (zero is the youngest stack frame, the invocation order is from bottom to top):

0 0xFFFFF80003686F97 ntoskrnl.exe!SwapContext_PatchXRstor+0x103 
1 0xFFFFF80003686A8A ntoskrnl.exe!KiSwapContext+0x7A 
2 0xFFFFF800036795F2 ntoskrnl.exe!KiCommitThreadWait+0x1D2 
3 0xFFFFF800036860EA ntoskrnl.exe!KeWaitForMultipleObjects+0x26A 
4 0xFFFFF880043CA3F8 dxgmms1.sys!VidSchiWaitForSchedulerEvents+0x1BC 
5 0xFFFFF8800439D2D2 dxgmms1.sys!VidSchiScheduleCommandToRun+0x1DA 
6 0xFFFFF880043C9FF6 dxgmms1.sys!VidSchiWorkerThread+0xBA 
7 0xFFFFF800039202EA ntoskrnl.exe!PspSystemThreadStartup+0x5A 
8 0xFFFFF800036748E6 ntoskrnl.exe!KiStartSystemThread+0x16

Everything above "KeWaitForMultipleObjects" can be ignored. Just below this call we see what the thread was waiting for: "VidSchiWaitForSchedulerEvents". And it was invoked by "VidSchiScheduleCommandToRun". I checked the stack traces of the three successive wake ups. Two times it was the same stack trace. But the last time the thread was waiting for something different:

0 0xFFFFF80003686F97 ntoskrnl.exe!SwapContext_PatchXRstor+0x103 
1 0xFFFFF80003686A8A ntoskrnl.exe!KiSwapContext+0x7A 
2 0xFFFFF800036795F2 ntoskrnl.exe!KiCommitThreadWait+0x1D2 
3 0xFFFFF800036860EA ntoskrnl.exe!KeWaitForMultipleObjects+0x26A 
4 0xFFFFF880043A10B5 dxgmms1.sys!VidSchWaitForEvents+0xF5 
5 0xFFFFF880043CFE79 dxgmms1.sys!VidSchWaitForCompletionEvent+0x139 
6 0xFFFFF8800439D2AA dxgmms1.sys!VidSchiScheduleCommandToRun+0x1B2 
7 0xFFFFF880043C9FF6 dxgmms1.sys!VidSchiWorkerThread+0xBA 
8 0xFFFFF800039202EA ntoskrnl.exe!PspSystemThreadStartup+0x5A 
9 0xFFFFF800036748E6 ntoskrnl.exe!KiStartSystemThread+0x16

This time it was waiting for completion ("VidSchWaitForCompletionEvent"). As already said the thread waits for graphic command packets to appear in process queues and moves them to the GPU queue. When the GPU queue is full, it waits until there's room.

Next I want to see how the bad frame looks like:

The VidSchiWorkerThread was woken up six times in this frame. It woke up for the first time as dwm.exe added a packet to it's queue. It also started adjustment of it's own priority for some reason ("VidSchiAdjustWorkerThreadPriority"). When it was woken up for the second time, it finished the priority adjustment and went to sleep again. The third time it was woken up when Source.exe added a packet to it's queue. It finished it's work and went to sleep again. The fourth time it was woken up by the new 'present' packet. But the GPU queue was full so it went to sleep again, this time waiting for completion of the packet being processed by the GPU. Just like in the "good" frame. But when it woke up for the fifth time, it did not move the present packet to the GPU queue. Even though there was room for a new packet it went to sleep again waiting for something. The stack trace of the sixth packet shows what it was waiting for:

0 0xFFFFF80003686F97 ntoskrnl.exe!SwapContext_PatchXRstor+0x103 
1 0xFFFFF80003686A8A ntoskrnl.exe!KiSwapContext+0x7A 
2 0xFFFFF800036795F2 ntoskrnl.exe!KiCommitThreadWait+0x1D2 
3 0xFFFFF800036860EA ntoskrnl.exe!KeWaitForMultipleObjects+0x26A 
4 0xFFFFF880043A10B5 dxgmms1.sys!VidSchWaitForEvents+0xF5 
5 0xFFFFF880043CFE79 dxgmms1.sys!VidSchWaitForCompletionEvent+0x139 
6 0xFFFFF880043CB82B dxgmms1.sys!VidSchiWaitForEmptyHwQueue+0x8F 
7 0xFFFFF880043CA19F dxgmms1.sys!VidSchiHandleControlEvent+0x3B 
8 0xFFFFF880043CA47A dxgmms1.sys!VidSchiWaitForSchedulerEvents+0x23E 
9 0xFFFFF8800439D2D2 dxgmms1.sys!VidSchiScheduleCommandToRun+0x1DA 
10 0xFFFFF880043C9FF6 dxgmms1.sys!VidSchiWorkerThread+0xBA 
11 0xFFFFF800039202EA ntoskrnl.exe!PspSystemThreadStartup+0x5A 
12 0xFFFFF800036748E6 ntoskrnl.exe!KiStartSystemThread+0x16

It's waiting for an empty GPU queue ("VidSchiWaitForEmptyHwQueue"). And "VidSchiHandleControlEvent" was the reason it started to wait for a empty GPU queue. So there arrived some kind of a control command which made the scheduler clear the GPU queue. But this still was not enough. The 'present' package remained in the software queue until the end of the frame. Just after the next vsync the scheduler thread wakes up:

and I can see in the stack trace what it was waiting for the frame through:

0 0xFFFFF80003686F97 ntoskrnl.exe!SwapContext_PatchXRstor+0x103 
1 0xFFFFF80003686A8A ntoskrnl.exe!KiSwapContext+0x7A 
2 0xFFFFF800036795F2 ntoskrnl.exe!KiCommitThreadWait+0x1D2 
3 0xFFFFF800036860EA ntoskrnl.exe!KeWaitForMultipleObjects+0x26A 
4 0xFFFFF880043A10B5 dxgmms1.sys!VidSchWaitForEvents+0xF5 
5 0xFFFFF880043CFE79 dxgmms1.sys!VidSchWaitForCompletionEvent+0x139 
6 0xFFFFF880043CB91A dxgmms1.sys!VidSchiWaitForDrainFlipQueue+0x92 
7 0xFFFFF880043CA1B0 dxgmms1.sys!VidSchiHandleControlEvent+0x4C 
8 0xFFFFF880043CA47A dxgmms1.sys!VidSchiWaitForSchedulerEvents+0x23E 
9 0xFFFFF8800439D2D2 dxgmms1.sys!VidSchiScheduleCommandToRun+0x1DA 
10 0xFFFFF880043C9FF6 dxgmms1.sys!VidSchiWorkerThread+0xBA 
11 0xFFFFF800039202EA ntoskrnl.exe!PspSystemThreadStartup+0x5A 
12 0xFFFFF800036748E6 ntoskrnl.exe!KiStartSystemThread+0x16

The control command required even the flip queue to run completely dry. It looks like it was some kind of "flush" GPU command. But what did request this flush? In the screen shots above there was a collapsed process (csrss.exe) which I made visible for the next screen shot. It's the beginning of the bad frame again:

You see that csrss.exe executed something before the video scheduler started to flush the GPU. Unfortunately I can't tell from this CPU time slice what csrss.exe did. The event trace contains only the stack of when the thread woke up. And that didn't reveal anything about what csrss.exe was about to do. If I was lucky the sampled stack collector was active at exactly this time point so I'd see the stack trace of csrss.exe in the "CPU Usage (Sampled)" in WPA. But I was not lucky. The information I was looking war was in the next CPU time slice of csrss.exe which also woke up after the next vsync:

When I clicked the small handle of the csrss.exe CPU time slice GPUView showed me that the System video scheduler thread did ready the csrss.exe thread. The small white triangle does indicate this. In other words: the video scheduler did unlock something what the csrss.exe thread was waiting for (to learn more about wait analysis read this great article by Bruce Dawson: Xperf Wait Analysis–Finding Idle Time). The video scheduler did this after the GPU flush completed, so csrss.exe was waiting for the GPU flush. And indeed the stack trace of csrss.exe contains the next hint:

0 0xFFFFF80003686F97 ntoskrnl.exe!SwapContext_PatchXRstor+0x103 
1 0xFFFFF80003686A8A ntoskrnl.exe!KiSwapContext+0x7A 
2 0xFFFFF800036795F2 ntoskrnl.exe!KiCommitThreadWait+0x1D2 
3 0xFFFFF8000368A99F ntoskrnl.exe!KeWaitForSingleObject+0x19F 
4 0xFFFFF880043CD132 dxgmms1.sys!VidSchiRequestSchedulerStatus+0x86 
5 0xFFFFF880043CE002 dxgmms1.sys!VidSchSuspendAdapter+0x32 
6 0xFFFFF880042D09A3 dxgkrnl.sys!DXGADAPTER::SetGammaRamp+0xFF 
7 0xFFFFF880042D0EA2 dxgkrnl.sys!DxgkCddSetGammaRamp+0x1BA 
8 0xFFFFF960006D5E75 cdd.dll!PresentWorkerThread+0xB09 
9 0xFFFFF800039202EA ntoskrnl.exe!PspSystemThreadStartup+0x5A 
10 0xFFFFF800036748E6 ntoskrnl.exe!KiStartSystemThread+0x16

"SetGammaRamp" was the command which required the GPU to complete all the work in the queue (so it won't be affected by the change probably). But I still don't know which process told csrss.exe to execute this command. I don't see no more processes in GPUView. But this doesn't mean there are no more processes. GPUView does hide some processes which it find to be not relevant for the trace. I switch to WPA and load the same trace.

"SetGammaRamp" was a blocking call for csrss.exe. So maybe it did also block the process which called csrsss.exe. If it did, I would see the culprit wake up after csrss.exe with it's stack trace some where around a gamma ramp. I use GPUView to determine the time stamps where this all happened. Like the time when the the video scheduler woke up. In WPA I select the "CPU Usage (Precise) Timeline by Process" view. The precise CPU view contains stack traces of the threads when they wake up. Right click in the graph area and choose "Select Time Range...". Then right click "Zoom". The view zooms to the time range. Here's what I saw:

I saw the first slice of the system video scheduler thread (this was right after the vsync event); it readied csrss.exe. csrss.exe did some work and went to sleep for two times. It looks like csrss.exe readied either dmw.exe or flux.exe. I verified this by looking at the "ReadyingProcess" column in the tabular view (you can adjust the visible columns by clicking the cog icon in the graph head line right to the graph name). It looks like csrss.exe did ready both threads:

Examining the stack traces (don't forget to setup symbol path and tell WPA to load symbols (Trace->Load Symbols)) showed that dwm.exe was waiting for shared access to the graphical device to do some frame statistics. Looks harmless. flux.exe on the other hand, had this wake up stack trace:

(Note that WPA presents the stack traces the other way round as GPUView. The oldest frame is at the top and the youngest frame is at the bottom. So frames invoked lower frames.)
Bulls eye! So flux.exe told the system to set the gamma ramp. f.lux is a small back ground tool which adjusts the colour temperature of your screen according to sun rise and sun set times. This shows very nicely that one has to consider everything when analysing problems. Even though f.lux consumed no CPU and no GPU power it was responsible for the frame skip.

Dienstag, 21. Januar 2014

Diagnose OpenGL Performance Problems

I stumbled upon an interesting OpenGL slow down issue on stackoverflow.com: Curious slowdown in opengl when using instanced rendering. The author created a program which would render 20 x 20 x 20 cubes in two different ways: indexed drawing and instanced drawing. When using indexed drawing performance seemed fine. But after switching to instanced drawing it broke down. But only in a special case. As described by the question author: when the camera position changed the performance changed from decent (distance at 40) to horrible (distance at 0). This sounded strange.

First step to find a problem is to reproduce it. So I copied the example code, fixed small compiler issues, and created shaders which were referenced but missing in the question example code. They are simple so I guess the author used similar shaders.

Fragment shader for both drawing methods:

#version 330 core

in vec3 fragmentColor;
out vec3 color;

void main() {
  color = fragmentColor;
}

The vertex shader for basic indexed drawing:

#version 330 core

layout(location = 0) in vec3 vertexPosition;
layout(location = 1) in vec3 vertexColor;

out vec3 fragmentColor;
uniform mat4 MVP;

void main(){
  gl_Position = MVP * vec4((vertexPosition), 1);
  fragmentColor = vertexColor;
}

The vertex shader for instanced drawing:

#version 330 core

layout(location = 0) in vec3 vertexPosition;
layout(location = 1) in vec3 vertexColor;
layout(location = 2) in vec3 translation;

out vec3 fragmentColor;

uniform mat4 MVP;

void main(){
  gl_Position = MVP * vec4((vertexPosition + translation), 1);
  fragmentColor = vertexColor;
}

With those changes I was able to confirm the questions claim. The performance of the indexed drawing was fine. The performance of the instanced drawing method was horrible. It not only rendered extremely slowly it also stalled the whole computer. The cpu load was nearly at 0%. Since the program stalled the whole computer the bottleneck must have been either some kind of a system wide lock (either a software lock or hardware if form of a resource).

Next step was to create a system event trace file. This would not only tell me what the analysed program is doing but also how it interacts with the system. Windows 7 onwards has a very powerful event tracing system. The tools to create and analyse event traces are wprUI (Windows Performance Recorder) and wpaUI (Windows Performance Analyser). Both are part of the Windows 8.1 SDK. Here is a very nice blog if one wants to get down to details of performance analysing: randomascii.wordpress.com.

I enabled "GPU activity" and "Video glitches" profiles in wpr, recorded a session of the slow running program and opened it with GPUView. Also part of the Windows 8.1 SDK. The GPUView project site of Matt Fisher describes nicely how to read GPUView output. Here is a screen shot from GPUView's output generated from the recorded event trace:

The light green lane is the test process having enormous rendering problems. The darker green parts are GPU command packets queued up in the process. The gray strip at the bottom is the CPU time consumed by the process. Since you see only thin black lines means the process is hardly consuming CPU time. But it heavily uses GPU. At the top you see a ruler. One tick of the ruler is about 100ms. The top blue lane is the graphics card and the hardware queue of command packets. The dark green packets are packets from the test process which are moved from software to hardware when there is room in the hardware queue. So it takes the GPU around 100ms to process one dark green packet!
The red packets are from the idle process and the brown from the dwm.exe process. Don't get confused by the size of the red and the brown packets. One needs to get used to the presentation type of GPUView. The packet in the pottom row of the hardware queue is the packet being processed. This packet determines the width of the whole (!) column. When it's the turn of the red and the brown packet to be processed, they are so fast, that you never see them in the bottom row. Here's a screen shot zoomed in onto one of those red packets:

The ruler ticks are now 1ms. Once it's turn for the red packet to be processed, it finishes in less than 1ms.

So we found out that the GPU is the bottleneck. The application hardly uses CPU but it creates heaps of GPU command packets and each packet takes ages (in terms of GPU time) to be processed. The question is why.

The next step in finding a problem is to reduce the complexity of the test. Reduce it as far as possible. Sometimes it's easier to restart from zero and add complexity until the problem occurs. This is what I did. I recreated the program step by step. First it drew small numbers of cubes with indexed drawing and instanced drawing. Then I added more and more cubes. GPUView confirmed that the number of command packets increased along with cube count. This was expected. But the execution time for each packet was around 0.8-3ms depending on the camera position (how many cubes were visible) but independent of the drawing style! Indexed drawing and instanced drawing were equally fast.

So my rewrite reached the complexity of the original program without triggering the problem. I started to replace parts of the original program with parts from my program. After I replaced the lattice calculation I noticed a strange cube right in the center of the grid. The grid in the original program was aligned in a way that the camera would face the center of a cube:

But in my rewrite I placed the camera between the rows and colums:

The strange cube was there in both versions. But in my rewrite it got visible. The performance dropped when the camera moved towards this cube.

I continued replacing parts of the program until the problem disappeared. And also the cube disappeared! I reverted my last change and realized that it was a bug in the code triggering the problem, not some strange OpenGL behaviour. This is the buggy line drawing the cubes:

glDrawElementsInstanced(GL_TRIANGLES, sqIndice.size(), GL_UNSIGNED_INT,
 (GLvoid*) (0), latticePoints.size());

This is the line from my rewrite:

glDrawElementsInstanced(GL_TRIANGLES, indices.size(), GL_UNSIGNED_INT,
 (GLvoid*)NULL, gridCoordinates.size()/3);

glDrawElementsInstanced expects the count of instances (cubes) to draw not the count of array elements. So instead of drawing 8000 cubes as it was meant to it drew 24000 cubes. But that's not the whole story. ;)

Let's remember the vertex shader. For each instance it reads three float values from the latticePoints array and stores them in the vec3 translation variable. Since OpenGL tried to draw far more instances than the latticePoints array was able to supply (latticePoints contained 24000 float elements, but OpenGL tried to access 24000*3=72000 elements). So this out of bounds situation would usually create a illegal memory access. Luckily there are facilities which prevent a shader from crashing. So instead of a crash the shader just received zero values. translation was set to (0,0,0) so no translation happened. This is explanation for the mysterious cube in the center. Or more exactly: the 16000 mysterious cubes in the center.

But why did it only perform so horribly when the camera moved towards the center cube? The vertices of the 16000 additional cubes did not make it so worse. I did some testing to come up with the solution for this. The program hits the graphic card memory speed limit. How's this? Here is what happens: Each fragment (pixel) painted by OpenGL has to be transferred to the graphic card memory. OpenGL prevents the most paints by not painting invisible parts. The depth buffer is very good at this task. But how about identical objects on exactly the same position? Is the front face of cube 8001 concealed by the front face of cube 8002 which is on the same position? It looks like OpengGL thinks it is not. So it paints all front faces of all 16000 cubes in the center. When the cube is far away this are not many pixels. But as the camera moves towards the center the front face of the cube covers more and more pixels. When the calculation time for one frame drops to 500ms the cube is roughly 250x250 pixels big. Let's do some math:

250 * 250 = 62,500 - amount of pixels to draw for one cube front face
62,500 * 16,000 = 1,000,000,000 - pixels to draw for all 16,000 cubes in the center
1,000,000,000 * 3 = 3,000,000,000 - 3 bytes are transferred for each pixel
3,000,000,000 * 2 = 6,000,000,000 - so many pixel per second (we have 2fps)

That's a lot. Let's see what my graphic cart is able to handle. I'm using a notebook with a Nvidia Quadro NVS 160M. Those are the specs:

Memory clock: 700 MHz
Memory bus bandwidth: 64bit
Memory type: GDDR2, GDDR3

700,000,000 * 8 = 5,600,000,000 - That's the clock multiplied by the bus bandwidth

So the NVS160M is able to move 5,6GBytes per second (I'm ignoring the memory type here for simplicity). According to the calculation above the program moves about 6GBytes per second. Don't nail me down to the numbers! Everything is approximately! But this little calculation points out that the memory speed is clearly on it's limit. I did other tests to verify this. I moved the 16000 cubes a little bit on the z axis so the fist one would conceal the others. The frame processing time dropped drastically (16ms per frame)! Then I disabled the depth buffer which kept OpenGL from drawing unnecessary pixels. There it was again the horrible performance. Then I created another small program which would just paint two triangles to form a square and let OpenGL paint it thousand times. It surely took long. For a full screen windows of a size of 800x600 the GPU required 400ms per frame. Nearly no geometry. Only 6000 vertices!

So here's the culprit: graphic card memory speed. Or: the bug. Depends on the perspective. ;)

For the vigilant reader of the original stackoverflow.com question: the enabled multisampling surely took it's tall also.