More

pocak · on Feb 11, 2025

There is, it's Druid. Intel announced the first four codenames in 2021.

> [...] first generation, based on the Xe HPG microarchitecture, codenamed Alchemist (formerly known as DG2). Intel also revealed the code names of future generations under the Arc brand: Battlemage, Celestial and Druid.

https://www.intel.com/content/www/us/en/newsroom/news/introd...

A_D_E_P_T · on Feb 11, 2025

C should have been Cleric, and I don't know about E (Eldritch Knight?!), but if F ain't Fighter I'm going to be disappointed.

MrDrMcCoy · on Feb 12, 2025

E could be evoker, enchanter, or exorcist. F could be Firedancer, but will probably be fighter :)

pocak · on March 27, 2024

In the post about the texture unit, that ROM table for mip level address offsets seems to use quite a bit of space. Have you considered making the mip base addresses a part of the texture spec instead?

PfhorSlayer · on March 27, 2024

The problem with doing that is it would require significantly more space in that spec. At a minimum, one offset for each possible mip level. That data needs to be moved around the GPU internally quite a bit, crossing clock domains and everything else, and would require a ton of extra registers to keep track of. Putting it in a ROM is basically free - a pair of BRAM versus a ton of registers (and the associated timing considerations), the BRAM wins almost every time.

pocak · on May 13, 2022

I don't understand why the programs are the same. The partial render store program has to write out both the color and the depth buffer, while the final render store should only write out color and throw away depth.

kimixa · on May 13, 2022

Possibly pixel local storage - I think this can be accessed with extended raster order groups and image blocks in metal.

https://developer.apple.com/documentation/metal/resource_fun...

E.g in their example in the link above for deferred rendering (figure 4) the multiple G buffers won't actually need to leave the on-chip tile buffer - unless there's a partial render before the final shading shader is run.

plekter · on May 13, 2022

I think multisampling may be the answer.

For partial rendering all samples must be written out, but for the final one you can resolve(average) them before writeout.

hansihe · on May 13, 2022

Not necessarily, other render passes could need the depth data later.

pocak · on May 13, 2022

Right, I had the article's bunny test program on my mind, which looks like it has only one pass.

In OpenGL, the driver would have to scan the following commands to see if it can discard the depth data. If it doesn't see the depth buffer get cleared, it has to be conservative and save the data. I assume mobile GPU drivers in general do make the effort to do this optimization, as the bandwidth savings are significant.

In Vulkan, the application explicitly specifies which attachment (i.e. stencil, depth, color buffer) must be persisted at the end of a render pass, and which need not. So that maps nicely to the "final render flush program".

The quote is about Metal, though, which I'm not familiar with, but a sibling comment points out it's similar to Vulkan in this aspect.

So that leaves me wondering: did Rosenzweig happen to only try Metal apps that always use MTLStoreAction.store in passes that overflow the TVB, or is the Metal driver skipping a useful optimization, or neither? E.g. because the hardware has another control for this?

johntb86 · on May 13, 2022

Most likely that would depend on what storeAction is set to: https://developer.apple.com/documentation/metal/mtlrenderpas...

Someone · on May 13, 2022

So it seems it allows for optimization. If you know you don’t need everything, one of the steps can do less than the other.

pocak · on May 13, 2022

That's what I thought, too, until I saw ARM's Hot Chips 2016 slides. Page 24 shows that they write transformed positions to RAM, and later write varyings to RAM. That's for Bifrost, but it's implied Midgard is the same, except it doesn't filter out vertices from culled primitives.

That makes me wonder whether the other GPUs with position-only shading - Intel and Adreno - do the same.

As for PowerVR, I've never seen them described as position-only shaders - I think they've always done full vertex processing upfront.

edit: slides are at https://old.hotchips.org/wp-content/uploads/hc_archives/hc28...

Jasper_ · on May 13, 2022

Mali's slides here still show them doing two vertex shading passes, one for positions, and again for other attributes. I'm guessing "memory" here means high-performance in-unit memory like TMEM, rather than a full frame's worth of data, but I'm not sure!

pocak · on April 26, 2021

In the article, cost is cpu time, and benefit is file size reduction multiplied by number of times watched.

pocak · on Jan 9, 2021

Linux allocates page tables lazily, and fills them lazily. The only upfront work is to mark the virtual address range as valid and associated with the file. I'd expect mapping giant files to be fast enough to not need windowing.

jabl · on Jan 9, 2021

Good point, scratch that part of my answer.

There are still some cases where you'd not want unlimited VM mapping, but those are getting a bit esoteric and at least the most obvious ones are in the process of getting fixed.

pocak · on July 21, 2020

Only one row at a time has voltage applied to it. In one update, the image is scanned out multiple times, so it appears as if all pixels were changing simultaneously (and perhaps they do, if the electrodes have significant capacitance.)

With fewer rows to update, each row gets a push more often, flipping the grains faster.

pocak · on June 17, 2020

Took me a while to get past the home screen. I kept clicking 'paint' at the bottom and 'no' at the top.

justanothersys · on June 17, 2020

Noted

pocak · on June 10, 2020

Thread migration only costs on the order of 100 microseconds, including the effect of cold caches. If you keep the AVX thread on the big core for at least 100 milliseconds at a time, you only lose ~0.2% performance.

ww520 · on June 11, 2020

Good to know the switching cost is low. Runtime profiling would be key to provide insight into when to switch back.

pocak · on June 10, 2020

Migrate back to the little core if the thread hasn't used AVX in a while.

Linux already tracks how long ago a task used AVX-512. I assume the same mechanism could be used to track AVX as well.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/...