Hacker Newsnew | past | comments | ask | show | jobs | submit | pocak's commentslogin

There is, it's Druid. Intel announced the first four codenames in 2021.

> [...] first generation, based on the Xe HPG microarchitecture, codenamed Alchemist (formerly known as DG2). Intel also revealed the code names of future generations under the Arc brand: Battlemage, Celestial and Druid.

https://www.intel.com/content/www/us/en/newsroom/news/introd...


C should have been Cleric, and I don't know about E (Eldritch Knight?!), but if F ain't Fighter I'm going to be disappointed.


E could be evoker, enchanter, or exorcist. F could be Firedancer, but will probably be fighter :)


In the post about the texture unit, that ROM table for mip level address offsets seems to use quite a bit of space. Have you considered making the mip base addresses a part of the texture spec instead?


The problem with doing that is it would require significantly more space in that spec. At a minimum, one offset for each possible mip level. That data needs to be moved around the GPU internally quite a bit, crossing clock domains and everything else, and would require a ton of extra registers to keep track of. Putting it in a ROM is basically free - a pair of BRAM versus a ton of registers (and the associated timing considerations), the BRAM wins almost every time.


I don't understand why the programs are the same. The partial render store program has to write out both the color and the depth buffer, while the final render store should only write out color and throw away depth.


Possibly pixel local storage - I think this can be accessed with extended raster order groups and image blocks in metal.

https://developer.apple.com/documentation/metal/resource_fun...

E.g in their example in the link above for deferred rendering (figure 4) the multiple G buffers won't actually need to leave the on-chip tile buffer - unless there's a partial render before the final shading shader is run.


I think multisampling may be the answer.

For partial rendering all samples must be written out, but for the final one you can resolve(average) them before writeout.


Not necessarily, other render passes could need the depth data later.


Right, I had the article's bunny test program on my mind, which looks like it has only one pass.

In OpenGL, the driver would have to scan the following commands to see if it can discard the depth data. If it doesn't see the depth buffer get cleared, it has to be conservative and save the data. I assume mobile GPU drivers in general do make the effort to do this optimization, as the bandwidth savings are significant.

In Vulkan, the application explicitly specifies which attachment (i.e. stencil, depth, color buffer) must be persisted at the end of a render pass, and which need not. So that maps nicely to the "final render flush program".

The quote is about Metal, though, which I'm not familiar with, but a sibling comment points out it's similar to Vulkan in this aspect.

So that leaves me wondering: did Rosenzweig happen to only try Metal apps that always use MTLStoreAction.store in passes that overflow the TVB, or is the Metal driver skipping a useful optimization, or neither? E.g. because the hardware has another control for this?


Most likely that would depend on what storeAction is set to: https://developer.apple.com/documentation/metal/mtlrenderpas...


So it seems it allows for optimization. If you know you don’t need everything, one of the steps can do less than the other.


That's what I thought, too, until I saw ARM's Hot Chips 2016 slides. Page 24 shows that they write transformed positions to RAM, and later write varyings to RAM. That's for Bifrost, but it's implied Midgard is the same, except it doesn't filter out vertices from culled primitives.

That makes me wonder whether the other GPUs with position-only shading - Intel and Adreno - do the same.

As for PowerVR, I've never seen them described as position-only shaders - I think they've always done full vertex processing upfront.

edit: slides are at https://old.hotchips.org/wp-content/uploads/hc_archives/hc28...


Mali's slides here still show them doing two vertex shading passes, one for positions, and again for other attributes. I'm guessing "memory" here means high-performance in-unit memory like TMEM, rather than a full frame's worth of data, but I'm not sure!


In the article, cost is cpu time, and benefit is file size reduction multiplied by number of times watched.


Linux allocates page tables lazily, and fills them lazily. The only upfront work is to mark the virtual address range as valid and associated with the file. I'd expect mapping giant files to be fast enough to not need windowing.


Good point, scratch that part of my answer.

There are still some cases where you'd not want unlimited VM mapping, but those are getting a bit esoteric and at least the most obvious ones are in the process of getting fixed.


Only one row at a time has voltage applied to it. In one update, the image is scanned out multiple times, so it appears as if all pixels were changing simultaneously (and perhaps they do, if the electrodes have significant capacitance.)

With fewer rows to update, each row gets a push more often, flipping the grains faster.


Took me a while to get past the home screen. I kept clicking 'paint' at the bottom and 'no' at the top.


Noted


Thread migration only costs on the order of 100 microseconds, including the effect of cold caches. If you keep the AVX thread on the big core for at least 100 milliseconds at a time, you only lose ~0.2% performance.


Good to know the switching cost is low. Runtime profiling would be key to provide insight into when to switch back.


Migrate back to the little core if the thread hasn't used AVX in a while.

Linux already tracks how long ago a task used AVX-512. I assume the same mechanism could be used to track AVX as well.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: