Hacker Newsnew | past | comments | ask | show | jobs | submit | def-pri-pub's commentslogin

>> It also gets in the way of elegance and truth.

Where did that come from in the article?


I revisited that article and ... now I have no idea. Maybe I stumbled into some other trig-related article and came back here. Or maybe this one had some A/B content going on?

The only thing I remember at this point is that I copied and pasted that sentence (I didn't type it.) Even search doesn't find the sentence anywhere but HN.


Search finds that sentence on this blog post https://iquilezles.org/articles/noacos/


Thanks for finding it! Still not sure how I got there while thinking I was at "Even Faster Asin()..."


Thank you for linking that!


The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.


Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.


I did see that, but isn't the vast majority of that page talking about acos() instead?


That’s equivalent right? acos x = pi/2 - asin x

So if you’ve got one that’s fast you have them both.


I did scan some (major) open source games and graphics related project and found a few of them using `std::asin()`. I plan on submitting some patches.


When I was working on this project, I was trying to restrict myself to the architecture of the original Ray Tracing in One Weekend book series. I am aware that things are not as SIMD friendly and that becomes a major bottle neck. While I am confident that an architectural change could yield a massive performance boost, it's something I don't want to spend my time on.

I think it's also more fun sometimes to take existing systems and to try to optimize them given whatever constraints exist. I've had to do that a lot in my day job already.


I can relate to setting an arbitrary challenge for myself. fwiw, don't know where you draw the line of an architectural change, but I think that switching AoS -> SoA may actually be an approachably-sized mechanical refactor, and then taking advantage of it to SIMDify object lists can be done incrementally.

The value of course is contingent on there being a decent number of objects of a given type in the list rather than just a huge number of rays being sent to a small number of objects; I didn't evaluate that. If it's the other way around, the structure would be better flipped, and I don't know how reasonable that is with bounces (that maybe then aren't all being evaluated against the same objects?).


You'd be surprised how it actually is worth the effort, even just a 1% improvement. If you have the time, this is a great talk to listen to: https://www.youtube.com/watch?v=kPR8h4-qZdk

For a little toy ray tracer, it is pretty measly. But for a larger corporation (with a professional project) a 4% speed improvement can mean MASSIVE cost savings.

Some of these tiny improvements can also have a cascading effect. Imagining finding a +4%, a +2% somewhere else, +3% in neighboring code, and a bunch of +1%s here and there. Eventually you'll have built up something that is 15-20% faster. Down the road you'll come across those optimizations which can yield the big results too (e.g. the +25%).


It's a cool talk, but the relevance to the present problem escapes me.

If you're alluding to gcc vs fbstring's performance (circa 15:43), then the performance improvement is not because fbstring is faster/simpler, but due to a foundational gcc design decision to always use the heap for string variables. Also, at around 16:40, the speaker concedes that gcc's simpler size() implementation runs significantly faster (3x faster at 0.3 ns) when the test conditions are different.


Funny enough that fdlimb implementation of asin() did come up in my research. I believe it might have been more performant in the past. But taking a quick scan of `e_asin.c`, I see it doing something similar to the Cg asin() implementation (but with more terms and more multiplications, which my guess is that it's slower). I think I see it also taking more branches (which could also lead to more of a slowdown).


Yeah Ng’s work in fdlibm is cool and really clever in parts but a lot of branching. Some of the ways they reach correct rounding are…so cool.


These are books that my uni courses never had me read. I'm a little shocked at times at how my degree program skimped on some of the more famous texts.


It is not a textbook, it is an extremely dense reference manual, so that honestly makes sense.

In physics grad school, professors would occasionally allude to it, and textbooks would cite it ... pretty often. So it's a thing anyone with postgraduate physics education should know exists, but you wouldn't ever be assigned it.


Presumably someone read it though, at some point, in order to be able to cite it.


The relevant sections, at any rate


I didn't need Abramowitz and Stegun until grad school. In the 1990s. It was a well-known reference book for people at that level, not a text book.

For my undergrad the CRC math handbook was enough.


Wait, what? Do you have a resource I could read up on about that? That is moderately concerning if your math isn't portable across chips.


When Intel specced the rsqrt[ps]s and rcp[ps]s instructions ~30 years ago, they didn't fully specify their behavior. They just said their relative error is "smaller than 1.5 * 2⁻¹²," which someone thought was very clever because it gave them leeway to use tables or piecewise linear approximations or digit-by-digit computation or whatever was best suited to future processors. Since these are not IEEE 754 correctly-rounded operations, and there was (by definition) no software that currently used them, this was "fine".

And mostly it has been OK, except for some cases like games or simulations that want to get bitwise identical results across HW, which (if they're lucky) just don't use these operations or (if they're unlucky) use them and have to handle mismatches somehow. Compilers never generate these operations implicitly unless you're compiling with some sort of fast-math flag, so you mostly only get to them by explicitly using an intrinsic, and in theory you know what you're signing up for if you do that.

However, this did make them unusable for some scenarios where you would otherwise like to use them, so a bunch of graphics and scientific computing and math library developers said "please fully specify these operations next time" and now NEON/SVE and AVX512 have fully-specified reciprocal estimates,¹ which solves the problem unless you have to interoperate between x86 and ARM.

¹ e.g. Intel "specifies" theirs here: https://www.intel.com/content/www/us/en/developer/articles/c...

ARM's is a little more readable: https://developer.arm.com/documentation/ddi0596/2021-03/Shar...


Thanks!


Take a look at the "rsqrt_rcp" section of reference [6] in the accuracy report by Gladman et al referenced above. I did that work 10 years ago because some people at CERN had reported getting different results from certain programs depending on whether the exact same executables were run on Intel or AMD cpus. The result of the investigation was that the differing results were due to different implementations of the rsqrt instruction on the different cpus.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: