I know that the classic SHL instead of multiplication has been out dated for a n...

ComSubVie · on April 28, 2010

That is pretty interesting, how can a shift operation be slower than a multiplication?

Robin_Message · on April 28, 2010

A fast, variable shifter is a reasonably complicated slice of chip. Having a fast multiplier is one of the best improvements chips have been making for the last 20 years. At some point, you'll make the multiplier as fast as the fast shifter, and then you might as well slow the shifter down to save space, as no-one should be using it any more.

ComSubVie · on April 28, 2010

Which is a good thing (tm).

I think I'll have to take a look at state-of-the-art implementations of shifters and multipliers. Does anybody know of some resources regarding this topic (papers or reasonably optimized VHDL code)?

Robin_Message · on April 28, 2010

http://en.wikipedia.org/wiki/Wallace_tree is the implementation used on the Pentium III and http://en.wikipedia.org/wiki/Dadda_tree is an improvement on that. The Dadda tree was invented in 1965 [1], so I guess the issue is more in having enough transistors and joining them up fast than in designing the multiply algorithm.

Having said that, the following might be some possible references.

Glenn Colón-Bonet and Paul Winterrowd, "Multiplier Evolution: A Family of Multiplier VLSI Implementations" -- http://comjnl.oxfordjournals.org/cgi/content/abstract/bxm123...

Shailendra Jain, Vasantha Erraguntla, Sriram R. Vangal, Yatin Hoskote, Nitin Borkar, Tulasi Mandepudi, Karthik VP, "A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm," VLSI Design, International Conference on, pp. 252-257, 2010 23rd International Conference on VLSI Design, 2010. -- http://www.computer.org/portal/web/csdl/doi/10.1109/VLSI.Des...

I would have thought the state-of-the-art is trade secret for Intel and AMD, unless anyone actually depackages their chips to look at the multipliers? Patent applications might be worth looking at, but they are probably incomprehensible.

Also, http://www.ece.ucsb.edu/~parhami/ece_252b.htm looks like a good course web page on VLSI maths, but I don't know.

[1] Dadda, L. (1965). "Some schemes for parallel multipliers". Alta Frequenza 34: 349–356.

ComSubVie · on April 28, 2010

thanks!

ssp · on April 28, 2010

Maybe he means that things like reducing (a * 5) to (a << 2 + a) are now obsolete.

ComSubVie · on April 28, 2010

Where the latency of the (integer) multiplication is 3 clock cycles, of (integer) addition is 1 clock cycle and of (integer) shift it is also 1 clock cycle (according to the Intel® 64 and IA-32 Architectures Optimization Reference Manual [1]). Therefore the shifting is still faster than multiplication, however an optimization for 1 or 2 clock cycles doesn't make sense anymore...

[1] http://www.intel.com/products/processor/manuals/

Robin_Message · on April 28, 2010

That was something like lea eax,[ebx*4 + ebx] anyway :)