I know that the classic SHL instead of multiplication has been out dated for a number of years. My Intel IA32 manuals confirm that this is now slower than the multiply operation in most cases.
A fast, variable shifter is a reasonably complicated slice of chip. Having a fast multiplier is one of the best improvements chips have been making for the last 20 years. At some point, you'll make the multiplier as fast as the fast shifter, and then you might as well slow the shifter down to save space, as no-one should be using it any more.
I think I'll have to take a look at state-of-the-art implementations of shifters and multipliers. Does anybody know of some resources regarding this topic (papers or reasonably optimized VHDL code)?
http://en.wikipedia.org/wiki/Wallace_tree is the implementation used on the Pentium III and http://en.wikipedia.org/wiki/Dadda_tree is an improvement on that. The Dadda tree was invented in 1965 [1], so I guess the issue is more in having enough transistors and joining them up fast than in designing the multiply algorithm.
Having said that, the following might be some possible references.
Shailendra Jain, Vasantha Erraguntla, Sriram R. Vangal, Yatin Hoskote, Nitin Borkar, Tulasi Mandepudi, Karthik VP, "A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm," VLSI Design, International Conference on, pp. 252-257, 2010 23rd International Conference on VLSI Design, 2010. -- http://www.computer.org/portal/web/csdl/doi/10.1109/VLSI.Des...
I would have thought the state-of-the-art is trade secret for Intel and AMD, unless anyone actually depackages their chips to look at the multipliers? Patent applications might be worth looking at, but they are probably incomprehensible.
Where the latency of the (integer) multiplication is 3 clock cycles, of (integer) addition is 1 clock cycle and of (integer) shift it is also 1 clock cycle (according to the Intel® 64 and IA-32 Architectures Optimization Reference Manual [1]). Therefore the shifting is still faster than multiplication, however an optimization for 1 or 2 clock cycles doesn't make sense anymore...