Technically, H(H(info) || msg) is enough, which is what we do in libra: https://...

kwantam · on March 14, 2020

There are a lot of things that are much safer to do with SHA-3 (as in Libra) than with SHA-2.

If H is in the SHA-2 family, you're much better off doing

    HMAC(info, msg)

than

    H(H(info) || msg)

In particular, HMAC has provable security properties that the latter construction does not. (HMAC is immune to length extension, it remains secure even when the M-D compression function is quite non-ideal, etc.)

lisper · on March 14, 2020

> you're much better off doing HMAC(info, msg)

In terms of security this may be true. But the whole point in this discussion is to avoid having to re-hash msg multiple times for different values of info.

kwantam · on March 14, 2020

D'oh. Of course. One thing we could do in this case is use NMAC rather than HMAC.

First, let's more explicitly define a Merkle Damgaard function in terms of its compression function and initialization vector. For some compression function

    h(prior, next)

(here, think of `prior` as b bits---say, 256---and `next` as 2b bits) we can define a Merkle Damgaard hash with explicit initialization vector IV as

    H_{IV}(block0) =
        h(IV, block0)

    H_{IV}(... || blockN) =
        h(H_{IV}(... || block(N-1)), blockN)

(here, blockX represents an input block of 2b bits; I'm assuming we'll pad the message, say, the way SHA-2 does).

Now we can define HMAC-SHA-256 in terms of the above notation as

    HMAC(key, msg) = H_{oiv}(H_{iiv}(msg))
      where
        oiv = h(sha256iv, key XOR opad)
        iiv = h(sha256iv, key XOR ipad)

(here, h is the SHA-256 compression function and sha256iv is the standards-specified IV for SHA-256). Here, I'm slightly glossing over the way `key` is padded, but the point is that in HMAC the key inputs are always expanded or squashed to exactly one block in length, which means that the key is effectively being used to create two initialization vectors, one for the inner call to H and a different one (with overwhelming probability) for the outer call.

This is related to the reason that HMAC(info, msg) doesn't let us reuse the hashing work for different info strings: the inner H's IV depends on info, so we have to recompute the whole inner hash. When info is public, however, there's no need to make the inner IV depend on info---it can just as well be some other fixed public string. Said another way, we can use a generalization of HMAC called NMAC, which we can define as follows:

    NMAC(oiv, iiv, msg) = H_{oiv}(H_{iiv}(msg))

one natural way to do this would be

    NMAC(f(info), sha256iv, msg)

for some reasonable function f. A natural one is

    f(info) = h(sha256iv, H(info) || 0000...0000)

where the zero padding is b bits. Re-writing this in terms of H without explicit IVs, we have

    H(H(info) || 0000...0000 || H(msg))

which is very reasonable to compute (and is related to a construction in the Coron et al paper I linked [2] in my original note). Note that the zero padding between H(info) and H(msg) is crucial if we want to be able to leverage the NMAC analysis. Intuitively, we don't want the outer IV to depend on msg, so we have to push msg into the next block.

I should also note that all of the above is a bit far afield from where we started, and it probably seems like we've come full circle to prepending again! The kind of exercise in this post is essentially expanding my comment that "you will want to stop length extension attacks and related issues"---this is one way to do it, but by no means the only way. For example, if we want to use the info string for domain separation and be able to reuse work hashing msg, you could do something like:

    H(fixed_zeros_block || H(msg || info) || info)

This puts msg in prefix position (good for reuse), uses a different IV for inner and outer hash (good for preventing length extension), and always appends info to the input to H (good for domain separation).

There has been a lot of work studying the security of NMAC and HMAC, and the upshot is that both constructions protect us really well from non-idealities in the compression function for secret IVs, though somewhat less well for public ones. Off the top of my head, one further nice reference is https://eprint.iacr.org/2006/043

kwantam · on March 20, 2020

Ah, quick update to the above (can't edit anymore):

the final construction I posted in the above message is not good for arbitrary messages, because if msg starts with an all-zeros block then the inner and outer hash invocations won't have different IVs (which is bad).

If msg is structured in a way that guarantees that it doesn't have an all-zeros block as a prefix, then this construction is fine, but that does seem quite fragile...

Goes to show how easy it is to mess this up. :\