It would be good to have them measure aspects other than performance, like how long it took to build each, was there a learning curve because it's an unfamiliar language, how secure the resulting code is, etc.
I'm pretty sure that an exhaustive answer to your last question is "the history will tell" with additional assumption that all those drivers would be deployed in many production environments.
To measure how long it took to build it, track how many days it took to implement in each language.
To measure how secure the resulting code is, have a test suite of malformed packets or other input and see how many of them the code in each language handles.
I'm not qualified to judge the latter. But I can about the former.
That measures how well a particular developer takes, with some noise induced by the working environment (say there's construction noise during python week, not during java week, or more meetings one week than another, or the developer's has relationship troubles at home). Randomness happens.
Deducing a number that's more generally valid requires having n workers doing the same work and then doing statistical analysis. Happily that also takes care of the single-worker problem. Still, the cost of the experiment easily rises by a factor of twenty or a hundred, depending on how well the noise can be controlled and how much accuracy is needed.
Asking for improvements that would increase the cost of an experiment by many thousand per cent is a %$#@!%#@! $%#@!%@#$ thing to do. IMO.
The perfect is the enemy of the good. I'll take a rough estimate over no data at all. If the Python guy took 10 days and the Java guy took 20, then the conclusion is that Python is more productive than Java. Is it more productive exactly by 100%? No, maybe 60%, or maybe 120%. But whether the benefit is 60% or 120%, I know which tool I'll choose next time.