Why aren’t you using pretrained models?

godelski · on Oct 3, 2022

Okay, I'm dead tired of hearing this argument (I've been getting it from reviewers a lot). There's a lot of reasons to not use a pretrained model.

The most obvious one is that your dataset doesn't share a lot of features with the pretrained dataset. This is pretty common in vision tasks where the pretrained set is usually ImageNet. If there isn't significant mutual information in the tasks (e.g. language to vision or vise versa), then you aren't going to transfer the knowledge that well and you've just wasted your time.

Scientific datasets often have this issue as usually you can't collect much data and even though there might be shared knowledge between the pre-trained it will give you a bias that you don't want or can't use. It may even prevent it from generalizing. Training from scratch can smooth this out in some cases. Pretraining means you're starting from a different point in the optimization space and it can pigeonhole you towards certain optima.

What's often being suggested for pre-trained models are HUGE. Sometimes you might as well just write something from scratch (or train a smaller model from scratch) because you just don't have the compute. There are many small models that are highly powerful and can be trained from scratch. You can even train transformers on CPUs. Various architectures will help with different tasks, so even just a random pretrained model that does well on a dataset isn't going to save you. You may also just be wasting significant/costly compute.

So when you can, use a pretrained model. But knowledge transfer isn't going to always help you. Your millage may vary is all I'm saying and pretrained models are not the solution to everything.

yeldarb · on Oct 3, 2022

> The most obvious one is that your dataset doesn't share a lot of features with the pretrained dataset. This is pretty common in vision tasks where the pretrained set is usually ImageNet.

If you've seen a paper that verifies this I'd love to see it. The early layers of the network detecting simple features like lines, curves, textures, shapes, colors, etc could still benefit from what they learn on ImageNet even if the features later in the network are not similar.

FWIW I have not yet seen a model starting from a pre-trained COCO checkpoint that does worse than random initialization.

godelski · on Oct 3, 2022

> The early layers of the network detecting simple features like lines, curves, textures, shapes, colors, etc

This is pretty common, especially with convolutions, but is not guaranteed. How you embed matters a lot. For example, there are transformers that use early convolutions for the embeddings and that makes them just work. Though too many and they are less performant (ViT tried pre-resnets which wasn't great). [0] also investigates transfer learning in medical domains and shows that CNNs depends more on statistics reuse and transformers depend more on feature reuse.

> If you've seen a paper that verifies this I'd love to see it.

[0] also discusses this. When it does and doesn't work on medical domains. Basically any paper that discusses transfer learning will also discuss the limitations. But note that there is a bias towards results that work. [1] also shows some of these results, where Imagenet pretraining helps and doesn't (and references others doing the same). Note in Figures 2 and 3 how InceptionV{3,4} and MNASNet have higher performance without pretraining (Fig 4 is a summary). So this shows in part of what I was saying that it isn't always about dataset either. You have a coupled problem that is hard to disentangle. There's also plenty of papers that try to say that LLMs are good at learning vision classification and never get past 50/60% accuracy (or worse) on ImageNet. Lots of scientific papers will also just straight up train from scratch and not mention transfer learning because it just didn't work for them, but you'd need to physically talk to these people as it isn't in their papers.

> FWIW I have not yet seen a model starting from a pre-trained COCO checkpoint that does worse than random initialization.

Additionally ImageNet performance doesn't correlate 1-to-1 with how well it works as a backbone in object detection and segmentation.

As another note, I would often say to be careful with pretrained models. The vast majority of papers are using test accuracy to hyper-parameter tune. So you're leaking knowledge into your model. I think this is mostly caused by reviewer benchmarkism (desk reject if you aren't SOTA) so bad practices become standard.

[0] https://arxiv.org/abs/2203.01825

[1] https://arxiv.org/abs/2101.06871

yeldarb · on Oct 3, 2022

Thanks! Def going to check out those papers.

godelski · on Oct 3, 2022

Yeah I do want to make clear that transfer learning frequently works. I think my initial comment probably comes off too strong (just reeling from terrible and unproductive reviewers who are reject happy). But there is a common belief that you never need to train from scratch and that's what I'm really trying to counter.

usgroup · on Oct 4, 2022

That medium sized dry objects in photographs and the matter of satellite imagery (for example) have little in common strikes me as self evident. The separation of "lines", "colours", etc at the most abstract level is a trivial task. It is not a trivial task at a semantic level, where the network is supposed to be useful, and where the overlap in mutual information makes all the difference.

AnEro · on Oct 3, 2022

People don't understand ML and AI well enough to do this, and don't realize a surface understanding of the math is all that is needed to create cool tools like this.

I ask you kindly to stop showing people, so I can keep feeling smart for knowing discrete mathematics /lh

j7ake · on Oct 3, 2022

Problem is it is fragile.

If it works then great. If it doesn’t, it’s difficult to know why a need even more difficult to fix it. The fix might involve retraining with better data, retraining with different architecture, regularisation, endless and unknown knobs to tune.

geysersam · on Oct 3, 2022

But not using a pretrained model you still need to take those things into account?

lordnacho · on Oct 3, 2022

But then you know what the knobs were set to?

geysersam · on Oct 3, 2022

CLIP is one of the most powerful and underutilized pretrained models. More flexible than any other pretrained image recognition model (that I'm aware of). Labels are given in plain English, and can be changed without data annotation or retraining.

yeldarb · on Oct 3, 2022

100% - I've been cataloguing CLIP use-cases here; any additional ones that have popped up recently I should add? https://blog.roboflow.com/openai-clip/

cma · on Oct 3, 2022

CLIP was recently upgraded by Stability.ai with more training:

https://twitter.com/emostaque/status/1570501470751174656

claytonjy · on Oct 3, 2022

Because my tasks are neither computer vision nor NLP!

usgroup · on Oct 3, 2022

Yeah exactly — deep learning became the universal hammer some time ago, and I think people forget that it is far more sparsely used and a whole less dominant outside fairly specialised domains such as text/image use cases.

claytonjy · on Oct 3, 2022

Yup, and even in non-CV/NLP tasks where a DL model is a good choice, there's nothing pre-trained to start from. Tabular datasets vary too much from datum to datum.

elforce002 · on Oct 3, 2022

Preach. There is no "one size fits all" approach.

jstx1 · on Oct 3, 2022

Yep, for so many NLP and vision tasks it makes sense to use a pretrained model as a preprocessing step and train a couple of dense layers with your own dataset on top of it.

syntaxing · on Oct 3, 2022

Do you only train the new dense network or the whole thing?

jstx1 · on Oct 3, 2022

I meant the former (easy, fast, works well very often) but you could do either. If you decide to tune the whole thing, the training cost, time and complexity grow by a lot, and of course you're destroying a lot of information that exists in the model.

quibono · on Oct 3, 2022

Start with the former, see how well it works.

Fine tuning the whole thing could potentially take far more work (but could also be worth it - the answer is probably that this depends on the practical use case you've got in mind).

chudi · on Oct 3, 2022

depends, but you fine tune the whole thing, check ulmfit paper and fastai libs

informal007 · on Oct 3, 2022

Generally, only train the dense network.

nl · on Oct 3, 2022

I think almost everyone deploying ML solutions now is using pretrained models. Sometimes there's some fine tuning (especially on image tasks) but it's pretty rare for a ML pipeline to exist without at least starting with existing models.

go_elmo · on Oct 3, 2022

Because the gap between pretraining and my data is too big and models are stupid / dont generalize well

yeldarb · on Oct 3, 2022

How big is the gap? I did some testing & found that so long as it's similar domain (eg don't pre-train for COCO on X-Rays) it still helps. https://blog.roboflow.com/transfer-learning-similarity/

The intuition for this is that lines, colors, textures, shapes are all general concepts that can be learned from a different domain & used in the earlier layers of the model to build up to more complex features.

mrits · on Oct 3, 2022

Coming to the conclusion that models don't generalize well could be because you are seeking to create solutions to more interesting problems. I don't spend resources on an AI team and have been surprised as how well models do generalize.

ntonozzi · on Oct 3, 2022

One of the critical points here is using pretrained models to generate embeddings that the application level programmer uses. This technique isn't common knowledge among programmers without any ML experience, but is a key to getting good, generalizable that extend to unseen domains.

usgroup · on Oct 4, 2022

... so long as the unseen domain has substantially overlapping mutual information with the training set ...

mijail · on Oct 3, 2022

The "model" is not the hard part! If a pretrained model is generalized enough and valuable enough then it can exist as simple API or a runtime, it stops being a "model." If you have the engineering chops to deploy models in production for your application then deploying a pretrained model is trivial. If its valuable enough to the business then squeezing a few more points by fine tuning the model is worth it.

orasis · on Oct 3, 2022

A little known feature of Swift/iOS is NLEmbedding (https://developer.apple.com/documentation/naturallanguage/nl...). As the name implies, it is a built-in natural language embedding model that will produce the same embeddings across all iOS devices. I can't tell for sure but I would guess it is some sort of BERT model and probably weighs in at hundreds of megabytes.

rajman187 · on Oct 3, 2022

Two major issues are generalizability (or lack thereof, sometimes pretrained weights can even be detrimental to your specific tasks) and licensing (pertaining to the data used in the initial training)

ShamelessC · on Oct 3, 2022

I dont think the license of the dataset is relevant given it isnt being distributed or hosted. just the license of the weights. Happy to be corrected.

philipov · on Oct 3, 2022

Suppose you produce a network whose weights can reproduce the content of the training dataset, as can happen. Are you saying that because it's just weights for a model, it can bypass the license of the data used to produce it and which can be reproduced by it?

rajman187 · on Oct 3, 2022

If you’re using it at work then license of the training datasets absolutely matter, at least as far as lawyers are concerned (note that I am not a lawyer, I’ve just had to share this information with them in the past)

collegeburner · on Oct 3, 2022

because it doesn't apply to my use case obv. i'm not doing any of those. but more generally there's usually a "easy win" to be made from doing a little final training on your actual target data. plus i've found it more reliable to store and distribute your own models so at least i'm the one making and shipping the black box. like NNs are hard enough to debug without relying on somebody else's as well.

wahnfrieden · on Oct 3, 2022

their example would be very useful in my language learning apps. anyone know more leads on similar datasets / generating these without license issue?

informal007 · on Oct 3, 2022

I think controllability is one of the problems

vnchr · on Oct 3, 2022

Seems like closing that gap is a startup opportunity

MonkeyMalarky · on Oct 3, 2022

Huggingface?

jrochkind1 · on Oct 3, 2022

I don't understand this enough to know, this is a real question, but is that what these people are doing? https://replicate.com/

on Oct 3, 2022

[dead]

binarymax · on Oct 3, 2022

Shameless plug, this is why I built Mighty Inference Server [0]. It wraps ONNX Runtime in a production ready rust binary, and can be installed quickly and painlessly on most machines.

The sentence transformer models in the example notebook from the OP will infer queries in <10ms, which is fast enough for a production customer facing service…and lets you run on commodity instances without expensive GPUs

[0] https://max.io

jstx1 · on Oct 3, 2022

Some of this might be difficult to set up but I don't think that's the reason why people don't use more pretrained models. If you have a working ML environment adding a pretrained model to it is trivial most of the time.

snek_case · on Oct 3, 2022

That's a big if. The initial set up is a pain.

And even once the initial set up is done, ML software is constantly changing/breaking. You're forced to reinstall a new proprietary Nvidia driver, which is then incompatible with the old version of PyTorch the model uses... But that other model requires a newer PyTorch... It's a pain... The fact that there's a dependency on GPU driver versions is ridiculous.

disgruntledphd2 · on Oct 3, 2022

This should apparently improve, as at least CUDA no longer depends on specific versions (thank god). I was pleasantly surprised last time I installed it.

snek_case · on Oct 3, 2022

I think that's a good step but it's still a very brittle ecosystem IMO. Needs time to mature.

disgruntledphd2 · on Oct 3, 2022

Completely agreed, but it's probably the Python part that's causing the problems, tbh.

Der_Einzige · on Oct 3, 2022

Uhh, conda create a new environment first. Conda literally does everything else for you (including the actual hard part , getting your GPUs to play nice with CUDA and to be recognized by pytorch or tensorflow)

Just make a new conda environment, and throw it away if it gets clobbered.

But don't listen to me, I only work for a billion dollar company that does all of its cloud data science using conda, so I must not know what I'm talking about...

quibono · on Oct 3, 2022

I think the point is that conda is great until suddenly it's not because of various compatibility issues. Especially when the errors you get will sound very cryptic to someone who's not used to ML and libraries that use CUDA.

That's at least my experience, I've personally always preferred to not use it at all (N=1).

Also, you sound very insecure, not sure what the company drop at the end is for.

darepublic · on Oct 3, 2022

Because then I have no advantage