Okay, I'm dead tired of hearing this argument (I've been getting it from reviewers a lot). There's a lot of reasons to not use a pretrained model.
The most obvious one is that your dataset doesn't share a lot of features with the pretrained dataset. This is pretty common in vision tasks where the pretrained set is usually ImageNet. If there isn't significant mutual information in the tasks (e.g. language to vision or vise versa), then you aren't going to transfer the knowledge that well and you've just wasted your time.
Scientific datasets often have this issue as usually you can't collect much data and even though there might be shared knowledge between the pre-trained it will give you a bias that you don't want or can't use. It may even prevent it from generalizing. Training from scratch can smooth this out in some cases. Pretraining means you're starting from a different point in the optimization space and it can pigeonhole you towards certain optima.
What's often being suggested for pre-trained models are HUGE. Sometimes you might as well just write something from scratch (or train a smaller model from scratch) because you just don't have the compute. There are many small models that are highly powerful and can be trained from scratch. You can even train transformers on CPUs. Various architectures will help with different tasks, so even just a random pretrained model that does well on a dataset isn't going to save you. You may also just be wasting significant/costly compute.
So when you can, use a pretrained model. But knowledge transfer isn't going to always help you. Your millage may vary is all I'm saying and pretrained models are not the solution to everything.
> The most obvious one is that your dataset doesn't share a lot of features with the pretrained dataset. This is pretty common in vision tasks where the pretrained set is usually ImageNet.
If you've seen a paper that verifies this I'd love to see it. The early layers of the network detecting simple features like lines, curves, textures, shapes, colors, etc could still benefit from what they learn on ImageNet even if the features later in the network are not similar.
FWIW I have not yet seen a model starting from a pre-trained COCO checkpoint that does worse than random initialization.
> The early layers of the network detecting simple features like lines, curves, textures, shapes, colors, etc
This is pretty common, especially with convolutions, but is not guaranteed. How you embed matters a lot. For example, there are transformers that use early convolutions for the embeddings and that makes them just work. Though too many and they are less performant (ViT tried pre-resnets which wasn't great). [0] also investigates transfer learning in medical domains and shows that CNNs depends more on statistics reuse and transformers depend more on feature reuse.
> If you've seen a paper that verifies this I'd love to see it.
[0] also discusses this. When it does and doesn't work on medical domains. Basically any paper that discusses transfer learning will also discuss the limitations. But note that there is a bias towards results that work. [1] also shows some of these results, where Imagenet pretraining helps and doesn't (and references others doing the same). Note in Figures 2 and 3 how InceptionV{3,4} and MNASNet have higher performance without pretraining (Fig 4 is a summary). So this shows in part of what I was saying that it isn't always about dataset either. You have a coupled problem that is hard to disentangle. There's also plenty of papers that try to say that LLMs are good at learning vision classification and never get past 50/60% accuracy (or worse) on ImageNet. Lots of scientific papers will also just straight up train from scratch and not mention transfer learning because it just didn't work for them, but you'd need to physically talk to these people as it isn't in their papers.
> FWIW I have not yet seen a model starting from a pre-trained COCO checkpoint that does worse than random initialization.
Additionally ImageNet performance doesn't correlate 1-to-1 with how well it works as a backbone in object detection and segmentation.
As another note, I would often say to be careful with pretrained models. The vast majority of papers are using test accuracy to hyper-parameter tune. So you're leaking knowledge into your model. I think this is mostly caused by reviewer benchmarkism (desk reject if you aren't SOTA) so bad practices become standard.
Yeah I do want to make clear that transfer learning frequently works. I think my initial comment probably comes off too strong (just reeling from terrible and unproductive reviewers who are reject happy). But there is a common belief that you never need to train from scratch and that's what I'm really trying to counter.
That medium sized dry objects in photographs and the matter of satellite imagery (for example) have little in common strikes me as self evident. The separation of "lines", "colours", etc at the most abstract level is a trivial task. It is not a trivial task at a semantic level, where the network is supposed to be useful, and where the overlap in mutual information makes all the difference.
People don't understand ML and AI well enough to do this, and don't realize a surface understanding of the math is all that is needed to create cool tools like this.
I ask you kindly to stop showing people, so I can keep feeling smart for knowing discrete mathematics /lh
If it works then great. If it doesn’t, it’s difficult to know why a need even more difficult to fix it. The fix might involve retraining with better data, retraining with different architecture, regularisation, endless and unknown knobs to tune.
CLIP is one of the most powerful and underutilized pretrained models.
More flexible than any other pretrained image recognition model (that I'm aware of). Labels are given in plain English, and can be changed without data annotation or retraining.
Yeah exactly — deep learning became the universal hammer some time ago, and I think people forget that it is far more sparsely used and a whole less dominant outside fairly specialised domains such as text/image use cases.
Yup, and even in non-CV/NLP tasks where a DL model is a good choice, there's nothing pre-trained to start from. Tabular datasets vary too much from datum to datum.
Yep, for so many NLP and vision tasks it makes sense to use a pretrained model as a preprocessing step and train a couple of dense layers with your own dataset on top of it.
I meant the former (easy, fast, works well very often) but you could do either. If you decide to tune the whole thing, the training cost, time and complexity grow by a lot, and of course you're destroying a lot of information that exists in the model.
Fine tuning the whole thing could potentially take far more work (but could also be worth it - the answer is probably that this depends on the practical use case you've got in mind).
I think almost everyone deploying ML solutions now is using pretrained models. Sometimes there's some fine tuning (especially on image tasks) but it's pretty rare for a ML pipeline to exist without at least starting with existing models.
The intuition for this is that lines, colors, textures, shapes are all general concepts that can be learned from a different domain & used in the earlier layers of the model to build up to more complex features.
Coming to the conclusion that models don't generalize well could be because you are seeking to create solutions to more interesting problems. I don't spend resources on an AI team and have been surprised as how well models do generalize.
One of the critical points here is using pretrained models to generate embeddings that the application level programmer uses. This technique isn't common knowledge among programmers without any ML experience, but is a key to getting good, generalizable that extend to unseen domains.
The "model" is not the hard part! If a pretrained model is generalized enough and valuable enough then it can exist as simple API or a runtime, it stops being a "model." If you have the engineering chops to deploy models in production for your application then deploying a pretrained model is trivial. If its valuable enough to the business then squeezing a few more points by fine tuning the model is worth it.
A little known feature of Swift/iOS is NLEmbedding (https://developer.apple.com/documentation/naturallanguage/nl...). As the name implies, it is a built-in natural language embedding model that will produce the same embeddings across all iOS devices. I can't tell for sure but I would guess it is some sort of BERT model and probably weighs in at hundreds of megabytes.
Two major issues are generalizability (or lack thereof, sometimes pretrained weights can even be detrimental to your specific tasks) and licensing (pertaining to the data used in the initial training)
Suppose you produce a network whose weights can reproduce the content of the training dataset, as can happen. Are you saying that because it's just weights for a model, it can bypass the license of the data used to produce it and which can be reproduced by it?
If you’re using it at work then license of the training datasets absolutely matter, at least as far as lawyers are concerned (note that I am not a lawyer, I’ve just had to share this information with them in the past)
because it doesn't apply to my use case obv. i'm not doing any of those. but more generally there's usually a "easy win" to be made from doing a little final training on your actual target data. plus i've found it more reliable to store and distribute your own models so at least i'm the one making and shipping the black box. like NNs are hard enough to debug without relying on somebody else's as well.
Shameless plug, this is why I built Mighty Inference Server [0]. It wraps ONNX Runtime in a production ready rust binary, and can be installed quickly and painlessly on most machines.
The sentence transformer models in the example notebook from the OP will infer queries in <10ms, which is fast enough for a production customer facing service…and lets you run on commodity instances without expensive GPUs
Some of this might be difficult to set up but I don't think that's the reason why people don't use more pretrained models. If you have a working ML environment adding a pretrained model to it is trivial most of the time.
And even once the initial set up is done, ML software is constantly changing/breaking. You're forced to reinstall a new proprietary Nvidia driver, which is then incompatible with the old version of PyTorch the model uses... But that other model requires a newer PyTorch... It's a pain... The fact that there's a dependency on GPU driver versions is ridiculous.
This should apparently improve, as at least CUDA no longer depends on specific versions (thank god). I was pleasantly surprised last time I installed it.
Uhh, conda create a new environment first. Conda literally does everything else for you (including the actual hard part , getting your GPUs to play nice with CUDA and to be recognized by pytorch or tensorflow)
Just make a new conda environment, and throw it away if it gets clobbered.
But don't listen to me, I only work for a billion dollar company that does all of its cloud data science using conda, so I must not know what I'm talking about...
I think the point is that conda is great until suddenly it's not because of various compatibility issues. Especially when the errors you get will sound very cryptic to someone who's not used to ML and libraries that use CUDA.
That's at least my experience, I've personally always preferred to not use it at all (N=1).
Also, you sound very insecure, not sure what the company drop at the end is for.
The most obvious one is that your dataset doesn't share a lot of features with the pretrained dataset. This is pretty common in vision tasks where the pretrained set is usually ImageNet. If there isn't significant mutual information in the tasks (e.g. language to vision or vise versa), then you aren't going to transfer the knowledge that well and you've just wasted your time.
Scientific datasets often have this issue as usually you can't collect much data and even though there might be shared knowledge between the pre-trained it will give you a bias that you don't want or can't use. It may even prevent it from generalizing. Training from scratch can smooth this out in some cases. Pretraining means you're starting from a different point in the optimization space and it can pigeonhole you towards certain optima.
What's often being suggested for pre-trained models are HUGE. Sometimes you might as well just write something from scratch (or train a smaller model from scratch) because you just don't have the compute. There are many small models that are highly powerful and can be trained from scratch. You can even train transformers on CPUs. Various architectures will help with different tasks, so even just a random pretrained model that does well on a dataset isn't going to save you. You may also just be wasting significant/costly compute.
So when you can, use a pretrained model. But knowledge transfer isn't going to always help you. Your millage may vary is all I'm saying and pretrained models are not the solution to everything.