I agree with this statement, but that home is definitely not with your code. There are tools better suited for that. For instance nexus & maven.
Nexus store artifacts (which are really any piece of data: a jar, a tar.gz, a zip file). That's where you put your dependencies. Maven manage a project's build, that's where you tell what are your dependencies.
But really, there are many other tools out there to handle that (ant, nant, pypi, etc.).
I agree with you - the "not invented here" libs don't belong in your repo. I'm always amazed when I walk up to a "mature" base of code and find third party and open source dependencies festering all over the repository. This drives me crazy. There is just NO reason this stuff needs to be in source control. It's pure laziness that drives developers to just punt and check whatever version they happened to be using during dev. This makes final packaging as well as future upgrading a nightmare (particularly if other projects have glommed-on to the same dependency in the meantime).
The lowest-budget solution is simply to store the third party and open source projects in a reliable network file share (NFS, samba, etc). Then you can either sync locally or just build/link directly against the mounted file share.
It's also important to keep each project in a directory structure that includes the version number in the path (e.g. /nih_libs/boost/1.44.1/...) so that you can easily drop in a new version and start using as needed on a project by project basis. I'm always amazed how many places neglect this step and then have nothing but pain when they want to upgrade to a new version of a lib.
I still don't understand why dependencies don't belong to the repo. My naive reasoning: I just have to clone the repo and bam, I already have all the dependecies needed to build the artifact. When I need to upgrade to a new version of a library, I only have to commit a new version and delete the old one (possibly integrating the new library in a separate branch if the process is not trivial).
If you use something like Git submodules, Mercurial subrepos, or Subversion externals, you can get the best of both worlds. Your repo contains just your code, but a fresh clone will set up dependencies automatically.
I've also seen really simple projects get by with just an "install_deps" task in the Makefile, which you run first thing on a new clone. ("The simplicity of Maven meets the dependency management of Make," the wags will say.)
Sure, this is a solution. But my confusion remains: why I should't put all these dependencies in the repository with the code that uses them? In the end, doing so I have (almost) all I need to produce the intended software artifact.
There are a few reasons to do this. One is to at least nominally separate code that's licensed differently than yours (something your lawyers may ask you to do). Another is to make it easier to share your tweaks to the third-party code among several of your own projects.
However, yes, checking in dependencies with your code is the way to go.
Deps that can be reliably included with just a meta-descriptor (i.e. Gemfile, pom.xml, etc.) are still exceptionally rare. Thus the natural choice is to put the rest right next to the code that it belongs to. Well, natural choice unless you're stuck in a 90s java mindset...
If you want to do something simple it takes a 60 line pom.xml to express it. If you want to do something a little trickier it takes another 60 lines and three hours minimum of poking around at inadequately written docs. If you want to do something difficult, forget it. Ant, which is a nightmare of usability and continues to make me wake in the middle of the night screaming, is at least as complex to configure, better documented and is not a quarter as constrained as Maven.
That said the Maven dependency model isn't too bad; fortunately you don't have to adopt Maven to get it, you can just use Ivy.
You've got that backwards. The "90s java mindset" is the lib dir. The 2011 approach is Maven, as evidenced by the fact that 100% of my clients use it, and practically every major open source project uses it as well.
Agreed. It's a problem, but VC just isn't built for dependency management.
Since the histories of the dependencies don't get tracked, merging them can be tricky. Especially when dependencies get upgraded separately by multiple people.
This is a legitimate question--but what if you need to support different versions of open source packages and you need to be sure that they will exist years later. This is something that I'm facing now in that I may be using numpy 1.2.x, but suppose someone else is using numpy 1.3.x and we all need to play well together in a distributed environment--the context needs to be stable. If I don't put this in source control, then where's a good place to stash it so that I can reconstruct the various contexts when I have to migrate to a different server?
I disagree. Even back in 1995 it had fatal flaws. More than once it got so confused that we had to restore the source tree from backup. There was also the continual problem of files being locked by others when they should not have been, hampering progress and complexities. I was glad when we could kick it out of the door.
I'm pretty lucky I got old enough to use version control after fast, reliable and easy-to-use tools like git and hg were created. Every time I have to interact with a CVS repository I'm like "wow, people actually used to use this everyday".
This. is wrong. One should be able to commit every little meaningful change he has made, with a equally meaningful commit message. If you wait until everything works you'll end with huge meaningless conflicting commits. Of course it is quite annoying if you use a centralized SCM since everything you commit becomes public. Well this is actually why you should not use centralized SCM.
> One should be able to commit every little meaningful change he has made, with a equally meaningful commit message.
If your commit is broken, it's not meaningful, it's just broken.
> If you wait until everything works you'll end with huge meaningless conflicting commits.
Reading comprehension, please. b0sk talked about commits which can not be built. It's a very clear and simple requirement and it definitely does not mean is not feature-complete.
> Of course it is quite annoying if you use a centralized SCM since everything you commit becomes public. Well this is actually why you should not use centralized SCM.
There really is no relation, and a DVCS will not save you when an axe-murderer with a short temper tries to bisect a bug, and you break his bisection because your commits made the whole project un-buildable.
Let's try this again. Speaking as a git user, my workflow is a little different.
> If your commit is broken, it's not meaningful, it's just broken.
Sometimes I make broken commits just to have something for the written record. Sometimes I rebase them away. Sometimes I ask friends to pull from a broken commit (gasp!) when I need their help to fix my bug. Often times I 'git commit --amend' to fix broken commits before I push. In any case, there's no reason to artificially hide these mistakes as long as the result works.
> There really is no relation, and a DVCS will not save you when an axe-murderer with a short temper tries to bisect a bug, and you break his bisection because your commits made the whole project un-buildable.
In my workflow, I usually use a 'master' branch and 'topic' branches. The master must always build, as you say. Topic branches don't -- they're experimental by definition. When a topic is ready to go, we rebase and clean up the commits. This way, we get a traceable, ever-buildable source tree from master and experimental branches when we want them.
With modern technologies this is definitely a no-no.
With DVCS you can create your own branch and push it to other systems super easily, so there's little excuse for checking in broken code to a shared branch. Also, with a lot of newer VCSes there's often a feature to "shelve" a changeset to a central DB without committing it.
Not to mention that breaking the build for more than a few minutes raises some serious red flags about the way you're working.
I agree that breaking the build is normally a very bad idea. I've gone after friends with Nerf weapons for doing this. :-)
But there are a few cases where it's a necessary evil. For example, when upgrading a large Rails application from Rails 2.3 to 3.0, you're likely to make hundreds of small changes before everything works again.
In this case, I create a new branch, prepend "BROKEN:" to each commit message, and record the number of unit tests that are currently passing. Once all the tests are fixed, I hand-test, add new tests for any regressions that weren't caught by the automatic tests, and merge back to the main development branch.
I disagree. Get your code updates in there when leaving for the evening/weekend. But, of course, you'll do it in your own project branch that doesn't affect other people.
Even then, I usually comment out partial code then slap on a TODO. And where it makes sense, I'll also add in a "throw NotImplementedException" or some equivalent, so it's explicit that the code isn't expected behavior when executed. It's usually not much effort to do this. Especially when you plan ahead while you write.
There's always repercussions to checking in unbuildable code. Suddenly, nobody else can contribute or pull from that branch without first fixing your bad code, for one. What's worse is when they think it's a mistake and do tweak your code. You'll have to backtrack then, adding noise to the file's history. Another is when you need to roll back. If you allow broken commits, there's always a change of hitting a broken version. In times of emergency, you really don’t want to be bogged down by unbuildable code.
I had to dig for it, but he references another article he wrote [1] which says:
There is never a reason to use source control to version your dataThis will be painfully obvious to most people, but I’ve seen it done before, and more than once too. Source control management exists to version, um, source code...
This seems to assume that everyone is committing into the same branch. How about the workflow where you create a new branch for each work package and merge it into the main branch when finished? It lets you check in to your own branch as often as you want (even broken code) without worrying about breaking anything for the other devs. The merges will be bigger though, days or weeks of work. A good point is that the branch merge is a natural time to look through all the diffs.
Compilation output does not belong in source control
I've seen this before, but never found a suitable alternative. Where does it belong? Suppose multiple developers are compiling a C++ .dll which testers are grabbing through websvn and testing. To track down crashes they get, we need the associated .pdb for the right revision. Where should these files be kept? In a plain folder where each revision gets its own subfolder named as the revision number? That means updating two separate locations with each build, using two different interfaces...
> Suppose multiple developers are compiling a C++ .dll which testers are grabbing through websvn and testing.
The DLL is a compilation output, it's not to be in source control.
> To track down crashes they get, we need the associated .pdb for the right revision. Where should these files be kept?
To track down crashes they need the DLL to start with.
Testers should either have the ability to build the project on their machine, or they should be able to grab the output from the CI server (just go to the CI server, open the latest correct (compiled, tested, green) revision, grab files from there and test that.
The DLL is a compilation output, it's not to be in source control.
Mmm... I've been working in a medium-sized group project, and we've been having pretty good luck with source controlling our driver .lib file. The intermediate object files are discarded, but it's really nice not needing to recompile the library every time somebody updates the library code.
(This is a 'project' that produces a library rather than an executable, for inclusion into other 'projects')
>Where should these files be kept? In a plain folder where each revision gets its own subfolder named as the revision number?
Yes. Although revision number naming would be problematic if youre using git as it uses SHA1s to identify revisions, naming them with date.branch.author.sha1 might be better.
That means updating two separate locations with each build, using two different interfaces...
It just means having your VCS do the build and create the directory in a post-commit script.
Thanks for the ideas. However, in this case the DLL relies on MSVC for binary compatibility with its host, and the version control server is a *nix box.
There are tools to solve this problem. And these tools are not source version control.
In my company we setup the following workflow :
-> commit code
-> code is built and tested using Jenkins
-> generated artifacts are stored on Nexus
On top of that, daily we deploy everything on an internal pypi (we do mostly Python). Deploying the code in production is then not much more than running a script that easy_install everything from this internal pypi...
A CI server solves this issue. We have it set up so that whenever someone commits, the CI Server checks it out, builds, unit tests and then sends nag e-mails to whoever may have inadvertantly broken the build. This also gets rid of the "works on my machine" excuse.
"6. You must commit your own changes - you can’t delegate it"
My company does this, and I'm not entirely sure how I feel about this. The motivation for delegating changes is that different groups in our company have different check-in priveleges. This results in half of my changes being committed by me and half the changes being committed by someone else. This does, of course, present some coordination problems though.
subsequent commit messages from the same author should never be identical
I violate this one sometimes. I'm not a VCS magician, so sometimes I get commit errors, and wind up having to make a second commit. I figure identical commit messages makes it pretty obvious the two commits were intended to be one.
The most useful thing that git has given my workflow is `git stash`. I almost never find myself branching and absolutely hate having to push, pull and pick changes between separate branches.
Git Immersion has been my friend on this. I went through the set of labs several times (like katas) in order to feel minimally competent to use Git. Lab 19 deals with amend:
I know this is a tangent and that this story isn't about dependency management, but it's important to me and I've spent a good amount of time trying to understand it completely.
I spent what I thought to be a generous amount of time trying to wrap my head around Ivy and I just couldn't make it work the way I thought it should work. I use Maven, and while I'm not a Maven...er...maven, I can work with it well enough and I'm certainly adept with its dependency management mechanisms. Ivy just seems to have way too many moving pieces and their documentation always seems to be missing key pieces of information that link theory to example. I'm left with an incomplete understanding of how Ivy works and, as a result, I can't use it effectively.
I agree with this statement, but that home is definitely not with your code. There are tools better suited for that. For instance nexus & maven.
Nexus store artifacts (which are really any piece of data: a jar, a tar.gz, a zip file). That's where you put your dependencies. Maven manage a project's build, that's where you tell what are your dependencies.
But really, there are many other tools out there to handle that (ant, nant, pypi, etc.).