Git is an excellent way to collaborate but it manages data in a different way than other version control systems, specially subversion. In Git you have to clone the entire repository any time you want to make a change. In contrast, subversion (and CVS) allowed you to simply download a subtree of the repo.
Below are a set of recommendations on how to use Git (and potentially GitHub/BitKeeper/Gitorious) to collaborate with others and manage your research repositories.
The key is to think of one repository for each major collaboration tasks. My suggestion is that every paper should have 3 repositories:
- The actual paper, including all the figures that it needs
- A data repository, where the data collection and analysis scripts are held
- A replication package. This repo will contain the data that will be shared for replication
I also keep a repo for each of the students. This repo is used before deciding on a venue as the means to exchange information.
Depending on your constraints you might want to choose one of the main Git hosting sites, such as GitHub or BitKeeper. You can also keep the master repositories in a machine accessible via ssh, but administration becomes more difficult. Specifically, consider the following:
- Easiness to browse repos and their content
- Permissions per repo
- Do different repos have different people who can modify/browse them?
- Should different people be able to administer repos (rename, create, delete)?
- Are you ok with a service hosting your repos?
- Would the other co-authors have access to the repo?
Git hosting sites support “organizations”. I recommend to create an organization for the lab, where all the expected papers are to be hosted (eventually) and archived. Every hosting service has different constraints with respect to permissions, number of repos you can have, etc.
I also recommend to apply for an academic account for your lab with the hosting service of your preference. Students should still apply for a student account as well (e.g., allows to have unlimited private repos).
See above. I recommend you explore how each hosting site handles permissions. I believe BitKeeper is a bit better on this that GitHub. But things might have changed recently (in the last year).
Here are some things you should consider:
- You will have many repos
- You will probably try to find the repos by conference/journal, title and year. I strongly recommend these three “tokens” in the title of a repo
I personally prefer to name my repos as YYYY_conference-name_title-of-the-paper<type>. The <type> can be empty, _paper, or _replication. This way the paper appears first, and the other two repos after.
I let my students start the repos. Once we move to a collaborative mode, I might move the repo into the organization account (Git hosting sites let you change ownership and move repositories across accounts very easily). Typically, it is at that moment that the moved repo is renamed.
Authorship (who and in what order) should be discussed early on, as it dictates the expectations and responsibilities from the authors. The first author is expected to lead the paper, and “push” it towards publishing. However, all authors hold a responsibility. For example,
Anyone listed as Author on an ACM paper must meet all the following criteria:
- they have made substantial intellectual contributions to some components of the original work described in the paper; and
- they have participated in drafting and/or revision of the paper; and
- they are aware that the paper has been submitted for publication; and
- they agree to be held accountable for any issues relating to correctness or integrity of the work.
All data should be tidy. Data should be saved by default in a transparent, open format like CSV (some exceptions may exist).
We use Latex to write papers, with bibliographies in bibtex. Typically, we do the analysis with R and save the scripts to allow for replication later. Diomidis Spinellis provides great advice on writing Latex documents.
When starting to write, consider using Google docs first to flesh out the outline and make sure that the relevant collaborators have agreed on some aspects of it before you flesh it out into a full document — Google docs allows brainstorming and easy inline commenting, which can save a lot of time later and lead to more organized manuscript.
If revising a paper following a peer review process, the first thing you should do is paste the reviews into GitHub issue and break it up into specific and actionable comments. These comments serve as a “to do list” for the revision.
Different Git hosting sites show the repos you have in an account in different ways. GitHub lists them in order of activity (the most recently modified first). BitKeeper in lexicographical order. Nonetheless, after a while you will end up with lots of repositories.
If you are worried about this, I recommend creating an archival area. Once the paper is published, move the repos from the collaborative repo to the archival one. Note that this might mess permissions of others who want to access the repo. Alternatively, you can organize these in “ongoing papers” and “archived papers” locally without affecting the co-authors (e.g., moving and renaming the repo main folder).
Once you are creating the camera ready version of the paper, you can make the replication package public. This is simply flipping a switch. I find it convenient. I also recommend that, after you are sure about the replication data you:
- create a tag (e.g. published)
- create a zip file (you can use GitHub to do this)
- test files for completeness
- submit the file to Zenodo for long term archival http://zenodo.org
- add a link to both locations in your paper
You can automate (i.e., script) the synchronization of local repos to the outside repos.
This script, for example, will backup all the repos of an organization. You will have to create an ssh account without password that can read all the repos, so be careful about how you store these credentials.
https://gist.github.com/rodw/3073987
You can also use mr. It is a great package to work with a large number of repos, but it requires that you keep the config file up-to-date. Mr is more useful for users who want to keep a large number of repos synced in their computers.
- If you commit a large file, even if you remove it later, the file is still in the history of the project and it will continue to take the space.
- The same happens if you commit sensitive information. Even if you remove the file, its history will allow its recovery by anybody who can clone the repo.
If you commit any of these files, use bfg (https://rtyley.github.io/bfg-repo-cleaner/) to scrub everything about these files.
- Do not store large files in Git. It will make it a pain to clone/synchronize the repo to others. The problem of large files is that version control was not meant for it.
- If you want to save a backup, use the “data” repo of the paper, compress it and then commit it.
- Anybody can update history of repos. It is possible that somebody might completely delete all the history of a repo and all its files. But if you have a copy of the repo, you have a full backup of the repo.