Checking authors.txt before migrating to github

Ensuring that contributors are correctly recognised for their work is a cornerstone of the free and open source software community. Here I present a convenient script to help.

When projects migrate to git (and usually Github these days), there is usually quite a lot of enthusiasm to get up and running and the authors.txt file is not looked at closely, or not at all.

Github does a great job of summarising the free software contributions of each member. To do this, it needs to cross-reference their commits with their email addresses. This possibility is not unique to Github though: as git is a distributed version control system, it is quite possible that other services will seek to mirror and report on code contributions in the future.

authors.txt can be tedious

For a large project with a long history, there may be more than 100 committers, many of them designated by SVN user IDs that are unique to the project and not easily mapped to the user email address.

If the project is hosted on Google Code SVN, then Google often makes dummy gmail.com accounts for the committers - and many people don't even use those accounts for anything else, not even as an email account. Including these accounts in a git repository is a bad idea if it is not the person's preferred email address. If they don't have that address associated with their Github account, then their commits won't be easily attributed to them at all.

Often, building an accurate authors.txt can be tedious, but here are some steps to simplify it:

  • Use a script like svn-authors-extract to quickly build a template file by scanning the SVN repository.
  • Scan email list history - if necessary, download the email archive as a mailbox, and use regular expressions to extract all email addresses
  • Find out if the SVN server keeps a list of email addresses with the SVN user IDs somewhere
  • Use searches in Google, LinkedIn and various other places to try and find out where the person is now
  • Finally, use the script below to test the results

Testing the results against Github automatically

How do you know which email addresses are already matched to a Github account and which need further research?

I've written a convenient script to help

It creates a dummy repository with one commit per committer. Upload this dummy repository as a new project on Github, view the commit log page, and all the matched user IDs will be highlighted, so the missing ones will stand out and you can quickly focus your search for missing email addresses.

I've tested this with the authors.txt file I'm building for the Sipdroid/Lumicall projects. This project comes from Google Code SVN and almost all the identities are obscured by gmail.com addresses that may or may not be the real/preferred email addresses.

I created a dummy project on Github called sipdroid-users and pushed the dummy commits there so I can see how many of the contributors are correctly mapped to a Github account.

A full copy of the script is below, you can also find it in the sync2git repository:

#!/bin/bash

set -e

if [ $# -lt 2 ];
then
  echo "$0 <authors file> <dest repo for push>"
  exit 1
fi

AUTHORS_FILE="`pwd`/$1"
DEST_REPO="$2"

if [ ! -e "${AUTHORS_FILE}" ];
then
  echo "can't find ${AUTHORS_FILE}, aborting"
  exit 1
fi

TMP_REPO=`mktemp -d`
TARGET_FILE="testing.txt"

cd "${TMP_REPO}"
git init .

cat "${AUTHORS_FILE}" | while read ;
do
  full_info=`echo "${REPLY}" | cut -f2 -d'=' | cut -b2-`
  an="`echo "${full_info}" | sed -e "s/ <.*\$//\"`"
  am="`echo "${full_info}" | sed -e "s/^.*</</\"`"
  cn="$an"
  cm="$am"

  echo "$REPLY" >> "${TARGET_FILE}"
  git add "${TARGET_FILE}"

  export GIT_AUTHOR_NAME="$an"
  export GIT_AUTHOR_EMAIL="$am"
  export GIT_COMMITTER_NAME="$cn"
  export GIT_COMMITTER_EMAIL="$cm"

  git commit -m "Test adding ${REPLY}"
done

git remote add origin "${DEST_REPO}"
git push -u origin master