Unpaid work training Google's spam filters


This week, there has been increased discussion about the pain of spam filtering by large companies, especially Google.

It started with Google's announcement that they are offering a service for email senders to know if their messages are wrongly classified as spam. Two particular things caught my attention: the statement that less than 0.05% of genuine email goes to the spam folder by mistake and the statement that this new tool to understand misclassification is only available to "help qualified high-volume senders".

From there, discussion has proceeded with Linus Torvalds blogging about his own experience of Google misclassifying patches from Linux contributors as spam and that has been widely reported in places like Slashdot and The Register.

Personally, I've observed much the same thing from the other perspective. While Torvalds complains that he isn't receiving email, I've observed that my own emails are not always received when the recipient is a Gmail address.

It seems that Google expects their users work a little bit every day going through every message in the spam folder and explicitly clicking the "Not Spam" button:

so that Google can improve their proprietary algorithms for classifying mail. If you just read or reply to a message in the folder without clicking the button, or if you don't do this for every message, including mailing list posts and other trivial notifications that are not actually spam, more important messages from the same senders will also continue to be misclassified.

If you are not willing to volunteer your time to do this, or if you are simply one of those people who has better things to do, Google's Gmail service is going to have a corrosive effect on your relationships.

A few months ago, we visited Australia and I sent emails to many people who I wanted to catch up with, including invitations to a family event. Some people received the emails in their inboxes yet other people didn't see them because the systems at Google (and other companies, notably Hotmail) put them in a spam folder. The rate at which this appeared to happen was definitely higher than the 0.05% quoted in the Google article above. Maybe the Google spam filters noticed that I haven't sent email to some members of the extended family for a long time and this triggered the spam algorithm? Yet it was at that very moment that we were visiting Australia that email needs to work reliably with that type of contact as we don't fly out there every year.

A little bit earlier in the year, I was corresponding with a few students who were applying for Google Summer of Code. Some of them also observed the same thing, they sent me an email and didn't receive my response until they were looking in their spam folder a few days later. Last year I know a GSoC mentor who lost track of a student for over a week because of Google silently discarding chat messages, so it appears Google has not just shot themselves in the foot, they managed to shoot their foot twice.

What is remarkable is that in both cases, the email problems and the XMPP problems, Google doesn't send any error back to the sender so that they know their message didn't get through. Instead, it is silently discarded or left in a spam folder. This is the most corrosive form of communication problem as more time can pass before anybody realizes that something went wrong. After it happens a few times, people lose a lot of confidence in the technology itself and try other means of communication which may be more expensive, more synchronous and time intensive or less private.

When I discussed these issues with friends, some people replied by telling me I should send them things through Facebook or WhatsApp, but each of those services has a higher privacy cost and there are also many other people who don't use either of those services. This tends to fragment communications even more as people who use Facebook end up communicating with other people who use Facebook and excluding all the people who don't have time for Facebook. On top of that, it creates more tedious effort going to three or four different places to check for messages.

Despite all of this, the suggestion that Google's only response is to build a service to "help qualified high-volume senders" get their messages through leaves me feeling that things will get worse before they start to get better. There is no mention in the Google announcement about what they will offer to help the average person eliminate these problems, other than to stop using Gmail or spend unpaid time meticulously training the Google spam filter and hoping everybody else does the same thing.

Some more observations on the issue

Many spam filtering programs used in corporate networks, such as SpamAssassin, add headers to each email to suggest why it was classified as spam. Google's systems don't appear to give any such feedback to their users or message senders though, just a very basic set of recommendations for running a mail server.

Many chat protocols work with an explicit opt-in. Before you can exchange messages with somebody, you must add each other to your buddy lists. Once you do this, virtually all messages get through without filtering. Could this concept be adapted to email, maybe giving users a summary of messages from people they don't have in their contact list and asking them to explicitly accept or reject each contact?

If a message spends more than a week in the spam folder and Google detects that the user isn't ever looking in the spam folder, should Google send a bounce message back to the sender to indicate that Google refused to deliver it to the inbox?

I've personally heard that misclassification occurs with mailing list posts as well as private messages.