Image Classification Efficiency for Beers

Andrew Argeros
May 9, 2022
8 min read

Say you're an enterprising brewery owner who wants to track how long beers stay in the glasses of your patrons. Sure, you could have your bartender sit and take data from behind the bar, but not only is that not very cool to your bartender, it's also not that accurate, as the bartender has other things to do. One option to this is via cameras. Since you know the different styles of beer you are producing, maybe you can segment the images you periodically snap of patrons' brews into these buckets using Computer Vision.

Taking this route yields many concerns, however. Obviously this model should be accurate, but a taproom probably doesn't have the highest budget for things like MLOps; that's why things like parameters, latency, and training times are of concern. Not to mention, every taproom I have been to lacked an Nvidia GPU strapped to the bar.

Part 1: Getting the Data

To start this project, as with any data science project, we needed data. Originally, I scanned a few different GitHub repos (OpenBeer and OpenBreweryDB) for pictures of pints, but got nowhere. Turns out there is a big market for repositories of beer labels, but not the beer itself. Thus I turned to the next best place: Untappd.

For those who are unfamiliar, Untapped is essentially Yelp for beer. Users review and rate the beer they drink, and accompany the text with a photo of their beverage. This allows for a created hierarchy of beers within each style and sub style, and allows brewers to gather data on what makes a popular beer. While Untappd has an API for said brewers, it is egregiously expensive, and not somethings feasible to the collegiate (or average brewer's) budget, thus I went to the next best thing: scraping.

Given that they have this paid API, Untapped really doesn't want people to scrape their data, thus I had to be creative in my approach. Any requests made from normal Chromium user-agents returns a 403: Forbidden status. Thus to get around this, I passed differing headers on each request, using the below function in python. This essentially passed random junk to the web server in order to confuse it into letting me get the data, as if I were someone new.

def make_headers():
  headers = {
    'User-Agent': uuid4().hex,
    'From': f"{uuid4().hex}@{uuid4().hex}.com"
  }
  return headers

This allowed for the gathering of the data from Untappd's Top Rated section. First, I scraped the 215 styles of beer from the menu, then used each to scrape a list of the corresponding top 50 beers per style. From there, I iteratively went through each of the beers to find links to each page's top 12 images. From there it was a matter of downloading the images into a directory. This takes some time, however, as the expectation is around 129,000 images (215*50*12). The resulting data is roughly 12.5 GB of space, which is certainly chunky. Additionally, I compiled the metadata on each beer's url, style, and number of files downloaded into a .csv that I used later, for splitting the image data.

Then, both due to cardinality and the lack of difference between styles, I binned Untappd's 215 styles into 6 overarching styles: Dark Malty Beer, Light Beers, Stouts & Porters, Fruit Beers & Seltzers, and IPAs. This is where things started to get messy, as styles like Trappist Beers and others don't have an associated look, but are more a product of their manufacturing, thus I included a "Not Applicable" catchall class that was not trained on. This binning can be done using LabelStudio, and labeling the Untappd beer styles to the underlying classes. The question here is essentially: can a model learn the crisp, golden color of a pilsner, or the lupulin lacings of a New England IPA, or the coffee head of a Milk Stout? Sure there is some overlap, but I chose these styles to create the most interclass differentiation, without resulting in something akin to a light/dark binary classification that would frankly lack meaningful insight.

Part 2: Cleaning the Data

Since Untappd data is, by definition, not the cleanest source of data, there was a great deal of cleaning to be done. Since the project is about classifying beers, we need pictures of beer This might sound self explanatory, but the data scraped from these images contains myriad forms of junk that only muddy the signal in the data. Two of the most common of these differences are people posting photos of their beer bottles or the group of friends with whom they drank the beer in question.

There is almost enough in this to warrant a image classification problem of its own, yet I did not have the time, desire, or effort to do so. Thankfully, there is CLIP from OpenAI. CLIP essentially takes the powers of large language models and combines it with images. Thus, one of the things we can do with it is Zero-Shot Classification using HuggingFace (see my previous post on Zero Shot NLP). This allowed me to classify each image as essentially "relevant" or not without having to train a model on this data. Using the function below, I ran CLIP over my data directory to keep only images of pints of beer.

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

options = ['a photo of a can or bottle', 'a photo of a glass of beer', 'a photo of a person or group of people', 'a photo of a company logo']

def classify_image(img, options, model = model, processor = processor):
  image = Image.open(img)
  inputs = processor(text=options, images=image, return_tensors="pt", padding=True)
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image
  probs = logits_per_image.softmax(dim=1) # take softmax for probability

  return dict(zip(options, probs.tolist()[0]))

This took the directory down to about 106,000 usable images that then needed to be split into train and test. Since the distribution of images and classes were held constant by Untapped, the data had a natural balance to it, thus over/under-sampling was not needed. From there, since I would go on to use PyTorch, I had to restructure the folder. This was simple enough, as some linux commands, with some python to boot (think shutil and os) got the job done, such that the data directory looked something like this:

/Data
|
    /Train
    |
        /Dark and Malty Beers
        |
            001_1.jpg
            001_2.jpg
            :
        /Stouts
    /Test
    |
        /Dark and Malty Beers
        |
            006_2.jpg

This is where the real mess happens. This data set is huge by the standards of most file transfer services. It is too big to push to Github, too big for Github LFS, too big for email. Not to mention, this was all performed on a work PC that limits access to other file sharing sites like Drive or Dropbox. Sure I could have kept everything local, but my laptop doesn't have a big enough GPU to reasonably train models. So, to accommodate this, I was able to push a .zip of the directory to a Google Cloud bucket and base it out of there.

Part 3: The Models

Awhile ago, I saw a tweet thread from Brandon Rohrer, that talked about the nuances of convolutions for image classification. His point is that, while Convolutions may be the status quo for image processing, they are ill-equipped to be the "feature detector" we want them to be. Essentially, convolutional kernels tend to squish everything toward zero, while leaving some noise in the resulting matrix.

An alternative, he proposed was based Cosine Distance, an algorithm typically used in the field of Natural Language Processing. This essentially uses a rolling cosine window to compute similarity between 0 and 1. This is then "sharpened" by exponentiating the similarity to some parameter q that accentuates the "peaks" of the resulting matrix. This creates orders of magnitudes fewer parameters, and thus a leaner network.

Weirdly, Sharpened Cosine Similarity performs better on the CPU than with CUDA. This is due to the exponentiation aspect of the algorithm. This "sharpened" aspect of the algorithm pushes the complexity past the ability of GPU cores, thus requiring a CPU and being incompatible with torch.cuda.tensor objects.

Thus, to create a suite of models to test, I assembled a prototypical Convolutional Network (CNN), Brandon Rohrer's SimCSE network, an example of CoAtNet (the current ImageNet SOTA), and CoAtNet with convolutions replaced with Sharpened Cosine Similarity. CoAtNet is a model proposed by Dai et. al in 2021 that combines Convolutions with the Attention from Transformers. This creates a wide network with many parameters that allow for high feature recognition in image datasets. In theory, replacing these convolutions with Sharpened Cosines should allow for high accuracy with fewer parameters.

To keep the experiment constant (and compute costs down) each model was trained for five epochs, which is obviously fewer than optimal, but allows for a decent understanding of each model's performance without the time overhead of a full training period. Additionally kept constant were the model's backends. The learning rate was set to 0.001, batch size was set to 128, the optimizer was chosen as Adam, and loss was chosen as Categorical Cross Entropy.

Part 4: Model Training & Evaluation

Theoretically, these models could be trained anywhere, but I certainly don't recommend it. Since I was dealing with a large dataset, and neural nets, this problem was set up for CUDA. Thus, I began the training phase of the project in Google Colab, using their free tier of notebooks with a GPU runtime. However, it should be known, that Google severely limits this access for its non-paying customers. Thus, about 2 hours into this, Google informed me that free GPUs would no longer come my way, and I was out of luck. Due to that, I was then essentially cornered into requisitioning a Virtual Machine from Google Cloud. This had the additional benefits of being within the existing google cloud SDK, and has a constant connection type. This image was set to have 8 vCPU cores, with 30 GB of RAM, and a NVIDIA Tesla T4 GPU. Obviously this has a cost to it which can run pretty high.

In order to run the models, the Google VMs come preloaded with PyTorch and JupyterLab, which I was able to SSH into via my laptop's command line. From there, I was able to use the Jupyter magic commands to retrieve the zipped data from the cloud bucket via:

gcloud config set project cds-5950-capstone 
gsutil cp gs://dataset-cds-5950/Data.zip /content/Data.zip

This was then unzipped to the project directory such that the /Data directory would be located at /content/cds-5950-capstone/Data. From there, it was a matter of copying and pasting the models into the notebook cells, and running the training loop for each algorithm. This loop recorded the parameters, trained the models, and computed the training time for the model.

Then, in a separate cell, I set up an accuracy calculation for each model that loaded the training Dataloader, and predicted for each batch within the test set. These values were all recorded and serve as the conclusions for the project.

Part 5: Results

Below is the table from the results of the trainings:

Model Name	Parameters	Training Time	Accuracy	Latency
CNN	7,656,632	03:17:22	49.14%	0.09 s
SimCSE	39,261	02:01:48	39.52%	0.06 s
CoAtNet	16,054,392	04:46:13	58.88%	0.12 s
CoAtNet w/ SimCSE	13,107,902	06:27:27	48.26%	0.09 s

For many of the criteria, Sharpened Cosine Similarity is a clear winner. However, it suffered in terms of accuracy, being 10 percentage points lower than the next least accurate model, Convolution. CoAtNet is the favorite in terms of accuracy, but takes more than double the time to train that SimCSE took. This is largely due to the massive increase in parameters.

Thus, in the case of the allegorical brewer at the start of this article, it may be worth the resources of the brewery to invest in longer training times (more epochs) of the Sharpened Cosine Similarity model. This would allow the taproom to make classifications with a lighter weight model, yielding lower computational overheads and removing the brewery's need for GPUs or Computing Clusters.