HuggingFace Zero Shot NLP: As Cool as Trendy

Andrew Argeros
Sep 23, 2020
4 min read

Lately the realm of Natural Language Processing (NLP) has been dominated by Zero-Shot learning. OpenAI's GPT-3 is incredibly cool, and applicable to almost anything. Check out this article in TowardsDataScience to see some mind blowing demos of how GPT-3 is being used. The main downside to this is that once it is fully pushed out, GPT-3 will be a paid service used via an API.

This is a digression from what has largely been the status quo of data science and algorithm development. Git based sites a-la Github and Gitlab have long been host to repositories holding the source code for algorithms that largely drive data science. However, given that GPT is a Zero-Shot model, it is held close to the chest, instead of being an open source model.

However, HuggingFace has their own Zero-Shot model, used for classification. Unlike OpenAI's model, this one is open source and is completely democratized. The implications of this are massive. Not only is the model versitile, but it doesn't require the extensive cycle of "train-improve-repeat" common among neural models. Basically, once you have the classifier established in the python kernel, you can feed it a string and some options, and the model will return the best fit for the string.

from transformers import pipeline
classifier = pipeline('zero-shot-classification')

result = classifier(string, options)

Using this, I have found two solid uses for this: Fuzzy String Matching and Deduction. Sure, there are other fuzzy matchers out there, SeetGeek's FuzzyWuzzy comes to mind, but this not only has its limitations, but is almost too slow to use on decent size data frames. Say you have a bunch of survey results of people's favorite animals at the zoo, whoever designed the survey left the field open for free text. The obvious problem now is that "Pnada" and its properly spelled counterpart "Panda" are two different things. Account for every spelling error across your data frame, and you can easily find yourself with 18 different things that represent one animal. Searching with a regex could work, but this would likely involve crafting a regex around each animal (hopefully you have a small zoo) and would be far more headache inducing than it needs to be. Here's what you could do:

from transformers import pipeline
import pandas as pd

classifier = pipeline('zero-shot-classification')

survey = ['pnada', 'jellyfishes', 'leon']
options = ['panda', 'jellyfish', 'lion']

fixed_survey = []
for animal in survey:
    result = classifier(listing, options, multi_class=True)
    print(result['labels'][0]) # Print best class label
    print(result['scores'][0]) # Print prediction score
    
    fixed_temp = {'input_term': animal,
                  'output_term': result['labels'][0],
                  'output_score': result['scores'][0]
                  }
    fixed_survey.append(fixed_temp)

fixed_data = pd.DataFrame(fixed_survey)

By putting all of the animals in your hypothetical zoo into a list (probably an Excel sheet getting read in), you have all of the possible options for what a bad spelling zoo-goer could be referring to. This is also scalable, as the algorithm is only limited by the technical aspects of your computer. With the increased normativity of cloud computing platforms like AWS, Azure, and Google Cloud, a cheap Virtual Machine could run this without a problem.

Using largely the same code as above, we can also use HuggingFace to run some deduction style learning. Again with the hypotheticals... This time say you work in real estate (a real journeyman) and are interested in houses with pools. You have the listing description However, pools come in a few forms, namely private pools, and community amenities. You want to know which kind of pool is at which house. Again, in the spirit of devil's advocacy, there are other ways to do this: Computer Vision, Governmental Records Searching, and others. You could also just read the listing description, but if you're reading this blog, I assume you're not a heathen and would do this programmatically.

Take this listing for example:

“Located in the prestigious community of Stonehenge, this breathtaking property is a haven from the chaos of everyday life. The stunning yard which is entirely fenced boasts a luxurious in-ground pool, hot tub, and spacious patio for relaxing on warm summer days. Readily available is a pool house with a full bath and kitchenette as well. For tennis enthusiasts, there is a private, fenced-in tennis court with lighting for night time play, and a large gazebo for shade. The property also boasts a stall barn and plenty of space for equestrian activities--- all horses are welcome! Moving into the main house itself, this dazzling home features 5 bedrooms, 8 full, 2 half baths, and over 9,000 square feet of living space. The home is highlighted by an open floor plan, and showcases a beautifully lit sun room with access to the pool and patio area” -Zillow

By setting the classifications to something like [‘Private Pool’, ‘Community Pool’, ‘Whirlpool’, ‘Pool Table’], we can get the model to determine that this listing contains a private pool. If we were to set the multi_class parameter to True we could theoretically see if the listing falls into more than one class.

Obviously these aren't the only used for Zero-Shot, but just a few that I have encountered. If you know of any others that you'd like to share, feel free to comment below!