The Leedz is a platform for monetizing structured, time-sensitive, private information within a gated social network. One source of friction in its original implementation, the process of creating a leed from an arbitrary email, social inbox, Yelp or other booking site, could be automated through ML.
The challenge: extract schema-level fields (name, date, time, budget, address, etc.) from arbitrary, semi-structured blobs of platform (and person) specific text scraped from the DOM.
My first approach treated the DOM text blob as a document-level input to a classification model that filled-out a JSON schema containing fields for the event's date, location, start time, etc. Budget constrains limited me to an AWS EC2 instance only large enough to support a tiny transformer (~700K parameters). The ratio of context to parameter was too high. The model could not compress the blob's token space into a meaningful latent representation. Even a phrase like I want to book you for July 4 2025� went unrecognized.
If the model couldn't get bigger, I decided to reduce the numerator by trimming the input. I introduced heuristics to remove headers, footers, navbars, anything unlikely to contain booking data and stripped known phrases (learn more, terms and conditions) to improve signal-to-noise ratio.
Then I compromised by ML aspirations and adopted a pragmatic regex approach for identifying phone numbers and emails orders of magnitude faster and with near-perfect precision, stripping those too from the trimmed blob
The classification model maps the entire blob to a schema in one jump. Named Entity Recognition examines all the tokens in the input and assigns each one a tag (NAME, DATE, etc.). The tags matching the schema are then harvested. It's more granular and flexible, but training a NER requires generating thousands of blobs and breaking each one down word by word and tagging each word according to the schema.
I decided to use the classifier to generate the training data for the NER. Take the blob, let the classifier predict the schema, then search the original text for the predicted values. If the classifier generated a schema assigning "name : Susan" then the training data for the NER would include "Hi I'm Susan" [0][0][NAME] . The classifier output becomes pseudo-labeled data for training the NER. And while I'm was it, I might as well use a top-level LLM like Chat GPT to take a small set of hand-written training blob schema mappings for the classifier (20 samples) and extrapolate it 5x.
The story of this project's evolution represents an entirely new paradigm. The center of gravity in programming has moved:
Old Way: define inputs, spend time coding the logic, test the outputs.
New Way: The right program will be auto-generated in seconds. The challenge is figuring out the right one to generate and how to train it. Time is spent designing the inputs, debugging the outputs, and aligning the two with data.
I don't need to write the
function parse_gig_posting() anymore. I need to create the
conditions under which a black-box system can understand the mapping between
a blob and a schema data curation, pipeline design, and functional auditing,
not logic and unit tests. This is the new software engineering.
Scott Gross
Founder, the Leedz