My First AI Project: Building an MVP with Less Than Marvelous Data

We had an interesting idea. And we had an amazing team. The next step in making Smart Staffing a reality was funding. But, unfortunately, people just do not give out money willy-nilly. You have to do something to earn it.

So before we could get funding, we need to prove out our concept by building a prototype. Or Minimum Viable Product (MVP) as they say in hoity-toity innovation literature.

The obvious solution that came to mind was building a down-and-dirty recommendation algorithm that matched past project data to people data. We would train the algorithm using historical data of the people who had been assigned to past projects.

Project Data seemed simple enough to obtain. After all, we have been tracking projects, and the people who were on those projects, in order to bill our clients. That data had to be good, right?

So we skipped ahead to People Data. The problem with a bunch of super excited, super smart people is that their brains work in overdrive and see unlimited possibilities. The list we generated of possible sources of People Data was endless: resumes, our proprietary HRM system, GitHub, LinkedIn, Degreed, and on and on and on.

We needed to return the focus to the “Minimum” in Minimum Viable Product. What was the simplest people data source to obtain that would still provide meaningful data?

Resumes were the easy choice. While some of the resumes were a little dated, they were still stuffed to the brim with important keywords about an employee’s skillset. And we already had a repository of everyone’s resumes that we could fairly easy extract into a single data file.

Awesome. Let’s get to work building that MVP Recommendation Algorithm!

Not so fast. We soon realized that the Project Data we had was not sufficient. While we did have the projects people have been assigned to, the metadata about the projects was pretty general. For instance, the data we had showed that Bobby the BI Developer had worked on the Acme project in the past, but it did not tell us if Bobby used Tableau or PowerBI. We needed to feed our algorithm more granular data if we wanted to train it properly.

We explored a few options. The Statements of Work for each project were too unwieldy to provide any data value. Manually enriching past project data was too tedious and time-consuming.

The MVP Recommendation Algorithm was not going to happen.

We had one source of data at this point – resumes. What if we compare that data to itself?

One of the common requests I get around staffing is “Find me someone just like…” In other words, “I really want Janey Javadeveloper on my project, but she is booked elsewhere. Is there someone else just like her I could use?”

Could we use Machine Learning to find people within the company whose backgrounds closely matched each other?

Hence, the MVP Similarity Algorithm was born.

Key development elements:

  • Downloaded and parsed employee resume
  • Used word stemmer to shorten words
  • Got rid of dates and non-important words phrases during data pre-processing
  • Emphasized words that that distinguished one person from another
  • Implemented cosine distance to measure similarity between two employees
  • Tools used included Python, Pandas, Numpy, SKLearn, and Natural Language Toolkit (all open source)

We ran several tests and found that the Similarity Algorithm returned strong people matches, as evaluated by a human being.

While initial results proved good, we were still struggling with how to incorporate a feedback loop into the process so the Similarity Algorithm could continuously learn. Nevertheless, we felt we had checked the boxes for both “Minimum” and “Viable Product.” It was time to ask for money.

Next up: Getting funding


I am slowly catching this series up to the actual progress point of the project. Until then, here is a little taste of current events:

  • Why I am feeling at the top of the world: Last week I traded the Smart Staffing project for skiing. We took a family trip to Colorado to sneak in some powder before the season ended. While it definitely felt like I was at the top world standing at the crest of the mountain (12,060 feet to be exact), what really energized me was taking a break from emails, meetings, and project plans. I am a firm believer in mindful rest.
  • Except for this not-so-minor detail: Ok, I did check my email intermittently over vacation. And my heart sunk when I saw the subject line “Resignation.” A few weeks ago, our team lost our Machine Learning guru, Michael, when he took a job with a Machine Learning start up. And now, our up-and-coming ML mind, Hitanshu, was leaving as well, taking a great opportunity elsewhere as a Data Scientist. I am thrilled that Michael and Hitanshu were able to land their dream jobs. And I am also happy that this project was able to provide them a great foundation for their data science careers. After all, one of the secondary objectives of this project was providing ML learning opportunities for our people. But now two-thirds of my ML development team is gone, taking both their expertise and enthusiasm with them. They will be difficult to replace.
  • What’s making me feel smarter: Also last week (I promise I did rest the remainder of my vacation!), we met with our Microsoft account reps. The MVP we built was not based on the Microsoft platform, so we needed to get up to speed on Azure ML. In addition to discussing various technology options, they shared training options our team could explore. I have already taken a Machine Learning class, but it was somewhat technology agnostic. I felt to understand better the Microsoft Azure ML environment I really needed to get specific Microsoft training. So, with much insecurity-fueled trepidation, I signed up to pursue my Microsoft Professional Program in Data Science certificate.
  • What’s making me feel dumb: My first consulting internship (I will not give the year, but let’s say it was not in this century) involved a copious amount of data crunching in Excel. By the end of the summer, I could pivot-table and VB script LIKE A BOSS. So, I thought the first course in the Microsoft Data Science certificate (“Use Microsoft Excel to explore data“) would be a breeze. Ha. Microsoft has added a lot of features since my “Summer of Spreadsheets.” Things like this and this. Note to self: It’s probably good to brush up on your software skills every decade or so.

This is the third installment of my real-time case study on my first AI project. I plan to share what we are working on, what is going well, what is sucking at the moment – everything – as it happens.

My hope is by sharing our project’s small victories and painful bruises, you will be encouraged to tackle a project that scares the sh?! out of you too.