Building A Lead Scoring Model In-House: 10 Things to Consider
We’ve spoken with dozens, if not hundreds, of marketing teams that have considered building a lead scoring model internally. They’ve already got the lead scoring basics figured out, and both sales and marketing are reaping the benefits of it. They’ve got a ton of data and are now ready to take it to the next level. And they’ve got the resources to help them do some impressive data analysis and build an algorithm specifically for their use case - bonus!
Not so fast… Given the extensive experience we have with the matter, we thought we'd share some of the most frequently overlooked aspects of such an endeavor. Check out the top 10 things to consider below!
1. Building a Lead Scoring Model In-House Will Cost You More Than You Think
We've already covered a lot of this in our article on build vs buy to build here. The gist of it is that, while it might seem straightforward to build some point-based or even basic regression-based models, the opportunity cost is high. Opportunity cost is probably one of the most underrated metrics in startup-land, yet it is fundamental. Sure... building it in-house doesn't cost you anything from a "new cost = new line on the financial report" perspective, but it does cost you in features that won't be shipped because your engineering team is working on scoring leads. Growth engineers are better off building experiences and automation than optimizing internal alignment.
The very basic cost calculator we've put together is pretty eye-opening.
2. Build With Marketing AND SALES in Mind
One of the most challenging aspects of building a lead scoring model is that your data engineers aren't necessarily marketers and even fewer have any background in sales - the end-users of the lead score you are trying to implement. Sales and marketing alignment is so important, so it is important to also address some of the fundamental differences in their personalities.
- Marketers are evolving into data-driven professionals and this helps them align well with Engineers. However this is also leading them further away from sales teams who don't care about conversion rates and aggregated funnel metrics.
- Sales people think in specifics (a particular deal, person, meeting). Plus, there is a human element to sales, so if they are too aggregate-oriented and become more like robots they will not be less successful because they are unable to be consultative in every conversation.
It is critical when building a lead score model to bear in mind that it will be evaluated by Sales as much as it will be by Marketing.
We recommend companies use their recurring sales+marketing meetings to better define the requirements for a lead score model that is most likely to be adopted by Sales. Knowing precisely how the model will be used and surfaced will determine which trade-offs can be made. For example, I've seen companies hide the scores from reps to avoid having to debate why a lead is a 92 and another an 88… #facepalm
Get the complete sales and marketing alignment playbook for 5 ways (including tactical how-tos) to harness your data and create go-to-market alignment through effective lead management, productive meetings, and more.
3. Fancy ML is BS in B2B
Neural Nets were all the hype a couple of years ago. AI and Machine Learning are the hottest topics of 2023. But there is absolutely no need for any fancy algorithms in B2B sales. The datasets are simply too small to warrant anything fancier than basic regression algorithms or maybe a kNN.
I've spent the past 10 years building models for B2B and reading research papers on the lookout for a breakthrough but there hasn't been anything to date. If you're curious, the most interesting paper I've seen is how IBM built a size of wallet predictions (see here for the full paper).
As you'll see in the next few paragraphs, the complexity instead lies in the data prep for the model and the success metric.
4. You'll Still Suffer From "Garbage In, Garbage Out"
Weren't we all so disappointed when the promise of Big Data analytics platforms failed to deliver because of "bad data"? Most Go-To-Market teams realized they were unable to unlock the benefits of these platforms. This IBM study claimed that poor data quality costs US businesses over $3.1 trillion per year.
The same data can be "Big" based on one of the 4 Vs referenced in the IBM article. I also believe that data can be "Bad" for any of the following reasons:
- Duplicate: We often find CRMs containing duplicate records with conflicting information.
- Incomplete: We'll address data sparsity in its own section but a classic issue I've seen many times is companies heavily relying on self-input form fields. This leads to artificially boosting the score of leads created through specific forms.
- Inaccurate: Prospects will fill forms with incorrect information and even 3rd party enrichment tools will have some inaccurate values. Kael Kelly from Avalara explains here how he ended up being given demos where Box was flagged as a 20-employees company.
- Siloed: Leads are not necessarily linked to accounts and disparate systems will contain records for the same person but won't share information (analytics vs CRM). This makes it even harder to create a holistic view of your customers and leads to feed into the ML pipeline.
5. Your Data Sparsity is Crippling
B2B marketing teams are constantly seeking the best enrichment tools that will unlock the ever-elusive 100% match rate. In the meantime, we have to deal with sparse records in Salesforce missing company size, industry, and even HQ country. This missing data makes it harder to run a regression on top of your data. It also explains why Salesforce Einstein hasn't crushed every single lead score model built outside of the platform. During our incubation by Salesforce, we ran a test to validate this assumption. The TL;DR is that a model with fewer but more present datapoints outperformed a model using 5 features (company size, industry, traffic, technologies used, and 1 custom data point from the form) by 50%.
Figuring out how you manage self-input information is also more complicated than it may seem. Many of our customers complain that the model they've built internally relies heavily on a few fields from specific forms (e.g. # of sales people for sales automation tools, # of images on website for cloud hosting). The challenge is that they know they should reduce the number of fields on their forms to increase conversion rates, but if they remove these heavily weighted fields all of the leads will get scored as medium quality. You want to avoid having the same person (defined as having the same email) be scored 95 and 55 in your system just because of the absence of that self-input data. Based on what you are trying to optimize for -- increase the MQL-to-SQL conversion rate, solve for capacity, or increase the predictability of revenue -- you will need to configure your scoring mechanism differently to handle self-input information.
The TL;DR still stands true: no amount of algorithmic brute force will ever make up for appropriate data preparation.
6. Keep Behavioral and Fit Scores Separate in Your Lead Scoring Models
I've found that most marketing operators were not trained to keep behavioral and firmographic signals separate. I would often see scoring models that would allocate +50 pts for a demo request and +20 pts for being an executive. While this might seem to make sense at first, it quickly creates operational problems. The main one being that you are setting a hierarchy between intrinsic and intent attributes.
We recommend starting with the ideal business goal before building anything and writing down how each campaign response should be evaluated for being MQL-ed. In general, we'd recommend building a model to tier your hand-raisers SLA (which hand-raisers should be contacted within 5min vs 2h vs 24h). Then you can work your way down to other campaign response types.
7. Keep Job Titles/Seniority Separate From Firmographics
When building a lead score, you essentially want to mimic the qualification process your best reps go through. While the BANT framework (Budget, Authority, Need, Timing) is extremely helpful, the order in the acronym along with combining the 4 elements together can be misleading. From a business standpoint, we would rather use NBTA:
- Need: Does the company have a problem our solution can solve?
- Budget: Does the company value solving that problem highly enough that they can justify the price to their CFO?
- Timing: When is the company looking to solve that problem? A core element of this is understanding what stage in the procurement process the company is at -- education, evaluation, or decision.
- Authority: Do the people we are interacting with have authority over the budget or are they influencers (or Ninas as Mark Suster calls them)?
Speaking to someone with authority but no budget is a classic time-sink. However, speaking to the wrong person at the right company means you need to work your way up the authority ladder within the org before marking the deal as qualified.
This is why we highly recommend building a firmographic model to predict B & N, then layering on the A component. T is usually predicted through intent data.
8. Standard Model Performance Evaluation Isn't Relevant
I've seen many companies task an in-house data scientist to build a lead scoring model. The core challenge with this is that the typical model performance metrics (f1-score, R^2, AUC, etc.) aren't meaningful in B2B lead scoring. False negatives and false positives don't have the same impact and that impact differs from one company to another. For instance, a false negative (lead scored as poor quality but is, in fact, high quality) is going to hurt your top line because you might not have a rep talk to them. But a false positive (lead scored as high quality but is, in fact, poor quality) can destroy trust in your model.
I've heard countless stories of models weighing company size heavily and therefore scoring universities as A-grade leads. Once your sales reps have seen this, it will be an uphill battle to convince them to trust any lead score regardless of your f1-scores and AUC numbers.
At MadKudu we've created a custom error function to measure the performance of the models we create for our customers. Put simply it has 4 components that we recommend implementing:
- Recall: The percentage of your pipeline is captured by your top scores (aim for 70% or more).
- Precision: The ratio in conversion rate between your top and bottom scores (aim for 10x or more).
- Rejection Rate: The rejection rate of your top scores (aim for less than 5%).
- Deal Value: How the average deal size correlates to scores (aim for 2x between the top and bottom scores).
In any case, remember that the final users of lead scores are sales reps and therefore spot-checking is a critical part of validating the model is ready to go live.
9. Beware of Class Imbalance
Class imbalance is the problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative). The challenge with ML models is that they work better when the classes are roughly equal which is not the case with lead conversion.
To make this more obvious, consider the exaggerated scenario of having a 1% conversion rate from lead to SQO and you are looking to build a model to predict if a lead is likely to convert to an SQO. The typical approach here is to use some logistic regression technique since the outcome is binary. However, the model would be right 99% of the time if it always predicted 0 (negative outcome).
To solve this issue, you should do both sampling and boosting:
- Sampling: Reduce the amount of negative-class data points by randomly sampling.
- Boosting: Increase the amount of positive-class data points. This can be done in a couple of different ways:
- Add conversions that could have been generated outside of the cohort used in creating your training data set.
- Add target account leads that you've curated. The danger here is that you might be artificially training your model to recognize your preconceived definition of your ICP.
10. Watch Out for Simpson's Paradox
We've spoken about Simpson’s paradox many times in the past. Simpson’s paradox is highly relevant in marketing, especially when looking to build a lead scoring model.
Companies will generally only prospect "Good" leads but their conversion rate (from lead created to any deal stage) is expected to be 10x lower than an average inbound lead. When we look at the distribution of leads overall in a CRM, we tend to see a 50/50 split between good and bad leads. This is mainly because our outbound deteriorates our conversion rate. However, when we look at inbound leads only, the ratio is expected to be 20/80.
Looking at the chart above, it seems like the Average Conversion Rate of Good leads is lower than that of Bad leads (8.4% vs 8.6%). However, when we look at each channel individually, Good leads convert consistently at 3x the rate of Bad leads (30% vs 9% for inbound; 3% vs 0.7% for outbound). Therefore building an overall model vs at the channel level will cause the model to be incorrect.
We recommend building the training cohort off of a single channel with uniform intent in order to ensure the main conversion rate differentiator will be firmographic quality.