Q&A with a Data Scientist: What is Lookalike Modeling?

Posted on 15 Feb 2015

By Team Intertrust

The following is a post by Sanjiv Jha, Chief Data Scientist at Personagraph. In September of 2015, Sanjiv won Top Chief Architect in India, Awarded by iCMG.

Every company on mobile has a deep-rooted desire to reach his or her ideal customer by leveraging data. Often, it is a brand or media agency that wants to target a specific audience; other times it is a mobile publisher who seeks to understand more about who is using their app. Both lead to core drivers such as increasing profits, lifting sales and conversions, or growing monthly active users.

The three key data components for a healthy ROI include who is the ideal audience, and how to reach/target this ideal audience. Data collected is used to predict user interests and demographic viz profile of a user. Predicted modeling leads to sparse data, which may or may not include all demographic data we need for a healthy ROI. What look-alike modeling achieves is how to fill up the gap in predicted data by finding look alike models of user profiles.

Data science has risen in popularity as one way to train a customer model to seek and duplicate based on predictions of who behaves like your best audience. This specific practice is called “lookalike modeling,” and although not new to the offline world, it requires special consideration – especially when new parties onboard from offline into the complex, mobile space.

How Lookalike Modeling Works at Personagraph

A look-alike model is created with user profile mapped to an interest vector for a user. A typical interest vector for a user “Harry” might look like this:

Interest1	Weight	Interest2	Weight	Interest3	Weight
Classic films	0.75	Action games	0.6	Boating	0.75

Each weight in the interest vector ranges from 0-1. Numbers are an average of the weights.

Now, imagine this interest vector consists of hundreds of interest categories and their respective weights. These interest vectors are then used to generate demographic attributes using look-alike modeling. To illustrate, we will use this interest vector to generate a look-alike for gender.

First, we do extensive offline analysis and generate relevant feature list, select training data set and train model. Train model is integrated with Personagraph platform which has plug and play capability to integrate any model at fly.

From the training data set, we extract the profile and generate interest vectors corresponding to each class. In other words, we take all of the interests and average their corresponding weights, so we have an interest vector for each class. In this case for gender there is two class male and female and so we have two interest vectors one for male persona say Vm and other for female persona let say Vf.

Now take a population for whom we have to find the look alike for gender. For a given user, Harry, we find distance between Vm which denotes a male persona and his interest, let’s call it Dm. Similarly, distance between Vf (interest vector for female persona) and his interest say Df.

If Dm > Df then user Harry is a female as his interest is closer to that of a female persona. If Df > Dm then user Harry is a male.

In practice, When we onboard a new partner we have to solve the onboarding problem for machine learning to be effective. We do not always have a perfect training data set. To take care of noises in data, or inaccuracies, we introduce a weight variable that biases the model towards a dimension. Let’s say we introduce weight for male 1.5. Now, the above look alike model will be:

If (1.5 Df > Dm) then Harry is a male.

By this weight variable, we can match data quality at the aggregate level with any industry standard aggregated data. With time, the model improves with the process of reinforcement learning. We make look alike models for each demographic attribute we support and also corresponding for each partner so that we can tune the model for each attribute without affecting the data of any other partners.