Customer Segmentation Model for News Publishers

A powerful ML model that analyzes reader behavior and segments users into distinct personas without compromising their privacy.

Leverages NLP and
clustering algorithms
Scalable to suit large publishers
Enhanced revenue & subscriptions

Story behind

The client is a Danish startup in the AdTech and FinTech sectors, aiming to revolutionize how news publishers understand and engage with their audience. Their platform receives 10,000 to 20,000 new anonymized user profiles daily, using fingerprinting techniques to track user behavior and interactions with paywalls without using any personal information.

With around 1.3 million profiles and various performance metrics, the client sought to leverage machine learning (ML) to predict the type of content that would drive subscriptions and enhance the efficiency of ads, aspiring to emulate the success of The New York Times' ML-driven strategies.

Goal

To build an ML model to calculate user engagement scores and cluster newspaper readers, assigning them to the closest explainable personas using limited available data.

This would enable publishers to understand their audience better and increase revenue by allowing advertisers to target key metrics more effectively.

Challenges

Accuracy

Ensuring the model's accuracy was critical. Inaccurate models could lead to incorrect engagement scores and misclassified personas due to factors such as a general lack of data, poorly designed training algorithms, or suboptimal clustering algorithms.

Scalability

The model needed to scale effectively for both small and large publishers and across different user groups, given that newspaper readers on the web, mobile, and Smart TV behave differently.

Speed

The model needed to train quickly, with fast and effective inference (generation of predictions with very low latency) being even more important.

Privacy concerns

Handling user data securely and in compliance with privacy regulations was essential to maintain user trust and meet legal requirements.

Tensorway’s
solution

Data collection

Gathered data from different publishers on various user interactions, including activity (time on page, page depth, scroll depth), location (IP address, country, area, time zone), and more. Conducted data pre-processing to clean and normalize the data, removing outliers and handling missing values. Improved the preprocessing pipeline to normalize and preprocess data from other sources, such as Smart TVs and mobile apps.

Feature engineering

Developed features associated with engagement scores and applied techniques like feature selection to enhance model performance and optimize the training process.

Text embeddings

Used text embeddings to capture the meaning and context of the content on the page, allowing for accurate content matching across different languages. Created text embeddings using a neural network trained on a large corpus of text to encode each word into a high-dimensional vector. Trained a specialized neural network (embedder) to create better representations of different articles and content, allowing the final model to better differentiate between similar and different documents.

Clustering model development

Built a separate clustering model using engagement-related features to ensure users were segmented based on engagement rather than behavior patterns. Tested various clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering, to determine the best fit for the dataset.

Behavioral analytics

Implemented clustering algorithms to categorize users based on their website activity. Conducted extensive exploratory data analysis (EDA) to extract actionable insights for publishers, helping them understand audience expansion and content prioritization.

Model training and validation

Created a pipeline that allowed us to find the best data processing and clustering hyperparameters given custom objectives. Performed thorough testing and validation to ensure model accuracy and reliability. This involved using a test dataset to evaluate the model's performance and adjust parameters as needed.

Scalability considerations

Developed models working for both small and big publishers. Developed a plan to scale the solution for bigger sites/publishers, ensuring the models could be trained and applied to various volumes of data.

Model deployment

Deployed the trained models as a service in production, allowing for real-time application and continuous improvement. Provided pre-processing and training scripts with detailed comments, documentation, and instructions to reproduce the work.

As a result...

Tensorway’s customer segmentation solution demonstrated impressive scalability, applicable to both small and large publishers, with strategic recommendations for future enhancements.

The clustering model and behavioral analytics provided publishers with a deep understanding of their audience, enabling targeted content delivery and personalized user experiences. This led to optimized revenue through precise advertiser targeting, higher conversion rates, and increased subscriber numbers.

The deployment of our models as a real-time service facilitated immediate application and continuous improvement, ensuring the solution remains effective and adaptive to evolving user behaviors and market trends.

Project team, steps, and timeline

Team

3 full-time engineers

Timeline

3–4 months

Other possible
applications

Content personalization

Using similar models to personalize content for individual users, increasing engagement and retention.

Ad targeting optimization

Enhancing ad targeting by understanding user preferences and behavior more accurately.

Subscription prediction

Predicting which users are more likely to subscribe based on their engagement patterns.

Churn prediction

Identifying users at risk of leaving, allowing for targeted retention strategies.

Cross-site analysis

Applying the model across different sites to gather broader insights and refine strategies.

Contact Us
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.