Personalized recommendations shape every internet user’s online experience. These users generate a wealth of data by browsing and sharing it with websites and online portals, like ecommerce platforms. Ecommerce companies can use this data to train their text-based recommendation systems, which benefits buyers and sellers.
Implementing a text-based recommendation system adds value to your company by improving results in several key business areas, like sales, customer experience, and employee satisfaction.
While most technology officers will agree text-based systems are effective, it’s always helpful to know how they can be fine-tuned even further. In this article, we will look at how to measure the performance of text-based recommendation systems.
What is a text-based recommendation system meant to do?
The evidence for the effectiveness of recommendation systems can be seen everywhere online. From Netflix and YouTube’s suggested content to TikTok, Instagram, and X’s “For You” section, recommendation engines drive a significant amount of online traffic. In ecommerce, recommendation systems take on even greater importance, as they are key to driving repeated sales and increasing average order value (AOV). A report by McKinsey found that 35% of all sales on Amazon came through personalized product recommendations.
Whether you refer to it as a recommendation system or a recommendation engine, the software serves the same purpose: directing users to items that will appeal to them based on the software’s analysis. Industry reports from Grand View Research predict the global recommendation engine market will experience a compound annual growth rate (CAGR) of 33% until 2028, reaching a total value of $17.3 billion.
Ecommerce giants like Amazon and small- and mid-sized ecommerce retailers are turning to recommendation engines. This promising technology is set for a period of healthy growth.
Evaluating text-based recommendation systems
As the options for recommendation engines increase, companies will need ways to determine how well the software is performing when meeting business goals. This is the most crucial aspect of evaluating recommendation systems.
Before choosing a system, you should have a clear use case for how it will benefit your business. Only then will you have the appropriate context to measure its performance.
There are various types of recommendation systems, and each of them uses a different approach. Briefly, these are:
Content-based filtering: Users receive recommendations based on shared traits between two users or two items.
Collaborative filtering: Recommendations are pushed based on shared preferences with other users, online behavior, and purchase history.
Hybrid filtering: A recommendation engine that uses a combination of content-based and collaborative filtering to deliver results.
Your choice of recommendation engine will play a role in determining which evaluation method will give you the most accurate insights. Broadly, there are two main ways to evaluate the performance of a text-based recommendation engine. They are metric-based and user-centric methods.
With metric-based methods, you evaluate the recommender system by tracking key performance indicators (KPIs) that show how well the system can predict user preference for items in its database. These metrics can be used to get quantitative measurements of its performance.
In contrast, user-centric methods rely on data from users to see how they rate their experience. This can be either explicit data (shared by the users themselves) or implicit data (automatically collected from user activities).
Key areas of interest for metric-based methods of evaluating recommendation engines
When using metric-based methodologies, you must first understand what each means. Every metric can be categorized according to its function. The three main functions of the commonly tracked evaluation metrics are measuring recommendation accuracy, assessing ranking quality, and quantifying user interactions.
Let’s take a closer look at these functions and their associated metrics.
Measure recommendation accuracy
Accuracy is a key measure of a recommendation system’s effectiveness. It can be calculated using the following formulas:
Precision: Determines the percentage of recommended items relevant to the user. Precision (P) can be calculated by dividing the total number of selected and relevant items (Nrs) by the number of selected items (Ns). The formula is P = Nrs / Ns.
Recall: Measures the percentage of relevant items selected by users. Another term for recall is sensitivity, and higher recall means more relevant items will be included in the recommendation list. Recall (R) is calculated by dividing the number of selected and relevant items (Nrs) by the number of relevant items (Nr). The formula is R = Nrs / Nr.
F-measure: Also called the F1-score, this represents the balance between precision and recall. It ensures that the list has the most relevant recommendation and all other possible recommendations. The formula is F-measure = 2PR / P + R.
Assess ranking quality
Not all recommendations are created equal. Certain recommendations will be more relevant than others depending on the user profile and context. These metrics help recommendation engines calculate, which will give better results.
Mean average precision (MAP): MAP prioritizes results for users in order of relevance by calculating average precision across the user base. The formula is MAP = Sum of the ratings of recommended items / Number of recommended items.
Normalized discounted cumulative gain (NDCG): Sometimes shortened to DCG, this metric is used to emphasize the most relevant recommendations in the list. It assigns a greater weight to relevant items and less weight to irrelevant ones, then measures the value of placing an item at a certain position in a list.
Measure user interaction
The most explicit endorsement of a recommendation engine’s effectiveness is how much time users spend interacting with it.
There are many metrics to measure user interaction, and we will take a look at the two most insightful ones.
Click-through rate (CTR): This applies to individual recommendations in a list. CTR measures the percentage of users who click on the recommendation after it appears. You can calculate CTR = Number of clicks / Number of times the recommendation appears x 100.
Dwell time (DT): This metric indicates what interests users and holds their attention. DT refers to how long a user spends looking at an item or webpage after clicking on a recommendation.
Key areas of interest for user-centric methods of evaluating recommendation engines
The above metrics reliably analyze a recommendation engine’s KPIs. However, for a qualitative study, where unstructured data and abstract opinions can be captured to help refine your strategy, you will need user-centric methodologies.
Text analysis plays a significant role in user-centric methods since most of the data collected is in the form of unstructured text documents. Artificial intelligence (AI) programs with natural language processing (NLP) capabilities and machine learning (ML) models are used to scan text data and find helpful information. These are the key areas where user-centric methods help in evaluating the effectiveness of recommendation engines.
User feedback and sentiment analysis
Customers are happy to share their opinions, especially if it leads to better service in the future. Text-based analysis helps identify useful information in customer emails, social media posts, and other communications.
-
Qualitative analysis: Customer feedback is collected to see whether users find the recommendations useful. Collecting the data can be as simple and convenient as Netflix asking you to rate a movie after watching it. Customer surveys can deliver more detailed feedback.
-
Sentiment analysis: This branch of NLP can be used to gauge user sentiments about certain items. Knowing how users feel about certain products lets recommendation engines effectively predict their reactions to related ones.
User interaction and engagement
An effective recommendation system should be able to hold your customers’ attention. The more focused they are on these recommendations, the more likely they will make purchases.
If you want to measure the effectiveness of your recommendation engine in terms of user interaction and engagement, you should focus on a couple of areas.
-
Repeat purchase rate (RPR): This indicates how many customers returned to your store after their first visit. A high RPR indicates that your recommendation engine is performing well since it means that those customers are discovering new products in your inventory. The formula is RPR = Return Customers / All Customers x 100.
-
Conversion rate (CVR): Measuring CVR lets you know how many customers completed a purchase after clicking on a relevant recommendation. When viewed in combination, CVR and CTR give a fair representation of customer interactions powered by your recommendation system.
User satisfaction surveys and A/B testing
Users are essential stakeholders in your recommendation system, so you should seek out their feedback directly. Surveys are valuable for gathering intelligence on the quality and relevance of recommendations that users are getting.
Another important way to judge how well recommendations perform is to conduct A/B testing. Calculate the number of interactions and CVR of a set of users that received recommendations and compare it to another set that did not receive any recommendations over the same period. The difference in engagement rates will show how well your recommendation engine performs.
Real-world applications of recommendation engine evaluation
The success of Amazon’s recommendation engine in driving ecommerce sales is one of the best examples of these systems in action. However, Amazon is far from the only company that has used text-based recommendation systems to great effect. In several fields, from retail to entertainment, brands are constantly upgrading their recommendation engines to achieve better results.
Netflix’s recommendation algorithm is another prominent example. The streaming platform is constantly refining ways to deliver better recommendations to its users. As far back as 2006, Netflix displayed an intent on evaluating and improving its recommendation system by announcing “The Netflix Prize.” It offered a $1 million reward for an algorithm that could accurately predict the rating users would give each movie. Today, 75% of content viewed on Netflix is discovered through the recommendation engine by users.
Another brand that has committed to constantly evaluating and iterating on its recommendation system is Instagram. The social media platform has also embraced ecommerce, enabling users to shop directly through the app. The user-generated data available on the app allows its recommendation engines to update lists for each user in real time, directing them to relevant pages and brands according to their interests and activity.
Limitations of text-based recommendation system evaluation methodologies
If you want to thoroughly evaluate your chosen recommendation engine, you need a comprehensive approach.
Metric-based methods will give you quantitative data but lack insights about what drives customer behavior. Meanwhile, user-centric methods can capture customers’ motivations accurately but may not paint the whole picture.
Evaluating text-based recommendation systems
The best way to evaluate your text-based recommendation engine is to adopt a combined approach using metric- and user-centric methodologies. This will let you analyze every aspect of your recommendation engine and see if it is fulfilling its purpose.