Introduction

The tracking of user’s online behavior is valuable in predicting their preferences and serving them with relevant content and ads. At the same time, such tracking raises privacy concerns among users. In this project, I quantify the value and cost of tracking various types of browsing data, and propose a simple algorithm to balance these trade-offs.

Data

I use data from about a quarter of a million visitors to an e-commerce website that tracked various behavioral variables of their users, as described in Table 1. This included temporal variables such as how long a visitor stayed on the website and how long has it been since their last visit, frequency variables, breadth and depth of browsing.

Table 1: Variable description

Variable Description
Searches How many times searched for a product on the website using the search function
Popular category views How many times viewed most-viewed product category page
Popular product views How many times viewed most-viewed product page
Products viewed How many products viewed on the website
Purchases How many purchases made on the website
Categories viewed How many product category pages viewed on the website
Category view time How much time spent viewing product category pages
Products added How many products added to your shopping cart
Product view time How much time spent viewing product pages
Total time The total time spent on the website
Time last visit Time since last visit to the website

Quantifying the predictive value of behavioral data

I first quantify the predictive value of tracked behavioral data. I use few non-behavioral variables (type of operating system, device etc.) for the base prediction. To quantify the incremental predictive value of each of the behavioral tracked variables, I use several machine learning prediction algorithms such as logistic regression, random forests, and XGBoost. I use forward as well as backward selection methods to measure the incremental increase in the AUC (Area Under the receiver-operating characteristic Curve) by adding each behavioral variable to the set of base non-behavioral variables. In Figure 1, I show the relative predictive value of each of the behavioral tracked variables. We find that the total time a user has spent on the website is the most valuable in predicting future purchases, while, somewhat surprisingly, past purchases are not highly predictive.

Figure 1: Predictive value of each variable, in descending order


Next, I estimate the relative privacy cost of tracking of each of these variables as perceived by users. I conduct a binary choice survey on Amazon Mechanical Turk and ask users which of two variables they perceive as more privacy intrusive when tracked. The results are shown in Figure 2. I find that users perceive tracking their searches as most privacy intrusive, and find that tracking the time they have spent on the website to not be very privacy intrusive.

Figure 2: Privacy cost of each variable, in descending order


We can now bring these two results together to compare the predictive value and privacy cost of each behavioral variable. In Figure 3, we can see that variables such as time spent by the user on the website ought to be tracked, since it has high predictive value and low privacy cost. On the other hand, a variable such as user searches should probably not be tracked, since it provides limited predictive value yet is perceived to have a high privacy cost to the user. Such a framework when used to decide which variables can be tracked and which should be avoided can increase overall surplus.

Figure 3: Predictive value vs. Privacy Cost