The tracking of user’s online behavior is valuable in predicting their preferences and serving them with relevant content and ads. At the same time, such tracking raises privacy concerns among users. In this project, I quantify the value and cost of tracking various types of browsing data, and propose a simple algorithm to balance these trade-offs.
I use data from about a quarter of a million visitors to an e-commerce website that tracked various behavioral variables of their users, as described in Table 1. This included temporal variables such as how long a visitor stayed on the website and how long has it been since their last visit, frequency variables, breadth and depth of browsing.
Variable | Description |
---|---|
Searches | How many times searched for a product on the website using the search function |
Popular category views | How many times viewed most-viewed product category page |
Popular product views | How many times viewed most-viewed product page |
Products viewed | How many products viewed on the website |
Purchases | How many purchases made on the website |
Categories viewed | How many product category pages viewed on the website |
Category view time | How much time spent viewing product category pages |
Products added | How many products added to your shopping cart |
Product view time | How much time spent viewing product pages |
Total time | The total time spent on the website |
Time last visit | Time since last visit to the website |
I first quantify the predictive value of tracked behavioral data. I use few non-behavioral variables (type of operating system, device etc.) for the base prediction. To quantify the incremental predictive value of each of the behavioral tracked variables, I use several machine learning prediction algorithms such as logistic regression, random forests, and XGBoost. I use forward as well as backward selection methods to measure the incremental increase in the AUC (Area Under the receiver-operating characteristic Curve) by adding each behavioral variable to the set of base non-behavioral variables. In Figure 1, I show the relative predictive value of each of the behavioral tracked variables. We find that the total time a user has spent on the website is the most valuable in predicting future purchases, while, somewhat surprisingly, past purchases are not highly predictive.
Next, I estimate the relative privacy cost of tracking of each of these variables as perceived by users. I conduct a binary choice survey on Amazon Mechanical Turk and ask users which of two variables they perceive as more privacy intrusive when tracked. The results are shown in Figure 2. I find that users perceive tracking their searches as most privacy intrusive, and find that tracking the time they have spent on the website to not be very privacy intrusive.
We can now bring these two results together to compare the predictive value and privacy cost of each behavioral variable. In Figure 3, we can see that variables such as time spent by the user on the website ought to be tracked, since it has high predictive value and low privacy cost. On the other hand, a variable such as user searches should probably not be tracked, since it provides limited predictive value yet is perceived to have a high privacy cost to the user. Such a framework when used to decide which variables can be tracked and which should be avoided can increase overall surplus.