The Quantitative Challenges from Click stream Data
... repeatedly and sometimes pauses to do other tasks between page views (for example run other applications or watch television). Only five of the 12 viewings the user requested could generate a “hit” to the server. This illustrates the advantage of collecting data at a user's machine and not from a host site since it includes all requests, eliminating a potential source of bias. Information about where and how frequently users access web sites is used for various tasks. Marketers use such information to target banner ads. For example, users who often visit business sites may receive targeted banner ads for financial services even while browsing at no business sites. Web managers may use this information to understand consumer behavior at their site. Additionally, it can be used to compare competing web sites. Members of the financial community use such information to value dot com companies. Analysts use click stream information to track trends in a particular site or within a general community. Financial analysts find this type of intelligence useful for assessing the values of companies because many traditional accounting and finance measures can be poor predictors of firms’ values. Another use of click stream data is to profile visitors to a web site. Identifying characteristics about visitors to a site is an important precept of personalization. One way to find out characteristics of visitors is to ask them to fill out a survey. However, not everyone is willing to fill them out, creating what is known in marketing research as a self-selection bias. The information may be inaccurate as well, for example visitors may give invalid mailing addresses to protect their privacy or inaccurately report incomes to inflate their egos. Also, completing a survey takes time, and the effort required may severely skew the type of individuals that complete it and the results. An alternative way to profile users is with click stream data. The demographic profiles of sites reported by companies like Media Metrix can be used to determine what type of individuals visit a site. For example, Media Metrix reports that 66 percent of visitors to ivillage.com are female. Even without knowing anything about a user except that they visit ivillage.com, the odds are two to one that a visitor is female. This is quite reasonable because ivillage.com offers content geared toward issues of primary concern to women. Some gaming sites appeal primarily to teenage boys, and sports sites may draw predominately adult men. On the other hand, such portals as Yahoo! and Excite draw audiences that are fairly representative of the web as a whole. Media Metrix can identify demographic characteristics of visitors using information provided to them by panelists. However, simply a knowledge of the web sites visited by a user and profiles of these web sites (that is, the demographic characteristics of a sample of users) is enough to make a good prediction about a visitor’s demographics. For example, suppose we wish to predict whether a user is a woman. In general, about 45 percent of web users are female. Therefore without knowing what sites a person visited one would guess that there is a 45 percent probability of being female and a 55 percent probability of being male. If forced to choose, one would guess the user to be male, but this would be an inaccurate guess since the odds are almost equal. However, if one knows that this individual visited ivillage.com, whose visitors are 66 percent women, the hypothesis that this user is female can be improved. This is a Bayesian hypothesis updating problem, and an analyst could apply Bayes formula to recompute the probability that the user was female using this new information: The original probability of being female is denoted by p_=.45, and the new information we have is p=.66. The updated probability or posterior probability of our hypothesis is denoted by p_=.62. In other words, the probability this is a female user has increased from 45 percent to 62 percent. While most of the web sites visited by this individual indicate the user is most likely a female, some of the sites (aol.com, eplay.com, halcyon.com, lycos.com, and netradio.net) visited might point to the individual being male. However, based on information from all 22 sites the probability that the user is female is 99.97 percent. To assess the accuracy of this technique I applied it to actual usage information from a sample of 19,000 representative web users with one month of usage and known gender. There is a great deal of variation in users in this sample, such as some users visit only one site, while others may visit hundreds. If the model predicts more than an 80 percent probability that the user is male then the user is predicted...