Clickstream Consulting
       The data warehouse experts



Search this site or  the web
powered by FreeFind




    Site search
    Web search
Cookies: The Perfect User Identification Snack

By Mark Sweiger, President and Principal, Clickstream Consulting

In the last article, I presented a generalized meta-schema for an e-business clickstream data warehouse. While it is tempting to forge ahead and apply this model to a specific example, first we need to have a much better understanding of the data contained in such a schema. Since the focus of a clickstream data warehouse is the analysis of user activity, there must be a mechanism to identify each user. This is the job for one of the most reviled and misunderstood components of a typical e-business architecture, the cookie file.

Most people are at least vaguely aware of the existence of cookie files on their client systems. They are created and modified by all Web browsers when they parse a “Set-Cookie” header string in the response to a particular HTTP request (like a GET of a Web page). Although the exact format of cookies varies from browser to browser, they all have at least the six following fields:

  1. Name: The name of the cookie variable, for example, “UserID”. The name is a required field and has no default values.
  2. Value: The string value assigned to the cookie variable. For example, the cookie variable named “UserID” could be set to a value of “334”. If the value is empty then the value of the client cookie is cleared.
  3. Domain: This is the domain name that created the cookie and it is the only domain that is permitted to receive or modify the cookie on subsequent accesses. Only the creating domain may read its own cookies, other domains have no access. The cookie variable must have at least two dots, like “.clickstreamconsulting.com”, for example, otherwise one could create a cookie for .com, or .net, which is not permitted.
  4. Path: The top level of the subtree within the domain for which the cookie is valid and returned upon access to a page within the subtree. A path of “/” means the cookie is good for all pages in the website, while a more qualified path, like “/ClickstreamConsulting/articles” means the cookie only applies pages in the /articles subtree.
  5. Expires: The expiration date of the cookie. The cookie persists on the client system until this date. If this value is not set, the cookie only lasts for the duration of a the browser session, after which it is automatically deleted
  6. Secure: If TRUE, a secure connection to the domain is needed to pass the cookie. The default value is FALSE.
The key to quickly determining a user’s identity is the cookie file. If a user accesses your website for the first time, there will be no cookie file returned for your domain, because the cookie hasn’t been created yet. Assuming your web server is configured to accept cookies, on first access by a user it will note that no cookie was passed, and it will then add a Set-Cookie header to the response that sends back the requested page, causing the cookie to be created on the client system. If the cookie variable in the Set-Cookie header is unique, then all subsequent accesses by that user will be identified by the unique value of the returned cookie variable.

Knowing that a particular browser instance has a cookie file “UserID” of “334” is certainly helpful because it distinguishes web server activities of that UserID from those of other cookied UserIDs. But this level of knowledge about user identity is not very specific, and we probably can do much better. One way to increase the level of user knowledge associated with a cookie file UserID is to allow users to register themselves at your site, and then associate the registration information with the cookie. This common technique may increase your knowledge about the user to include things like a registration ID and an email address. Using syndicated data available from a number of providers, email addresses can often be decoded into more specific information like real name, address, phone number, user psycho/demographic data, etc.

A clever new way to quickly determine user identity is available from Coremetrics. Coremetrics subscribers insert special JavaScript tags, which call a Coremetrics server, into the start of their web pages. If any site that uses Coremetrics has been previously accessed by that user (and chances of this are high given its popularity), a cookie called “.data.coremetrics.com” will already exist on the client system. This cookie identifies the user to Coremetrics, and Coremetrics passes this identity back to the subscribing site in its response to the original call. Because the same Coremetrics cookie and, therefore, user identity, is used by multiple web site subscribers, it is possible to identify user activities that cross site boundaries, like referring sites, advertising engines, affiliated sites, etc, not to mention the ability to share detailed identity information among Coremetrics and those sites. While few will admit to using Coremetrics, it is one of the more popular user identity services, and definitely worth a look.

In the next article, we explain web server log files, the primary data source for a clickstream data warehouse. When coupled with cookie files, this data stream is key to any clickstream data warehouse implementation.

Home    Our Services    Company Overview     Consultant Profiles     Articles and Papers     Contact Us!     Book Site     Links

© Copyright 2001, 2002 Clickstream Consulting, All Rights Reserved