|
Web Server Log Files: The Ultimate Clickstream Data Source
By Mark Sweiger, President and Principal, Clickstream Consulting
In the last article, I shed some light on the inner workings of the mysterious cookie file, the principal user identity mechanism for a clickstream data warehouse. Armed with that knowledge, we now turn our attention to the principal data source for a clickstream data warehouse, Web server log files. While clickstream data warehouses typically have multiple data sources just like any other data warehouse, one type of data source dominates all the others, namely the records contained in Web server log files.
In many ways, Web server log file records are a dream data source. They are automatically emitted by every popular type of web server, including Apache, Microsoft IIS, IBM Websphere, BEA WebLogic, etc. There is one record for each HTTP transaction, providing a very fine granularity of user activity detail. And unless the site chooses to customize its logging, log records have a universal standard format, the NCSA Combined Log File Format, making them exceptionally easy to parse and manipulate.
Lets take a look at the Combined Log File Format and see whats inside. The Combined Log File Format contains the following fields:
- h = The remote requestor ID, in IP format (example: 165.32.124.3) if Directory Name Service (DNS) is not enabled
- I = The remote username via identd, which is blank (indicated by a -) unless the client is Unix or Linux.
- u = The HTTP authenticated username, which is likely blank unless your site requires user registration
- r = The entire text of the HTTP request
- s = The returned HTTP status code
- b = The number of bytes in the returned page
- {referer}i = most recent referring URL, taken from the referer field of the HTTP request header
- {user-agent}i = the requesting software type, usually a browser like Microsoft IE or Netscape, but it can also be things like search engine robots
Here is an example of three Combined Log File Format log records from my companys web site, ClickstreamConsulting.com:
204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)"
204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)"
204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)"
Lets examine the first log record. The first field, 204.243.130.5, is the IP address of system issuing the HTTP requests. It is quite likely that this IP address is an Internet proxy for the client, like a server at his Internet Service Provider (ISP). As a matter of course, I run a DNS mapper against our sites log file to decode IP addresses into real host names (like earthlink.net, or ibm.com), but in this case the IP address did not correspond to a registered domain name and so it was not decoded. This happens more often than you might guess, somewhere between 20% and 50% of the time in my experience. The next two fields in the log record are blanks, which are indicated by the two hyphen placeholders. These blank fields indicate that the client was not authenticated by identd or by HTTP authentication, which makes sense because currently my site does not require any authentication to access it. The next field, [26/Feb/2001:15:34:52 -0600], is the GMT time when the users request was logged. The next field, "GET / HTTP/1.0", is the actual text of the HTTP request, which in this case is a GET of my sites home page using the shorthand / for the root page of the site. The next field, 200, is the returned HTTP status code, and 200 means that the operation was successful. The next field following the status code, 8437, is the actual number of bytes transferred to the client as the result of the HTTP transaction. If the number of bytes transferred is not equal to the number of bytes in the actual home page, then we know that the user abandoned the GET by hitting STOP, BACK, or by clicking through to another link before the page was fully loaded. The next field, "http://metacrawler.com/crawler?general=dimensional+modeling", is the referring URL, and in this case the referrer is the search engine metacrawler.com, which found my site as the result of a search for sites that mention the term dimensional modeling, which is specified by the query string parameters after the ? in the URL. The last field in the log record, "Mozilla/4.5 [en] (Win98; I)" indicates which type of browser and OS the user employed to get this page, and in this case the browser was Microsoft IE version 4.5 running on a Windows 98 system.
Moving on to the second log record, we see that it records an HTTP GET of the image logo1.gif, which is the file containing the logo of Clickstream Consulting. This additional GET is embedded inside the HTML that generates the home page of ClickstreamConsulting.com, which resulted in the creation of the first log record above. The existence of this second log record for the home page illustrates a key concept: Most page views generate multiple HTTP transactions, resulting in multiple log records per page view.
The final log record in the example indicates that the user decided to visit our sites Articles and Information link, whose content is in the file articles.html. This page contains a variety information on subject of clickstream data warehousing. By subtracting the logged GMT time in this record from the GMT time in previous log record, we can see that the user viewed our home page for 34 seconds before clicking on the link for the Articles and Information page. Since no more log records for this IP address appear after this one, the exited the site probably by clicking on one of the many external links on Articles and Information page, making it the exit page for this users visit.
While the fields contained in the standard Combined Log File Format are obviously useful as the clickstream data source, additional fields can be added to the log record by adding additional objects to the log record template used by your Web servers logging module. Virtually all Web server software supports extensions to the format of the log file.
For example, if your site creates client-side cookies, you should always append the resultant cookie text to your Web server log file records. In an Apache or Netscape web server environment, adding the string {cookie}i to the log record format causes the client-side cookie to get logged as the string cookie-variable-name=value for each HTTP transaction. A logged cookie is a much more accurate way of establishing user identity than a host IP address. Many ISPs, including AOL and Microsoft Network (MSN), change the user IP address with every transaction, making it impossible to group log records into a visit by a single user using the IP address, as we did in the above example. Unless the log file format is extended to log a users cookie, it would be impossible to do visit analysis on users that come from those ISPs since one cannot establish which set of records pertain to a particular user because of the ever-changing IP addresses. And because AOL and MSN have such a large combined user community, currently numbering in the tens of millions, failure to include the cookie in the log record could mean that the majority of a sites clickstream data could not be analyzed at a visit level of granularity.
In the next article we will dive into the clickstream data warehouse extract, transformation and load (ET&L), that is, the process of taking the flat file data contained in the raw log file records and loading it into a dimensional clickstream data warehouse.
|