Clickstream Consulting
       The data warehouse experts



Search this site or  the web
powered by FreeFind




    Site search
    Web search
Clickstream Data Warehouse Extract Transformation and Load

By Mark Sweiger, President and Principal, Clickstream Consulting

In the previous article in this series, we went over the details of web server log file records, the main data source for a clickstream data warehouse. Now that we understand the format of the main data source, we can explore the architecture of the extract, transformation and load (ETL) process needed to get the data into a clickstream data warehouse.

The distributed nature of Web page content makes clickstream ETL more of an around-the-world safari rather than a short trip to the data stored inside the back-office data center. Consider the architecture in the diagram below:



The average web page may contain content that is served-up by all the types of servers pictured in the diagram. Let’s go through them one-by-one, starting with the internal systems:
  1. Replicated Web Servers: These are your site’s web servers, probably Apache or Microsoft IIS.
  2. Web Application Servers: Some vendors, like BEA, Blue Martini, and Art Technology Group, produce application servers that create dynamically generated, personalized web pages on the fly. They also often contain specialized logic to create application-specific constructs like shopping carts and product hierarchies, checkout pages, etc. If these application servers can log these application specific constructs, they can be combined with the site-hit-oriented web server log files to produce a more complete picture of user activity than what can be derived from just the web logs.
  3. Internal Cached Content Servers: Many sites use these to speed access to popular pages, and these specialized servers typically log at least at the Combined Log File format level.
  4. SSL Web Servers: These servers log secure transactions like user registrations and credit card authorizations.
External systems also log information:
  1. Advertising Engines: If you contract with an advertising engine to run advertisements for your e-business, then every time a banner ad is viewed, that information is logged by the advertising engine’s servers. This information can be shipped back to the clickstream data warehouse for analysis.
  2. External Cached Content Engines: Many sites use external caching engines to push large content objects to users faster than they would be served by their own web servers. These external caching engines, like Akamai, log their activities and this information can also be shipped back to the clickstream data warehouse for analysis.
  3. External User Identity Engines: These engines, like Coremetrics, use a special cookie to identify users to a greater level than may be possible using normal mechanisms like a site-created persistent cookie. The identity information is then passed back to the requesting site using redirection, and the more complete identity information can be logged.

All of this logged information gets passed over the Internet and the intranet to a separate log file server. There the log information is parsed, merged together, page dwell time is calculated, and user session-level information like the type of visit, the length of the session, and other interesting aggregates are calculated. This information is then put into a staging database schema on the log file server for eventual insertion into the actual clickstream data warehouse. By having a separate log file server, data extraction can be performed continuously without impacting the query performance of the clickstream data warehouse. Since the staging schema closely resembles the actual clickstream data warehouse schema, loading is straightforward, and can be done in batches or even as a continuous trickle feed of new data, so long as queries are always properly constrained by date and time.

Using a flexible architecture like this one can make clickstream data warehouse ETL construction almost a mechanical process. But one has to make sure that similar log data sources use the same log record formats, or information will be lost. For example, if your site uses cookies to identify users or sessions, your web servers, application servers, and internal caching servers should all produce identical log record formats that have been extended to log cookies.

In the next article we investigate the end-user analytical environment of a clickstream data warehouse, including the kinds of business questions can be answered by the warehouse.


Home    Our Services    Company Overview     Consultant Profiles     Articles and Papers     Contact Us!     Book Site     Links

© Copyright 2001, 2002 Clickstream Consulting, All Rights Reserved