As our final project for CS 4920 we chose to visualize the SUU Web Server logs for January, 2003. For this project we will present 5 different visualizations of the data showing various things such as page hits and referer links.
To do this we first needed to get the raw server logs from Mark Walton, and 200 mb file in hand we had all the data needed. It should be noted that our raw data file was 200 MB, but this is only because we used a filtered version of the logs which showed only page hits and ignored image requests. Otherwise this log file would have been over 2GB in size.
The first step was to make the data more manageable. The raw data had 15 columns including information such as the request method (GET, POST) and the browser used. As none of the visualizations we chose to display required this information we needed to trim it down. At the same time we needed to be able to generate statistics on the data, such as the number of times the home page was accessed at 7:00 PM on January 5th. We could have written a bunch of code to do these operations but that would have a lot of wasted effort for very little return. Rather than reinvent the wheel we imported the data in to Microsoft SQL Server, as SQL server comes with some really powerful tools for importing data. Once we had the data in a relational database we were able to index all the fields which we were going to be performing queries on such as date, time and page.
At this point we were able to easily extract just the information we wanted from the database. Using Microsofts Query Analyzer tool which comes with SQL Server 7, we were able to perform queries on the data and have the results output to a file. For those of you unfamiliar with SQL the queries are statements which describe the operation you wish to perform on your database such as the following:
SELECT COUNT(*) FROM logs WHERE Col003='index.html"
The above query would return a count of the number of times the file index.html was accessed. I'm not going to bother listing the actual queries we used to extract the data (because I forgot to save them) but I will state that the query above is extremely simple and does not actually reflect the quieries used. In reality almost all the queries we used consisted of performing counts on each distinct field value on a certain day, where the hour begins with a certain value.
Once this was done we now had the raw data formatted in a way that was useful to us. However, many of the smaller data files still contained over 30,000 rows. As we didn't want to have to create visualizations which handled 30,000+ actors, we chose to manually extract data which we thought would give an interesting visualization. Often in doing so we limited ourselves to about 5 pieces of data per visualization.
With data in hand we were then ready to create the actual visualizations: