C#: Parsing Apache / IIS log files (NCSA/W3C)

June 11, 2011 by C#   IIS  

Webserver log files (e.g. NCSA/W3C) contain all kinds of useful data - amount of visitors to a website, how pages are performing/underperforming, bandwidth consumption, missing pages etc.

There are a number of excellent applications out there that parse these log files and provides the webmaster with all kinds of sexy reports (containing funky graphs/grids about the unfortunate victims/visitors to our websites).

For example (feel free to add to the list):

http://www.weblogexpert.com
http://awstats.sourceforge.net
http://www.smartertools.com/smarterstats/web-analytics-seo-software.aspx

Some of these tools, like smarterstats (if memory serves me correctly) even feature a simple report builder which empowers a webmaster to build their own reports.

But even that amount of flexibility can prove to be a bit unreliable, observe the following extract of a NCSA format log:

crawl-66-249-72-249.googlebot.com - - [11/Apr/2011:00:01:47 +0200] "GET /robots.txt HTTP/1.1" 301 291 "-" "Googlebot-Image/1.0"
41.0.34.4 - - [11/Apr/2011:00:02:21 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FPaddle_Colt_Gove_4c68c62b3111e_90x90.jpg&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 5129 "-" "Mozilla/4.0 (compatible;)"
41.0.34.4 - - [11/Apr/2011:00:14:15 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FLooper_Kydex_Bel_4ce4c8c4a74b1_90x90.gif&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 3046 "-" "Mozilla/4.0 (compatible;)"
41.0.34.4 - - [11/Apr/2011:00:14:16 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FESP_Baton_21__4c52fdff9221e_90x90.gif&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 2041 "-" "Mozilla/4.0 (compatible;)"
41.0.34.4 - - [11/Apr/2011:00:14:37 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FITP_A3_EOS_4bb9a8780aaef_90x90.jpg&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 5761 "-" "Mozilla/4.0 (compatible;)"
ool-18bc86a4.dyn.optonline.net - - [11/Apr/2011:00:17:28 +0200] "GET /index.php?page=shop.product_details&category_id=3&flypage=tpflypage.tpl&product_id=8&option=com_virtuemart&Itemid=10 HTTP/1.1" 200 7503 "http://www.google.com/url?sa=t&source=web&cd=12&ved=0CCYQFjABOAo&url=http%3A%2F%2Fwww.hailstormza.com%2Findex.php%3Fpage%3Dshop.product_details%26category_id%3D3%26flypage%3Dtpflypage.tpl%26product_id%3D8%26option%3Dcom_virtuemart%26Itemid%3D10&rct=j&q=esp%20baton&ei=bP2UTZvGH6OD0QHkn-XkCw&usg=AFQjCNGur0K_j-lqCC8VZX5Ftz5I-NnjuA" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SearchToolbar 1.2; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; InfoPath.2; .NET CLR 3.0.30729; .NET4.0C; SRS_IT_E8790476BD76555A36A199; SRS_IT_E8790476BD76555A31A090)"

In the preceding extract you might have noticed the page "show_image_in_imgtag.php" being called multiple times. The page in question (from virtuemart - a joomla shop web application) is used to dynamically output resized images to the browser.

If we're interested in an accurate amount of page views (not to be confused with hits), we'll need to exclude that file from our report (since its not actually a page view) - something we can't expect the reporting application to know (perhaps some apps support manual exclusion of files?).

Which obviously means that the structure of a website can influence the accuracy of our reports; Incidently the website where I took this extract from reported over 150,000 in page views using AWStats, but as soon as I manually excluded the "show_image_in_imgtag.php" page, views fell to under 10,000 - a rather horrible inflation of the actual stats don't you think?

Now I am not going to tell everyone to dump their reporting applications and to write their own, I am however going to give you a few clues on how to manually retrieve and interpret the raw data yourself.

The first thing we're going to have a look at is how to parse a log file.

Note that log files come in various formats, like seen in the following image (from IIS 7.5 Manager):
Image listing log file format options

Instead of writing our own log parser(s) we can use a pre-written one like the "log parser 2.2" tool Microsoft created back in 2005 (it can be downloaded from here), the tool supports all of the formats seen in the image above (and even a few others).

In your project add a reference to the LogParser.dll - if you installed the above mentioned tool it will generally be located within your 32bit program files folder under the "Log Parser 2.2" folder.

Observe the following snippet:

using System;
using System.Data;
using MSUtil;

public static Type[] types = new Type[] 
{  
    Type.GetType("System.Int32"), 
    Type.GetType("System.Single"),
    Type.GetType("System.String"), 
    Type.GetType("System.DateTime"),
    Type.GetType("System.Nullable")
};

public static DataTable ParseLog<T>(string query) where T : new()
{
    LogQueryClassClass log = new LogQueryClassClass();
    ILogRecordset recordset = log.Execute(query, new T());
    ILogRecord record = null;

    DataTable dt = new DataTable();
    Int32 columnCount = recordset.getColumnCount();

    for (int i = 0; i < columnCount; i++)
    {
        dt.Columns.Add(recordset.getColumnName(i), types[recordset.getColumnType(i) - 1]);
    }

    for (; !recordset.atEnd(); recordset.moveNext())
    {
        DataRow dr = dt.NewRow();

        record = recordset.getRecord();

        for (int i = 0; i < columnCount; i++)
        {
            dr[i] = record.getValue(i);
        }
        dt.Rows.Add(dr);
    }
    return dt;
}

In order to use the ParseLog generic method (like seen in the preceding snippet), we need to pass the type of log we're attempting to parse and a SQL style query as parameter.

SQL style query? Well, that is the true power of the microsoft log parser tool, you're able to write queries against the parsed entries e.g.

// W3C
DataTable dt = ParseLog<COMW3CInputContextClassClass>(@"select * from c:\logs\w3c\*.*");

// NCSA
DataTable dt2 = ParseLog<COMIISNCSAInputContextClassClass>(@"select * from C:\logs\ncsa\*.*");


Awesome, so now we've got our logs all parsed and ready to go, everyone happy?

Well, not exactly, there is a few crucial shortcomings I need to point out...

As soon as we're working with custom fields everything goes a bit haywire, something I only noticed by accident.

The COMIISNCSAInputContextClassClass parser strictly requires the log file to be in NCSA format with an exact set of fields, no custom fields allowed and lines that don't conform gets excluded - makes sense since the log doesn't contain any column definitions (What happend to me was that the administrators of this webserver added two extra fields to the log files and all of a sudden the parser failed parsing lines).

The COMW3CInputContextClassClass parser handles things a bit better in that it actually supports custom fields (also due to the fact that the log file contains column definitions). Unfortunately there is a problem in the way it handles custom fields, observe the following extract:

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2011-05-24 11:49:35
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2011-05-24 11:49:35 ::1 POST /UIT/AppWebServices/LookupMruWebService.asmx - 5555 UIT\Administrator ::1 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+.NET4.0C;+.NET4.0E) 200 0 0 2387
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2011-05-24 13:04:55
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs-host sc-status sc-substatus sc-win32-status time-taken
2011-05-24 13:04:55 ::1 GET /blog/247/default.aspx - 5555 - ::1 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+.NET4.0C;+.NET4.0E) localhost:5555 401 1 2148074254 1640

Notice the #Fields rows, it contains a definition of the columns represented in each of its following rows, but oddly enough you will find that the last entry in the extract won't be parsed.

The reason being that the COMW3CInputContextClassClass parser takes the first #Fields line in the log file and assumes that it is an accurate representation for all its following rows (disregarding every other #Fields that might be different, like seen in the above extract).

Personally I think it would have made a lot more sense if the parser returned all W3C Logging Fields (ones that don't exist in the log returned as nulls) and simply use the #Fields line as reference to which column a field belongs - which means we're not going to need to exclude entries that are actually valid.


Leave a Comment


August 25, 2016 by Eric Brown

Greetings - I know this article was written back in 2011 but was wondering if you may have an answer. When trying to use this code I'm getting an error: "Interop type 'LogQueryClassClass' cannot be embedded. Use the applicable interface instead" Any suggestions would be much appreciated! Thank you.