Logfeed
Logfeed generates generates custom-filtered, templated Atom feeds from Apache access logs.
Table of Contents
1. Overview
Feeds are defined by simple configuration files which contain
- feed metadata such as the title, author, URI, and number of entries,
- filtering rules based on the URL, referrer, IP address, etc., and
- an entry template (optional).
Filtering is performed by matching or ignoring user-defined regular expressions written in plain Perl. Entry templates define the content of entries and are simple XML+XHTML fragments containing template variables. The easiest way to understand how logfeed works is through a few simple examples.
Logfeed is very similar in nature to Blosxom both in the way config files are defined and how templates are interpolated. In fact, I used several bits of code from Blosxom itself and the Blosxom config plugin. If you have used Blosxom before then logfeed’s behavior will probably seem natural.
2. Dependencies
Logfeed requires the File::ReadBackwards Perl module. This is a “non-standard” module and may need to be installed. On Debian-based Linux distributions, this is as easy as:
sudo apt-get install libfile-readbackwards-perl
If there is no similar package available for your operating system you can download the module from CPAN.
3. Download
You can either browse the repository, download a snapshot, or clone the repository using Git:
git clone git://jblevins.org/git/logfeed.git/
4. Configuration
Logfeed config files are actually just Perl fragments. They are evaluated each time logfeed runs. Each configuration variable is described below with examples.
Metadata
The following metadata variables are required:
$log_file
- The location of the log file.$log_file = '/var/log/httpd/access.log';
$feed_title
- The title of the feed.$feed_title = 'Recent Referrers';
$base_url
- The base URL for the site (with no trailing slash).$base_url = 'https://jblevins.org';
$base_url
- The full path to the feed.$feed_path = '/feeds/referrers.atom';
$author_name
- The author’s name.$author_name = 'Jason Blevins';
$id_year
- The year this feed was started. This is used to construct a unique Tag URI for the feed.$id_year = '2008';
All of the following are optional.
$feed_subtitle
- A short description of the feed.$feed_subtitle = 'A list of recent referrers.';
$feed_icon
- The URL of the feed’s icon.$feed_icon = 'https://jblevins.org/favicon.ico';
$author_email
- The author’s email address.$author_email = 'jrblevin@sdf.lonestar.org';
$author_uri
- The author’s URI.$author_uri = 'https://jblevins.org/';
$num_entries
- The number of entries to include. Defaults to 50.$num_entries = 25;
$reverse_dns
- Set to 1 to enable reverse DNS lookup and to 0 otherwise. Defaults to 0.$reverse_dns = 1;
Filters
You can match or ignore lines using the %match
and %ignore
hashes
with the following keys:
'ip'
- IP address'user'
- Username (if authenticated)'req'
- Request filename'code'
- Status code'ref'
- Referring URL'ua'
- User agent string
Values in these hashes should consist of regular expressions.
Lines that match at least one of the %ignore
rules will be excluded.
Remaining lines that match all of the %match
rules for each key
will be included. This is perhaps best illustrated with an example.
The following rules will create a feed of all requests with referring URLs containing (‘google’ OR ‘yahoo’) AND result in a 404 code:
$match{'ref'} = 'google|yahoo';
$match{'code'} = '404';
Below are some more examples:
Match hits coming from Wikipedia:
$match{'ref'} = 'wikipedia\.org'; </code></pre></li> <li><p>Ignore hits on files in /css and /code:</p> <pre><code>```perl $ignore{'req'} = '^/css|^/code';
Match requests for the feeds index.atom and index.rss:
$match{'req'} = '^index\.atom$|^index\.rss$';
Ignore Googlebot and Yahoo! Slurp:
$ignore{'ua'} = 'slurp|googlebot';
Match Internet Explorer users:
$match{'ua'} = 'MSIE';
Since the configuration file is just Perl code, you can even do things like the following, which ignores hits with no referring URL and hits from Google and Yahoo:
my @temp = qw! ^-$ google\.com ^http://search\.yahoo\.com !; $ignore{'ref'} = join '|', @temp;
Templates
The body of feed entries can be completely customized using a template, a
string stored in the variable $entry
. This template tells logfeed how to
generate <entry>
items in the Atom feed. If you do not define this
variable, the default template will be used. You need to use single quotes
(or qw
) so that the variables don’t interpolate.
If you modify the default template, make sure the body of the <content>
element is valid XHTML and that the required elements, <id>
, <title>
, and
<updated>
are all included. It is very important that the IDs are unique.
The following variables will be interpolated using information from the log file:
$ip
- IP address$host
- hostname when reverse DNS is enabled,$ip
otherwise$user
- username (if authenticated)$time
- time and date as printed in the log file$utc_time
- time and date as called for by the Atom specification$id_time
- The UNIX time of the log entry$req
- the request filename$code
- status code$sz
- file size$ref
- referring URL$short_ref
- referrer without CGI parameters$ua
- User agent
And the following will be interpolated using the metadata defined above:
$log_file
- Path to the Apache access log$feed_title
- The title of the overall feed$base_url
- The base url of the site$feed_path
- The absolute path to the feed$id_year
- Your chosen identifying year$id_domain
- Your domain
Here is the default template:
$entry = '<entry>
<id>tag$colon$id_domain,$id_year$colon$feed_path/$id_time/$ip$req</id>
<title>$host: $req</title>
<author>
<name>$author_name</name>
$author_uri$author_email
</author>
<updated>$utc_date</updated>
<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li><strong>Date:</strong> $utc_date</li>
<li><strong>User:</strong> $user</li>
<li><strong>Host:</strong> $host</li>
<li><strong>User Agent:</strong> $ua</li>
<li><strong>Referrer:</strong> <a href="$ref">$ref</a></li>
<li><strong>File:</strong> <a href="$base_url$req">$base_url$req</a></li>
<li><strong>Size:</strong> $sz</li>
<li><strong>Status:</strong> $code</li>
</ul>
</div>
</content>
<link rel="alternate" href="$ref"/>
</entry>
';
5. Usage
Config files can be named anything. For the following examples, let’s assume
files have the .conf
extension. This is completely optional.
logfeed can run from the command line, as in
perl log-feed.pl conf=bar.conf
This command can, for example, be called periodically via a cron job. It can also run as a CGI script:
http://foo.net/feeds/log-feed.pl?conf=bar.conf
Optionally, one could use Apache’s mod_rewrite to help clean up the URLs when
running in CGI mode. For example, the following rewrite rule could be placed
in .htaccess
:
RewriteRule ^feeds/(.*).atom$ feeds/log-feed.cgi?conf=$1.conf
Then, given a config file called bar.conf
, the feed would be made available
at http://foo.net/feeds/bar.atom
.
6. Notes
logfeed only operates on logs in the combined format. Other formats are possible by a simple modification of the relevant regular expression in the script.
Reverse DNS lookup can be slow. This should be taken into consideration when choosing whether to run logfeed as a CGI script.
The default entry IDs should be unique under most circumstances. They could fail to be unique if you receive two hits in the same second from the same IP address for the same file. This is an unlikely event and in this case both entries would be identical anyhow.