A spam filter for Usenet servers. Cleanfeed blocks spam on the way
into your server, before it is written to disk or propagated to outbound
feeds. It can also block binaries in non-binary newsgroups and includes
several other features to keep your newsfeed clean.
Cleanfeed currently works with INN, Cyclone, Typhoon, Breeze, and
NNTPRelay servers. See my webpage (listed at the end of this document)
for pointers to information about using Cleanfeed with CNews, Diablo,
Collabra, or INN versions earlier than 1.5.1.
USAGE
For all versions, place the cleanfeed.conf configuration file
somewhere, then edit the Cleanfeed source file and change the
$config_dir option at the top to point to the directory where
the config file lives.
INN
Install the filter file (called cleanfeed) as filter_innd.pl, and
cleanfeed.conf, in the location you specified in config.data (INN
1.7.2 and earlier) or when configuring INN 2.x (usually the bin/filter
directory under the installation root). Make sure both files are readable
by the news user. Once in place, the filter is loaded with the command
ctlinnd reload filter.perl meow. Filtering can be turned on with
ctlinnd perl y and turned off with ctlinnd perl n.
Cyclone/Typhoon/Breeze
Add the -program <file> and -body options to the bin/start
script, where <file> is the location and name of the Cleanfeed
program. Restart the server. Cleanfeed will run as an external process
(standalone mode). IMPORTANT: make sure both cleanfeed and cleanfeed.conf
are readable by the news user! Double-check the permissions as this is
a fairly common mistake!
NNTPRelay
Find the ExternalFilter directive in config.txt and make it look like:
Cleanfeed will run as an external process (standalone mode).
More detailed installation instructions are provided later in this
document.
CONFIGURATION OPTIONS
Configuration is accomplished by setting the various options in the
cleanfeed.conf configuration file. This file is evaluated as Perl
code, so comments can be included in the usual Perl # syntax. A
sample default file is included with the distribution.
If you would rather not use cleanfeed.conf, you can set its
location to ``undef'' in the source and edit the configuration
variables directly in the source file.
cleanfeed.conf has two sections (which define perl hashes):
%config_local and %config_append. Entries in %config_local
will override the default settings of the same name in the Cleanfeed
source. Entries in %config_append can be used to add to most of
the default regular expressions, for items such as badguys,
bin_allowed, poison_groups, etc. Settings in %config_append
for these items will be appended to the default regexps, seperated by
``|'' (or).
If you want to completely override the default regexps for these options,
rather than just add to the defaults, you can add an entry for them into
the %config_local section of cleanfeed.conf.
All of this is done quite blindly, so if you do anything odd, be careful.
(Cleanfeed will remove the common mistake of including two ``|'' (or) signs
in a row.) All config options are exposed to %config_local, including
any that may not be present in the sample file. Only the defined list of
options are exposed to %config_append.
Options that are on/off or yes/no should be set to 1 for on/yes, or 0
for off/no.
First, you need to tell Cleanfeed which news server software you are
using. At the top of the file, set the appropriate variable to 1. For
INN, set $inn; for Cyclone, Typhoon, or Breeze, set $highwind; and
for NNTPRelay, set $nntprelay. Ensure the other two (the ones you're
not using) are set to 0.
General Settings
aggressive
Set this to 0 to disable all content-based filters. Helpful to please
paranoid lawyers, or paranoid customers.
active_file
Set this to the full path to an active file, to allow Cleanfeed to know
what groups are moderated. This is normally your server's active file,
but it doesn't have to be; it is possible, for example, to run Cyclone
with no active file, but give one to Cleanfeed anyway.
MD5 Body Filter Settings
do_md5
When turned on, the MD5 EMP checks will be done. This should be left
on unless you have a really good reason to turn it off. If you're
running Hippo along with Cleanfeed, you might feel Cleanfeed's MD5
checks are redundant and want to turn them off, for example. It
would probably be better to leave it on with the history turned
down, instead.
md5maxmultiposts
Start rejecting articles after we have seen this many copies, according
to the MD5 checksum filter.
MD5History
How many articles to remember for MD5-based EMP comparison. Since the MD5
filter is not prone to false positives, setting this higher is a good idea
to catch more spam, if you have the RAM to spare.
MD5maxlife
When a spam is identified by the MD5 EMP filter, it is saved for continual
rejection. MD5maxlife specifies how long, in hours, to keep a saved
MD5 id which is no longer getting any hits. (A spam id which is still
getting matches will be saved regardless of age.) 24 hours works well.
fuzzy_md5
When turned on, the message bodies will be munged up a bit before MD5
checksums are generated. Whitespace and other non-alphanumeric
characters are stripped and letters are forced to lowercase, as well
as a couple other bits of treachery to try to defeat the ``hashbuster''
spam-bots. This adds a bit of ``fuzziness'' to the MD5 filter, and
results in a performance hit as well.
Since the smarter spammers have discovered hashbusting, I recommend
that this be turned on.
fuzzy_max_length
Sets the maximum amount of lines for an article body to be subject to
the fuzzy_md5 munging (above). This keeps extremely large articles
out of those nasty regular expressions.
md5_skips_followups
Determines whether the MD5 filter checks articles with References
headers. The default is to skip them. Setting this option to 0
will result in all articles passing through the MD5 filter, which
can result in a major performance hit, but does close another hole
in the filter. If you turn this off, you should increase MD5history
as well to avoid shortening your ``window''.
MD5HistSize
The maximum allowed size of the EMP memory for the MD5-checksum EMP filter.
Use this as a ``sanity check'' to prevent a sudden burst of spam from eating
up all of your memory. It should be set high enough so that you normally
never hit this number; use the MD5MaxLife to expire the hash instead.
Header-Based EMP Filter Settings
do_phl
Turns on the NNTP-Posting-Host/Lines EMP filter. This filter identifies
spam by identical posting-host headers and article sizes in a short period
of time. You really don't want to turn this off.
do_fsl
Turns on the From/Subject/Lines EMP filter. This filter identifies spam
by identical From and Subject headers and article sizes in a short period
of time. This is the one that gets the least number of hits these days,
so you won't lose much by shutting it off.
maxmultiposts
Start rejecting articles after we have seen this many copies, according
to the header-based EMP filter. Since false positives are somewhat more
likely with this filter than with MD5, this should be set appropriately
higher to reduce the odds.
ArticleHistory
How many ids to remember for header-based EMP comparison. Setting this
higher will catch more spam because there will be a larger ``window'' to
look at. Larger settings will also consume more memory and have a (small)
impact on performance, as well as slightly increase the chance of a false
positive (since the sample size will be larger). Most articles will
actually take up two entries in this history because there are two
different header-based filters.
EMPmaxlife
Same as MD5maxlife but for the header-based EMP filter.
EMPHistSize
Same as MD5HistSize but for the header-based EMP filter. If you are
running the header-based filter but not the MD5 filter for whatever
reason, set this high.
Excessive Crosspost Settings
maxgroups
Reject articles crossposted so that followups will be to more than
this many newsgroups.
low_xpost_maxgroups
Specify a special, lower crosspost limit for certain groups, specifed
by regular expression in low_xpost_groups (below). Useful for being
more strict in groups plagued by crossposting, such as sex, binaries,
and jobs groups. (Replaces the old tfjmaxgroups option.)
Misplaced Binaries Filter
block_binaries
Enables blocking of binary posts in non-binary newsgroups. Which newsgroups
allow binaries is configured with bin_allowed (below).
max_encoded_lines
Sets the number of uuencoded or base64-encoded lines to allow before
considering a post to be a binary. This should be set high enough to pass
regular PGP signatures. (Those satanic Netscape crypto-sigs can die along
with the other binaries.) Default is 15 lines, which may be a little low if
you are lenient, which you're not.
binaries_in_mod_groups
If set, binaries are allowed in spite of block_binaries if they are
posted only to moderated groups (requires active_file).
HTML
block_mime_html
Enables blocking of MIME-encapsulated HTML posts. This does NOT affect
straight text/html or multipart/alternative posts of the type created by
misconfigured Netscape and Microsoft ``newsreaders'', but ONLY posts which
are MIME-encapsulated HTML, a favorite format of sex spammers which
often sneaks in under the EMP radar.
block_html
Enables blocking of HTML and multipart/alternative posts. You can specify
group patterns where HTML is allowed by setting html_allowed (below).
Cancel Message Filtering
block_late_cancels
If turned on, cancels for recently rejected articles will be rejected.
Set the window with MIDmaxlife (below). This will result in a
huge number of rejections if you have multiple full feeds and you
aren't backlogging. If you are concerned about your downstream sites
receiving the cancels, leave this off. If you need a performance boost,
turn it on.
MIDmaxlife
How long to remember rejected message-ids so cancels for these posts can
later be rejected. Specified in hours. This only has an effect if
block_late_cancels is enabled (above).
Disabling Other Filters
do_scoring_filter
Enables the (new) ``scoring'' filter. You probably want to leave this on,
even if you need to turn of aggressive mode (turning off aggressive
mode will disable the content-based parts of the scoring filter).
do_mid_filter (INN only)
Enables the message-id filter. This requires an additional patch to
INN 1.7.2, which is included with Cleanfeed (but optional). The patch
adds a new Perl hook to check message-id's during the NNTP CHECK
transaction, and decide whether to refuse the article. There is a
patch for this for INN 2.0 which may get incorporated into the INN
distribution at some point. The default is off.
do_bot_checks
Enables the filters that check for spam bot signatures. The only reason
you would ever want to turn this off is if you've written your own
version, or something. Otherwise, leave it on.
do_supersedes_filter
Enables the Excessive Supersedes filter, to catch rogue Supersedes
attacks. This filter begins dropping articles with Supersedes headers
if too many appear from the same posting-host in a short time. Moderated
groups are given a higher limit (if active_file is set), as is
news.answers. Default is on.
check_supersedes_path
If set, bad_cancel_paths will also be applied to Supersedes articles.
Articles with Supersedes headers, where a path element matches the regexp
in bad_cancel_paths, will be dropped. Default is on.
drop_useless_controls
If set, all control messages of types sendsys, senduuname, and version
will be dropped. These are no longer useful and are a hole for
denial-of-service attacks due to the way INN and some other servers
handle them. On by default.
drop_ihave_sendme
If set, control messages of types ihave and sendme will be dropped.
See drop_useless_controls. If you use these types of control messages,
turn this off. If you're not sure, then you're not using them.
drop_control_with_supersedes
Drops any and all control messages which contain a Supersedes header.
Since control messages are not passed through the same filters as regular
messages, a rogue Supersedes attack can use control messages to avoid
filtering; this option closes this hole. Legitimate control messages
don't have Supersedes headers. On by default.
Hash-Trimming
trimcycles
The EMP memories are trimmed every trimcycles times through the filter.
EMPstarttrimming
Tells the filter not to waste time trimming the EMP memories until they
have this many entries. Just a minor performance enhancement during
the first hours the filter is running or when you first start innd.
Logging
verbose
When turned on, verbose logging to news.notice will happen; spam domains
will be listed, etc. When off, only general messages will be logged,
making the news.daily summaries less interesting but much shorter and
more to the point. (There is, alas, no way to shut off news.notice
logging entirely.) (news.notice only applies to INN.) Note that this
will not reduce the number of log entries, but only their verbosity.
logfile (Standalone Mode)
If set to the path to a file, this will enable logging of message-ids
of all articles processed by the filter. Rejections will be logged
with the reason for rejection. Note that this will create a very large
logfile which you will need to rotate or delete (see max_log_size,
below).
reportfile (Standalone Mode)
If set to the path to a file, this will enable generation of a simple
report of articles accepted and rejected. The report file will contain
one entry per line with the start time, end time, number of articles
accepted, and number of articles rejected, tab-separated.
log_accepts (Standalone Mode)
When using the above logfiles, this setting determines whether articles
accepted should be logged. When disabled, only rejections will be logged.
max_log_size (Standalone Mode)
The size at which to rotate the logfile. This will be replaced by
time-based rotation at some point.
statfile
If this is set to the full path of a file, a crude stats file will be
written each time the filter is reloaded with ctlinnd reload
filter.perl meow (for INN) or whenever the Cleanfeed process receives a
SIGUSR1 (for standalone mode). The file shows how many entries are
present in each of the EMP histories, MID history and excessive
supersedes history; timer information if enabled (see timer_info);
and the contents of all configuration settings. Posting-hosts in for
each supersedes entry will be listed, along with their counts; these
are not being rejected unless they are over the threshold. The
default for this is undef, which disables creation of the stat file.
More comprehensive stats are planned for the future.
Timing Info
timer_info
When enabled, Cleanfeed will generate timing statistics telling you
how many articles per second are being examined by the filter and
being accepted by the filter. This information will appear in the
statfile if this is enabled, and in the output of INN's ctlinnd mode
if the mode.patch is applied to INN. Note that the accepted/second
rate is not necessarily the rate at which your server is accepting
articles; articles can be rejected by the server after Cleanfeed
passes them, for example if they are posted to groups not in your
active file.
timer_interval
The period over which to average timing information, in seconds. The
default is 600 seconds, or 5 minutes.
Debugging
debug_batch_directory
Specifies a directory where debugging ``batchfiles'' can be written.
See the Hacker's Guide in this document for more information.
debug_batch_size
The maximum size of a debugging batchfile before it gets rotated.
Rotation is done by renaming the file to file.1, file.2, etc.,
using the lowest number that doesn't already exist.
Regular Expressions
You can add to most of these regular expressions in the %config_append
section of cleanfeed.conf; settings you add there will be added to
the defaults, rather than overriding them. If you want to completely
override the default settings you can add entries for these to the
%config_local section instead.
bin_allowed
This is a regular expression telling the anti-binary filter in which
newsgroups binaries are allowed. If all groups in the Newsgroups header
match this pattern, binaries are allowed through the filter. (This
obviously has no effect when the binary filter is disabled.) If the
binary filter is enabled and this is set to a null string (by overriding
the default in the local config) the result will be blocking all binaries
regardless of where they are posted.
poison_groups
If any groups in the Newsgroups header match this regexp, the article
will be rejected. Thus you can reject crossposts to certain groups even
if they are also posted to groups you carry.
html_allowed
This is a regular expression telling the anti-HTML filter in which
newsgroups HTML and multipart/alternative posts are allowed. This
only has an effect if block_html is turned on (above). The default
(to allow HTML in microsoft.* groups) can be added to in cleanfeed.conf.
If you don't want to allow HTML anywhere, not even the microsoft.*
groups, override this setting in the local configuration and set it
to a null string or undef.
md5exclude
If an article is posted only to groups matching this regexp, the MD5 EMP
filter will not be applied. Useful for ``test'' groups where it's okay
for lots of the posts to be the same.
allexclude
If an article is posted only to groups matching this regexp, NO checks
are applied at all.
low_xpost_groups
If a group matches this regular expression, it gets a special crosspost
limit, set in low_xpost_maxgroups, rather than the general crosspost
limit set in maxgroups. This is useful for groups plagued by excessive
crossposting, such as sex, binaries, and jobs groups. The default is
to limit crossposts to 6 groups in test, forsale, and jobs groups.
Setting this to a null string, or undef, will disable this feature.
badguys
This is a monster regular expression containing domains of known spammers.
Only the ``middle'' part of the domains are listed; these are checked as
email addresses in From headers by appending a list of top-level domains
to the end, and as URLs by prepending http:// and an optional ``www.''. If
you modify this list, be very careful not to end up with ``||'' in there
(two ``or'' signs in a row); this will match every single post that comes
through, which is Bad.
baddomainpat
If a post contains a URL for a site whose domain name matches this
pattern (in .com, .net, and .nu TLDs only) the post will be rejected.
For example, there are hundreds of spamming porn sites whose domain names
begin or end with ``xxx''. This prevents us from having to keep up with
their nonsense. Yes, it's a little aggressive, but it works.
exempt
Regular expression of NNTP-Posting-Hosts that are exempt from the
posting-host-based EMP filter. This is for high-output systems where
all posts contain the same NNTP-Posting-Host header, such as AOL, which
if not exempted would end up hitting the posting-host EMP filter with
all of their posts. There aren't many of these out there; a ``regular''
multi-user system does not present a problem because the filter doesn't
kick in until it sees a large number of posts from the same posting-host
and also of the same length, in a short period of time.
supersedes_exempt
Regular expression of NNTP-Posting-Hosts that are exempt from the
excessive supersedes filter. Generally this will be systems which
post a lot of FAQs.
bad_cancel_paths
Cancel messages will be rejected if the Path header contains elements
matching this regular expression. Also applied to the NNTP-Posting-Host.
If check_supersedes_path is set, this will also be checked against
the Path header of articles with Supersedes headers. This list contains
sites which are or have recently been the source of rogue cancel attacks.
refuse_messageids (INN only)
If you have do_mid_filter (above) enabled, and you have the optional
message-id patch applied to INN (or otherwise have obtained the hook
for filter_messageid in INN 2.0), this regular expression will be applied
to message-ids as they are offered to your server, and they will be
refused if it matches.
net_abuse_groups
spam_report_groups
These regular expressions are used to exempt certain groups from certain
filters; for example, groups expected to contain spam reports, example
spams, NoCeM notices, etc. These are not in cleanfeed.conf; if you
need to add to them please let me know.
After modifying the filter file, always check for mistakes by typing:
perl -cw filter_innd.pl (or cleanfeed or whatever you called it)
There should be no errors and no warnings.
You can check cleanfeed.conf with:
perl -cw cleanfeed.conf
You will get several warnings about variables being used only once;
these can be ignored.
If you are running INN, you can modify the file and reload it with
ctlinnd reload filter.perl meow while the server is running. The
configuration in f<cleanfeed.conf> will be reloaded at this time as
well.
With the Highwind servers, modifying the program will require a server
restart (use the bin/restart script). Note that this will result in
all connections (including newsreader clients) being dropped. This
is not my fault. :)
When in standalone mode, configuration from cleanfeed.conf can be
reloaded by sending Cleanfeed a SIGHUP.
I have no idea what NNTPRelay does, but I'm guessing it needs a restart
as well.
IMPORTANT NOTE: A common mistake is not setting file permissions on
cleanfeed/filter_innd.pl, cleanfeed.conf, and cleanfeed.local so that
they are readable by the news user. Please double-check your permissions!
If Cleanfeed is running, and fails to successfully load cleanfeed.conf,
it will use the default settings instead of those you specified in the
config file.
INSTALLATION - INN
These instructions assume you have the Perl hooks compiled into INN.
If you don't, you will need to add them and rebuild the INN distribution
before proceeding.
With INN, Perl is embedded into the innd program. The filter file
defines subroutines that are called by innd at the appropriate times.
SYSTEM REQUIREMENTS
In order to run Cleanfeed with INN, you will need:
*
INN 1.5.1 or later (1.7.2+insync1.1d or 2.1 recommended)
INN 2.0 includes everything you need to run Cleanfeed, except the MD5
Perl module.
With earlier versions, Cleanfeed requires some patches to INN in order
to function properly.
If you are running INN 1.7.2+insync1.1d, you already have the original
filter.patch and the dynamic-load.patch; You need only apply the
upgrade.patch.
None of these patches are against INN 2.1; the ``extra feature'' ones
like mode.patch may not apply to 2.1. Ports are always welcome.
filter.patch
This patch provides the basic functionality for Cleanfeed by making some
extra headers available to the Perl filter, as well as message bodies.
This patch was changed in version 0.95.3. It is against INN 1.7.2 and
should be applied in the innd directory. This patch is included in the
insync ``megapatch'' for INN as of version 1.1c, so if you are running this
version of INN you need not apply this patch. Not necessary for INN 2.x.
dynamic-load.patch
This patch enables INN's Perl interpreter to load dynamic modules. It is
necessary for MD5 support. The patch is against INN 1.7+insync and should
be applied in the lib directory (NOT the innd directory). It applies cleanly
to other versions of INN including 1.5.1 and 1.7.2. This patch is included
in the insync ``megapatch'' for INN as of version 1.1d, so if you are running
this version of INN you need not apply this patch. Not necessary for INN 2.x.
If you are still using INN 1.5.1, you can use dynamic-1.5.1.patch instead.
In order to compile INN with the new patch, you need to edit the PERL_LIB
entry in config.data. Type this command at the shell, and paste its output
into config.data as PERL_LIB:
perl -MExtUtils::Embed -e ldopts
Most systems also allow you to simply enter that line in backquotes as PERL_LIB.
This patch requires Perl 5.004 or later! INN will not compile linked with
Perl 5.003 after following these instructions!
AIX: There is a problem with Perl dynamic loading from INN under the
AIX operating system. In simple terms, it doesn't work. This seems to
be a problem with the gcc compiler. Success has been reported by
rebuilding both Perl and INN with IBM's commercial compiler CSet
(a.k.a. xlC).
Solaris: There have been multiple reports of Cleanfeed not working
under Solaris if any part of the system -- INN, Perl, or the MD5 module --
are compiled using egcs. Success has been reported by recompiling
everything with gcc, and by upgrading to the very newest egcs.
upgrade.patch
For current users of Cleanfeed, this is a patch for an already-patched
INN, or for 1.7.2+insync1.1d, to bring you up to the new version of the
Cleanfeed patch. Not applying this patch right now will only lose you a
couple of filters, and nothing will break if you don't apply it (no
changes to the filter source or configuration will be required).
messageid.patch
This is a patch which adds a new Perl hook to innd, filter_messageid.
This allows you to run a Perl subroutine against each message-id as
it is offered to your server, and decide whether to refuse the article
before it is even sent to your server. Cleanfeed includes a small
filter_messageid. This patch is entirely optional.
mode.patch
This patch adds a line to INN's ctlinnd mode output for Perl filter
status. The output line is generated by the filter_stats subroutine.
The default output contains the number of articles accepted, rejected
and refused since the filter started, and the sizes of the EMP,
Message-ID, and Excessive Supersedes hashes. If timer_info is enabled,
this will also include the rate in articles per second (rounded to the
nearest tenth) at which articles were examined (total sent through the
filter) and accepted by the filter, averaged over the timer_interval
number of seconds.
After applying the patches, rebuild all of INN and do a ``make update''.
The first patch (filter.patch) only requires innd to be rebuilt, but
the dynamic-load.patch requires you to rebuild the whole distribution.
Current users upgrading with upgrade.patch need only rebuild innd and
reinstall that executable.
Thus:
cd inn [to the top-level source directory]
make clean
cd innd
cp wherever/filter.patch . [from the Cleanfeed distribution]
patch <filter.patch
cd ../lib
cp wherever/dynamic-load.patch [from the Cleanfeed distribution]
patch <dynamic-load.patch
cd ../config
emacs config.data [edit the PERL_LIB entry as above]
make all
make update
Finally, you need to install the MD5 Perl module, no matter what version of
INN you are running.
INSTALLING CLEANFEED - INN
In INN 1.7.2 and earlier, the location where INN looks for the Perl filter
is set in config.data, as _PATH_PERL_FILTER_INND. By default, the
filename is filter_innd.pl. The Cleanfeed filter program file should
be installed in this location. INN comes with an example filter_innd.pl
file; move this file (or whatever other filter is in place) out of the way
first.
Before putting the filter in place, edit the file, changing $config_dir
to the location of your cleanfeed.conf file.
After editing the file, always check for errors with the command:
perl -cw filter_innd.pl
Once the file is in place, tell innd to reload it:
ctlinnd reload filter.perl meow
And, if Perl filtering is currently disabled, enable it:
ctlinnd perl y
Now, you can watch it working by looking at your news.notice log:
tail -f /var/log/news/news.notice
If your server is running a full feed, you should start seeing a
constant stream of rejections almost immediately.
INSTALLATION - HIGHWIND SERVERS
The various Highwind server packages (Cyclone, Typhoon, and Breeze)
all have the same external filter interface. The filter runs as
its own process, reading from standard input and writing to standard
output.
SYSTEM REQUIREMENTS
In order to run Cleanfeed with a Highwind server, you will need:
The Cleanfeed program file should be installed as ``cleanfeed'' in your
news server's bin directory (cyclone/bin, etc). Make it owned by
news:news and make it executable.
Before putting the filter in place, edit the file, changing $config_dir
to the location of your cleanfeed.conf file. Also ensure that the
shebang line (the first line of the file, starting with #!) points to
the correct location of your perl executable.
After editing the file, always check for errors with the command:
perl -cw cleanfeed
There should be no warnings.
Now, edit your bin/start script. You need to add two options to the
command line that starts up the server process, the -program option to
tell it what program to use as a filter, and the -body option to tell
it to send the bodies as well as the headers.
typhoond -program /typhoon/bin/cleanfeed -body
...along with whatever else you have cluttering up the command line.
(Highwind has indicated that this may/will be a config file option
in a future release.)
Now you can restart the server with the bin/restart script. Check
to make sure Cleanfeed is running, with ``ps -ef'' or ``top''. If
Cyclone/Typhoon is unable to start the filter for some reason, it will
log an error via syslog. The error will not be terribly helpful.
You can make Cleanfeed reload its configuration from cleanfeed.conf
and local code from cleanfeed.local by sending it a SIGHUP.
INSTALLATION - NNTPRELAY
Please note that I do not have an NNTPRelay server, nor access to one,
nor much interest in mucking around with Windows NT, and thus I have
not tested the NNTPRelay filtering support myself. The necessary changes
and notes were contributed by someone else. Additions and improvements
to this documentation would be most welcome.
The filter interface in NNTPRelay is pretty much the same as in the
Highwind servers.
SYSTEM REQUIREMENTS
In order to run Cleanfeed with NNTPRelay, you will need:
An NT binary release of Perl 5.004, which apparently includes the MD5
module, can be found at:
http://www.perl.com/CPAN/ports/win32/Standard/x86
The MD5 module (in source code) can be found at:
http://www.perl.com/CPAN-local/modules/by-module/MD5/
INSTALLING CLEANFEED - NNTPRELAY
Before putting the filter in place, edit the file, changing $config_dir
to the location of your cleanfeed.conf file.
Install the Cleanfeed program file wherever is appropriate on
your system, as ``cleanfeed.pl''. Edit NNTPRelay's config.txt
file, adding an entry like this:
Of course, use the correct path to your Perl executable and to
the Cleanfeed program file. Now restart NNTPRelay. If you
defined a logfile in Cleanfeed, it should appear.
THE HACKER'S GUIDE
Cleanfeed will look for a file called cleanfeed.local, in the same
directory as cleanfeed.conf. If this file exists, it will be loaded
and evaluated as Perl code right after the config file. This enables
you to provide your own local filter code which will survive an upgrade
of the main Cleanfeed source.
It will be reloaded when the filter is reloaded with ctlinnd reload
filter.perl meow (for INN), or when configuration is reloaded with a
SIGHUP (in standalone mode). This means that you can modify the running
code without restarting Cleanfeed.
cleanfeed.local can define a number of different subroutines, which,
if defined, will be called at various points in the filter process.
Other subroutines can, of course, be defined as required by your code.
The file is simply re-evaluated each time. So, if you remove a subroutine
from the file completely, that subroutine will remain defined after the
reload, because nothing replaced it. You will need instead to define it
as an empty subroutine, or explicitely undef it, to make it go away.
STUFF YOU CAN DEFINE
Cleanfeed will call the following subroutines, if they are defined.
See the section on return values for instructions on what your code
should return.
local_config
This is called after configuration is loaded, each time. It will be
called when the filter is reloaded (with INN) or when configuration
is reloaded with SIGHUP (running standalone), as well as when the
filter is first run. No return value is expected.
local_filter_before_emp
Called for each (non-control) article, before any other filters.
General-purpose spam filters shouldn't go here, because you really
want to populate the EMP hashes first.
local_filter_after_emp
Called for each (non-control) article, after the EMP filters but
before any other filters.
local_filter_middle
Called for each (non-control) article, after the ``simple'' filters
but before the ``expensive'' body checks.
local_filter_scoring
Called during the scoring filter. Return the value, positive or
negative, by which to adjust the article's score.
Warning: Here there be dragons! If you're going to play with
this please examine the existing source, and use the debugging
routines to watch what you're doing.
local_filter_last
Called for each (non-control) article, after all other filters
are done.
local_filter_cancel
Called for all cancel control messages.
local_filter_newrmgroup
Called for all newgroup and rmgroup control messages.
RETURN VALUES
The general filtering subroutines you can define (local_filter_before_emp,
local_filter_after_emp, local_filter_middle, local_filter_last,
local_filter_cancel, and local_filter_newrmgroup) are expected to
return a value indicating whether you want to accept the article being
examined. If the article is okay, you should return "" (empty string),
in which case filtering will proceed as usual. If you want to reject the
article, you return any other string, which will be used as the reason.
The rejection code actually expects two return values -- the first string
is the ``verbose'' rejection message, and the second is the ``non-verbose''
message (see the verbose configuration option). If only one is
supplied, it will be used for both purposes.
The scoring filter calls local_filter_scoring, which is expected
to return the value, postive or negative, by which the article's score
should be adjusted.
WHAT YOU GET
Your subroutines get information about the article in several variables.
%hdr
A hash containing the article headers. The key is the header name, in
``canonical'' case as INN likes them; the value is the content of the header.
When running under INN, only headers known to INN will be included in the
hash (which includes any header used anywhere in Cleanfeed). In standalone
mode, all headers will be present, but only the known headers will be sent
in canonical case; others will have the header name (and thus hash key) in
whatever case they are in the article itself, making them difficult to find
and use consistently.
The message body is in this hash under the key __BODY__. If running INN
2.x with storageapi, it will be provided in wireformat, with lines
terminated in \r\n rather than just \n. With the traditional spool
format (and in all cases with INN prior to 2.x) lines will be terminated
only with \n.
Examples:
To get the Subject header as a scalar: $hdr{'Subject'}
To get the entire message body as a scalar: $hdr{'__BODY__'}
%lch
A hash containing lowercased versions of some of the article headers.
The hash keys are the header names in all lowercase; the values are the
contents of the headers, with all letters forced to lowercase.
Currently, the only headers added to this hash are From, Organization,
Subject, Content-Type, X-Newsreader, X-Newsposter, Message-ID, and Sender.
This hash is not availabe to local_filter_before_emp.
@groups
An array containing the newsgroups the article is posted to (from the
Newsgroups header). You can find out how many groups the article is
crossposted to with ``scalar @groups''.
@followups
An array containing the newsgroups to which followups are set (from the
Followup-To header). If the article has no Followup-To header, this
array will be identical to @groups. You can find out how many groups
followups are set to with ``scalar @followups''. This is the preferred
way to limit crossposting, because limiting only by the Newsgroups
header will catch FAQs and such.
$lines
The number of lines in the message body. This is not taken from the Lines
header as that can be client-supplied to fool filtering; this is determined
by counting the lines in the message body.
%gr
A hash containing information about the groups the article is posted
to. This isn't very straightforward and may not be useful to you, but
I'm including it in this documentation for completeness. The following
entries may be present in this hash:
$gr{'net'} - the number of net.* (Usenet II) newsgroups the article is
posted to, if any.
$gr{'other'} - the number of non-net.* groups the article is posted to.
$gr{'md5skip'} - true if the article should be exempted from the MD5
body checks (if all newsgroups match the regexp in md5exclude).
$gr{'binary'} - true if the article is posted only to groups where
binaries are allowed (if all newsgroups match bin_allowed).
$gr{'html'} - true if the article is posted only to groups where html
is allowed (if all newsgroups match html_allowed).
$gr{'poison'} - number of `poison' newsgroups this article is posted
to (matching poison_groups). If this is present, you'll only see this
entry in local_filter_before_emp and local_filter_after_emp because
it will be rejected after that.
$gr{'abuse'} - number of `net abuse' newsgroups this article is posted
to (matching net_abuse_groups).
$gr{'reports'} - number of `spam reports' newsgroups this article is
posted to (matching spam_report_groups).
$gr{'low_xpost'} - number of `low crosspost limit' groups this article
is posted to (matching low_xpost_groups).
$gr{'mod'} - number of moderated groups this article is posted to
(requires that Cleanfeed have an active file).
$gr{'allmod'} - true if this article is posted only to moderated groups.
$gr{'faq'} - true if this article is crossposted to news.answers.
%config
A hash containing all configuration options.
DEBUGGING
When you make filtering changes, you should always check the results for
false positives. I've provided two subroutines to help you do this:
writeheaders() and writefull().
First, make sure debug_batch_directory is set in your configuration.
Set this to a directory that is writable by the news user.
Call either of these subroutines with one argument, the basename of the
batch file you want to write the current article to. writeheaders
will dump the article's headers out to the file (with INN this will only
give you the known headers). writefull will dump the full article,
headers (again, known headers with INN) and body. The file will be
rotated if it becomes larger than debug_batch_size, set in your
configuration. The rotation is simple, a number is appended to the end
of the file, and incremented until the filename does not exist. You'll
have to delete the old files yourself.
When testing a new filter, simply call writeheaders ("batchfile") or
writefull ("batchfile") when you're going to reject an article.
Then you can look at the file to make sure you're doing what you think
you're doing.
SIGNALS
When running under Cyclone, Typhoon, Breeze, or NNTPRelay (standalone
mode), Cleanfeed will catch SIGHUP, and reload its configuration from
cleanfeed.conf. It will also reload and reevaluate cleanfeed.local
if you're using it. Note that, unlike INN, there is no way to reload the
filter code itself without restarting the server.
Cleanfeed in standalone mode will also catch SIGUSR1 and write its crude
current-status file (see statfile in the config section) on the next
cycle through the filter.
(I honestly don't know if SIGUSR1 and SIGHUP are things which exist on NT
for NNTPRelay.)
I can't possibly mention everyone who has submitted ideas or fixes
for the filter, but I'd like to acknowledge the substantial
contributions of several people: Danhiel Baker, Frank Copeland,
Brian Moore, John Payne, Russ Allbery, David Riley, and SeokChan LEE.
Thanks, guys.
dynamic-load.patch is from Piers Cawley.
The body-filtering portion of the INN filter.patch is from Jeff Garzik.
messageid.patch is from Ed Mooring.
mode.patch is from John Payne.
COPYRIGHT
Copyright 1997-1998 by Jeremy Nixon, All Rights Reserved.
LICENSE
This software may be distributed freely, provided it is intact (including
all the files from the original archive). You may modify it, and you
may distribute your modified version, provided the original work is
credited to the appropriate authors, and your work is credited to you
(don't make changes and pass them off as my work), and that you aren't
charging for it.