Sophie: bogofilter-1.2.2-2.2.mga2 i586

bogofilter-1.2.2-2.2.mga2.i586.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Tuning bogofilter</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</head>
<body>
<h1>TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- an updated
HOWTO</h1>

<address>(<a href="mailto:glouis@dynamicro.on.ca">Greg Louis</a>,
September 2004)</address>

<p>NB: Bogotune is a tool (shipped with bogofilter) that automates
the tuning process. Its "full search" mode performs a
five-dimensional grid search over possible values of the parameters
to be described below, and comes up with recommendations for
optimal settings. There's also a "partial search" mode that is only
three-dimensional. If you have enough spam and nonspam messages (at
least 2,500 of each), using bogotune is highly recommended for
optimizing bogofilter's accuracy.</p>

<h2>CONTENTS</h2>

<ul>
<li><a href="#intro">Introduction</a></li>

<li><a href="#robx">Robinson's x</a></li>

<li><a href="#robs">Robinson's s</a></li>

<li><a href="#md">The minimum deviation</a></li>

<li><a href="#esf">Effective size factors</a></li>

<li><a href="#co">The spam and nonspam cutoffs</a></li>

<li><a href="#bogot">Overview of bogotune</a></li>

<li><a href="#freq">How often to tune</a></li>

<li><a href="#train">A note on training</a></li>
</ul>

<p><a name="intro"></a></p>

<h2>INTRODUCTION</h2>

<p>The bogofilter program has evolved through three classification
methods: the original as proposed by Paul Graham and implemented in
bogofilter by Eric S. Raymond; a variation proposed by Gary
Robinson and implemented in bogofilter by Greg Louis; and a further
variation, also proposed by Gary Robinson, which uses Fisher's
method <cite>(Fisher, R. A., 1950: Statistical Methods for Research
Workers, pp. 99ff.&nbsp; London: Oliver and Boyd)</cite> of
combining probabilities; for bogofilter, your author has
implemented this one too.&nbsp; Recently, Gary Robinson described a
further improvement, the application of effective size factors in
the scoring process; this is available by default in bogofilter,
but since it's relatively new, an option exists in both bogofilter
and bogotune to switch it off.</p>

<p>Each of Gary Robinson's proposed classification methods works
better than the earlier versions.&nbsp; For optimal results, they
(and the original) require some tuning.&nbsp; As distributed,
bogofilter attempts to supply good starting values for the tunable
parameters.&nbsp; Since the optimum values depend on the size and
content of the wordlists at <em>your</em> installation, the best
results can only be determined by some experimentation using
<em>your</em> wordlists.&nbsp; After several thousand each of spam
and nonspam messages have been fed to bogofilter for training, this
experimentation can be well worthwhile: you may be able to cut the
number of spams that are still getting through by more than
half.</p>

<p>The purpose of this document is to explain how to adjust
bogofilter's parameters for best spam filtering.&nbsp; A manual
tuning process is described; though you'll be wise to use bogotune
to help find your parameter values, understanding the manual
process will help you make sense of what bogotune does.</p>

<p>With Robinson's changes as implemented in bogofilter, there are
seven (five without effective size factors) things to tune, six (or
four) of which are highly interdependent, as explained below:</p>

<p><a name="robx"></a></p>

<h2>ROBINSON'S x</h2>

<p>First off, you should determine the value of x appropriate to
your training database.</p>

<p>The way bogofilter works, summarizing briefly, is that the
message being classified is separated into "tokens" -- words, IP
addresses and other logical units of information.&nbsp; Each token
is looked up in the wordlist that makes up the training
database.&nbsp; The number of times it's been seen in a spam
message is divided by the total number of times it's been seen, and
that gives an indication of the likelihood that the token is in a
spam message.&nbsp; The likelihood estimates for all the tokens are
combined to give a score between 0 and 1 -- 0 means the message is
not likely to be a spam, 1 means it is.&nbsp; That's fine, but what
happens if a token's never been seen before? That's where x comes
in: it's a "first guess" at what the presence of an unknown token
means, in terms of its contribution to the score.&nbsp; It's the
value used as the likelihood estimate when a new token is
found.</p>

<p>Obtaining a value for x is easy: bogoutil will calculate it for
you.</p>

<p>Assuming your bogofilter wordlist is in ~/.bogofilter, run</p>

<div style="margin-left: 2em">
<pre>
  bogoutil -r
</pre>
</div>

<p>This will print out an x value. If it's in the range of 0.4 to
0.6, you can run&nbsp;
<code>bogoutil&nbsp;-R&nbsp;~/.bogofilter</code>&nbsp; to install
the calculated value so bogofilter will use it from then on.</p>

<p>The value of x that bogoutil calculates is just the average
of</p>

<div style="margin-left: 2em">
<pre>
  p(w) = badcount / (goodcount + badcount)
</pre>
</div>

<p>for every word in your training set that appears at least 10
times in spam and/or ham counts in your wordlist (i.e.
badcount&nbsp;+&nbsp;goodcount&nbsp;&gt;=&nbsp;10).</p>

<p>To be honest, that's an oversimplification, for the sake of
explaining the basic concept clearly.&nbsp; In real life, you have
to scale the counts somehow.&nbsp; If you had exactly the same
number of spam and nonspam messages contributing to your wordlist,
the formula for p(w) given above would be ok; but we actually have
to use</p>

<div style="margin-left: 2em">
<pre>
  p(w) = (badcount / badlist_msgcount) /
         (badcount / badlist_msgcount + goodcount / goodlist_msgcount)
</pre>
</div>

<p>where badlist_msgcount is the number of spam messages that were
fed into the training set, and goodlist_msgcount is the number of
nonspams used in training.</p>

<p>An equivalent way of calculating x, that's a little easier to
read, is:</p>

<div style="margin-left: 2em">
<pre>
  scalefactor = badlist_msgcount / goodlist_msgcount
  p(w) = badcount / (badcount + goodcount * scalefactor)
</pre>
</div>

<p>In either case, x is the average of those p(w) values.</p>

<p>The calculated x is just a first guess, and it may be worth
while experimenting (after tuning s and the minimum deviation as
described below) to see what happens if you adjust it up or
downward within a range of about 0.1 either way. <a name=
"robs"></a></p>

<h2>ROBINSON'S s</h2>

<p>With a count of zero for a given token, we have only x to go on,
so that's what we use as the likelihood estimate for that
token.&nbsp; The question then arises, what if we have seen the
token before, but only a few times? Statistical variation will
result in the ratio of two small numbers being rather unreliable,
so perhaps we ought to compromise between that ratio and our "first
guess" value.&nbsp; This is what the Robinson method does.</p>

<p>The compromise works like this: a parameter s is defined, that
serves as a weighting factor; the larger the value of s, the more
importance is given to x in the presence of low token counts.&nbsp;
The token count is&nbsp;
<code>n&nbsp;=&nbsp;badcount&nbsp;+&nbsp;goodcount</code>&nbsp; and
the p(w) value for the token is modified as follows:</p>

<div style="margin-left: 2em">
<pre>
  f(w) = (s * x + n * p(w))/(s + n)
</pre>
</div>

<p>As you see, if n is zero, f(w) is just x, as we have been saying
all along; but if n is nonzero and small, the s&nbsp;*&nbsp;x term
will influence the f(w) value.</p>

<p>Parameters s and x are only important when the counts are small,
and the value of s reflects what we think of as "small." If s is
large, then when counts are small we trust our x value more than we
do the p(w); if s is small, we give more weight to p(w) and less to
x.</p>

<p>So how big should s be? Gary Robinson suggested, on a
theoretical basis, that we start with a value of 1; it might be
worth trying values in the range of 0.01 to 10, though I've never
had good results with values larger than 1.&nbsp; Making s smaller
than 0.01 or so is a bad idea, because of what happens when a token
is encountered that's been seen in spam before, but never in
nonspam, or vice versa.&nbsp; In that case, p(w) is exactly 1 or 0,
and f(w) will vary greatly as the value of s diminishes.&nbsp; As
an example, one spam that had about ten such tokens, out of 78 that
contributed to the spam score, scored 0.999 with s set to 0.001,
and was found to score 0.505 when s was 1.0e-8!</p>

<p>Choosing a value for s is a matter of trial and error.&nbsp;
Experience suggests that 0.1 might be a better starting value than
1, at least if your training database is moderately large.</p>

<p><a name="md"></a></p>

<h2>THE MINIMUM DEVIATION</h2>

<p>MIN_DEV is another thing you might need to vary.&nbsp; Paul
Graham's original method was based on looking only at the fifteen
tokens with p(w) values farthest from 0.5 (nearest to 0 or
1).&nbsp; We don't do it that way; instead, we set a minimum
deviation from 0.5 (Gary Robinson coined the term "exclusion
radius" for this parameter), and look at all the tokens with f(w)
values farther away than that.&nbsp; If the minimum is set to zero,
every token in the message contributes its spammishness value f(w)
to the final calculation.&nbsp; It might save time, and perhaps
improve discrimination, to ignore tokens with f(w) values near 0.5,
since those tokens obviously make less difference to the outcome of
the calculation.&nbsp; It seems helpful, at least once the training
database is a good size, to use a MIN_DEV value between 0.3 and
0.46.&nbsp; You might try 0.35 initially; one experiment suggests
that even 0.44 might be a good value, but that may not work for
everyone.&nbsp; In fact, some people are likely to find that quite
a small value of MIN_DEV (around 0.05) works best.</p>

<p>Note that higher values of MIN_DEV accentuate the distortion
caused by small s, because tokens appearing in only one of the spam
and nonspam counts will (if present) be a higher proportion of the
total number of tokens considered.</p>

<p><a name="esf"></a></p>

<h2>EFFECTIVE SIZE FACTORS (ESF)</h2>

<p>Tokens tend to appear more than once in a given message.&nbsp;
If a spam contains the word "valium" it's likely to occur several
times.&nbsp; Bogofilter only uses any given token once in
calculating the message score, no matter how many times it appears
in the message; but since "spammy" and "non-spammy" tokens may
occur with very different frequency in the populations of spam and
nonspam messages respectively, it helps (as Gary Robinson pointed
out) to take this difference in redundancy into account.&nbsp; This
is done as follows:&nbsp; Without effective size factors, the score
is calculated with inverse chi-squared function
<strong>prbx</strong> thus:</p>

<div style="margin-left: 2em">
<pre>
P = prbx(-2 * sum(ln(1-f(w))), 2*N)
Q = prbx(-2 * sum(ln(f(w))), 2*N)
S = (1 + Q - P) / 2
</pre>
</div>

<p>and to apply ESF, we instead calculate:</p>

<div style="margin-left: 2em">
<pre>
P = prbx(-2 * ln(prod(1-f(w))^y), 2*N*y)
Q = prbx(-2 * ln(prod(f(w))^z), 2*N*z)
S = Q / (Q + P)
</pre>
</div>

<p>where y and z are the spam and nonspam ESF values, the
<strong>prod</strong> function returns the product of all its
arguments, and S is forced to 0.5 if Q and P are both very near
zero.</p>

<p>Determining the correct values to use for y and z is, like
choosing the value for s, a matter of trial and error.&nbsp; A
significant corpus of messages is required and users who don't have
access to large collections of their spam and nonspam messages
should probably opt to do without ESF.&nbsp; Useful values to try
seem to be in the range of 0.75 raised to powers between 1 and
20.</p>

<p><a name="co"></a></p>

<h2>THE SPAM AND NONSPAM CUTOFFS</h2>

<p>Most spam messages will have scores very close to 1 when the
Fisher method of combining the likelihood estimates is applied, and
most nonspam will have scores very close to 0.&nbsp; In between,
there is a grey area, where messages will fall that have both
spammish and nonspammish characteristics.&nbsp; The spam and
nonspam cutoffs are thresholds: bogofilter classes messages with
scores below the nonspam cutoff as nonspam, and those with scores
at or above the spam cutoff as spam. Messages with scores between
the two cutoff values are classed as uncertain.&nbsp; (Usually,
mail administrators will want to deliver mail classed as uncertain,
even though some of it may well be spam.)</p>

<p>The best value for the spam cutoff depends strongly on the
values of the x, s, MIN_DEV and ESF parameters that are being used
in the calculation.&nbsp; For that reason, the way to test a given
parameter set is the following:</p>

<ol>
<li>Determine a suitable level of false positives (nonspams
classified as spam); this will probably need to fall somewhere in
the range of 0.05 to 0.3 percent (the lower the better, of course,
except that lowering the false-positive target too much gives too
many false negatives).</li>

<li>Apply the parameters to classify a number of known nonspam
messages. From the distribution of scores, pick a spam cutoff value
that will give the selected proportion of false positives.</li>

<li>Apply the parameters and the chosen spam cutoff to classify a
number of spam messages; the parameter set should be judged on how
few false negatives (spam messages classed as uncertain or nonspam)
are obtained.</li>
</ol>

<p>The value for the nonspam threshold should be such that no more
than about one in ten thousand spams is classified as
nonspam.&nbsp; A value of 0.2 to 0.25 might be a good starting
point for use with the recommended s and MIN_DEV (0.1 and 0.35
respectively).</p>

<p><a name="bogot"></a></p>

<h2>AN OVERVIEW OF BOGOTUNE</h2>

<p>As already mentioned, bogotune attempts to automate the above
process by doing a grid search over useful ranges of s, MIN_DEV, x,
y and z (an option exists to disable ESF, i.e. leave the ESF values
[y and z] at 1.0).&nbsp; Bogotune validates the test inputs,
calculates a suitable cache size for the training database,
calculates the starting x value as described above, and picks a
false-positive target with which to run the grid search.&nbsp;
There is an option to force a specific target (more about this
shortly).&nbsp; If not coerced, the target is calculated based on
the number of nonspam messages in the test set, and then adjusted
downward to give a cutoff between 0.5 and 0.975, above 0.55 if
possible.&nbsp; It's important to note that this false-positive
target is chosen to facilitate the grid search, and is usually very
much larger than one would want to see in production; there's no
relation between the two.&nbsp; At the end of its run, bogotune
attempts to suggest a reasonable production target.</p>

<p>With these preliminaries completed, bogotune performs two
scans:&nbsp; The first is a coarse grid search over the entire
useful range of each parameter, which should locate an approximate
optimum.&nbsp; A finer grid, centered on the optimum derived from
the coarse search, is then scanned to produce the final
recommendations.&nbsp; At the end of each scan, outliers --
apparently good parameter sets from which even slight deviation
degrades performance significantly -- are rejected.&nbsp; Bogotune
usually manages to find a robust parameter set that gives good
discrimination between spam and nonspam messages, provided that
sufficient training and test messages are supplied.</p>

<p>There are two reasons why one might want to force a bogotune run
to use a specific false-positive target.&nbsp; One is that
sometimes bogotune's target calculation algorithm is overly
optimistic, i.e. it occasionally sets the target too low.&nbsp; The
symptom of this problem is that many parameter combinations in the
coarse grid scan simply can't deliver that few false positives from
the test message corpus.&nbsp; Manually increasing the target by 20
to 30 percent usually fixes this.&nbsp; The other reason to coerce
the target to a specific value is that one might want to compare
two bogotune runs -- with and without ESF, for example -- and
comparisons aren't valid unless both runs use the same target.</p>

<p>Bogotune may produce very different recommendations for very
similar sets of spam and nonspam messages.&nbsp; That's not
necessarily a defect.&nbsp; For many message populations, the
scoring algorithm doesn't depend heavily on precise values of any
of the parameters but the spam cutoff.&nbsp; In such cases it's
common to see bogotune's coarse scan settle on one of several local
optima that may have quite different parameter values.&nbsp; In
general, the parameter values have more influence when a single
training database is used for a large number of users, and are less
crucial when the wordlist is for just one or a few users. <a name=
"freq"></a></p>

<h2>HOW OFTEN TO TUNE</h2>

<p>It's probably wise to review the spam cutoff frequently.&nbsp;
If the false-negative count is gratifyingly low, or if false
positives are occurring, increasing the cutoff will reduce the
false-positive rate. If, on the other hand, there are absolutely no
false positives but the false-negative rate is high, lowering the
cutoff a bit may improve discrimination.</p>

<p>How often to tune the other parameters depends on how fussy you
are about optimizing performance.&nbsp; If you're eager to get the
best discrimination you can, here are my recommendations:&nbsp; In
the early stage, while the training database is still small and
growing rapidly, it's probably wise to experiment with tuning x, s
and MIN_DEV once a month or so.&nbsp; Once the training database is
a good size (over 5000 spams and 5000 nonspams), this can be
reduced to quarterly or half-yearly.&nbsp; If you use ESF (which I
recommend you do), retuning should probably happen a bit more
often, especially if you see that the characteristics of spam
you're receiving seem to be changing.</p>

<p>FWIW my own practice, after two years' experience, is to review
the spam cutoff monthly and do a bogotune run about quarterly. <a
name="train"></a></p>

<h2>A NOTE ON TRAINING</h2>

<p>Bogofilter's ability to distinguish accurately between spam and
nonspam messages depends on the quality of its training
database.&nbsp; Here is a way of maximizing that quality with
relatively little effort.</p>

<p>When starting afresh, feed every spam and every nonspam you get
into bogofilter.&nbsp; Do not use bogofilter's -u option to do
this: there will be far too many errors when your training database
is small.&nbsp; Instead, classify messages manually and train
bogofilter with the -n and -s options appropriately.&nbsp; You can
do it in batches: if you work with standard mbox files, use a mail
reader to move spam and nonspam into separate files, then do&nbsp;
<code>bogofilter&nbsp;-s&nbsp;&lt;&nbsp;spambox</code>&nbsp;
and&nbsp; <code>bogofilter&nbsp;-n&nbsp;&lt;&nbsp;nonspambox</code>
to register the messages.</p>

<p>To find out how many spam and nonspam messages have gone into
your wordlist, assuming it's kept in ~/.bogofilter, do</p>

<div style="margin-left: 2em">
<pre>
  bogoutil -w ~/.bogofilter .MSG_COUNT
</pre>
</div>

<p>Once you've accumulated about 5,000 spam or nonspam messages in
the list, you need to let the other count catch up unless they're
growing at about the same rate.&nbsp; Stop adding every message to
the larger of the two counts, and instead, add only messages that
bogofilter got wrong or was unsure about.&nbsp; To do this, you
need to start classifying messages into 4 sets instead of 2: spam,
nonspam, unsures that were actually spam, and unsures that were
nonspam.</p>

<p>Once the counts are both over 5,000 and fairly similar, you can
train only on errors and unsures.&nbsp; By this time there should
be very few errors (spams classed as nonspam or vice versa), but
there will still be a proportion of unsure spam and unsure nonspam
messages. Training on these will keep bogofilter working well, as
you're telling it what it needs to learn, and not so much of what
it already knows. If one list grows faster than the other, extra
(correctly classified) messages may be added from time to time to
equalize them again; try to keep the smaller list's message count
at least two thirds of the larger's.</p>

<p>The use of bogofilter's -u option is convenient but
dangerous.&nbsp; Even when well trained, bogofilter <em>will</em>
misclassify a small number of messages.&nbsp; With the -u option,
these mistakes are entered into the database and decrease its
accuracy.&nbsp; More mistakes result and the process
snowballs.&nbsp; You therefore need to review and correct the
databases frequently, and in the interval between such reviews,
bogofilter's effectiveness will keep falling off.&nbsp; The method
described above has the advantage that (human error excepted) no
wrongly classified messages are ever entered into the database, and
if you have to leave it for a time without updating it, its
effectiveness doesn't diminish.</p>

<hr>
<p>Much of the advice given here arises out of experiments reported
on the author's bogofilter web pages; in particular, the report at
<a href="http://www.bgl.nu/bogofilter/smindev3.html">http://www.bgl.nu/bogofilter/smindev3.html</a>
might interest those who'd like more information about the basis of
bogofilter tuning.</p>

<p>Thanks go to David Relson for reviewing this document in draft
and suggesting several improvements.</p>

<address>[&copy; <a href="mailto:glouis@dynamicro.on.ca">Greg
Louis</a>, 2004; last modified 2004-09-09]</address>
</body>
</html>