Sophie: bogofilter-1.2.2-2.2.mga2 i586

bogofilter-1.2.2-2.2.mga2.i586.rpm

WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING
------------------------------------------------------------------------
POTENTIAL FOR DATA CORRUPTION DURING UPDATES

If you plan to upgrade your database library, if only as a side effect
of an operating system upgrade, DO HEED the relevant documentation, for
instance, the doc/README.db file.  You may need to prepare the upgrade
with the old version of the software.

Otherwise, you may cause irrecoverable damage to your databases.

DO backup your databases before making the upgrade.
------------------------------------------------------------------------
WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING



This file documents changes in bogofilter since version 0.11.  In
particular it describes: (1) Features, which are significant changes
(noteworthy and compatible) and (2) Incompatibilities, which are
changes that require action upon update.

Caution: If upgrading from an old version and skipping several
intervening versions of bogofilter, be smart and check all the
changes of the versions you skipped!  In particular, read the sections
labeled "Incompat" and "Major".

NOTE: the NEWS document has greater detail on some of these changes.

------------------------------------------------------------------------
[Incompat 0.96.0] TDB removed

Support for the TDB database library has been removed.

------------------------------------------------------------------------
[Incompat 0.95.2] Applies to Berkeley DB Transactional ONLY:

This release gives up on locking the databases at page granularity and
locks whole environments, to overcome lock sizing requirements which are
a major issue in unattended setups.

This however means that a writer (token registration) will lock out
readers (message scoring) and readers will prevent new writers from
starting. This may be fixed in a future version.

------------------------------------------------------------------------
[Major 0.95.0] Unicode in UTF-8

This release supports Unicode (UTF-8).  A new meta-token .ENCODING has
been added to the wordlist so that bogofilter can determine if it's
using Unicode or not.  A value of 1 indicates raw storage and 2
indicates UTF-8 encoded tokens.  Bogofilter checks for this meta-token
and converts incoming text to UTF-8 as appropriate.  

Command line options "--unicode=yes" and "--unicode=no" can be used.

 - With bogofilter, they control encoding of newly created databases.
 - With bogoutil, --unicode=yes converts the wordlist to Unicode.
 - For bogolexer, they print parser results in new and old modes.

./configure options allow bogofilter customization:

 - "./configure --unicode=yes" will _always_ operate in Unicode mode
 - "./configure --unicode=no"  will _never_ operate in Unicode mode

Wordlists can be converted from raw storage to Unicode using:
NOTE: Replace iso-8859-1 by the character set and encoding of the
dominant input token character set!

    bogoutil -d wordlist.db > wordlist.raw.txt
    iconv -f iso-8859-1 -t UTF-8 < wordlist.raw.txt > wordlist.UTF-8.txt
    bogoutil -l wordlist.db.new < wordlist.UTF-8.txt

For a wordlist containing tokens from multiple languages, particularly
non-European languages, the conversion methods described above may not
work well for you.  Building a new wordlist (from scratch) will likely
work better as the new wordlist will be based solely on Unicode.

------------------------------------------------------------------------
[Incompat 0.94.12] Changed Options

Some options have been added or modified.  If you use any of the
changed options, you will probably need to modify your scripts,
procmail recipes, etc.  As an example, some bogoutil options which
used to allow either filenames or directory names are now restricted
to filenames.  See the man pages and help messages if you have
questions.

------------------------------------------------------------------------
[Incompat 0.94.0] Transactions

The transactional mode now defaults to off because the lock table sizing
issue is unresolved.

Bogofilter and bogoutil now support both build-time and run-time
choosing whether to operate with (or without) transaction support.
They can also auto-detect whether you've been using transactions or not.

Run-time Selection:

For bogofilter and bogoutil, transactions can be enabled or disabled
in 2 ways -- by command line options or config file options.

Command line option "--db-transaction=yes" enables transactions and
"--db-transaction=no" disables them.

Config file options "db_transaction=yes" and "db_transaction=no"
have the same effect.

Auto-detection:

If none of the above methods are used to enable/disable transactions,
bogofilter and bogoutil will query Berkeley DB to see if a transaction
environment already exists.  If so, transactions will be enabled.  If
not, they will be disabled.

Compile-time selection:

A default build includes the run-time and auto-detect capabilities.
If you wish to minimize program size, ./configure can be used to
create single mode versions of bogofilter and bogoutil, i.e. programs
that only run transactionally or non-transactionallly.  Use
"./configure --enable-transactions" to enable transactions and
"./configure --disable-transactions" to disable them.  These programs
will be _slightly_ smaller than the default build.

-----------------------------------------------------------------------
[Incompat 0.93] Summary for the hasty

YOU MUST ADJUST YOUR SCRIPTS EVALUATING "X-Bogosity" HEADERS!

YOU MAY NEED TO ADJUST YOUR SCRIPTS THAT PARSE 'bogofilter -V'!

WHEN USING BERKELEY DB (DEFAULT), NFS NO LONGER WORKS AND
YOU   M U S T   READ doc/README.db AND POSSIBLY CONFIGURE THE DATABASE!

-----------------------------------------------------------------------
[Incompat 0.93] Defaults changed

Bogofilter's defaults have been changed.  It now operates in tri-state
mode and will classify messages as Spam, Ham, or Unsure.

If you're checking messages for "X-Bogosity: Yes" or "X-Bogosity: No",
you _need_ to change your checks.  Use "X-Bogosity: Spam" and
"X-Bogosity: Ham" instead of the old forms.  Also, checking for
"X-Bogosity: Unsure" and putting those messages in a separate folder (or
mailbox) will give you an excellent set of messages for training, as
"Unsure" messages are messages that bogofilter has too little
information to classify (with certainty) as spam or ham.

-----------------------------------------------------------------------
[Incompat 0.93] Berkeley DB switched to Transactional Data Store

Bogofilter will now use the Berkeley DB Transactional Data Store when
compiled with Berkeley DB as the data base engine (the default).

This means the Berkeley DB directory can no longer reside on a networked
or otherwise shared file system (such as NFS, AFS, Coda).

When using BerkeleyDB 4.1 - 4.3, it is recommended that you dump and
load the data bases to add checksums, for enhanced reliablity. See
section 2.2 in doc/README.db for details.

This means that bogofilter programs now exhibit the A C I D traits:
changes are atomic (all-or-nothing); the data base is always consistent;
changes are always isolated from each other; and all changes that are
acknowledged are durable.

Bogofilter can support multiple writers at the same time, mixed freely
with simultaneous readers, and the data base will not be corrupted by
application or system crashes, except when the disk drive gets damaged.

Note that this requires that the operating system and disk drive
maintain proper write order on the disk, and that both be honest about
synchronous I/O completion.

Note also that this causes bogofilter to write additional "log" files
to its ~/.bogofilter (or other) home directory.  The log files need to
be archived or deleted periodically.

For detailed instructions, be sure to _read_ doc/README.db and check the
BerkeleyDB documentation.

As a backwards compatibility option, for instance when space and I/O
bandwidth are tight, it is possible to use the old non-transactional,
non-concurrent Berkeley DB Data Store, which can only register messages
when there are NO scoring processes at all and that may not be able
recover from application or system crashes.

These benefits are not available when bogofilter is compiled to use the
TDB or QDBM data bases.

-----------------------------------------------------------------------
[Major 1.1.6] Tokyo Cabinet support (B+-trees with transactions) added

Bogofilter, as of release 1.1.6, supports Tokyo Cabinet databases,
courtesy of Pierre Habouzit. Tokyo Cabinet is the sequel to QDBM
with support for larger files and also written by Mikio Hirabayashi.

For new installations, if you considered using QDBM, consider using
Tokyo Cabinet instead.

-----------------------------------------------------------------------
[Incompat 0.93] Berkeley DB version strings changed

Bogofilter will now return the BerkeleyDB's actual DB_VERSION_STRING
in the output of 'bogofilter -V'. The OLD format was:

    Database: BerkeleyDB (4.3.21)

The NEW format is:

    Database: Sleepycat Software: Berkeley DB 4.3.21: (November  8, 2004)

You may need to adjust your scripts.

-----------------------------------------------------------------------
[Incompat 0.93] QDBM database format changed to B+ trees

The QDBM database format has been changed from hash tables to B+
trees, i.e. from the Depot API to the Villa API.  This results in
significantly better performance, i.e. faster speed.  Unfortunately,
the two modes are incompatible, so upgrading to 0.93 requires running
a special command to convert the database once:

bogoQDBMupgrade wordlist.qdbm wordlist.tmp wordlist.qdbm.old

If this command didn't print anything, everything has gone well and it
has left your old data base in wordlist.qdbm.old.

NOTE: bogoQDBMupgrade needs qdbm-1.7.23 or newer to build.

-----------------------------------------------------------------------
[Incompat 0.93] Bogotune option parsing changes

In bogotune 0.93.2 and newer, you must repeat the -n or -s option as
prefix for the mailbox.

Example: bogotune -n good1 good2 -s bad1 bad2 ...
will be: bogotune -n good1 -n good2 -s bad1 -s bad2

-----------------------------------------------------------------------
[Major 0.93.3] SQLite 3.0.8 (and newer) is now supported. It isn't
nearly as fast as Berkeley DB but uses only one permanent and one
transient file (hence less maintenance work) and is supposed to be
proof against application and system crashes.

-----------------------------------------------------------------------
[Incompat 0.92]

The formatting parameters have changed:
      '%A' is now the message's IP address.
      '%I' is now the Message-ID.
      '%Q' is now the Queue-ID.

-----------------------------------------------------------------------
[Incompat 0.17] Support for --enable-deprecated-code (see the 0.16
release notes) has been removed. If you've run 0.16.X without that
switch, nothing changes for you.

Support for Berkeley DB 3.0 was removed in bogofilter 0.17.3
as a side effect of adding Concurrent Database support.

-----------------------------------------------------------------------
[Incompat 0.16] A number of features have been deprecated.  The
relevant code is bracketed by "#ifdef ENABLE_DEPRECATED_CODE" and
"#endif" statements.  The default build will not include the
deprecated features.  For those who still need these features,
configure option "--enable-deprecated-code" exists to allow them to be
turned on.

THIS MAY REQUIRE MAJOR CHANGES TO YOUR CONFIGURATION OR SCRIPTS!

The following list is supposed to be complete.  Let us know if we've
omitted anything. We shall try to provide workarounds and migration
paths whenever possible.

1) Scoring algorithms
---------------------

Bogofilter will support only the Robinson-Fisher algorithm, commonly
called the "Fisher algorithm".  The Graham algorithm and Robinson
geometric-mean algorithm, a.k.a. Robinson algorithm, have been
deprecated.

2) Wordlist support
-------------------

Bogofilter will now support only the combined wordlist, i.e.
wordlist.db, which contains both the ham and spam counts for each token.
The older, separate wordlists (spamlist.db and goodlist.db) are no
longer supported.

The bogoupgrade program can still be used to merge the separate
databases for you.  Type "bogoupgrade -d /you/wordlist/directory/".

Ignore lists, i.e. ignorelist.db, are also being deprecated.  The ignore
list feature has never been thoroughly tested and is not used (as far as
we know).

3) Command line switches
------------------------

Bogofilter will no longer support the switches listed in this section.
If used, bogofilter will print an error message and exit.

  Scoring related switches:

    -g - select Graham algorithm
    -r - select Robinson Geometric-Mean algorithm
    -f - select Robinson-Fisher algorithm

    see section 1 above

    -2 - set binary classification mode
    -3 - set ternary classification mode

    Bogofilter will use binary mode if ham_cutoff is zero and will use
    ternary mode (Yes, No, Unsure) if ham_cutoff in non-zero and less
    than spam_cutoff.

  Wordlist modes:

    -W   - use combined wordlist  for spam and ham tokens
    -WW  - use separate wordlists for spam and ham tokens

    Bogofilter will always operate in combined mode now.

  Backwards compatible token generation switches:

    -Pi and -PI - ignore_case
    -Pt and -PT - tokenize_html_tags
    -Pc and -PC - strict_check
    -Pd and -PD - degen_enabled
    -Pf and -PF - first_match

    Note: Since last May, the default values for these switches
    have been:

    ignore_case         disabled
    tokenize_html_tags  enabled
    strict_check        disabled
    degen_enabled       disabled
    first_match         disabled

    There will be no change in the default values.

4) Configuration options
------------------------

The following configuration options (for the above switches) are
deprecated:

    algorithm

    wordlist
    wordlist_mode

    ignore_case
    tokenize_html_tags
    tokenize_html_script
    header_degen
    degen_enabled
    first_match

The following configuration options (which don't correspond to
switches) are deprecated:

    thresh_stats
    thresh_rtable

Note:  Bogofilter will print a warning message if it sees any of
these options, but will run fine anyhow.

5) Miscellany
-------------

The user formatted SPAM_HEADER will no longer support format
specification "%a" (for algorithm) since bogofilter now has only one
algorithm.

-----------------------------------------------------------------------
[Incompat 0.15.9]

Bogofilter no longer allows disabling of algorithms, a feature which has
never been well supported.

-----------------------------------------------------------------------
[Incompat 0.15.4]

All header line tokens are now tagged as:

	Subject:      subj:
	To:           to:
	From:         from:
	Return-Path:  rtrn:
	Received:     rcvd:   ***new***
	any other:    head:   ***new***

Because existing wordlists don't have "head:???" tokens, the new tokens
won't be found in the wordlist and bogofilter's accuracy will go down.

To correct this you can do one of the following things:

1 - Use the new "-H" (for header-degen) option when scoring messages.
This option tells bogofilter to check the wordlist twice for each header
token - once for "head:xyz" and a second time for "xyz".  The ham and
spam counts are added together to give a cumulative result.

Note that, with bogofilter 0.15.4 and later, during message
registration, "head:xyz" tokens are added to the wordlist (for the
header lines).  The "-H" option is only applied during scoring.

The "-H" option is meant for temporary usage to cover the period while
bogofilter goes from having no "head:xyz" tokens in the wordlist to the
time when there are enough such tokens to score messages effectively.
After a few weeks, or perhaps months, of registering messages with the
new bogofilter, use of the "-H" option can end and bogofilter will use
the newly added "head:xyz" tokens.

2 - Retrain bogofilter with whatever ham and spam you have available.
This will create "header:xyz" tokens and allow the new, more effective
header tagging to be used to fullest advantage.

-----------------------------------------------------------------------
[Major 0.15]

The code for processing multiple messages has been rewritten.  In
addition to understanding mbox format files, bogofilter now understands
maildirs and MH folders.

-----------------------------------------------------------------------
[Incompat 0.14]

The exit codes returned by bogofilter have been expanded.  They are:

	Spam   = 0 -- unchanged
	Ham    = 1 -- unchanged
	Unsure = 2 -- *NEW*
	Error  = 3 -- *CHANGED*

-----------------------------------------------------------------------
[Major 0.14] Bogofilter now supports TDB (Trivial Data base).

Instead of separate wordlists for spam and ham tokens, bogofilter can
now use a single combined, wordlist that stores both all tokens.
In the combined wordlist each token contains two counts - for spam and
ham.  The name of the new file is wordlist.db.

However, this change broke the early versions (up to and including
0.14.2) of bogofilter. You should use at least bogofilter 0.14.3.

Bogofilter will check in $BOGOFILTER_DIR and use the wordlist(s) that
are there.  If wordlist.db is present, bogofilter will use the combined
mode.  If wordlist.db is not present, but both spamlist.db and
goodlist.db are present, bogofilter will use the separate wordlist mode.
If no wordlists are present, bogofilter will create wordlist.db and use
it.

Command line switches '-W' and '-WW' can be used to tell bogofilter the
mode you want.  Also config file options "wordlist_mode=combined" and
"wordlist_mode=separate" can be used.

Upgrading from an old bogofilter environment with its two wordlists
(spamlist.db and goodlist.db) to the new 0.14.x environment with its
single, combined wordlist.db involves 3 main steps - dumping the current
spamlist.db and goodlist.db files, formatting that output, and then
loading the data into a new file wordlist.db.  The script "bogoupgrade" is
included with bogofilter and performs the task.  Use command
"bogoupgrade -d /path/to/your/wordlists" to do the upgrade.  After
running it, your BOGOFILTER_DIR will contain all 3 database files.  When
started, bogofilter checks for wordlist.db and will use it.

-----------------------------------------------------------------------
[Incompat 0.13]

Parsing has changed.  As background, Paul Graham has done work to
improve the results of his bayesian filter and has published them in
"Better Bayesian Filtering" at http://www.paulgraham.com/better.html.
He found the following definition of a token to be beneficial:

       1. Case is preserved.

       2. Exclamation points are constituent characters.

       3. Periods and commas are constituents if they occur between two
	  digits. This lets me get ip addresses and prices intact.

       4. A price range like $20-25 yields two tokens, $20 and $25.

       5. Tokens that occur within the To, From, Subject, and Return-Path
	  lines, or within urls, get marked accordingly.

Bogofilter has always done #3 and has tagged for Subject lines for a
while.  Its parser now does all of these things.  Several command line
switches and config file options have been added to allow enabling or
disabling them.  Here are the new switches and options:

       -Pi/-PI	ignore_case		default - disabled
       -Ph/-PH	header_line_markup 	default - enabled
       -Pt/-PT	tokenize_html_tags 	default - enabled

The options can be enabled using the lower case switch or disabled using
the upper case switch.

When header_line_markup_is enabled, tokens in To:, From:, Subject:, and
Return-Path: lines are prefixed by "to:", "from:", "subj:", and "rtrn:"
respectively.

When tokenize_html_tags_is enabled, tokens in A, IMG, and FONT tags are
scored while classifying the message.

NOTE: To take full advantage of these changes, additional training of
bogofilter is necessary.  Here's why:

With bogofilter's use of upper and lower case, the wordlists won't match
as many words as before.  For example, "From" and "from" both used to
match "from", but this is no longer the case.  As additional training is
done, words like these will be added to the wordlists and bogofilter
will have a larger number of distinct tokens to use when classifying
messages.  This will improve its classification accuracy.

Similarly, the use of header_line_markup will tokenize "Subject: great
p0rn site" as "subj:great", "subj:p0rn", and "subj:site".  At first
these tokens won't be recognized, so bogofilter won't use them to score
the message.  After being trained, bogofilter will have these additional
tokens to aid in the classification process.

-----------------------------------------------------------------------
[Major 0.12]

Directory bogofilter/tuning has been added and contains
scripts for running tuning experiments as described in the new
HOWTO. See file bogofilter/tuning/README for more information.

Bogofilter's man page and help message describe the many command line
switches.  They have been divided into groups (help, classification,
registration, general, algorithm, parameter, and info) in both places.

Bogofilter 0.12.0 has three new command line switches for rapidly
scoring large numbers of messages.  These "bulk mode" switches are
especially useful for the tuning process.  The new switches are:

    -M - allows scoring all the messages in a mbox formatted file.  If
    used with "-v", an X-Bogosity line is printed as each message is
    scored.  Using the "-t" (terse) option is recommended to reduce the
    amount of output.

    -B - allows scoring of multiple message files, with each file
    containing a single message.  With this option, bogofilter expects the
    file names to be at the end of the command line.  If used with "-v",
    the file name is included in each printed line.  Using "-t" is
    recommended.

    -b - allows scoring of multiple message files, with each file
    containing a single message.  With this option, bogofilter reads the
    file names from stdin.  This option can be used with maildirs, as in
    "ls Maildir/* | bogofilter -b ..."  If used with "-v", the file name
    is included in each printed line.  Using "-t" is recommended.

New script bogolex.sh converts an email to a special file format that
contains the information needed by bogofilter to score the email.  Its
use speeds up the message scoring done by the tuning scripts.  The
script is described in more detail in bogofilter/tuning/README.

-----------------------------------------------------------------------
[Incompat 0.11]

Command line flags:

The meaning of command line flags '-S' and '-N' was changed in version
0.11.0.  Previously '-S' meant to unregister a message from the spam
wordlist and register the message in the non-spam wordlist and '-N'
meant to unregister from non-spam and register as spam.

Each of the flags now performs a single action.

	'-S' unregisters a message from the spam wordlist and
	'-N' unregisters a message from the non-spam wordlist.

To duplicate the old (compound) actions, it is necessary to use two
options - an unregister option ('-S' or '-N') and a register option
('-s' or '-n').

To duplicate the effect of the old '-S' option, use '-N -s'.  To
duplicate the effect of the old '-N' option, use '-S -n'.  The order of
the options doesn't matter and they can be concatenated, as in '-Sn' and
'-sN'.

Config file processing
----------------------

The code to process config files now checks numeric values for validity.
It complains when it detects something wrong.  In particular, double
precision values are no longer allowed to have a terminal 'f'.  For
example "spam_cutoff=0.95f" will generate a messages.

-----------------------------------------------------------------------
[Major 0.11]

New parameter query option:

Using options "-q -v" in a bogofilter command line will run the
query_config() function and will display bogofilter's various parameter
values.  This can be very useful in finding the reason for an unexpected
message classification.

-----------------------------------------------------------------------
End of RELEASE.NOTES