Sophie: bogofilter-1.2.2-2.2.mga2 i586

bogofilter-1.2.2-2.2.mga2.i586.rpm

BERKELEY DB ENVIRONMENT CODE
============================

$Id: README.db 6898 2010-04-01 15:11:23Z m-a $

This document does not apply when you are installing a bogofilter
version that has been configured to use the QDBM or SQLite3
database libraries. Common entry points are these

QUICK LINKS:
------------
Upgrade of Berkeley DB or operating system    section 2.6
Switching transactions on or off              section 2.2.1 or 2.2.2
Recover broken database                       section 3.2 or 3.3

0. Definitions ---------------------------------------------------------

Whenever ~/.bogofilter appears in the text below, this is the directory
where bogofilter keeps its data base by default. If you are overriding
this directory by configuration or environment variables, replace your
actual bogofilter data base directory.

1. Overview ------------------------------------------------------------

Operating bogofilter with a Berkeley DB back-end requires some
attention. Bogofilter offers two operating modes that can be switched on
or off a directory-by-directory basis:

  - the traditional, simple Berkeley DB Data Store

    easy to use, but also easy to corrupt, an interrupted registration
    of spam or non-spam mail, an application crash, a system crash, a
    disk that runs out of space, a user that runs out of quota.

  - the Berkeley DB Transactional Data Store,
    transactional, TXN or XA in bogofilter jargon for short

    a bit more complex to use, but pretty crash-proof as long as it is
    not abused. Upgrading bogofilter, Berkeley DB, copying around
    databases and backups require special considerations given in this
    document.

Note that the system administrator can disable either mode of operation
at compile time; in the default install, either will be present.

2. Prerequisites and Caveats -------------------------------------------

2.1 Compatibility, Berkeley DB versions

These versions are supported (with all Sleepycat-posted patches applied
- if using a pre-packaged Berkeley DB version, the packager should have
applied the patches, check your vendor's update site regularly):

  Sleepycat Software: Berkeley DB 3.1.17: (July     31, 2000)
  Sleepycat Software: Berkeley DB 3.2.9:  (January  24, 2001)
  Sleepycat Software: Berkeley DB 3.3.11: (July     12, 2001)
  Sleepycat Software: Berkeley DB 4.0.14: (November 18, 2001)
  Sleepycat Software: Berkeley DB 4.1.25: (December 19, 2002)
  Sleepycat Software: Berkeley DB 4.2.52: (December  3, 2003)
  Sleepycat Software: Berkeley DB 4.3.29: (September 6, 2005)
  Sleepycat Software: Berkeley DB 4.4.20: (January  10, 2006)
  Berkeley DB 4.5.20: (September 20, 2006)
  Berkeley DB 4.6.19: (August 10, 2007)
  Berkeley DB 4.7.25: (May 15, 2008)
  Berkeley DB 4.8.24: (August 14, 2009)
  Berkeley DB 4.8.26: (December 18, 2009)
  Berkeley DB 5.0.21: (March 30, 2010)

Other versions of Berkeley DB between the first and last listed above
may or may not work but usually they will.

Berkeley DB versions 4.1 and newer are recommended over the previous
versions, because the newer can detect data corruptions more reliably
(through the use of checksums that detect partially written data base
pages); Berkeley DB 4.2 and 4.3 appear a bit faster under load than 4.1
and older versions, 4.4 and newer have not yet been evaluated for speed.

2.2.1 Upgrading to transactional databases, also from older bogofilter versions

NOTE: for updates of Berkeley DB itself, see section 2.6!

Bogofilter should transparently upgrade the existing data base to the
new transactional data base, but going the reverse way requires manual
intervention, see section 2.2.2 below.

For enhanced reliability (only available with Berkeley DB 4.1 or newer),
it is recommended that you dump and reload the database once so that
Berkeley DB adds page checksums.  Skip this procedure for Berkeley DB
versions 4.0 or older.  For 4.1 and newer, use these commands:

  cd ~/.bogofilter
  bogoutil -d wordlist.db > wordlist.txt
  mv wordlist.db wordlist.db.old
  bogoutil --db-transaction=yes -l wordlist.db < wordlist.txt

And if all commands succeeded: rm wordlist.txt wordlist.db.old

2.2.2 Downgrading to non-transactional database

To downgrade a transactional database to a non-transactional, a
dump/reload cycle is required. These commands should do the job:

  cd ~/.bogofilter
  bogoutil -d wordlist.db > wordlist.txt
  mv wordlist.db wordlist.db.old
  rm -f log.?????????? __db.???
  bogoutil --db-transaction=no -l wordlist.db < wordlist.txt

2.3 Recoverability

The ability to recover the data base after a crash (power failure!)
depends on data being written to the disk (or a battery-backed write
cache) _immediately_ rather than delayed to be written later.

Common disk drives in current PCs and MACs are of the ATA or SATA kind
and usually ship with their write cache enabled. They write fast, but
can lose or corrupt up to a few MB of data when the power fails.
Note: This problem is not specific to bogofilter.

It is possible to sacrifice a bit bit of the the write speed and get
reliability in turn, by switching off the disk's write cache (see
appendix A for instructions).

Switching the write cache off may however adversely affect the
performance below acceptable levels, particularly for large writes such
as recording live audio or video data to hard disk.
If performance is degraded too much, consider getting a separate disk
drive and using one for fast writes (with the write cache on) and one
for reliable writes (with the write cache off, for bogofilter, mail
servers and other applications that need survive a power loss without
data loss).

2.4 Choosing a file system

If your computer saves the data on its own disk drive (a "local file
system"), Berkeley DB should work fine. Such file systems are ext2, ext3,
ffs, jfs, hfs, hfs+, reiserfs, ufs, xfs.

Berkeley DB Transactional and Concurrent data stores do not work
reliably with a networked file system for various technical reasons.
AFS, CIFS, Coda, NFS, SMBFS fall into this category.

Strictly speaking, with Berkeley DB 4.0 and older versions, the data base
block size must be written atomically. The bogofilter maintainers are
not currently aware of a file system that meets this requirement and is
production quality at the same time.

2.5 Making a snapshot backup

The transactional data store is no good if the disk drive has become
inaccessible (which happens after some months or years with every
drive), so you _must_ back up your data base regularly (see the
db_archive utility manual for additional documentation of a "hot"
backup), bogofilter cannot, of course, guess data that got lost through
a hard drive fault.

When copying or archiving directory contents, be sure to copy or archive
the *.db files FIRST, BEFORE archiving/copying the log.* files, this
is needed to let the database copy or archive remain recoverable.

You can use the bf_tar script for convenience. It requires the pax
utility (UNIX standard for over a decade now) and writes a tar archive
to stdout, and it can optionally remove inactive log files before or
after writing the tarball.

Run bf_tar without arguments to see its synopsis.

2.6 Updating Berkeley DB version underneath bogofilter

When upgrading the Berkeley DB library to a new version, or recompiling
bogofilter to use a newer version, two things in the on-disk data format
can change, generally speaking: the database format (we use the Btree
access method), the log file format, or both.

You need a "log file upgrade" for your transactional databases if at
least one of these conditions is true ("-->" means "to")

- you upgraded Berkeley DB from a 3.X version --> a 4.Y version
- you upgraded Berkeley DB from a 4.X version --> a newer 4.Y version
  that means Y > X; EXCEPT if you upgraded from 4.0 to 4.1

Non-transactional databases do not need log file format upgrades as they
do not use log files.

If you need a log file upgrade, the upgrade procedure is:
(NOTE: DO NOT UPGRADE BERKELEY DB OR BOGOFILTER UNTIL STEP 4!)

  1. shut down your mail system,
  2. run 'bogoutil --db-remove-environment ~/.bogofilter' (for each user!)
     (this will automatically recover the environment)
  3. archive the database for catastrophic recovery (make a backup)
  4. install the new Berkeley DB version, recompile bogofilter (unless
     using a binary package), install the new bogofilter
  5. start your mail system.

If you've been using Berkeley DB 3.0 (only supported with bogofilter
versions 0.17.2 and older) and are about to update to any newer 3.X or
4.X version, you need a database format upgrade. You can either:

- dump the wordlists to a text file, then update Berkeley
  DB and bogofilter, remove the wordlists and load them from the dump.
  The procedure is the same as in section 2.2.1 or 2.2.2 (depending on
  whether you want to use transactions or not).

or

- use the db_upgrade utility to upgrade the databases in place
  (this is dangerous and must not be interrupted, backup first!)

3. Use and troubleshooting ---------------------------------------------

3.1 LOG FILE HANDLING

This section only applies to transactional databases.

The Berkeley DB Transactional Data Store uses log files to store data
base changes so that they can be rolled back or restored after an
application crash.

The logs of the transactional data store, log.NNNNNNNNNN files of up to
1 MB in size (in the default configuration), can consume considerable
amounts of disk space. Bogofilter therefore removes log files that are
no longer part of active transactions by default.

This automatic removal of log files can be disabled by either using
--db-log-autoremove=no on the command line, or by configuring
db_log_autoremove=no in the configuration file.  If automatic removal is
disabled, it can manually be triggered by running
'bogoutil --db-prune ~/.bogofilter'.

Note that removing log files makes catastrophic recovery impossible
without backups, so you *must* make snapshot
backups that contain both the *.db and log.* files (in this order!). See
section 2.5 for details.

Referral: Berkeley DB's db_archive documentation contains suggestions
for several backup strategies.

3.2 RECOVERY AND FAILED RECOVERY, FOR TRANSACTIONAL DATABASES ONLY

The recovery procedures should be tried in the order shown in this
section. If you aren't willing to experiment much, but have kept
sufficient spam and ham that you can easily and quickly retrain
bogofilter from scratch, read only sections 3.2.1 and 3.2.5, skipping
subsections 3.2.2 to 3.2.4.

3.2.1 Regular recovery

Bogofilter and related bogo* utilities will automatically detect when
regular recovery is needed and run it. This process is transparent,
the user will usually be aware this happens.

This process needs the *.db file and the corresponding _active_ log
files.

It is possible to trigger regular recovery by running
bogoutil --db-recover ~/.bogofilter
although this should not be needed.

If this fails, remove the __db.*, *.db and log.* files,
restore from the latest snapshot backup (see section 2.5) and force
recovery as shown in the previous paragraph.

3.2.2 Catastrophic recovery

If regular recovery fails after severe damage to hardware, filesystem,
database files, you can attempt to run catastrophic recovery. If log
files have been damaged, catastrophic recovery may not work.

This may need *all* log files from the backup and is therefore not
available if log files have been pruned.

To run catastrophic recovery, replace the log files from your archives,
then run:
bogoutil --db-recover-harder ~/.bogofilter

If this fails, read on.

3.2.3 Last-resort recovery method #1: nuke the environment

This recovery methods do not guarantee you are getting all of your
database, you may already have lost part or all of your data when this
is required, and you may lose recent updates to the database and corrupt
it. Only attempt this methods if the regular and catastrophic recovery
methods have failed or were unavailable.

To use this method:

  1. remove the __db.* and log.* files
  2. run: bogoutil -v --db-verify ~/.bogofilter/wordlist.db
  3a. if that printed OK, watch carefully if bogofilter performs to 
      the standards you are used to.
  3b. if verify failed, read the next section.

3.2.4 Last-resort recovery method #2: salvage the raw .db file

This recovery methods do not guarantee you are getting all of your
database, you may already have lost part or all of your data when this
is required, and you may lose recent updates to the database and corrupt 
it. Only attempt this methods if the regular and catastrophic recovery 
methods have failed or were unavailable.

To try this method:

  # salvage raw data
  db_dump -r ~/.bogofilter/wordlist.db > ~/.bogofilter/wordlist.saved
  rm ~/.bogofilter/{__db.*,log.*,wordlist.db}
  db_load ~/.bogofilter/wordlist.db < ~/.bogofilter/wordlist.saved
  # convert to transactional store
  bogoutil -d ~/.bogofilter/wordlist.db > ~/.bogofilter/wordlist.saved
  rm ~/.bogofilter/{__db.*,log.*,wordlist.db}
  bogoutil --db-transaction=yes -l ~/.bogofilter/wordlist.db \
    < ~/.bogofilter/wordlist.saved

3.2.5 No recovery possible?

We're sorry. This should happen really rarely. There's nothing left to
try, so you need to retrain from scratch.

First, remove the database and environment, type:

  rm ~/.bogofilter/{__db.*,log.*,wordlist.db}

Then retrain from scratch with the usual bogofilter -n and bogofilter -s
commands.

3.3 RECOVERY OF TRADITIONAL (NON-TRANSACTIONAL) DATABASES

Non-transactional databases do not store data that could help with
recovering data, so the recovery options are:

3.3.1 Salvage raw data

Note that your database may already have lost data, so this procedure
may not recover everything that used to be in the database before the
corruption happened.

To try this, suspend mail delivery to your directory, run these
commands:

  db_dump -r ~/.bogofilter/wordlist.db > ~/.bogofilter/wordlist.saved
  rm ~/.bogofilter/{__db.*,log.*,wordlist.db}
  db_load ~/.bogofilter/wordlist.db < ~/.bogofilter/wordlist.saved

and resume mail delivery.

3.3.2 Restore from backup

If the previous procedure has provided you with incomplete data, simply
replace your wordlist.db file by the latest good copy that you have.

3.3.3 Retrain from scratch

If you don't have good backups and the procedure from 3.3.1 also failed,
you have no other choice than to remove the wordlist.db file and redo
the whole training process with the usual bogofilter -s and bogofilter
-n commands.

3.4 COPYING DATABASES

If you intend to copy a transactional database, you MUST copy the *.db
file(s) first and then the log.* files. Note that the database should
NOT be subject to training while the copy is in progress, or automatic
log file removal should be suspended while the copy is made.

If you intend to copy a non-transactional database, you MUST NOT perform
any training while the copy is in progress.

Note that bogofilter's -u option can cause automatic training as mail
arrives.

If you need to copy the *.db files, DO NOT USE cp, BUT DO USE dd instead
and give it a block size that matches the data base's block size, which
can be found by running bogoutil --db-print-pagesize as shown in section
4.1 below. Example:

dd if=~/.bogofilter/wordlist.db bs=4096 of=/tmp/copy.db

The reason is that some operating systems let their cp command use
mmap() for efficiency without guaranteeing consistency of mmap() versus
simultaneous write().

A bf_copy script is provided for your convenience.

3.5 OTHER TROUBLESHOOTING

3.5.1 "Logging region out of memory; you may need to increase its size"

This happens only for transactional databases. Use the DB_CONFIG file to 
set the log region size (set_lg_regionmax) higher and after that, run 
database recovery - see section 4.2 for details.

4. Other Information of Interest ---------------------------------------

4.1 GENERAL INFORMATION

Berkeley DB keeps some additional statistics about the database itself,
and in transactional databases also about caching, logging,
transactions and their efficiency.

The page size of the database file can be queried by bogoutil:

bogoutil --db-print-pagesize ~/.bogofilter/wordlist.db

Further statistics can be obtained by running the db_stat utility:

db_stat -h ~/.bogofilter -d wordlist.db # data base statistics

The following are only available in transactional mode:

db_stat -h ~/.bogofilter -e # environment statistics
db_stat -h ~/.bogofilter -m # buffer pool statistics
db_stat -h ~/.bogofilter -l # log statistics
db_stat -h ~/.bogofilter -t # transaction statistics

Note that statistics may disappear when the data base is recovered. They
will reappear after running bogofilter and are the more reliable the
more often bogofilter has been used since the last recovery.

You MUST NOT manually remove files named __db.NNN and log.NNNNNNNNNN
from the ~/.bogofilter directory.
================================================
REMOVING THESE FILES CAUSES DATABASE CORRUPTION!
================================================
These can contain update data for the data base that must be still
written back to the wordlist.db file - this happens when there are many
concurrent processes alongside a registration process.

You can safely remove the __db.NNN files with the following command, but
note that these will reappear when bogofilter or bogoutil is used:

  bogoutil --db-remove-environment ~/.bogofilter

log.* files are removed automatically in the default configuration, the
remaining ones must remain in place. For further information and on safe
removal, see section 3.1.

4.2 INTERESTING CONFIGURATION PARAMETERS FOR DB_CONFIG

A "DB_CONFIG" file that is in your bogofilter home directory, usually
~/.bogofilter, can be used to configure the data base behavior of
transactional databases. Some options take effect immediately, some need
'bogoutil --db-recover' to become effective. All options are ignored if
the database in the same directory as DB_CONFIG is non-transactional.

Here is a list of interesting settings. Use one per line, and omit the
leading spaces and hyphen:

SIZING OPTIONS:

 - set_cachesize      G B C
   (valid in Berkeley DB 3.1 - 4.6, requires recovery to change)

   sets the cache size to G gigabytes plus B bytes which are spread out
   in C equally sized caches (all figures are natural numbers). You
   cannot configure caches smaller than 20 kB, Berkeley DB will increase
   all caches under 500 MB by 25%, and you must provide at least one
   cache per 4 GB.

   This option takes precedence over Bogofilter's db_cachesize option!

   Example: set_cachesize 0 60000000 6
   will create six caches sized 12500000 bytes (12.5 MB, +25% applied)

 - set_lg_max         250000
   (valid in Berkeley DB 3.1 - 4.6, takes effect immediately)

   this option configures the maximum log file size, in bytes, before
   Berkeley DB starts a new log file. The default is 1 MB.

 - set_lg_regionmax   262144
   (valid in Berkeley DB 3.3 - 4.6, requires recovery to take effect)

   this option configures the log region size, in bytes, and may 
   sometimes need to be increased. The default is around 60 kB.

SAFE OPTIONS:

 - set_flags DB_DIRECT_DB
   (valid in Berkeley DB 4.1 - 4.6, takes effect immediately)

   this option turns off system buffering of *database* files, to avoid
   double caching of data. NOT SUPPORTED ON ALL PLATFORMS!

 - set_flags DB_DIRECT_LOG
   (valid in Berkeley DB 4.1 - 4.6, takes effect immediately)

   this option turns off system buffering of *log* files, to avoid
   double caching of data. NOT SUPPORTED ON ALL PLATFORMS!

 - set_flags DB_DSYNC_LOG
   (valid in Berkeley DB 4.3 - 4.6, takes effect immediately)

   this option can increase performance on some systems (and decrease on
   other systems), by using the O_DSYNC POSIX flag rather than a
   separate function to flush the logs.

 - set_flags DB_NOMMAP
   (valid in Berkeley DB 3.2 - 4.6, takes effect immediately)

   this option can reduce memory consumption at run time, particularly
   with large databases, at some cost of performance

 - set_flags DB_REGION_INIT
   (valid in Berkeley DB 3.2 - 4.6, takes effect immediately)

   this option causes all shared memory regions to be "page faulted"
   into core memory at application start and written at data base
   creation. This can improve performance under load, and it allows
   bogofilter or bogoutil to allocate disk space in advance, to avoid
   some out-of-space conditions later on.

 - set_verbose DB_VERB_CHKPOINT
   (valid in Berkeley DB 3.1 - 4.2, takes effect immediately)
 - set_verbose DB_VERB_DEADLOCK
   (valid in Berkeley DB 3.1 - 4.6, takes effect immediately)
 - set_verbose DB_VERB_RECOVERY
   (valid in Berkeley DB 3.1 - 4.6, takes effect immediately)
 - set_verbose DB_VERB_WAITSFOR
   (valid in Berkeley DB 3.1 - 4.6, takes effect immediately)

   these verbose flags cause extended output for long-lasting
   operations, ...CHKPOINT prints location information when searching
   for checkpoints, ...DEADLOCK prints information on deadlock
   detection, ...RECOVERY prints information during recovery and
   ...WAITSFOR prints the waits-for table during deadlock detection.

UNSAFE OPTIONS - these may impair robustness/recoverability of the data base

 - set_flags DB_TXN_NOSYNC
   (valid in Berkeley DB 3.3 - 4.4, takes effect immediately)

   if set, the log is not written or synchronously flushed at commit
   time. In case of an application or system crash, the last few
   registrations can be lost.

 - set_flags DB_TXN_WRITE_NOSYNC
   (valid in Berkeley DB 4.1 - 4.4, takes effect immediately)

   if set, the log is written, but not synchronously flushed at commit
   time. In case of a system crash, the last few registrations can be
   lost.

DANGEROUS OPTIONS - these can improve performance, but should be avoided

 - set_flags DB_TXN_NOT_DURABLE
   (valid in Berkeley DB 4.2 - 4.4, takes effect at next application start,
    replaced by DB_LOG_INMEMORY for version 4.3 - 4.4)

   this option prevents writing into log files. In case of application
   or system crashes, the data base can become corrupt, and large
   registrations can exhaust the log buffer space and then fail.

 - set_flags DB_LOG_INMEMORY
   (valid in Berkeley DB 4.3 - 4.4, takes effect at next application start)

   this option prevents the writing of any log files to disk. In case of
   application or system crashes, the data base can become corrupt, and
   large registrations can exhaust the log buffer space and then fail.

   After a crash, verify the data base (see below) and if it is corrupt,
   restore from backup or rebuild from scratch.

4.3 VERIFYING DATABASES TO CHECK FOR CORRUPTION

To verify that the database is intact, type:

    bogoutil --db-verify ~/.bogofilter/wordlist.db

There should be no output. If there are errors, you must recover your
database, see section 3.2 for methods.

A. Switching the disk drive's write cache off and on -------------------

A.1 Introduction

You need to determine the name of the disk device and its type.
Type "mount", you'll usually get an output that contains lines like
these; find the "on /home" if you have it, if you don't check for "on
/usr" if you have it, or finally, resort to looking at the "on /" line.

From this line, look at the left hand column, usually starting with /dev.

If you have FreeBSD, skip to section A.3 now.

A.2 Switching the write cache off or on in Linux

In this line you've found (see previous section A.1), you'll usually find
something that starts with /dev/hda, /dev/hde or /dev/sda in the left
hand column of that line, you can ignore the trailing number. /dev/hd*
means ATA, /dev/sd* means SCSI.

If the drive name starts with /dev/hd, type the following line, but
replace hda by hde or what else you may have found:

/sbin/hdparm -W0 /dev/hda
                 (replace -W0 by -W1 to re-enable the write cache)

If your drive name starts with /dev/sd, use the graphical scsi-config
utility and add a blank the device name on the command line; for
example:

scsi-config /dev/sda

You can "try changes" (they will be forgotten the next time the computer
is switched off) or "save changes" (settings will be saved permanently);
you can use the same utility to restore the previous setting or load
manufacturer defaults. Skip to section 2.4.

What is this scsi-config?

The scsi-config command is a Tk script, delivered with the scsiinfo
package.  At the time of writing, scsiinfo can be found at
ftp://tsx-11.mit.edu/pub/linux/ALPHA/scsi/scsiinfo-1.7.tar.gz .

For users who don't run X on their mail servers, there is also a
command-line utility, scsiinfo, in the package.  Setting parameters
with scsiinfo is a bit hairy, but the following sequence worked for two
of us who tried it (back up your drive first):

# get current disk settings and turn off the write cache
# (substitute the appropriate device for /dev/sda in all these commands)
parms=`scsiinfo -cX /dev/sda | sed 's/^./0/'`

# write the parameters back to the hard drive's current settings
# this needs to be put in a boot script
scsiinfo -cXR /dev/sda $parms

# if you don't want to put this in a boot script, you can alternatively
# save the parameters to the hard drive's settings area:
scsiinfo -cXRS /dev/sda $parms

You did back up your drive before trying that, right? :)


A.3 Switching the write cache off in FreeBSD

Have you read section A.1 already? You should have.

In this line you've found (see section A.1), you'll usually have a line
that starts with /dev/ad0, /dev/wd0 (either means you have ATA) or
/dev/da0 (which means you have SCSI).

If you have ATA, add the line

  hw.ata.wc="0"

to /boot/loader.conf.local, shut down all applications and reboot. (To
revert the change, remove the line, shut down all applications and
reboot.)

If you have SCSI, you'll need to decide if you want the setting until the next
reboot, or permanent (the permanent setting can be changed back, don't worry).
In either case, omit the leading /dev and trailing s<NUMBER><LETTER> parts
(/dev/da0s1a -> da0; /dev/da4s3f -> da4). Replace da0 by your device name in
these examples, and leave out the part in parentheses:

  camcontrol modepage da0 -m8 -e -P0 (effective until computer is switched off)
  camcontrol modepage da0 -m8 -e -P3 (save parameters permanently)

camcontrol will open a temporary file with a WCE: line on top. Edit the
figure to read 0 (cache disabled) or 1 (cache enabled), then save the
file and exit the editor.