Cross-validation is a method which is widely used to compare the quality of classification and learning algorithms, and as such permits rudimentary comparisons between those classifiers which make use of dbacl(1) and bayesol(1), and other competing classifiers.
The mechanics of cross-validation are as follows: A set of pre-classified email messages is first split into a number of roughly equal-sized subsets. For each subset, the filter (by default, dbacl(1)) is used to classify each message within this subset, based upon having learned the categories from the remaining subsets. The resulting classification errors are then averaged over all subsets.
The results obtained by cross validation essentially do not depend upon the ordering of the sample emails. Other methods (see mailtoe(1),mailfoot(1)) attempt to capture the behaviour of classification errors over time.
mailcross uses the environment variables MAILCROSS_LEARNER and MAILCROSS_FILTER when executing, which permits the cross-validation of arbitrary filters, provided these satisfy the compatibility conditions stated in the ENVIRONMENT section below.
For convenience, mailcross implements a testsuite framework with predefined wrappers for several open source classifiers. This permits the direct comparison of dbacl(1) with competing classifiers on the same set of email samples. See the USAGE section below.
During preparation, mailcross builds a subdirectory named mailcross.d in the current working directory. All needed calculations are performed inside this subdirectory.
The normal usage pattern is the following: first, you should separate your email collection into several categories (manually or otherwise). Each category should be associated with one or more folders, but each folder should not contain more than one category. Next, you should decide how many subsets to use, say 10. Note that too many subsets will slow down the calculations rapidly. Now you can type
% mailcross add spam spam.mbox
% mailcross add work work.mbox
% mailcross add play play.mbox
You can now perform as many simulations as desired. Every cross validation consists of a learning, a running and a summarizing stage. These operations are performed on the classifier specified in the MAILCROSS_FILTER and MAILCROSS_LEARNER variables. By setting these variables appropriately, you can compare classification performance as you vary the command line options of your classifier(s).
% mailcross learn
% mailcross run
% mailcross summarize
The testsuite commands are designed to simplify the above steps and allow comparison of a wide range of email classifiers, including but not limited to dbacl. Classifiers are supported through wrapper scripts, which are located in the /usr/share/dbacl/testsuite directory.
The first stage when using the testsuite is deciding which classifiers to compare. You can view a list of available wrappers by typing:
% mailcross testsuite list
Note that the wrapper scripts are NOT the actual email classifiers, which must be installed separately by your system administrator or otherwise. Once this is done, you can select one or more wrappers for the simulation by typing, for example:
% mailcross testsuite select dbaclA ifile
If some of the selected classifiers cannot be found on the system, they are not selected. Note also that some wrappers can have hard-coded category names, e.g. if the classifier only supports binary classification. Heed the warning messages.
It remains only to run the simulation. Beware, this can take a long time (several hours depending on the classifier).
% mailcross testsuite run
% mailcross testsuite summarize
Once you are all done with simulations, you can delete the working files, log files etc. by typing
% mailcross clean
The progress of the cross validation is written silently in various log files which are located in the mailcross.d/log directory. Check these in case of problems.
mailcross testsuite takes care of learning and classifying your prepared email corpora for each selected classifier. Since classifiers have widely varying interfaces, this is only possible by wrapping those interfaces individually into a standard form which can be used by mailcross testsuite.
Each wrapper script is a command line tool which accepts a single command followed by zero or more optional arguments, in the standard form:
wrapper command [argument]...
Each wrapper script also makes use of STDIN and STDOUT in a well defined way. If no behaviour is described, then no output or input should be used. The possible commands are described below:
Right after loading, mailcross reads the hidden file .mailcrossrc in the $HOME directory, if it exists, so this would be a good place to define custom values for environment variables.
The subdirectory mailcross.d can grow quite large. It contains a full copy of the training corpora, as well as learning files for size times all the added categories, and various log files.
Cross-validation is a widely used, but ad-hoc statistical procedure, completely unrelated to Bayesian theory, and subject to controversy. Use this at your own risk.
The source code for the latest version of this program is available at the following locations:
Laird A. Breyer <email@example.com>
bayesol(1) dbacl(1), mailinspect(1), mailtoe(1), mailfoot(1), regex(7)