AutoClass is looking for the best classification(s) of the data it can find. A classification is composed of:
As a strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses is the total probability that, had you known nothing about your data or its domain, you would have found this set of data generated by this underlying model. This includes the prior probability that the "world" would have chosen this number of classes, this set of relative class weights, and this set of parameters for each class, and the likelihood that such a set of classes would have generated this set of values for the attributes in the data cases.
These probabilities are typically very small, in the range of e^-30000, and so are usually expressed in exponential notation.
When run with the -search command, AutoClass searches for a classification. The required arguments are the paths to the four input files, which supply the data, the data format, the desired classification model, and the search parameters, respectively.
By default, AutoClass writes intermediate results in a binary file. With the -report command, AutoClass generates an ASCII report. The arguments are the full path names of the .results, .search, and .r-params files.
When run with the -predict command, AutoClass predicts the class membership of a "test" data set based on classes found in a "training" data set (see "PREDICTIONS" below).
For more detailed information on the formats of these files, see /usr/share/doc/autoclass/preparation-c.text.
For example:
white 38.991306 0.54248405 2 2 1
red 25.254923 0.5010235 9 2 1
yellow 32.407973 ? 8 2 1
all_white 28.953982 0.5267696 0 1 1
A header file follows this general format:
;; num_db2_format_defs value (number of format def lines
;; that follow), range of n is 1 -> 5
num_db2_format_defs n
;; number_of_attributes token and value required
number_of_attributes <as required>
;; following are optional - default values are specified
separator_char ' '
comment_char ';'
unknown_token '?'
separator_char ','
;; attribute descriptors
;; <zero-based att#> <att_type> <att_sub_type> <att_description>
;; <att_param_pairs>
Each attribute descriptor is a line of:
Attribute index (zero based, beginning in column 1)
Attribute type. See below.
Attribute subtype. See below
Attribute description: symbol (no embedded blanks) or
string; <= 40 characters
Specific property and value pairs.
Currently available combinations:
type subtype property type(s)
---- -------- ---------------
dummy none/nil --
discrete nominal range
real location error
real scalar zero_point rel_error
The ERROR property should represent your best estimate of the
average error expected in the measurement and recording of that real
attribute. Lacking better information, the error can be taken as 1/2
the minimum possible difference between measured values. It can be
argued that real values are often truncated, so that smaller errors
may be justified, particularly for generated data. But AutoClass only
sees the recorded values. So it needs the error in the recorded
values, rather than the actual measurement error. Setting this error
much smaller than the minimum expressible difference implies the
possibility of values that cannot be expressed in the data. Worse, it
implies that two identical values must represent measurements that
were much closer than they might actually have been. This leads to
over-fitting of the classification.
The REL_ERROR property is used for SCALAR reals when the error is proportional to the measured value. The ERROR property is not supported.
AutoClass uses the error as a lower bound on the width of the normal distribution. So small error estimates tend to give narrower peaks and to increase both the number of classes and the classification probability. Broad error estimates tend to limit the number of classes.
The scalar ZERO_POINT property is the smallest value that the measurement process could have produced. This is often 0.0, or less by some error range. Similarly, the bounded real's min and max properties are exclusive bounds on the attributes generating process. For a calculated percentage these would be 0-e and 100+e, where e is an error value. The discrete attribute's range is the number of possible values the attribute can take on. This range must include unknown as a value when such values occur.
Header File Example:
!#; AutoClass C header file -- extension .hd2 !#; the following chars in column 1 make the line a comment: !#; '!', '#', ';', ' ', and '\n' (empty line) ;#! num_db2_format_defs <num of def lines -- min 1, max 4> num_db2_format_defs 2 ;; required number_of_attributes 7 ;; optional - default values are specified ;; separator_char ' ' ;; comment_char ';' ;; unknown_token '?' separator_char ',' ;; <zero-based att#> <att_type> <att_sub_type> <att_description> <att_param_pairs> 0 dummy nil "True class, range = 1 - 3" 1 real location "X location, m. in range of 25.0 - 40.0" error .25 2 real location "Y location, m. in range of 0.5 - 0.7" error .05 3 real scalar "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0 rel_error .001 4 discrete nominal "Truth value, range = 1 - 2" range 2 5 discrete nominal "Color of foobar, 10 values" range 10 6 discrete nominal Spectral_color_group range 6
Each model is specified by one or more model group definition lines. Each model group line associates attribute indices with a model term type.
Here is an example model file:
# AutoClass C model file -- extension .model model_index 0 7 ignore 0 single_normal_cn 3 single_normal_cn 17 18 21 multi_normal_cn 1 2 multi_normal_cn 8 9 10 multi_normal_cn 11 12 13 single_multinomial default
Here, the first line is a comment. The following characters in column 1 make the line a comment: `!', `#', ` ', `;', and `\n' (empty line).
The tokens "model_index n m" must appear on the first non-comment line, and precede the model term definition lines. n is the zero-based model index, typically 0 where there is only one model -- the majority of search situations. m is the number of model term definition lines that follow.
The last seven lines are model group lines. Each model group line
consists of:
Notes:
See the documentation in models-c.text for further information about specific model terms.
Once you have succeeded in describing your data with a header file and model file that passes the AUTOCLASS -SEARCH <...> input checks, you will have entered the search domain where AutoClass classifies your data. (At last!)
The main function to use in finding a good classification of your data is AUTOCLASS -SEARCH, and using it will take most of the computation time. Searches are invoked with:
autoclass -search <.db2 file path> <.hd2 file path>
<.model file path> <.s-params file path>
All files must be specified as fully qualified relative or absolute
pathnames. File name extensions (file types) for all files are forced
to canonical values required by the AutoClass program:
data file ("ascii") db2
data file ("binary") db2-bin
header file hd2
model file model
search params file s-params
The sample-run (/usr/share/doc/autoclass/examples/) that comes with AutoClass shows some sample searches, and browsing these is probably the fastest way to get familiar with how to do searches. The test data sets located under /usr/share/doc/autoclass/examples/ will show you some other header (.hd2), model (.model), and search params (.s-params) file setups. The remainder of this section describes how to do searches in somewhat more detail.
The bold faced tokens below are generally search params file parameters. For more information on the s-params file, see SEARCH PARAMETERS below, or /usr/share/doc/autoclass/search-c.text.gz.
As a strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses is the total probability that, had you known nothing about your data or its domain, you would have found this set of data generated by this underlying model. This includes the prior probability that the "world" would have chosen this number of classes, this set of relative class weights, and this set of parameters for each class, and the likelihood that such a set of classes would have generated this set of values for the attributes in the data cases.
These probabilities are typically very small, in the range of e^-30000, and so are usually expressed in exponential notation.
The relative probability between different classifications found can be very large, like e^1000, so the very best classification found is usually overwhelmingly more probable than the rest (and overwhelmingly less probable than any better classifications as yet undiscovered). If AutoClass should manage to find two classifications that are within about exp(5-10) of each other (i.e. within 100 to 10,000 times more probable) then you should consider them to be about equally probable, as our computation is usually not more accurate than this (and sometimes much less).
The standard approach to massaging is to
and repeat till they stop changing. There are three available convergence algorithms: "converge_search_3" (the default), "converge_search_4" and "converge". Their specification is controlled by search params file parameter try_fn_type.
Deciding when to stop is a judgment call and it's up to you. Since the search includes a random component, there's always the chance that if you let it keep going it will find something better. So you need to trade off how much better it might be with how long it might take to find it. The search status reports that are printed when a new best classification is found are intended to provide you information to help you make this tradeoff.
One clear sign that you should probably stop is if most of the classifications found are duplicates of previous ones (flagged by "dup" as they are found). This should only happen for very small sets of data or when fixing a very small number of classes, like two.
Our experience is that for moderately large to extremely large data sets (~200 to ~10,000 datum), it is necessary to run AutoClass for at least 50 trials.
By default AUTOCLASS -SEARCH will write out a number of files, both at the end and periodically during the search (in case your system crashes before it finishes). These files will all have the same name (taken from the search params pathname [<name>.s-params]), and differ only in their file extensions. If your search runs are very long and there is a possibility that your machine may crash, you can have intermediate "results" files written out. These can be used to restart your search run with minimum loss of search effort. See the documentation file /usr/share/doc/autoclass/checkpoint-c.text.
A ".log" file will hold a listing of most of what was printed to the screen during the run, unless you set log_file_p to false to say you want no such foolishness. Unless results_file_p is false, a binary ".results-bin" file (the default) or an ASCII ".results" text file, will hold the best classifications that were returned, and unless search_file_p is false, a ".search" file will hold the record of the search tries. save_compact_p controls whether the "results" files are saved as binary or ASCII text.
If the C global variable "G_safe_file_writing_p" is defined as TRUE in "autoclass-c/prog/globals.c", the names of "results" files (those that contain the saved classifications) are modified internally to account for redundant file writing. If the search params file name is "my_saved_clsfs" you will see the following "results" file names (ignoring directories and pathnames for this example)
save_compact_p = true --
"my_saved_clsfs.results-bin" - completely written file
"my_saved_clsfs.results-tmp-bin" - partially written file, renamed
when complete
save_compact_p = false --
"my_saved_clsfs.results" - completely written file
"my_saved_clsfs.results-tmp" - partially written file, renamed
when complete
If check pointing is being done, these additional names will appear
save_compact_p = true --
"my_saved_clsfs.chkpt-bin" - completely written checkpoint file
"my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file,
renamed when complete
save_compact_p = false --
"my_saved_clsfs.chkpt" - completely written checkpoint file
"my_saved_clsfs.chkpt-tmp" - partially written checkpoint file,
renamed when complete
autoclass -search <.db2 file path> <.hd2 file path>
<.model file path> <.s-params file path>
To restart a previous search, specify that force_new_search_p
has the value false in the search params file, since its default is
true. Specifying false tells AUTOCLASS -SEARCH to try to find a
previous compatible search (<...>.results[-bin] & <...>.search) to
continue from, and will restart using it if found. To force a new
search instead of restarting an old one, give the parameter
force_new_search_p the value of true, or use the default. If
there is an existing search (<...>.results[-bin] & <...>.search), the
user will be asked to confirm continuation since continuation will
discard the existing search.
If a previous search is continued, the message "RESTARTING SEARCH" will be given instead of the usual "BEGINNING SEARCH". It is generally better to continue a previous search than to start a new one, unless you are trying a significantly different search method, in which case statistics from the previous search may mislead the current one.
After each try a very short report (only a few characters long) is given. After each new best classification, a longer report is given, but no more often than min_report_period (default is 30 seconds).
"converge_search_3" uses an absolute stopping criterion (rel_delta_range, default value of 0.0025) which tests the variation of each class of the delta of the log approximate-marginal-likelihood of the class statistics with-respect-to the class hypothesis (class->log_a_w_s_h_j) divided by the class weight (class->w_j) between successive convergence cycles. Increasing this value loosens the convergence and reduces the number of cycles. Decreasing this value tightens the convergence and increases the number of cycles. n_average (default value of 3) specifies how many successive cycles must meet the stopping criterion before the trial terminates.
"converge_search_4" uses an absolute stopping criterion (cs4_delta_range, default value of 0.0025) which tests the variation of each class of the slope for each class of log approximate-marginal-likelihood of the class statistics with-respect-to the class hypothesis (class->log_a_w_s_h_j) divided by the class weight (class->w_j) over sigma_beta_n_values (default value 6) convergence cycles. Increasing the value of cs4_delta_range loosens the convergence and reduces the number of cycles. Decreasing this value tightens the convergence and increases the number of cycles. Computationally, this try function is more expensive than "converge_search_3", but may prove useful if the computational "noise" is significant compared to the variations in the computed values. Key calculations are done in double precision floating point, and for the largest data base we have tested so far ( 5,420 cases of 93 attributes), computational noise has not been a problem, although the value of max_cycles needed to be increased to 400.
"converge" uses one of two absolute stopping criterion which test the variation of the classification (clsf) log_marginal (clsf->log_a_x_h) delta between successive convergence cycles. The largest of halt_range (default value 0.5) and halt_factor * current_clsf_log_marginal) is used (default value of halt_factor is 0.0001). Increasing these values loosens the convergence and reduces the number of cycles. Decreasing these values tightens the convergence and increases the number of cycles. n_average (default value of 3) specifies how many cycles must meet the stopping criteria before the trial terminates. This is a very approximate stopping criterion, but will give you some feel for the kind of classifications to expect. It would be useful for "exploratory" searches of a data base.
The purpose of reconverge_type = "chkpt" is to complete an interrupted classification by continuing from its last checkpoint. The purpose of reconverge_type = "results" is to attempt further refinement of the best completed classification using a different value of try_fn_type ("converge_search_3", "converge_search_4", "converge"). If max_n_tries is greater than 1, then in each case, after the reconvergence has completed, AutoClass will perform further search trials based on the parameter values in the <...>.s-params file.
With the use of reconverge_type ( default value ""), you may apply more than one try function to a classification. Say you generate several exploratory trials using try_fn_type = "converge", and quit the search saving .search and .results[-bin] files. Then you can begin another search with try_fn_type = "converge_search_3", reconverge_type = "results", and max_n_tries = 1. This will result in the further convergence of the best classification generated with try_fn_type = "converge", with try_fn_type = "converge_search_3". When AutoClass completes this search try, you will have an additional refined classification.
A good way to verify that any of the alternate try_fun_type are generating a well converged classification is to run AutoClass in prediction mode on the same data used for generating the classification. Then generate and compare the corresponding case or class cross reference files for the original classification and the prediction. Small differences between these files are to be expected, while large differences indicate incomplete convergence. Differences between such file pairs should, on average and modulo class deletions, decrease monotonically with further convergence.
The standard way to create a random classification to begin a try is with the default value of "random" for start_fn_type. At this point there are no alternatives. Specifying "block" for start_fn_type produces repeatable non-random searches. That is how the <..>.s-params files in the autoclass-c/data/.. sub-directories are specified. This is how development testing is done.
max_cycles controls the maximum number of convergence cycles that will be performed in any one trial by the convergence functions. Its default value is 200. The screen output shows a period (".") for each cycle completed. If your search trials run for 200 cycles, then either your data base is very complex (increase the value), or the try_fn_type is not adequate for situation (try another of the available ones, and use converge_print_p to get more information on what is going on).
Specifying converge_print_p to be true will generate a brief print-out for each cycle which will provide information so that you can modify the default values of rel_delta_range & n_average for "converge_search_3"; cs4_delta_range & sigma_beta_n_values for "converge_search_4"; and halt_range, halt_factor, and n_average for "converge". Their default values are given in the <..>.s-params files in the autoclass-c/data/.. sub-directories.
n_classes_fn_type = "random_ln_normal" is the default way to make this choice. It fits a log normal to the number of classes (usually called "j" for short) of the 10 best classifications found so far, and randomly selects from that. There is currently no alternative.
To start the game off, the default is to go down start_j_list for the first few tries, and then switch to n_classes_fn_type. If you believe that the probable number of classes in your data base is say 75, then instead of using the default value of start_j_list (2, 3, 5, 7, 10, 15, 25), specify something like 50, 60, 70, 80, 90, 100.
If one wants to always look for, say, three classes, one can use fixed_j and override the above. Search status reports will describe what the current method for choosing j is.
AutoClass C is configured to handle a maximum of 999 attributes. If you attempt to run with more than that you will get array bound violations. In that case, change these configuration parameters in prog/autoclass.h and recompile AutoClass C:
#define ALL_ATTRIBUTES 999 #define VERY_LONG_STRING_LENGTH 20000 #define VERY_LONG_TOKEN_LENGTH 500For example, these values will handle several thousand attributes:
#define ALL_ATTRIBUTES 9999 #define VERY_LONG_STRING_LENGTH 50000 #define VERY_LONG_TOKEN_LENGTH 50000Disk space taken up by the "log" file will of course depend on the duration of the search. n_save (default value = 2) determines how many best classifications are saved into the ".results[-bin]" file. save_compact_p controls whether the "results" and "checkpoint" files are saved as binary. Binary files are faster and more compact, but are not portable. The default value of save_compact_p is true, which causes binary files to be written.
If the time taken to save the "results" files is a problem, consider increasing min_save_period (default value = 1800 seconds or 30 minutes). Files are saved to disk this often if there is anything different to report.
The running time of very large data sets will be quite uncertain. We advise that a few small scale test runs be made on your system to determine a baseline. Specify n_data to limit how many data vectors are read. Given a very large quantity of data, AutoClass may find its most probable classifications at upwards of a hundred classes, and this will require that start_j_list be specified appropriately (See above section HOW MANY CLASSES?). If you are quite certain that you only want a few classes, you can force AutoClass to search with a fixed number of classes specified by fixed_j. You will then need to run separate searches with each different fixed number of classes.
However, since the ".results" file is ASCII text, those pathnames could be changed with a text editor (save_compact_p must be specified as false).
n_clsfs 1
n_clsfs = 1
n_clsfs<tab>1
Spaces are ignored if "=" or "<tab>" are used as separators. Note there are no trailing semicolons.
The search parameters, with their default values, are as follows:
Random Number Generator. This implementation of AutoClass C uses the Unix srand48/lrand48 random number generator which generates pseudo-random numbers using the well-known linear congruential algorithm and 48-bit integer arithmetic. lrand48() returns non- negative long integers uniformly distributed over the interval [0, 2**31].
Search Parameters. The following .s-params file parameters should be specified:
force_new_search_p = true start_fn_type "block" randomize_random_p = false ;; specify the number of trials you wish to run max_n_tries = 50 ;; specify a time greater than duration of run min_report_period = 30000Note that no current best classification reports will be produced. Only a final classification summary will be output.
Checkpointing is initiated by specifying "checkpoint_p = true" in the ".s-params" file. This causes the inner convergence step, to save a copy of the classification onto the checkpoint file each time the classification is updated, providing a certain period of time has elapsed. The file extension is ".chkpt[-bin]".
Each time a AutoClass completes a cycle, a "." is output to the screen to provide you with information to be used in setting the min_checkpoint_period value (default 10800 seconds or 3 hours). There is obviously a trade-off between frequency of checkpointing and the probability that your machine may crash, since the repetitive writing of the checkpoint file will slow the search process.
Restarting AutoClass Search:
To recover the classification and continue the search after rebooting and reloading AutoClass, specify reconverge_type = "chkpt" in the ".s-params" file (specify force_new_search_p as false).
AutoClass will reload the appropriate database and models, provided there has been no change in their filenames since the time they were loaded for the checkpointed classification run. The ".s-params" file contains any non-default arguments that were provided to the original call.
In the beginning of a search, before start_j_list has been emptied, it will be necessary to trim the original list to what would have remained in the crashed search. This can be determined by looking at the ".log" file to determine what values were already used. If the start_j_list has been emptied, then an empty start_j_list should be specified in the ".s-params" file. This is done either by
start_j_list =
or
start_j_list = -9999
Here is an a set of scripts to demonstrate check-pointing:
autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \
data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params
Run 1)
## glassc-chkpt.s-params
max_n_tries = 2
force_new_search_p = true
## --------------------
;; run to completion
Run 2)
## glassc-chkpt.s-params
force_new_search_p = false
max_n_tries = 10
checkpoint_p = true
min_checkpoint_period = 2
## --------------------
;; after 1 checkpoint, ctrl-C to simulate cpu crash
Run 3)
## glassc-chkpt.s-params
force_new_search_p = false
max_n_tries = 1
checkpoint_p = true
min_checkpoint_period = 1
reconverge_type = "chkpt"
## --------------------
;; checkpointed trial should finish
The attribute influence values report attempts to provide relative measures of the "influence" of the data attributes on the classes found by the classification. The normalized class strengths, the normalized attribute influence values summed over all classes, and the individual influence values (I[jkl]) are all only relative measures and should be interpreted with more meaning than rank ordering, but not like anything approaching absolute values.
The reports are output to files whose names and pathnames are taken from the ".r-params" file pathname. The report file types (extensions) are:
or, if report_mode is overridden to "data":
where n is the classification number from the "results" file. The first or best classification is numbered 1, the next best 2, etc. The default is to generate reports only for the best classification in the "results" file. You can produce reports for other saved classifications by using report params keywords n_clsfs and clsf_n_list. The "influ-o-text-n" file type is the default (order_attributes_by_influence_p = true), and lists each class's attributes in descending order of attribute influence value. If the value of order_attributes_by_influence_p is overridden to be false in the <...>.r-params file, then each class's attributes will be listed in ascending order by attribute number. The extension of the file generated will be "influ-no-text-n". This method of listing facilitates the visual comparison of attribute values between classes.
For example, this command:
autoclass -reports sample/imports-85c.results-bin
sample/imports-85c.search sample/imports-85c.r-params
with this line in the ".r-params" file:
xref_class_report_att_list = 2, 5, 6
will generate these output files:
imports-85.influ-o-text-1
imports-85.case-text-1
imports-85.class-text-1
The AutoClass C reports provide the capability to compute sigma class contour values for specified pairs of real valued attributes, when generating the influence values report with the data option (report_mode = "data"). Note that sigma class contours are not generated from discrete type attributes.
The sigma contours are the two dimensional equivalent of n-sigma error bars in one dimension. Specifically, for two independent attributes the n-sigma contour is defined as the ellipse where
((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n
With covariant attributes, the n-sigma contours are defined identically, in the rotated coordinate system of the distribution's principle axes. Thus independent attributes give ellipses oriented parallel with the attribute axes, while the axes of sigma contours of covariant attributes are rotated about the center determined by the means. In either case the sigma contour represents a line where the class probability is constant, irrespective of any other class probabilities.
With three or more attributes the n-sigma contours become k-dimensional ellipsoidal surfaces. This code takes advantage of the fact that the parallel projection of an n-dimensional ellipsoid, onto any 2-dim plane, is bounded by an ellipse. In this simplified case of projecting the single sigma ellipsoid onto the coordinate planes, it is also true that the 2-dim covariances of this ellipse are equal to the corresponding elements of the n-dim ellipsoid's covariances. The Eigen-system of the 2-dim covariance then gives the variances w.r.t. the principal components of the eclipse, and the rotation that aligns it with the data. This represents the best way to display a distribution in the marginal plane.
To get contour values, set the keyword sigma_contours_att_list to a list of real valued attribute indices (from .hd2 file), and request an influence values report with the data option. For example,
report_mode = "data"
sigma_contours_att_list = 3, 4, 5, 8, 15
n_clsfs 1
n_clsfs = 1
n_clsfs<tab>1
Spaces are ignored if "=" or "<tab>" are used as separators. Note there are no trailing semicolons.
The following are the allowed parameters and their default values:
clsf_n_list = 1, 2
will produce the same output as
n_clsfs = 2
but
clsf_n_list = 2
will only output the "second best" classification report.
The first part of this report gives the heuristic class "strengths". Class "strength" is here defined as the geometric mean probability that any instance "belonging to" class, would have been generated from the class probability model. It thus provides a heuristic measure of how strongly each class predicts "its" instances.
The second part is a listing of the overall "influence" of each of the attributes used in the classification. These give a rough heuristic measure of the relative importance of each attribute in the classification. Attribute "influence values" are a class probability weighted average of the "influence" of each attribute in the classes, as described below.
The next part of the report is a summary description of each of the classes. The classes are arbitrarily numbered from 0 up to n, in order of descending class weight. A class weight of say 34.1 means that the weighted sum of membership probabilities for class is 34.1. Note that a class weight of 34 does not necessarily mean that 34 cases belong to that class, since many cases may have only partial membership in that class. Within each class, attributes or attribute sets are ordered by the "influence" of their model term.
We define the modeled term's "influence" on a class to be the cross entropy term for the class distribution w.r.t. the global class distribution of the single class classification. "Influence" is thus a measure of how strongly the model term helps differentiate the class from the whole data set. With independently modeled attributes, the influence can legitimately be ascribed to the attribute itself. With correlated or covariant attributes sets, the cross entropy factor is a function of the entire set, and we distribute the influence value equally over the modeled attributes.
The .case and .class reports give probabilities that cases are members of classes. Any assignment of cases to classes requires some decision rule. The maximum probability assignment rule is often implicitly assumed, but it cannot be expected that the resulting partition sizes will equal the class weights unless nearly all class membership probabilities are effectively one or zero. With non-1/0 membership probabilities, matching the class weights requires summing the probabilities.
In addition, there is the question of completeness of the EM (expectation maximization) convergence. EM alternates between estimating class parameters and estimating class membership probabilities. These estimates converge on each other, but never actually meet. AutoClass implements several convergence algorithms with alternate stopping criteria using appropriate parameters in the .s-params file. Proper setting of these parameters, to get reasonably complete and efficient convergence may require experimentation.
This technique for predicting class probabilities is applicable to all attributes, regardless of data type/sub_type or likelihood model term type.
In the event that the class membership of a data case does not exceed 0.0099999 for any of the "training" classes, the following message will appear in the screen output for each case:
xref_get_data: case_num xxx => class 9999
Class 9999 members will appear in the "case" and "class" cross-reference reports with a class membership of 1.0.
Cautionary Points:
The usual way of using AutoClass is to put all of your data in a data_file, describe that data with model and header files, and run "autoclass -search". Now, instead of one data_file you will have two, a training_data_file and a test_data_file.
It is most important that both databases have the same AutoClass internal representation. Should this not be true, AutoClass will exit, or possibly in in some situations, crash. The prediction mode is designed to hopefully direct the user into conforming to this requirement.
Preparation:
Prediction requires having a training classification and a test database. The training classification is generated by the running of "autoclass -search" on the training data_file ("data/soybean/soyc.db2"), for example:
autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2
data/soybean/soyc.model data/soybean/soyc.s-params
This will produce "soyc.results-bin" and "soyc.search". Then create a "reports" parameter file, such as "soyc.r-params" (see /usr/share/doc/autoclass/reports-c.text), and run AutoClass in "reports" mode, such as:
autoclass -reports data/soybean/soyc.results-bin
data/soybean/soyc.search data/soybean/soyc.r-params
This will generate class and case cross-reference files, and an influence values file. The file names are based on the ".r-params" file name:
data/soybean/soyc.class-text-1
data/soybean/soyc.case-text-1
data/soybean/soyc.influ-text-1
These will describe the classes found in the training_data_file. Now this classification can be used to predict the probabilistic class membership of the test_data_file cases ("data/soybean/soyc-predict.db2") in the training_data_file classes.
autoclass -predict data/soybean/soyc-predict.db2
data/soybean/soyc.results-bin data/soybean/soyc.search
data/soybean/soyc.r-params
This will generate class and case cross-reference files for the test_data_file cases predicting their probabilistic class memberships in the training_data_file classes. The file names are based on the ".db2" file name:
data/soybean/soyc-predict.class-text-1
data/soybean/soyc-predict.case-text-1
/usr/share/doc/autoclass/introduction-c.text Guide to the documentation
/usr/share/doc/autoclass/preparation-c.text How to prepare data for use by AutoClass
/usr/share/doc/autoclass/search-c.text How to run AutoClass to find classifications.
/usr/share/doc/autoclass/reports-c.text How to examine the classification in various ways.
/usr/share/doc/autoclass/interpretation-c.text How to interpret AutoClass results.
/usr/share/doc/autoclass/checkpoint-c.text Protocols for running a checkpointed search.
/usr/share/doc/autoclass/prediction-c.text Use classifications to predict class membership for new cases.
These provide supporting documentation:
/usr/share/doc/autoclass/classes-c.text What classification is all about, for beginners.
/usr/share/doc/autoclass/models-c.text Brief descriptions of the model term implementations.
The mathematical theory behind AutoClass is explained in these documents:
/usr/share/doc/autoclass/kdd-95.ps Postscript file containing: P. Cheeseman, J. Stutz, "Bayesian Classification (AutoClass): Theory and Results", in "Advances in Knowledge Discovery and Data Mining", Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, & Ramasamy Uthurusamy, Eds. The AAAI Press, Menlo Park, expected fall 1995.
/usr/share/doc/autoclass/tr-fia-90-12-7-01.ps Postscript file containing: R. Hanson, J. Stutz, P. Cheeseman, "Bayesian Classification Theory", Technical Report FIA-90-12-7-01, NASA Ames Research Center, Artificial Intelligence Branch, May 1991 (The figures are not included, since they were inserted by "cut-and-paste" methods into the original "camera-ready" copy.)
Dr. Peter Cheeseman Principal Investigator - NASA Ames, Computational Sciences Division cheesem@ptolemy.arc.nasa.gov John Stutz Research Programmer - NASA Ames, Computational Sciences Division stutz@ptolemy.arc.nasa.gov Will Taylor Support Programmer - NASA Ames, Computational Sciences Division taylor@ptolemy.arc.nasa.gov