Section: User Commands (1)Updated: December 10, 2001Local indexUp
NAME
multimix, multimix-prep - automatically discover classes in data
SYNOPSIS
multimix
multimix-prep
DESCRIPTION
multimix fits a mixture of multivariate distributions to a set
of observations using the EM algorithm. The data file may contain both
categorical and continuous variables.
multimix prompts for the names of the data and parameter files.
The assignment of the observations to groups and the posterior
probabilities are written to GROUPS.OUT. Parameter estimates,
convergence information, and group assignment probabilities are
written to GENERAL.OUT.
If multimix does not converge after ITER=200 iterations,
the estimates of the parameters will be written to
EMPARAMEST.OUT. This file can then be used as the parameter
input file for multimix if desired.
multimix is limited to a maximum of
1500 observations (IOB=1500)
6 groups (IK6=6)
15 attributes and partition cells (IP15=15)
10 levels of categories (IM10=10)
200 iterations to convergence (ITER=200)
Recompilation is required to change these parameters.
DATA FILE
The data file has one line for each observation. Each line has one
entry for each variable. Only the first NVAR entries on each
line are read.
PARAMETER FILE
The parameter file contains free field values which describe the data
and the fitting models. multimix-prep will ask the user a
series of questions and write a suitable parameter file. If the
starting point for the fit is given by specifying initial group
assignments for the observations, then the user should prepare the
file of group assignments before starting multimix-prep. The
file format is simple: the Ith line of the file contains an
integer between 1 and NG giving the group number of the
Ith observation. (The experienced user finds it faster to edit
old parameter files into new ones.)
multimix requires variables in a partition to be stored
contiguously. Hence the data is read in with the variable order being
specified by JP(J). INTYPE(J) and NCAT(J) both refer
to the rearranged data.
The first five values are
NG
The number of groups (distributions) in the finite mixture to be fitted.
NOBS
The number of observations.
NVAR
The number of attributes.
NPAR
The number of partition cells (sets of attributes associated within
each distribution).
ISPEC
Flag indicating how the starting point is specified for the fit:
1 Initial parameter estimates are specified.
2 Observations are assigned to groups.
Next come eight arrays of data:
JP
JP(J)
is the column of the data array into which the
Jth attribute of the data file will be stored, where J
varies from 1 to NVAR. For example, suppose we want the third
attribute in the first column, attribute 4 in the second column,
attribute 7 in the 3rd column, and then attributes 1, 2, 5, and 6.
Then JP(J) = 4 5 1 2 6 7 3, for J=1,...,7.
IP
IP(L)
is the number of attributes in the Lth
partition cell, L=1,...,NPAR.
IPC
IPC(L)
is the number of continuous attributes in the
Lth partition cell.
ISV
ISV(L)
gives the index J of the start of partition
cell L. E.g. if attributes 6, 7, and 8 are in the same
partition cell L, then ISV(L)=6 and IEV(L)=8.
IEV
IEV(L)
gives the index J of the end of partition cell L.
IPARTYPE
IPARTYPE(L)
is an indicator giving the type of model for partition L:
1 for a categorical model.
2 for a multivariate normal model.
3 for a location model.
IVARTYPE
IVARTYPE(J)
is an indicator for the type of attribute
J:
1 for a categorical attribute.
2 for a multivariate normal attribute;
3 for a categorical attribute in a location model;
4 for a multivariate normal attribute in a location model.
NCAT
NCAT(J)
is the number of categories for the Jth categorical attribute.
For continuous attributes, NCAT(J) should be 0.
If observations are assigned to groups (ISPEC=2), then those
assignments are next:
IGRP
IGRP(I)
is the index of the group that observation I
is in.
If observations are not assigned to groups (ISPEC=1), then
estimates of the parameters are next:
PI
PI(K)
is the estimated mixing proportion for group K
(K=1,...,NG).
The parameters for each group depend on the type of attribute:
THETA
THETA(K,J,M)
is the estimated probability that the Jth categorical attribute
is at level M, given that in group K. Repeat for each
attribute,
J=ISV(L),IEV(L).
categorical attributes only
EMU
EMU(K,L,J)
is the estimated mean vector for group K, partition cell L
and attribute J.
multivariate normal model only
THETA
THETA(K,J,M)
is the estimated probability that the Jth categorical attribute
in the location model is at level M, given that in group
K.
categorical attributes only
EMUL
EMUL(K,L,J,M)
is the estimated mean vector for group K, partition cell L
and attribute J, at the Mth level of the categorical
attribute in the location model.
multivariate normal model only
VARIX
((VARIX(K,L,I,J),J=1,IPC(L)), I=1,IPC(L))
An entry in VARIX is the estimated covariance between attributes
I and J for group K, partition cell L, where
I=1,...,IPC(L), and J=1,...,IPC(L).
The required parameters are read in for each partition cell,
L=1,...,NPAR.
For example, if the attributes within the partition cell are all
categorical, that is,
ITYPE(L)=1,
then
THETA(K,J,M),
for
M=1,...,NCAT(J)
is required for the attribute in that partition cell.
If the attributes within the partition cell are continuous,
multivariate normal attributes, that is
ITYPE(L)=2,
then estimates of
EMU(K,L,J)
are required for each attribute.
If the attributes within the partition cell follow the location model,
that is,
ITYPE(L)=3,
then
THETA(K,J,M),M=1,...,NCAT(J)
is required for the categorical attribute, and
EMUL(K,L,J,M),M=1,...,IM(L)
is required for each continuous multivariate normal attribute. (Note
that
IM(L)
is the number of categories of the categorical attribute associated
with the location model.)
The estimates are read in first for group 1, then for group 2, etc.
EXAMPLES
See /usr/share/doc/multimix/examples.
FILES
GROUPS.OUTmultimix output: the assignment of the
observations to groups and the posterior probabilities. If
observations were initially assigned to groups (ISPEC=2), these
assignments may be different. Some are likely to be different if the
fitting distributions overlap.
GENERAL.OUTmultimix output: parameter estimates,
convergence information, and group assignment probabilities.
EMPARAMEST.OUTmultimix output on failure to converge:
current parameter estimates. This file can then be used as the
parameter input file for multimix if desired.