The next few sections explain in some depth the notion of statistical models and especially that of (statistical) appearance models. They move on to the description of active appearance models which are an extension to active shape models and a brief introduction to shape models may be worthwhile to begin with.
Given a collection of images depicting an object which possesses some
innate properties, it is then possible to express the visual appearance
or shape of that object in a way that discards subtle changes in view-point,
object position, object size et cetera and is robust to some level
of object deformation. That object which appears in the group of images
need not even be the exact same one; it can be an object belonging
to one common class. Some variation that is typical for that
class can be handled (essentially be understood) reliably with the
help of elementary transformations (to be described in ),
but their functionality is inevitably very limited and constrained.
There are statistical methods which allow the encoding of the variability which was learned during a so-called training process. That training process does not require far more than an exhaustive inspection of the set of images where objects (or shapes) appear. However, in order to interpret a large set of objects, some simplification steps are required. This results from the fact that most images where objects lie are expected to be of relatively large-scale in practice - certainly large enough to result in an exponential blow-up2.4.
A method is sought which reduces the amount of information that is
required to describe an object of interest and the different forms
it can take. This is done by selecting points of interest which lie
in the image - ones which will be a representative sub-set of the
image contents2.5. Points must be picked so that they jointly preserve knowledge regarding
the object of interest. That object is often well-hidden in that pool
of image pixels. Such points are often chosen to become what is entitled
landmarks. Landmarks are positions in the image which effectively
distinguish one object from another in the set of images (see Figure
cap:Landmark-identification). They also have some interesting
spatial traits which can form near-optimal curves (or contours) which
together make up genuine shapes. The concatenation of the coordinates
of these landmarks can then describe an image (or rather the object
being focused on) in a concise and useful representation. In 2-D,
for landmarks, a vector of size
can infer the shape of
the object present in an image. This lossy inference can be described
as follows:
where is simply a discrete reconstruction of the shape
in the image. It is not the actual image.
It is worth pointing out that landmark points can be chosen arbitrarily. This turns out to be a serious issue as will be seen later along with possible solutions. Identification of objects is in most cases2.6 done by drawing lines or selecting surfaces which surround these objects. Given continuous elements such as a lines or surfaces, by no criterion does it become obvious how to suitably sample them using points. The choice of points affects the quality of reconstruction as measured by the assigned errors.
With the concise landmark-based representation (described above in
) set to be the convention and a collection of
fair-sized vectors rather than a massive collection of images, it
should be possible to express (in a feasible way) the legal range2.7 of each one of the vector components. This in essence establishes
the model. It is an entity that can be manipulated to reconstruct
all the shapes (or as later explained - images) it originated from
and far beyond that. This model encapsulates the variation which was
learned from the data and it usually improves its performance as more
legal examples are viewed and 'fed' to support some further training.
Varying the parameters of the model can generate new (unseen) examples
as long as that value variation is restricted by the legal range,
as learned from the training examples. The vector representation mentioned
beforehand can be also looked at as a description of a fixed location
in space that comprises
dimensions (see illustrative scatter
in Figure
). This turns out to be a useful demonstrative
idea as will be seen later when dimensionality reduction is applied.
Shape models are ``statistical information containers'' which can be built from the images with overlaid landmark points identified and recorded. In order to make such a mechanism possible, it is vital to firstly achieve consistency amongst the coordinates of all landmarks. This means that all points need to be projected onto a common space - a process whose purpose is to ease collective analysis. That process can also be thought of as an alignment step which somehow links to the next chapter. More issues that are concerned with normalisation, projection and the like are described in slightly more detail later in this document.
A human expert usually performs annotation or landmarking of the images
with the aid of some computerised special-purpose tools. In recent
years, alternatives which are automatic showed great promise []
and these extend to 3-D too []. The later
chapter on page is dedicated purely to that
one piece of work which is so fundamental to this current new research.
Appearance models were later developed by Edwards et al. [,] and the greatest advantage or essence of these was that they were able to sample grey-level2.8 data (incorporation of full colour has been made possible by now, e.g. Stegmann et al. [], [WWW-5]) from images rather than just points. Therefore, appearance models retained information about what an image looks like rather than just its form as visualised by contours (or surfaces in 3-D). Just as points in the image were earlier chosen, grey-level values (also referred to as intensity or texture) could be systematically extracted from a normalised image and preserved in an intensity vector for later analysis. This normalisation process and the representation of this intensity vector will be outlined later in this chapter.
What enables appearance models to exhibit quite an astonishing graphical
resemblance to reality is that at the later stages of the process,
a combined vector is made available. It incorporates both
shape and intensity while keeping aware of how change in one affects
the other (e.g. how expansion results in darkening and vice versa).
Hence it has a notion of the correlation between the two -
a notion that is dependent on the training data and Principal Component
Analysis. Although appearance models are usually not as quick and
accurate as shape models2.9, they contain all the information that is held in the shape models
and in that sense are a superset2.10 of shape models. Also, some techniques have been developed and employed
to speed up the matching of appearance models to image targets (see
later in Section and Appendix
).
Tasks such as the matching of an appearance model to some target image
are described later in this chapter and illustrated in [].