Version: | 1.5.0 |
Date: | 2025-02-27 |
Encoding: | UTF-8 |
Title: | Flexible Cluster Algorithms |
Depends: | R (≥ 2.14.0) |
Imports: | graphics, grid, lattice, methods, modeltools, parallel, stats, stats4, class |
Suggests: | ellipse, clue, cluster, seriation, skmeans |
Description: | The main function kcca implements a general framework for k-centroids cluster analysis supporting arbitrary distance measures and centroid computation. Further cluster methods include hard competitive learning, neural gas, and QT clustering. There are numerous visualization methods for cluster results (neighborhood graphs, convex cluster hulls, barcharts of centroids, ...), and bootstrap methods for the analysis of cluster stability. |
License: | GPL-2 |
LazyLoad: | yes |
NeedsCompilation: | yes |
Packaged: | 2025-02-27 21:19:24 UTC; gruen |
Author: | Friedrich Leisch |
Maintainer: | Bettina Grün <Bettina.Gruen@R-project.org> |
Repository: | CRAN |
Date/Publication: | 2025-02-28 06:40:02 UTC |
Artificial Example with 4 Gaussians
Description
A simple artificial regression example with 4 clusters, all of them having a Gaussian distribution.
Usage
data(Nclus)
Details
The Nclus
data set can be re-created by loading package
flexmix and running ExNclus(100)
using set.seed(2602)
. It has been saved as a data set for
simplicity of examples only.
Examples
data(Nclus)
cl <- cclust(Nclus, k=4, simple=FALSE, save.data=TRUE)
plot(cl)
Achievement Test Scores for New Haven Schools
Description
Measurements at the beginning of the 4th grade (when the national average is 4.0) and of the 6th grade in 25 schools in New Haven.
Usage
data(achieve)
Format
A data frame with 25 observations on the following 4 variables.
read4
4th grade reading.
arith4
4th grade arithmetic.
read6
6th grade reading.
arith6
6th grade arithmetic.
References
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Automobile Customer Survey Data
Description
A German manufacturer of premium cars asked customers approximately 3 months after a car purchase which characteristics of the car were most important for the decision to buy the car. The survey was done in 1983 and the data set contains all responses without missing values.
Usage
data(auto)
Format
A data frame with 793 observations on the following 46 variables.
model
A factor with levels
A
,B
,C
, orD
; model bought by the customer.gear
A factor with levels
4 gears
,5 econo
,5 sport
, orautomatic
.leasing
A logical vector, was leasing used to finance the car?
usage
A factor with levels
private
,both
,business
.previous_model
A factor describing which type of car was owned directly before the purchase.
other_consider
A factor with levels
same manuf
,other manuf
,both
, ornone
.test_drive
A logical vector, did you do a test drive?
info_adv
A logical vector, was advertising an important source of information?
info_exp
A logical vector, was experience an important source of information?
info_rec
A logical vector, were recommendations an important source of information?
ch_clarity
A logical vector.
ch_economy
A logical vector.
ch_driving_properties
A logical vector.
ch_service
A logical vector.
ch_interior
A logical vector.
ch_quality
A logical vector.
ch_technology
A logical vector.
ch_model_continuity
A logical vector.
ch_comfort
A logical vector.
ch_reliability
A logical vector.
ch_handling
A logical vector.
ch_reputation
A logical vector.
ch_concept
A logical vector.
ch_character
A logical vector.
ch_power
A logical vector.
ch_resale_value
A logical vector.
ch_styling
A logical vector.
ch_safety
A logical vector.
ch_sporty
A logical vector.
ch_consumption
A logical vector.
ch_space
A logical vector.
satisfaction
A numeric vector describing overall satisfaction (1=very good, 10=very bad).
good1
Conception, styling, dimensions.
good2
Auto body.
good3
Driving and coupled axles.
good4
Engine.
good5
Electronics.
good6
Financing and customer service.
good7
Other.
sporty
What do you think about the balance of sportiness and comfort? (
good
,more sport
,more comfort
).drive_char
Driving characteristis (
gentle
<speedy
<powerfull
<extreme
).tempo
Which average speed do you prefer on German Autobahn in km/h? (
< 130
<130-150
<150-180
<> 180
)consumption
An ordered factor with levels
low
<ok
<high
<too high
.gender
A factor with levels
male
andfemale
occupation
A factor with levels
self-employed
,freelance
, andemployee
.household
Size of household, an ordered factor with levels
1-2
<>=3
.
Source
The original German data are in the public domain and available from LMU Munich (doi:10.5282/ubm/data.14). The variable names and help page were translated to English and converted into Rd format by Friedrich Leisch.
References
Open Data LMU (1983): Umfrage unter Kunden einer Automobilfirma, doi:10.5282/ubm/data.14
Examples
data(auto)
summary(auto)
Barplot/chart Methods in Package ‘flexclust’
Description
Barplot of cluster centers or other cluster statistics.
Usage
## S4 method for signature 'kcca'
barplot(height, bycluster = TRUE, oneplot = TRUE,
data = NULL, FUN = colMeans, main = deparse(substitute(height)),
which = 1:height@k, names.arg = NULL,
oma = par("oma"), col = NULL, mcol = "darkred", srt = 45, ...)
## S4 method for signature 'kcca'
barchart(x, data, xlab="", strip.labels=NULL,
strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol,
which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE,
clusters=1:x@k, ...)
## S4 method for signature 'hclust'
barchart(x, data, xlab="", strip.labels=NULL,
strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol,
which=NULL, shade=FALSE, diff=NULL, byvar=FALSE, k=2, ...)
## S4 method for signature 'bclust'
barchart(x, data, xlab="", strip.labels=NULL,
strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol,
which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE,
k=x@k, clusters=1:k, ...)
Arguments
height , x |
An object of class |
bycluster |
If |
oneplot |
If |
data |
If not |
FUN |
The function to be applied to each cluster for calculating
the bar heights. Only used, if |
which |
For |
names.arg |
A vector of names to be plotted below each bar. |
main , oma , xlab , ... |
Graphical parameters. |
col |
Vector of colors for the clusters. |
mcol , mlcol |
If not |
srt |
Number between 0 and 90, rotation of the x-axis labels. |
strip.labels |
Vector of strings for the strips of the Trellis display. |
strip.prefix |
Prefix string for the strips of the Trellis display. |
legend |
If |
shade |
If |
diff |
A numerical vector of length two with absolute and
relative deviations for shading, default is |
byvar |
If |
clusters |
Integer vector of clusters to plot. |
k |
Integer specifying the desired number of clusters. |
Note
The flexclust barchart method uses a horizontal arrangements of the bars, and sorts them from top to bottom. Default barcharts in lattice are the other way round (bottom to top). See the examples below how this affects, e.g., manual labels for the y axis.
The barplot
method is legacy code and only maintained to keep up
with changes in R, all active development is done on barchart
.
Author(s)
Friedrich Leisch
References
Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2), 97-120, 2014.
Examples
cl <- cclust(iris[,-5], k=3)
barplot(cl)
barplot(cl, bycluster=FALSE)
## plot the maximum instead of mean value per cluster:
barplot(cl, bycluster=FALSE, data=iris[,-5],
FUN=function(x) apply(x,2,max))
## use lattice for plotting:
barchart(cl)
## automatic abbreviation of labels
barchart(cl, scales=list(abbreviate=TRUE))
## origin of bars at zero
barchart(cl, scales=list(abbreviate=TRUE), origin=0)
## Use manual labels. Note that the flexclust barchart orders bars
## from top to bottom (the default does it the other way round), hence
## we have to rev() the labels:
LAB <- c("SL", "SW", "PL", "PW")
barchart(cl, scales=list(y=list(labels=rev(LAB))), origin=0)
## deviation of each cluster center from the population means
barchart(cl, origin=rev(cl@xcent), mlcol=NULL)
## use shading to highlight large deviations from population mean
barchart(cl, shade=TRUE)
## use smaller deviation limit than default and add a legend
barchart(cl, shade=TRUE, diff=0.2, legend=TRUE)
Bagged Clustering
Description
Cluster the data in x
using the bagged clustering
algorithm. A partitioning cluster algorithm such as
cclust
is run repeatedly on bootstrap samples from the
original data. The resulting cluster centers are then combined using
the hierarchical cluster algorithm hclust
.
Usage
bclust(x, k = 2, base.iter = 10, base.k = 20, minsize = 0,
dist.method = "euclidian", hclust.method = "average",
FUN = "cclust", verbose = TRUE, final.cclust = FALSE,
resample = TRUE, weights = NULL, maxcluster = base.k, ...)
## S4 method for signature 'bclust,missing'
plot(x, y, maxcluster = x@maxcluster, main = "", ...)
## S4 method for signature 'bclust,missing'
clusters(object, newdata, k, ...)
## S4 method for signature 'bclust'
parameters(object, k)
Arguments
x |
Matrix of inputs (or object of class |
k |
Number of clusters. |
base.iter |
Number of runs of the base cluster algorithm. |
base.k |
Number of centers used in each repetition of the base method. |
minsize |
Minimum number of points in a base cluster. |
dist.method |
Distance method used for the hierarchical
clustering, see |
hclust.method |
Linkage method used for the hierarchical
clustering, see |
FUN |
Partitioning cluster method used as base algorithm. |
verbose |
Output status messages. |
final.cclust |
If |
resample |
Logical, if |
weights |
Vector of length |
maxcluster |
Maximum number of clusters memberships are to be computed for. |
object |
Object of class |
main |
Main title of the plot. |
... |
Optional arguments top be passed to the base method
in |
y |
Missing. |
newdata |
An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used. |
Details
First, base.iter
bootstrap samples of the original data in
x
are created by drawing with replacement. The base cluster
method is run on each of these samples with base.k
centers. The base.method
must be the name of a partitioning
cluster function returning an object with the same slots as the
return value of cclust
.
This results in a collection of iter.base * base.centers
centers, which are subsequently clustered using the hierarchical
method hclust
. Base centers with less than
minsize
points in there respective partitions are removed
before the hierarchical clustering. The resulting dendrogram is
then cut to produce k
clusters.
Value
bclust
returns objects of class
"bclust"
including the slots
hclust |
Return value of the hierarchical clustering of the
collection of base centers (Object of class |
cluster |
Vector with indices of the clusters the inputs are assigned to. |
centers |
Matrix of centers of the final clusters. Only useful, if the hierarchical clustering method produces convex clusters. |
allcenters |
Matrix of all |
Author(s)
Friedrich Leisch
References
Friedrich Leisch. Bagged clustering. Working Paper 51, SFB “Adaptive Information Systems and Modeling in Economics and Management Science”, August 1999. doi:10.57938/9b129f95-b53b-44ce-a129-5b7a1168d832
Sara Dolnicar and Friedrich Leisch. Winter tourist segments in Austria: Identifying stable vacation styles using bagged clustering techniques. Journal of Travel Research, 41(3):281-292, 2003.
See Also
Examples
data(iris)
bc1 <- bclust(iris[,1:4], 3, base.k=5)
plot(bc1)
table(clusters(bc1, k=3))
parameters(bc1, k=3)
Birth and Death Rates
Description
Birth and death rates for 70 countries.
Usage
data(birth)
Format
A data frame with 70 observations on the following 2 variables.
birth
Birth rate (in percent).
death
Death rate (in percent).
References
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Bootstrap Flexclust Algorithms
Description
Runs clustering algorithms repeatedly for different numbers of clusters on bootstrap replica of the original data and returns corresponding cluster assignments, centroids and (adjusted) Rand indices comparing pairs of partitions.
Usage
bootFlexclust(x, k, nboot=100, correct=TRUE, seed=NULL,
multicore=TRUE, verbose=FALSE, ...)
## S4 method for signature 'bootFlexclust'
summary(object)
## S4 method for signature 'bootFlexclust,missing'
plot(x, y, ...)
## S4 method for signature 'bootFlexclust'
boxplot(x, ...)
## S4 method for signature 'bootFlexclust'
densityplot(x, data, ...)
Arguments
x , k , ... |
Passed to |
nboot |
Number of bootstrap pairs of partitions. |
correct |
Logical, correct the Rand index for agreement by chance also called adjusted Rand index)? |
seed |
If not |
multicore |
If |
verbose |
If |
y , data |
Not used. |
object |
An object of class |
Details
Availability of multicore is checked
when flexclust is loaded. This information is stored and can be
obtained using
getOption("flexclust")$have_multicore
. Set to FALSE
for debugging and more sensible error messages in case something
goes wrong.
Author(s)
Friedrich Leisch
See Also
Examples
## Not run:
## data uniform on unit square
x <- matrix(runif(400), ncol=2)
cl <- FALSE
## to run bootstrap replications on a workstation cluster do the following:
library("parallel")
cl <- makeCluster(2, type = "PSOCK")
clusterCall(cl, function() require("flexclust"))
## 50 bootstrap replicates for speed in example,
## use more for real applications
bcl <- bootFlexclust(x, k=2:7, nboot=50, FUN=cclust, multicore=cl)
bcl
summary(bcl)
## splitting the square into four quadrants should be the most stable
## solution (increase nboot if not)
plot(bcl)
densityplot(bcl, from=0)
## End(Not run)
German Parliament Election Data
Description
Results of the elections 2002, 2005 or 2009 for the German Bundestag, the first chamber of the German parliament.
Usage
data(btw2002)
data(btw2005)
data(btw2009)
bundestag(year, second=TRUE, percent=TRUE, nazero=TRUE, state=FALSE)
Arguments
year |
Numeric or character, year of the election. |
second |
Logical, return second or first votes? |
percent |
Logical, return percentages or absolute numbers? |
nazero |
Logical, convert |
state |
Logical or character. If |
Format
btw200x
are data frames with 299 rows
(corresponding to constituencies) and 17 columns. All columns except
state
are numeric.
state
Factor, the 16 German federal states.
eligible
Number of citizens eligible to vote.
votes
Number of eligible citizens who did vote.
invalid1, invalid2
Number of invalid first and second votes (see details below).
valid1, valid2
Number of valid first and second votes.
SPD1, SPD2
Number of first and second votes for the Social Democrats.
UNION1, UNION2
Number of first and second votes for CDU/CSU, the conservative Christian Democrats.
GRUENE1, GRUENE2
Number of first and second votes for the Green Party.
FDP1, FDP2
Number of first and second votes for the Liberal Party.
LINKE1, LINKE2
Number of first and second votes for the Left Party (PDS in 2002).
Missing values indicate that a party did not candidate in the corresponding constituency.
Details
btw200x
are the original data sets.
bundestag()
is a helper function which extracts first
or second votes, calculates percentages (number of votes for a party divided by
number of valid votes), replaces missing values by zero, and converts
the result from a data frame to a matrix. By default
it returns the percentage of second votes for each party, which
determines the number of seats each party gets in parliament.
German Federal Elections
Half of the Members of the German Bundestag are elected directly from Germany's 299 constituencies, the other half on the parties' state lists. Accordingly, each voter has two votes in the elections to the German Bundestag. The first vote, allowing voters to elect their local representatives to the Bundestag, decides which candidates are sent to Parliament from the constituencies.
The second vote is cast for a party list. And it is this second vote that determines the relative strengths of the parties represented in the Bundestag. At least 598 Members of the German Bundestag are elected in this way. In addition to this, there are certain circumstances in which some candidates win what are known as “overhang mandates” when the seats are being distributed.
References
Homepage of the Bundestag: https://www.bundestag.de
Examples
p02 <- bundestag(2002)
pairs(p02)
p05 <- bundestag(2005)
pairs(p05)
p09 <- bundestag(2009)
pairs(p09)
state <- bundestag(2002, state=TRUE)
table(state)
start.with.b <- bundestag(2002, state="^B")
table(start.with.b)
pairs(p09, col=2-(state=="Bayern"))
Box-Whisker Plot Methods in Package ‘flexclust’
Description
Seperate boxplot of variables in each cluster in comparison with boxplot for complete sample.
Usage
## S4 method for signature 'kcca'
bwplot(x, data, xlab="",
strip.labels=NULL, strip.prefix="Cluster ",
col=NULL, shade=!is.null(shadefun), shadefun=NULL, byvar=FALSE, ...)
## S4 method for signature 'bclust'
bwplot(x, k=x@k, xlab="", strip.labels=NULL,
strip.prefix="Cluster ", clusters=1:k, ...)
Arguments
x |
An object of class |
data |
If not |
xlab , ... |
Graphical parameters. |
col |
Vector of colors for the clusters. |
strip.labels |
Vector of strings for the strips of the Trellis display. |
strip.prefix |
Prefix string for the strips of the Trellis display. |
shade |
If |
shadefun |
A function or name of a function to compute which
boxes are shaded, e.g. |
byvar |
If |
k |
Number of clusters. |
clusters |
Integer vector of clusters to plot. |
Examples
set.seed(1)
cl <- cclust(iris[,-5], k=3, save.data=TRUE)
bwplot(cl)
bwplot(cl, byvar=TRUE)
## fill only boxes with color which do not contain the overall median
## (grey dot of background box)
bwplot(cl, shade=TRUE)
## fill only boxes with color which do not overlap with the box of the
## complete sample (grey background box)
bwplot(cl, shadefun="boxOverlap")
Convex Clustering
Description
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
Usage
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Arguments
x |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
k |
Either the number of clusters, or a vector of cluster
assignments, or a matrix of initial
(distinct) cluster centroids. If a number, a random set of (distinct)
rows in |
dist |
Distance measure, one of |
method |
Clustering algorithm: one of |
weights |
An optional vector of weights for the observations
(rows of the |
control |
An object of class |
group |
Currently ignored. |
simple |
Return an object of class |
save.data |
Save a copy of |
Details
This function uses the same computational engine as the earlier
function of the same name from package ‘cclust’. The main difference
is that it returns an S4 object of class "kcca"
, hence all
available methods for "kcca"
objects can be used. By default
kcca
and cclust
use exactly the same algorithm,
but cclust
will usually be much faster because it uses compiled
code.
If dist
is "euclidean"
, the distance between the cluster
center and the data points is the Euclidian distance (ordinary kmeans
algorithm), and cluster means are used as centroids.
If "manhattan"
, the distance between the cluster
center and the data points is the sum of the absolute values of the
distances, and the column-wise cluster medians are used as centroids.
If method
is "kmeans"
, the classic kmeans algorithm as
given by MacQueen (1967) is
used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If
"hardcl"
,
on-line updates are used (AKA hard competitive learning), which work by
randomly drawing an observation from x
and moving the closest
center towards that point (e.g., Ripley 1996). If
"neuralgas"
then the neural gas algorithm by Martinetz et al
(1993) is used. It is similar to hard competitive learning, but in
addition to the closest centroid also the second closest centroid is
moved in each iteration.
Value
An object of class "kcca"
.
Author(s)
Evgenia Dimitriadou and Friedrich Leisch
References
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
Martinetz T., Berkovich S., and Schulten K (1993). ‘Neural-Gas’ Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge.
See Also
Examples
## a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd=0.3), ncol=2),
matrix(rnorm(100, mean=1, sd=0.3), ncol=2))
cl <- cclust(x,2)
plot(x, col=predict(cl))
points(cl@centers, pch="x", cex=2, col=3)
## a 3-dimensional example
x <- rbind(matrix(rnorm(150, sd=0.3), ncol=3),
matrix(rnorm(150, mean=2, sd=0.3), ncol=3),
matrix(rnorm(150, mean=4, sd=0.3), ncol=3))
cl <- cclust(x, 6, method="neuralgas", save.data=TRUE)
pairs(x, col=predict(cl))
plot(cl)
Cluster Similarity Matrix
Description
Returns a matrix of cluster similarities. Currently two methods for computing similarities of clusters are implemented, see details below.
Usage
## S4 method for signature 'kcca'
clusterSim(object, data=NULL, method=c("shadow", "centers"),
symmetric=FALSE, ...)
## S4 method for signature 'kccasimple'
clusterSim(object, data=NULL, method=c("shadow", "centers"),
symmetric=FALSE, ...)
Arguments
object |
Fitted object. |
data |
Data to use for computation of the shadow values. If
the cluster object |
method |
Type of similarities, see details below. |
symmetric |
Compute symmetric or asymmetric shadow values?
Ignored if |
... |
Currently not used. |
Details
If method="shadow"
(the default), then the similarity of two
clusters is proportional to the number of points in a cluster, where
the centroid of the other cluster is second-closest. See Leisch (2006,
2008) for detailed formulas.
If method="centers"
, then first the pairwise distances between
all centroids are computed and rescaled to [0,1]. The similarity
between tow clusters is then simply 1 minus the rescaled distance.
Author(s)
Friedrich Leisch
References
Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.
Friedrich Leisch. Visualizing cluster analysis and finite mixture models. In Chun houh Chen, Wolfgang Haerdle, and Antony Unwin, editors, Handbook of Data Visualization, Springer Handbooks of Computational Statistics. Springer Verlag, 2008.
Examples
example(Nclus)
clusterSim(cl)
clusterSim(cl, symmetric=TRUE)
## should have similar structure but will be numerically different:
clusterSim(cl, symmetric=TRUE, data=Nclus[sample(1:550, 200),])
## different concept of cluster similarity
clusterSim(cl, method="centers")
Conversion Between S3 Partition Objects and KCCA
Description
These functions can be used to convert the results from cluster
functions like
kmeans
or pam
to objects
of class "kcca"
and vice versa.
Usage
as.kcca(object, ...)
## S3 method for class 'hclust'
as.kcca(object, data, k, family=NULL, save.data=FALSE, ...)
## S3 method for class 'kmeans'
as.kcca(object, data, save.data=FALSE, ...)
## S3 method for class 'partition'
as.kcca(object, data=NULL, save.data=FALSE, ...)
## S3 method for class 'skmeans'
as.kcca(object, data, save.data=FALSE, ...)
## S4 method for signature 'kccasimple,kmeans'
coerce(from, to="kmeans", strict=TRUE)
Cutree(tree, k=NULL, h=NULL)
Arguments
object |
Fitted object. |
data |
Data which were used to obtain the clustering. For
|
save.data |
Save a copy of the data in the return object? |
k |
Number of clusters. |
family |
Object of class |
... |
Currently not used. |
from , to , strict |
Usual arguments for |
tree |
A tree as produced by |
h |
Numeric scalar or vector with heights where the tree should be cut. |
Details
The standard cutree
function orders clusters such that
observation one is in cluster one, the first observation (as ordered
in the data set) not in cluster one is in cluster two,
etc. Cutree
orders clusters as shown in the dendrogram from
left to right such that similar clusters have similar numbers. The
latter is used when converting to kcca
.
For hierarchical clustering the cluster memberships of the converted
object can be different from the result of Cutree
,
because one KCCA-iteration has to be performed in order to obtain a
valid kcca
object. In this case a warning is issued.
Author(s)
Friedrich Leisch
Examples
data(Nclus)
cl1 <- kmeans(Nclus, 4)
cl1
cl1a <- as.kcca(cl1, Nclus)
cl1a
cl1b <- as(cl1a, "kmeans")
library("cluster")
cl2 <- pam(Nclus, 4)
cl2
cl2a <- as.kcca(cl2)
cl2a
## the same
cl2b <- as.kcca(cl2, Nclus)
cl2b
## hierarchical clustering
hc <- hclust(dist(USArrests))
plot(hc)
rect.hclust(hc, k=3)
c3 <- Cutree(hc, k=3)
k3 <- as.kcca(hc, USArrests, k=3)
barchart(k3)
table(c3, clusters(k3))
Dentition of Mammals
Description
Mammal's teeth divided into the 4 groups: incisors, canines, premolars and molars.
Usage
data(dentitio)
Format
A data frame with 66 observations on the following 8 variables.
top.inc
Top incisors.
bot.inc
Bottom incisors.
top.can
Top canines.
bot.can
Bottom canines.
top.pre
Top premolars.
bot.pre
Bottom premolars.
top.mol
Top molars.
bot.mol
Bottom molars.
References
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Compute Pairwise Distances Between Two Data sets
Description
This function computes and returns the distance matrix computed by using the specified distance measure to compute the pairwise distances between the rows of two data matrices.
Usage
dist2(x, y, method = "euclidean", p=2)
Arguments
x |
A data matrix. |
y |
A vector or second data matrix. |
method |
the distance measure to be used. This must be one of
|
p |
The power of the Minkowski distance. |
Details
This is a two-data-set equivalent of the standard function
dist
. It returns a matrix of all pairwise
distances between rows in x
and y
. The current
implementation is efficient only if y
has not too many
rows (the code is vectorized in x
but not in y
).
Note
The definition of Canberra distance was wrong for negative data prior to version 1.3-5.
Author(s)
Friedrich Leisch
See Also
Examples
x <- matrix(rnorm(20), ncol=4)
rownames(x) = paste("X", 1:nrow(x), sep=".")
y <- matrix(rnorm(12), ncol=4)
rownames(y) = paste("Y", 1:nrow(y), sep=".")
dist2(x, y)
dist2(x, y, "man")
data(milk)
dist2(milk[1:5,], milk[4:6,])
Distance and Centroid Computation
Description
Helper functions to create kccaFamily
objects.
Usage
distAngle(x, centers)
distCanberra(x, centers)
distCor(x, centers)
distEuclidean(x, centers)
distJaccard(x, centers)
distManhattan(x, centers)
distMax(x, centers)
distMinkowski(x, centers, p=2)
centAngle(x)
centMean(x)
centMedian(x)
centOptim(x, dist)
centOptim01(x, dist)
Arguments
x |
A data matrix. |
centers |
A matrix of centroids. |
p |
The power of the Minkowski distance. |
dist |
A distance function. |
Author(s)
Friedrich Leisch
Classes "flexclustControl" and "cclustControl"
Description
Hyperparameters for cluster algorithms.
Objects from the Class
Objects can be created by calls of the form
new("flexclustControl", ...)
. In addition, named lists can be
coerced to flexclustControl
objects, names are completed if unique (see examples).
Slots
Objects of class "flexclustControl"
have the following slots:
iter.max
:Maximum number of iterations.
tolerance
:The algorithm is stopped when the (relative) change of the optimization criterion is smaller than
tolerance
.verbose
:If a positive integer, then progress is reported every
verbose
iterations. If 0, no output is generated during model fitting.classify
:Character string, one of
"auto"
,"weighted"
,"hard"
or"simann"
.initcent
:Character string, name of function for initial centroids, currently
"randomcent"
(the default) and"kmeanspp"
are available.gamma
:Gamma value for weighted hard competitive learning.
simann
:Parameters for simulated annealing optimization (only used when
classify="simann"
).ntry
:Number of trials per iteration for QT clustering.
min.size
:Clusters smaller than this value are treated as outliers.
Objects of class "cclustControl"
inherit from
"flexclustControl"
and have the following additional slots:
method
:Learning rate for hard competitive learning, one of
"polynomial"
or"exponential"
.pol.rate
:Positive number for polynomial learning rate of form
1/iter^{par}
.exp.rate
Vector of length 2 with parameters for exponential learning rate of form
par1*(par2/par1)^{(iter/iter.max)}
.
ng.rate
:Vector of length 4 with parameters for neural gas, see details below.
Learning Rate of Neural Gas
The neural gas algorithm uses updates of form
cnew = cold + e*exp(-m/l)*(x - cold)
for every centroid, where m
is the order (minus 1) of the
centroid with
respect to distance to data point x
(0=closest, 1=second,
...). The parameters e
and l
are given by
e = par1*(par2/par1)^{(iter/iter.max)},
l = par3*(par4/par3)^{(iter/iter.max)}.
See Martinetz et al (1993) for details of the algorithm, and the examples section on how to obtain default values.
Author(s)
Friedrich Leisch
References
Martinetz T., Berkovich S., and Schulten K. (1993). "Neural-Gas Network for Vector Quantization and its Application to Time-Series Prediction." IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
Arthur D. and Vassilvitskii S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. pp. 1027-1035.
See Also
Examples
## have a look at the defaults
new("flexclustControl")
## corce a list
mycont <- list(iter=500, tol=0.001, class="w")
as(mycont, "flexclustControl")
## some additional slots
as(mycont, "cclustControl")
## default values for ng.rate
new("cclustControl")@ng.rate
Flexclust Color Palettes
Description
Create and access palettes for the plot methods.
Usage
flxColors(n=1:8, color=c("full","medium", "light","dark"), grey=FALSE)
flxPalette(n, ...)
Arguments
n |
Index number of color to return (1 to 8) for |
color |
Type of color, see details. |
grey |
Return grey value corresponding to palette. |
... |
Passed on to |
Details
This function creates color palettes in HCL space for up to 8 colors. All palettes have constant chroma and luminance, only the hue of the colors change within a palette.
Palettes "full"
and "dark"
have the same luminance, and
palettes "medium"
and "light"
have the same luminance.
Author(s)
Friedrich Leisch
See Also
Examples
opar <- par(c("mfrow", "mar", "xaxt"))
par(mfrow=c(2, 2), mar=c(0, 0, 2, 0), yaxt="n")
x <- rep(1, 8)
barplot(x, col = flxColors(color="full"), main="full")
barplot(x, col = flxColors(color="dark"), main="dark")
barplot(x, col = flxColors(color="medium"), main="medium")
barplot(x, col = flxColors(color="light"), main="light")
par(opar)
Methods for Function histogram in Package ‘flexclust’
Description
Plot a histogram of the similarity of each observation to each cluster.
Usage
## S4 method for signature 'kccasimple,missing'
histogram(x, data, xlab="", ...)
## S4 method for signature 'kccasimple,data.frame'
histogram(x, data, xlab="", ...)
## S4 method for signature 'kccasimple,matrix'
histogram(x, data, xlab="Similarity",
power=1, ...)
Arguments
x |
An object of class |
data |
If not missing, the distance and thus similarity between observations and cluster centers is determined for the new data and used for the plots. By default the values from the training data are used. |
xlab |
Label for the x-axis. |
power |
Numeric indicating how similarities are transformed, for more details see Dolnicar et al. (2018). |
... |
Additional arguments passed to
|
Author(s)
Friedrich Leisch
References
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
Methods for Function image in Package ‘flexclust’
Description
Image plot of cluster segments overlaid by neighbourhood graph.
Usage
## S4 method for signature 'kcca'
image(x, which = 1:2, npoints = 100,
xlab = "", ylab = "", fastcol = TRUE, col=NULL,
clwd=0, graph=TRUE, ...)
Arguments
x |
An object of class |
which |
Index number of dimensions of input space to plot. |
npoints |
Number of grid points for image. |
fastcol |
If |
col |
Vector of background colors for the segments. |
clwd |
Line width of contour lines at cluster boundaries, use
larger values for |
graph |
Logical, add a neighborhood graph to the plot? |
xlab , ylab , ... |
Graphical parameters. |
Details
This works only for "kcca"
objects, no method is available for
"kccasimple" objects.
Author(s)
Friedrich Leisch
See Also
Get Information on Fitted Flexclust Objects
Description
Returns descriptive information about fitted flexclust objects like cluster sizes or sum of within-cluster distances.
Usage
## S4 method for signature 'flexclust,character'
info(object, which, drop=TRUE, ...)
Arguments
object |
Fitted object. |
which |
Which information to get. Use |
drop |
Logical. If |
... |
Passed to methods. |
Details
Function info
can be used to access slots of fitted flexclust
objects in a portable way, and in addition computes some
meta-information like sum of within-cluster distances.
Function infoCheck
returns a logical value that is TRUE
if the requested information can be computed from the object
.
Author(s)
Friedrich Leisch
See Also
Examples
data("Nclus")
plot(Nclus)
cl1 <- cclust(Nclus, k=4)
summary(cl1)
## these two are the same
info(cl1)
info(cl1, "help")
## cluster sizes
i1 <- info(cl1, "size")
i1
## average within cluster distances
i2 <- info(cl1, "av_dist")
i2
## the sum of all within-cluster distances
i3 <- info(cl1, "distsum")
i3
## sum(i1*i2) must of course be the same as i3
stopifnot(all.equal(sum(i1*i2), i3))
## This should return TRUE
modeltools::infoCheck(cl1, "size")
## and this FALSE
modeltools::infoCheck(cl1, "Homer Simpson")
## both combined
i4 <- modeltools::infoCheck(cl1, c("size", "Homer Simpson"))
i4
stopifnot(all.equal(i4, c(TRUE, FALSE)))
K-Centroids Cluster Analysis
Description
Perform k-centroids clustering on a data matrix.
Usage
kcca(x, k, family=kccaFamily("kmeans"), weights=NULL,
group=NULL, control=NULL, simple=FALSE, save.data=FALSE)
kccaFamily(which=NULL, dist=NULL, cent=NULL, name=which, preproc = NULL,
genDist=NULL, trim=0, groupFun = "minSumClusters")
## S4 method for signature 'kccasimple'
summary(object)
Arguments
x |
A numeric matrix of data, or an object that can be coerced to such a matrix using data.matrix. |
k |
Either the number of clusters, or a vector of cluster
assignments, or a matrix of initial
(distinct) cluster centroids. If a number, a random set of (distinct)
rows in |
family |
Object of class |
weights |
An optional vector of weights to be used in the clustering process, cannot be combined with all families. |
group |
An optional grouping vector for the data, see details below. |
control |
An object of class |
simple |
Return an object of class |
save.data |
Save a copy of |
which |
One of |
name |
Optional long name for family, used only for show methods. |
dist |
A function for distance computation, ignored
if |
cent |
A function for centroid computation, ignored
if |
preproc |
Function for data preprocessing. Defaults to
|
genDist |
Function for updating the family object based on
|
trim |
A number in between 0 and 0.5, if non-zero then trimmed
means are used for the |
groupFun |
Function or name of function to obtain clusters for grouped data, see details below. |
object |
Object of class |
Details
See the paper A Toolbox for K-Centroids Cluster Analysis referenced below for details.
Value
Function kcca
returns objects of class "kcca"
or
"kccasimple"
depending on the value of argument
simple
. The simpler objects contain fewer slots and hence are
faster to compute, but contain no auxiliary information used by the
plotting methods. Most plot methods for "kccasimple"
objects do
nothing and return a warning. If only centroids, cluster membership or
prediction for new data are of interest, then the simple objects are
sufficient.
Predefined Families
Function kccaFamily()
currently has the following predefined
families (distance / centroid):
- kmeans:
Euclidean distance / mean
- kmedians:
Manhattan distance / median
- angle:
angle between observation and centroid / standardized mean
- jaccard:
Jaccard distance / numeric optimization
- ejaccard:
Jaccard distance / mean
See Leisch (2006) for details on all combinations.
Group Constraints
If group
is not NULL
, then observations from the same
group are restricted to belong to the same cluster (must-link
constraint) or different clusters (cannot-link constraint) during the
fitting process. If groupFun = "minSumClusters"
, then all group
members are
assign to the cluster where the center has minimal average distance to
the group members. If groupFun = "majorityClusters"
, then all
group members are assigned to the cluster the majority would belong to
without a constraint.
groupFun = "differentClusters"
implements a cannot-link
constraint, i.e., members of one group are not allowed to belong to
the same cluster. The optimal allocation for each group is found by
solving a linear sum assignment problem using
solve_LSAP
. Obviously the group sizes must be smaller
than the number of clusters in this case.
Ties are broken at random in all cases.
Note that at the moment not all methods for fitted
"kcca"
objects respect the grouping information, most
importantly the plot method when a data argument is specified.
Author(s)
Friedrich Leisch
References
Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.
Friedrich Leisch and Bettina Gruen. Extending standard cluster algorithms to allow for group constraints. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006-Proceedings in Computational Statistics, pages 885-892. Physica Verlag, Heidelberg, Germany, 2006.
See Also
stepFlexclust
, cclust
,
distances
Examples
data("Nclus")
plot(Nclus)
## try kmeans
cl1 <- kcca(Nclus, k=4)
cl1
image(cl1)
points(Nclus)
## A barplot of the centroids
barplot(cl1)
## now use k-medians and kmeans++ initialization, cluster centroids
## should be similar...
cl2 <- kcca(Nclus, k=4, family=kccaFamily("kmedians"),
control=list(initcent="kmeanspp"))
cl2
## ... but the boundaries of the partitions have a different shape
image(cl2)
points(Nclus)
Convert Cluster Result to Data Frame
Description
Convert object of class "kcca"
to a data frame in long format.
Usage
kcca2df(object, data)
Arguments
object |
Object of class |
data |
Optional data if not saved in |
Value
A data.frame
with columns value
, variable
and
group
.
Examples
c.iris <- cclust(iris[,-5], 3, save.data=TRUE)
df.c.iris <- kcca2df(c.iris)
summary(df.c.iris)
densityplot(~value|variable+group, data=df.c.iris)
Milk of Mammals
Description
The data set contains the ingredients of mammal's milk of 25 animals.
Usage
data(milk)
Format
A data frame with 25 observations on the following 5 variables (all in percent).
water
Water.
protein
Protein.
fat
Fat.
lactose
Lactose.
ash
Ash.
References
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Nutrients in Meat, Fish and Fowl
Description
The data set contains the measurements of nutrients in several types of meat, fish and fowl.
Usage
data(nutrient)
Format
A data frame with 27 observations on the following 5 variables.
energy
Food energy (calories).
protein
Protein (grams).
fat
Fat (grams).
calcium
calcium (milli grams).
iron
Iron (milli grams).
References
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Methods for Function pairs in Package ‘flexclust’
Description
Plot a matrix of neighbourhood graphs.
Usage
## S4 method for signature 'kcca'
pairs(x, which=NULL, project=NULL, oma=NULL, ...)
Arguments
x |
An object of class |
which |
Index numbers of dimensions of (projected) input space to plot, default is to plot all dimensions. |
project |
Projection object for which a |
oma |
Outer margin. |
... |
Passed to the |
Details
This works only for "kcca"
objects, no method is available for
"kccasimple" objects.
Author(s)
Friedrich Leisch
Get Centroids from KCCA Object
Description
Returns the matrix of centroids of a fitted object of class "kcca"
.
Usage
## S4 method for signature 'kccasimple'
parameters(object, ...)
Arguments
object |
Fitted object. |
... |
Currently not used. |
Author(s)
Friedrich Leisch
Methods for Function plot in Package ‘flexclust’
Description
Plot the neighbourhood graph of a cluster solution together with projected data points.
Usage
## S4 method for signature 'kcca,missing'
plot(x, y, which=1:2, project=NULL,
data=NULL, points=TRUE, hull=TRUE, hull.args=NULL,
number = TRUE, simlines=TRUE,
lwd=1, maxlwd=8*lwd, cex=1.5, numcol=FALSE, nodes=16,
add=FALSE, xlab="", ylab="", xlim = NULL,
ylim = NULL, pch=NULL, col=NULL, ...)
Arguments
x |
An object of class |
y |
Not used |
which |
Index numbers of dimensions of (projected) input space to plot. |
project |
Projection object for which a |
data |
Data to include in plot. If the cluster object |
points |
Logical, shall data points be plotted (if available)? |
hull |
If |
hull.args |
A list of arguments for the hull function. |
number |
Logical, plot number labels in nodes of graph? |
numcol , cex |
Color and size of number labels in nodes of
graph. If |
nodes |
Plotting symbol to use for nodes if no numbers are drawn. |
simlines |
Logical, plot edges of graph? |
lwd , maxlwd |
Numerical, thickness of lines. |
add |
Logical, add to existing plot? |
xlab , ylab |
Axis labels. |
xlim , ylim |
Axis range. |
pch , col , ... |
Plotting symbols and colors for data points. |
Details
This works only for "kcca"
objects, no method is available for
"kccasimple" objects.
Author(s)
Friedrich Leisch
References
Friedrich Leisch. Visualizing cluster analysis and finite mixture models. In Chun houh Chen, Wolfgang Haerdle, and Antony Unwin, editors, Handbook of Data Visualization, Springer Handbooks of Computational Statistics. Springer Verlag, 2008.
Predict Cluster Membership
Description
Return either the cluster membership of training data or predict for new data.
Usage
## S4 method for signature 'kccasimple'
predict(object, newdata, ...)
## S4 method for signature 'flexclust,ANY'
clusters(object, newdata, ...)
Arguments
object |
Object of class inheriting from |
newdata |
An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used. |
... |
Currently not used. |
Details
clusters
can be used on any object of class "flexclust"
and returns the cluster memberships of the training data.
predict
can be used only on objects of class "kcca"
(which inherit from "flexclust"
). If no newdata
argument
is specified, the function is identical to clusters
, if
newdata
is specified, then cluster memberships for the new data
are predicted. clusters(object, newdata, ...)
is an alias for
predict(object, newdata, ...)
.
Author(s)
Friedrich Leisch
Artificial 2d Market Segment Data
Description
Simple artificial 2-dimensional data to demonstrate clustering for market segmentation. One dimension is the hypothetical feature sophistication (or performance or quality, etc) of a product, the second dimension the price customers are willing to pay for the product.
Usage
priceFeature(n, which=c("2clust", "3clust", "3clustold", "5clust",
"ellipse", "triangle", "circle", "square",
"largesmall"))
Arguments
n |
Sample size. |
which |
Shape of data set. |
References
Sara Dolnicar and Friedrich Leisch. Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Marketing Letters, 21:83-101, 2010.
Examples
plot(priceFeature(200, "2clust"))
plot(priceFeature(200, "3clust"))
plot(priceFeature(200, "3clustold"))
plot(priceFeature(200, "5clust"))
plot(priceFeature(200, "ell"))
plot(priceFeature(200, "tri"))
plot(priceFeature(200, "circ"))
plot(priceFeature(200, "square"))
plot(priceFeature(200, "largesmall"))
Add Arrows for Projected Axes to a Plot
Description
Adds arrows for original coordinate axes to a projection plot.
Usage
projAxes(object, which=1:2, center=NULL,
col="red", radius=NULL,
minradius=0.1, textargs=list(col=col),
col.names=getColnames(object),
which.names="", group = NULL, groupFun = colMeans,
plot=TRUE, ...)
placeLabels(object)
## S4 method for signature 'projAxes'
placeLabels(object)
Arguments
object |
Return value of a projection method like
|
which |
Index number of dimensions of (projected) input space that have been plotted. |
center |
Center of the coordinate system to use in projected space. Default is the center of the plotting region. |
col |
Color of arrows. |
radius |
Relative size of the arrows. |
minradius |
Minimum radius of arrows to include (relative to arrow size). |
textargs |
List of arguments for |
col.names |
Variable names of the original data. |
which.names |
A regular expression which variable names to include in the plot. |
group |
An optional grouping variable for the original
coordinates. Coordinates with group |
groupFun |
Function used to aggregate the projected coordinates
if |
plot |
Logical,if |
... |
Passed to |
Value
projAxes
invisibly returns an object of class
"projAxes"
, which can be
added to an existing plot by its plot
method.
Author(s)
Friedrich Leisch
Examples
data(milk)
milk.pca <- prcomp(milk, scale=TRUE)
## create a biplot step by step
plot(predict(milk.pca), type="n")
text(predict(milk.pca), rownames(milk), col="green", cex=0.8)
projAxes(milk.pca)
## the same, but arrows are blue, centered at origin and all arrows are
## plotted
plot(predict(milk.pca), type="n")
text(predict(milk.pca), rownames(milk), col="green", cex=0.8)
projAxes(milk.pca, col="blue", center=0, minradius=0)
## use points instead of text, plot PC2 and PC3, manual radius
## specification, store result
plot(predict(milk.pca)[,c(2,3)])
arr <- projAxes(milk.pca, which=c(2,3), radius=1.2, plot=FALSE)
plot(arr)
## Not run:
## manually try to find new places for the labels: each arrow is marked
## active in turn, use the left mouse button to find a better location
## for the label. Use the right mouse button to go on to the next
## variable.
arr1 <- placeLabels(arr)
## now do the plot again:
plot(predict(milk.pca)[,c(2,3)])
plot(arr1)
## End(Not run)
Barcharts and Boxplots for Columns of a Data Matrix Split by Groups
Description
Split a binary or numeric matrix by a grouping variable, run a series of tests on all variables, adjust for multiple testing and graphically represent results.
Usage
propBarchart(x, g, alpha=0.05, correct="holm", test="prop.test",
sort=FALSE, strip.prefix="", strip.labels=NULL,
which=NULL, byvar=FALSE, ...)
## S4 method for signature 'propBarchart'
summary(object, ...)
groupBWplot(x, g, alpha=0.05, correct="holm", xlab="", col=NULL,
shade=!is.null(shadefun), shadefun=NULL,
strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE,
...)
Arguments
x |
A binary data matrix. |
g |
A factor specifying the groups. |
alpha |
Significance level for test of differences in proportions. |
correct |
Correction method for multiple testing, passed to
|
test |
Test to use for detecting significant differences in proportions. |
sort |
Logical, sort variables by total sample mean? |
strip.prefix |
Character string prepended to strips of the
|
strip.labels |
Character vector of labels to use for strips of
|
which |
Index numbers or names of variables to plot. |
byvar |
If |
... |
|
object |
Return value of |
xlab |
A title for the x-axis: see |
col |
Vector of colors for the panels. |
shade |
If |
shadefun |
A function or name of a function to compute which
boxes are shaded, e.g. |
Details
Function propBarchart
splits a binary data matrix into
subgroups, computes the percentage of ones in each column and compares
the proportions in the groups using prop.test
. The
p-values for all variables are adjusted for multiple testing and a
barchart of group percentages is drawn highlighting variables with
significant differences in proportion. The summary
method can
be used to create a corresponding table for publications.
Function groupBWplot
takes a general numeric matrix, also
splits into subgroups and uses boxes instead of bars. By default
kruskal.test
is used to compute significant differences
in location, in addition the heuristics from
bwplot,kcca-method
can be used. Boxes of the complete sample
are used as reference in the background.
Author(s)
Friedrich Leisch
See Also
barplot-methods
,
bwplot,kcca-method
Examples
## create a binary matrix from the iris data plus a random noise column
x <- apply(iris[,-5], 2, function(z) z>median(z))
x <- cbind(x, Noise=sample(0:1, 150, replace=TRUE))
## There are significant differences in all 4 original variables, Noise
## has most likely no significant difference (of course the difference
## will be significant in alpha percent of all random samples).
p <- propBarchart(x, iris$Species)
p
summary(p)
propBarchart(x, iris$Species, byvar=TRUE)
x <- iris[,-5]
x <- cbind(x, Noise=rnorm(150, mean=3))
groupBWplot(x, iris$Species)
groupBWplot(x, iris$Species, shade=TRUE)
groupBWplot(x, iris$Species, shadefun="medianInside")
groupBWplot(x, iris$Species, shade=TRUE, byvar=TRUE)
Stochastic QT Clustering
Description
Perform stochastic QT clustering on a data matrix.
Usage
qtclust(x, radius, family = kccaFamily("kmeans"), control = NULL,
save.data=FALSE, kcca=FALSE)
Arguments
x |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
radius |
Maximum radius of clusters. |
family |
Object of class |
control |
An object of class |
.
save.data |
Save a copy of |
kcca |
Run |
Details
This function implements a variation of the QT clustering algorithm by
Heyer et al. (1999), see Scharl and Leisch (2006). The main difference
is that in each iteration not
all possible cluster start points are considered, but only a random
sample of size control@ntry
. We also consider only points as initial
centers where at least one other point is within a circle with radius
radius
. In most cases the resulting
solutions are almost
the same at a considerable speed increase, in some cases even better
solutions are obtained than with the original algorithm. If
control@ntry
is set to the size of the data set, an algorithm
similar to the original algorithm as proposed by Heyer et al. (1999)
is obtained.
Value
Function qtclust
by default returns objects of class
"kccasimple"
. If argument kcca
is TRUE
, function
kcca()
is run afterwards (initialized on the QT cluster
solution). Data points
not clustered by the QT cluster algorithm are omitted from the
kcca()
iterations, but filled back into the return
object. All plot methods defined for objects of class "kcca"
can be used.
Author(s)
Friedrich Leisch
References
Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research 9, 1106–1115.
Theresa Scharl and Friedrich Leisch. The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006 – Proceedings in Computational Statistics, pages 1015-1022. Physica Verlag, Heidelberg, Germany, 2006.
Examples
x <- matrix(10*runif(1000), ncol=2)
## maximum distrance of point to cluster center is 3
cl1 <- qtclust(x, radius=3)
## maximum distrance of point to cluster center is 1
## -> more clusters, longer runtime
cl2 <- qtclust(x, radius=1)
opar <- par(c("mfrow","mar"))
par(mfrow=c(2,1), mar=c(2.1,2.1,1,1))
plot(x, col=predict(cl1), xlab="", ylab="")
plot(x, col=predict(cl2), xlab="", ylab="")
par(opar)
Compare Partitions
Description
Compute the (adjusted) Rand, Jaccard and Fowlkes-Mallows index for agreement of two partitions.
Usage
comPart(x, y, type=c("ARI","RI","J","FM"))
## S4 method for signature 'flexclust,flexclust'
comPart(x, y, type)
## S4 method for signature 'numeric,numeric'
comPart(x, y, type)
## S4 method for signature 'flexclust,numeric'
comPart(x, y, type)
## S4 method for signature 'numeric,flexclust'
comPart(x, y, type)
randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'table,missing'
randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'ANY,ANY'
randIndex(x, y, correct=TRUE, original=!correct)
Arguments
x |
Either a 2-dimensional cross-tabulation of cluster
assignments (for |
y |
An object inheriting from class
|
type |
character vector of abbreviations of indices to compute. |
correct , original |
Logical, correct the Rand index for agreement by chance? |
Value
A vector of indices.
Rand Index
Let A
denote the number of all pairs of data
points which are either put into the same cluster by both partitions or
put into different clusters by both partitions. Conversely, let D
denote the number of all pairs of data points that are put into one
cluster in one partition, but into different clusters by the other
partition. The partitions disagree for all pairs D
and
agree for all pairs A
. We can measure the agreement by the Rand
index A/(A+D)
which is invariant with respect to permutations of
cluster labels.
The index has to be corrected for agreement by chance if the sizes of the clusters are not uniform (which is usually the case), or if there are many clusters, see Hubert & Arabie (1985) for details.
Jaccard Index
If the number of clusters is very large, then usually the vast
majority of pairs of points will not be in the same cluster. The
Jaccard index tries to account for this by using only pairs of points
that are in the same cluster in the defintion of A
.
Fowlkes-Mallows
Let A
again be the pairs of points that
are in the same cluster in both partitions. Fowlkes-Mallows divides
this number by the geometric mean of the sums of the number of pairs in each
cluster of the two partitions. This gives the probability that a pair
of points which are in the same cluster in one partition are also in the
same cluster in the other partition.
Author(s)
Friedrich Leisch
References
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.
Marina Meila. Comparing clusterings - an axiomatic view. In Stefan Wrobel and Luc De Raedt, editors, Proceedings of the International Machine Learning Conference (ICML). ACM Press, 2005.
Examples
## no class correlations: corrected Rand almost zero
g1 <- sample(1:5, size=1000, replace=TRUE)
g2 <- sample(1:5, size=1000, replace=TRUE)
tab <- table(g1, g2)
randIndex(tab)
## uncorrected version will be large, because there are many points
## which are assigned to different clusters in both cases
randIndex(tab, correct=FALSE)
comPart(g1, g2)
## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better
k <- sample(1:1000, size=200)
g1[k] <- 1
g2[k] <- 1
k <- sample(1:1000, size=200)
g1[k] <- 3
g2[k] <- 3
tab <- table(g1, g2)
## the index should be larger than before
randIndex(tab, correct=TRUE, original=TRUE)
comPart(g1, g2)
Plot a Random Tour
Description
Create a series of projection plots corresponding to a random tour through the data.
Usage
randomTour(object, ...)
## S4 method for signature 'ANY'
randomTour(object, ...)
## S4 method for signature 'matrix'
randomTour(object, ...)
## S4 method for signature 'flexclust'
randomTour(object, data=NULL, col=NULL, ...)
randomTourMatrix(x, directions=10,
steps=100, sec=4, sleep = sec/steps,
axiscol=2, axislab=colnames(x),
center=NULL, radius=1, minradius=0.01, asp=1,
...)
Arguments
object , x |
A matrix or an object of class |
data |
Data to include in plot. |
col |
Plotting colors for data points. |
directions |
Integer value, how many different directions are toured. |
steps |
Integer, number of steps in each direction. |
sec |
Numerical, lower bound for the number of seconds each direction takes. |
sleep |
Numerical, sleep for as many seconds after each picture has been plotted. |
axiscol |
If not |
axislab |
Optional labels for the projected axes. |
center |
Center of the coordinate system to use in projected space. Default is the center of the plotting region. |
radius |
Relative size of the arrows. |
minradius |
Minimum radius of arrows to include. |
asp , ... |
Passed on to |
Details
Two random locations are chosen, and data then projected onto
hyperplanes which are orthogonal to step
vectors interpolating
the two locations. The first two coordinates of the projected data are
plotted. If directions
is larger than one, then after the first
steps
plots one more random location is chosen, and the
procedure is repeated from the current position to the
new location, etc..
The whole procedure is similar to a grand tour, but no attempt is made
to optimize subsequent directions, randomTour
simply chooses a random
direction in each iteration. Use rggobi
for the real thing.
Obviously the function needs a reasonably fast computer and graphics
device to give a smooth impression, for x11
it may be
necessary to use type="Xlib"
rather than cairo.
Author(s)
Friedrich Leisch
Examples
if(interactive()){
par(ask=FALSE)
randomTour(iris[,1:4], axiscol=2:5)
randomTour(iris[,1:4], col=as.numeric(iris$Species), axiscol=4)
x <- matrix(runif(300), ncol=3)
x <- rbind(x, x+1, x+2)
cl <- cclust(x, k=3, save.data=TRUE)
randomTour(cl, center=0, axiscol="black")
## now use predicted cluster membership for new data as colors
randomTour(cl, center=0, axiscol="black",
data=matrix(rnorm(3000, mean=1, sd=2), ncol=3))
}
Relabel Cluster Results.
Description
The clusters are relabelled to obtain a unique labeling.
Usage
relabel(object, by, ...)
## S4 method for signature 'kccasimple,character'
relabel(object, by, which = NULL, ...)
## S4 method for signature 'kccasimple,integer'
relabel(object, by, ...)
## S4 method for signature 'kccasimple,missing'
relabel(object, by, ...)
## S4 method for signature 'stepFlexclust,integer'
relabel(object, by = "series", ...)
## S4 method for signature 'stepFlexclust,missing'
relabel(object, by, ...)
Arguments
object |
An object of class |
by |
If a character vector, it needs to be one of |
which |
Either an integer vector indiating the ordering or a vector of length one indicating the variable used for ordering. |
... |
Currently not used. |
Details
If by
is a character vector with value "mean"
or
"median"
, the clusters are ordered by the mean or median values
over all variables for each cluster. If by = "manual"
which
needs to be a vector indicating the ordering. If
by = "variable"
which
needs to be indicate the variable
which is used to determine the ordering. If by
is
"centers"
, "shadow"
or "symmshadow"
, cluster
similarities are calculated using clusterSim
and used to
determine an ordering using seriate
from package
seriation.
If by = "series"
the relabeling is performed over a series of
clustering to minimize the misclassification.
Author(s)
Friedrich Leisch
See Also
Cluster Shadows and Silhouettes
Description
Compute and plot shadows and silhouettes.
Usage
## S4 method for signature 'kccasimple'
shadow(object, ...)
## S4 method for signature 'kcca'
Silhouette(object, data=NULL, ...)
Arguments
object |
An object of class |
data |
Data to compute silhouette values for. If the cluster
|
... |
Currently not used. |
Details
The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to closest and second-closest centroid. If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.
The silhouette value of a data point is defined as the scaled difference between the average dissimilarity of a point to all points in its own cluster to the smallest average dissimilarity to the points of a different cluster. Large silhouette values indicate good separation.
The main difference between silhouette values and shadow values is that we replace average dissimilarities to points in a cluster by dissimilarities to point averages (=centroids). See Leisch (2009) for details.
Author(s)
Friedrich Leisch
References
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.
See Also
Examples
data(Nclus)
set.seed(1)
c5 <- cclust(Nclus, 5, save.data=TRUE)
c5
plot(c5)
## high shadow values indicate clusters with *bad* separation
shadow(c5)
plot(shadow(c5))
## high Silhouette values indicate clusters with *good* separation
Silhouette(c5)
plot(Silhouette(c5))
Shadow Stars
Description
Shadow star plots and corresponding panel functions.
Usage
shadowStars(object, which=1:2, project=NULL,
width=1, varwidth=FALSE,
panel=panelShadowStripes,
box=NULL, col=NULL, add=FALSE, ...)
panelShadowStripes(x, col, ...)
panelShadowViolin(x, ...)
panelShadowBP(x, ...)
panelShadowSkeleton(x, ...)
Arguments
object |
An object of class |
which |
Index numbers of dimensions of (projected) input space to plot. |
project |
Projection object for which a |
width |
Width of vertices connecting the cluster centroids. |
varwidth |
Logical, shall all vertices have the same width or should the width be proportional to number of points shown on the vertex? |
panel |
Function used to draw vertices. |
box |
Color of rectangle drawn around each vertex. |
col |
A vector of colors for the clusters. |
add |
Logical, start a new plot? |
... |
Passed on to panel function. |
x |
Shadow values of data points corresponding to the vertex. |
Details
The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to closest and second-closest centroid. If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.
The neighborhood graph of a cluster solution connects two centroids by a vertex if at least one data point has the two centroids as closest and second closest. The width of the vertex is proportional to the sum of shadow values of all points having these two as closest and second closest. A shadow star depicts the distribution of shadow values on the vertex, see Leisch (2009) for details.
Currently four panel functions are available:
panelShadowStripes
:line segment for each shadow value.
panelShadowViolin
:violin plot of shadow values.
panelShadowBP
:box-percentile plot of shadow values.
panelShadowSkeleton
:average shadow value.
Author(s)
Friedrich Leisch
References
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.
See Also
Examples
data(Nclus)
set.seed(1)
c5 <- cclust(Nclus, 5, save.data=TRUE)
c5
plot(c5)
shadowStars(c5)
shadowStars(c5, varwidth=TRUE)
shadowStars(c5, panel=panelShadowViolin)
shadowStars(c5, panel=panelShadowBP)
## always use varwidth=TRUE with panelShadowSkeleton, otherwise a few
## large shadow values can lead to misleading results:
shadowStars(c5, panel=panelShadowSkeleton)
shadowStars(c5, panel=panelShadowSkeleton, varwidth=TRUE)
Segment Level Stability Across Solutions Plot.
Description
Create a segment level stability across solutions plot, possibly using an additional variable for coloring the nodes.
Usage
slsaplot(object, nodecol = NULL, ...)
Arguments
object |
An object returned by |
nodecol |
A numeric vector of length equal to the number of
observations clustered in |
... |
Additional graphical parameters to modify the plot. |
Details
For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).
Value
List of length equal to the number of different cluster solutions minus one containing numeric vectors of the entropy values used by default to color the nodes.
Author(s)
Friedrich Leisch
References
Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
See Also
stepFlexclust
, relabel
, slswFlexclust
Examples
data("Nclus")
cl25 <- stepFlexclust(Nclus, k=2:5)
slsaplot(cl25)
cl25 <- relabel(cl25)
slsaplot(cl25)
Segment Level Stability Within Solution.
Description
Assess segment level stability within solution.
Usage
slswFlexclust(x, object, ...)
## S4 method for signature 'resampleFlexclust,missing'
plot(x, y, ...)
## S4 method for signature 'resampleFlexclust'
boxplot(x, which=1, ylab=NULL, ...)
## S4 method for signature 'resampleFlexclust'
densityplot(x, data, which=1, ...)
## S4 method for signature 'resampleFlexclust'
summary(object)
Arguments
x |
A numeric matrix of data, or an object that can be coerced to
such a matrix (such as a numeric vector or a data frame with all
numeric columns) passed to |
object |
Object of class |
y |
Missing. |
which |
Integer or character indicating which validation measure is used for plotting. |
ylab |
Axis label. |
data |
Not used. |
... |
Additional arguments; for details see below. |
Details
Additional arguments in slswFlexclust
are argument nsamp
which is by default equal to 100 and allows to change the number of
bootstrap pairs drawn. Argument seed
allows to set a random
seed and argument multicore
is by default TRUE
and
indicates if bootstrap samples should be drawn in parallel. Argument
verbose
is by default equal to FALSE
and if TRUE
progress information is shown during computations.
There are plotting as well as printing and summary methods implemented
for objects of class "resampleFlexclust"
. In addition to a
standard plot
method also methods for densityplot
and
boxplot
are provided.
For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).
Value
An object of class "resampleFlexclust"
.
Author(s)
Friedrich Leisch
References
Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
See Also
Examples
data("Nclus")
cl3 <- kcca(Nclus, k = 3)
slsw.cl3 <- slswFlexclust(Nclus, cl3, nsamp = 20)
plot(Nclus, col = clusters(cl3))
plot(slsw.cl3)
densityplot(slsw.cl3)
boxplot(slsw.cl3)
Run Flexclust Algorithms Repeatedly
Description
Runs clustering algorithms repeatedly for different numbers of clusters and returns the minimum within cluster distance solution for each.
Usage
stepFlexclust(x, k, nrep=3, verbose=TRUE, FUN = kcca, drop=TRUE,
group=NULL, simple=FALSE, save.data=FALSE, seed=NULL,
multicore=TRUE, ...)
stepcclust(...)
## S4 method for signature 'stepFlexclust,missing'
plot(x, y,
type=c("barplot", "lines"), totaldist=NULL,
xlab=NULL, ylab=NULL, ...)
## S4 method for signature 'stepFlexclust'
getModel(object, which=1)
Arguments
x , ... |
|
k |
A vector of integers passed in turn to the |
nrep |
For each value of |
FUN |
Cluster function to use, typically |
verbose |
If |
drop |
If |
group |
An optional grouping vector for the data, see
|
simple |
Return an object of class |
save.data |
Save a copy of |
seed |
If not |
multicore |
If |
y |
Not used. |
type |
Create a barplot or lines plot. |
totaldist |
Include value for 1-cluster solution in plot? Default
is |
xlab , ylab |
Graphical parameters. |
object |
Object of class |
which |
Number of model to get. If character, interpreted as number of clusters. |
Details
stepcclust
is a simple wrapper for
stepFlexclust(...,FUN=cclust)
.
Author(s)
Friedrich Leisch
Examples
data("Nclus")
plot(Nclus)
## multicore off for CRAN checks
cl1 <- stepFlexclust(Nclus, k=2:7, FUN=cclust, multicore=FALSE)
cl1
plot(cl1)
# two ways to do the same:
getModel(cl1, 4)
cl1[[4]]
opar <- par("mfrow")
par(mfrow=c(2, 2))
for(k in 3:6){
image(getModel(cl1, as.character(k)), data=Nclus)
title(main=paste(k, "clusters"))
}
par(opar)
Stripes Plot
Description
Plot distance of data points to cluster centroids using stripes.
Usage
stripes(object, groups=NULL, type=c("first", "second", "all"),
beside=(type!="first"), col=NULL, gp.line=NULL, gp.bar=NULL,
gp.bar2=NULL, number=TRUE, legend=!is.null(groups),
ylim=NULL, ylab="distance from centroid",
margins=c(2,5,3,2), ...)
Arguments
object |
An object of class |
groups |
Grouping variable to color-code the stripes. By default
cluster membership is used as |
type |
Plot distance to closest, closest and second-closest or to all centroids? |
beside |
Logical, make different stripes for different clusters? |
col |
Vector of colors for clusters or groups. |
gp.line , gp.bar , gp.bar2 |
Graphical parameters for horizontal
lines and background rectangular areas, see
|
number |
Logical, write cluster numbers on x-axis? |
legend |
Logical, plot a legend for the groups? |
ylim , ylab |
Graphical parameters for y-axis. |
margins |
Margin of the plot. |
... |
Further graphical parameters. |
Details
A simple, yet very effective plot for visualizing the distance of each
point from its closest and second-closest cluster centroids is a
stripes plot. For each of the k clusters we have a rectangular area,
which we optionally vertically
divide into k smaller rectangles (beside=TRUE
). Then we draw a
horizontal line segment for each data point marking the distance of
the data point from the corresponding centroid.
Author(s)
Friedrich Leisch
References
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 20(4), 457–469, 2010.
Examples
bw05 <- bundestag(2005)
bavaria <- bundestag(2005, state="Bayern")
set.seed(1)
c4 <- cclust(bw05, k=4, save.data=TRUE)
plot(c4)
stripes(c4)
stripes(c4, beside=TRUE)
stripes(c4, type="sec")
stripes(c4, type="sec", beside=FALSE)
stripes(c4, type="all")
stripes(c4, groups=bavaria)
## ugly, but shows how colors of all parts can be changed
library("grid")
stripes(c4, type="all",
gp.bar=gpar(col="red", lwd=3, fill="white"),
gp.bar2=gpar(col="green", lwd=3, fill="black"))
Vacation Motives of Australians
Description
In 2006 a sample of 1000 respondents representative for the adult Australian population was asked about their environmental behaviour when on vacation. In addition the survey also included a list of statements about vacation motives like "I want to rest and relax," "I use my holiday for the health and beauty of my body," and "Cultural offers and sights are a crucial factor.". Answers are binary ("applies", "does not apply").
Usage
data(vacmot)
Format
Data frame vacmot
has 1000 observations on 20 binary
variables on travel motives. Data frame vacmotdesc
has 1000
observation on sociodemographic descriptor variables, mean moral
obligation to protect the environment score, mean NEP score, and
mean environmental behaviour score, see Dolnicar & Leisch
(2008) for details.
In addition integer vector vacmot6
contains the 6
cluster partition presented in Dolnicar & Leisch (2008).
Source
The data set was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia).
References
Sara Dolnicar and Friedrich Leisch. An investigation of tourists' patterns of obligation to protect the environment. Journal of Travel Research, 46:381-391, 2008.
Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2):97-120, 2014.
Examples
data(vacmot)
summary(vacmotdesc)
dotchart(sort(colMeans(vacmot)))
## reproduce Figure 6 from Dolnicar & Leisch (2008)
cl6 <- kcca(vacmot, k=vacmot6, control=list(iter=0))
barchart(cl6)
Motivation of Australian Volunteers
Description
Part of an Australian survey on motivation of volunteers to work for non-profit organisations like Red Cross, State Emergency Service, Rural Fire Service, Surf Life Saving, Rotary, Parents and Citizens Associations, etc..
Usage
data(volunteers)
Format
A data frame with 1415 observations on the following 21 variables: age and gender of respondents plus 19 binary motivation items (1 applies/ 0 does not apply).
GENDER
Gender of respondent.
AGEG
Age group, a factor with categorized age of respondents.
meet.people
I can meet different types of people.
no.one.else
There is no-one else to do the work.
example
It sets a good example for others.
socialise
I can socialise with people who are like me.
help.others
It gives me the chance to help others.
give.back
I can give something back to society.
career
It will help my career prospects.
lonely
It makes me feel less lonely.
active
It keeps me active.
community
It will improve my community.
cause
I can support an important cause.
faith
I can put faith into action.
services
I want to maintain services that I may use one day.
children
My children are involved with the organisation.
good.job
I feel like I am doing a good job.
benefited
I know someone who has benefited from the organisation.
network
I can build a network of contacts.
recognition
I can gain recognition within the community.
mind.off
It takes my mind off other things.
Source
The volunteering data was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia), using funding from Bushcare Wollongong and the Australian Research Council under the ARC Linkage Grant scheme (LP0453682).
References
Melanie Randle and Sara Dolnicar. Not Just Any Volunteers: Segmenting the Market to Attract the High-Contributors. Journal of Non-profit and Public Sector Marketing, 21(3), 271-282, 2009.
Melanie Randle and Sara Dolnicar. Self-congruity and volunteering: A multi-organisation comparison. European Journal of Marketing, 45(5), 739-758, 2011.
Melanie Randle, Friedrich Leisch, and Sara Dolnicar. Competition or collaboration? The effect of non-profit brand image on volunteer recruitment strategy. Journal of Brand Management, 20(8):689-704, 2013.