| Statistics Toolbox | ![]() |
www.kxcad.net Home > CAE Software Index > MATLAB Index >
| On this page… |
|---|
Grouping variables are utility variables used to indicate which elements in a data set are to be considered together when computing statistics and creating visualizations. They may be numeric vectors, string arrays, cell arrays of strings, or categorical arrays.
Grouping variables have the same length as the variables (columns) in a data set. Observations (rows) i and j are considered to be in the same group if the values of the corresponding grouping variable are identical at those indices. Grouping variables with multiple columns are used to specify different groups within multiple variables.
For example, the following loads the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species into the workspace:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica). To group the observations by species, the following are all acceptable (and equivalent) grouping variables:
group1 = species; % Cell array of strings group2 = grp2idx(species) % Numeric vector group3 = char(species); % Character array group4 = nominal(species); % Categorical array
These grouping variables can be supplied as input arguments to any of the functions described in Functions for Grouped Data. Examples are given in Using Grouping Variables.
The following table lists functions in Statistics Toolbox that accept a grouping variable group as an input argument. The grouping variable may be in the form of a vector, string array, cell array of strings, or categorical array, as described in Grouping Variables.
For a full description of the syntax of any particular function, and examples of its use, consult its reference page, linked from the table. Using Grouping Variables also includes examples.
| Function | Basic Syntax for Grouped Data |
|---|---|
| andrewsplot | andrewsplot(X, ... ,'Group',group) |
| anova1 | p = anova1(X,group) |
| anovan | p = anovan(x,group) |
| aoctool | aoctool(x,y,group) |
| boxplot | boxplot(x,group) |
| classify | class = classify(sample,training,group) |
| controlchart | controlchart(x,group) |
| crosstab | crosstab(group1,group2) |
| dummyvar | D = dummyvar(group) |
| gagerr | gagerr(x,group) |
| gplotmatrix | gplotmatrix(x,y,group) |
| grp2idx | [G,GN] = grp2idx(group) |
| grpstats | means = grpstats(X,group) |
| gscatter | gscatter(x,y,group) |
| interactionplot | interactionplot(X,group) |
| kruskalwallis | p = kruskalwallis(X,group) |
| maineffectsplot | maineffectsplot(X,group) |
| manova1 | d = manova1(X,group) |
| multivarichart | multivarichart(x,group) |
| parallelcoords | parallelcoords(X, ... ,'Group',group) |
| silhouette | silhouette(X,group) |
| tabulate | tabulate(group) |
| treefit | T = treefit(X,y,'cost',S) or T = treefit(X,y,'priorprob',S), where S.group = group |
| vartestn | vartestn(X,group) |
This section provides an example demonstrating the use of grouping variables and associated functions. Grouping variables are introduced in Grouping Variables. A list of functions accepting grouping variables as input arguments is given in Functions for Grouped Data.
Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica).
Create a categorical array (see Categorical Arrays) from species to use as a grouping variable:
group = nominal(species);
While species, as a cell array of strings, is itself a grouping variable, the categorical array has the advantage that it can be easily manipulated with categorical methods. (See Categorical Array Operations.)
Compute some basic statistics for the data (median and interquartile range), by group, using the grpstats function:
[order,number,group_median,group_iqr] = ...
grpstats(meas,group,{'gname','numel',@median,@iqr})
order =
'setosa'
'versicolor'
'virginica'
number =
50 50 50 50
50 50 50 50
50 50 50 50
group_median =
5.0000 3.4000 1.5000 0.2000
5.9000 2.8000 4.3500 1.3000
6.5000 3.0000 5.5500 2.0000
group_iqr =
0.4000 0.5000 0.2000 0.1000
0.7000 0.5000 0.6000 0.3000
0.7000 0.4000 0.8000 0.5000The statistics appear in 3-by-4 arrays, corresponding to the 3 groups (categories) and 4 variables in the data. The order of the groups (not encoded in the nominal array group) is indicated by the group names in order.
To improve the labeling of the data, create a dataset array (see Dataset Arrays) from meas:
NumObs = size(meas,1);
ObsNames = strcat({'Obs'},num2str((1:NumObs)','%d'));
iris = dataset({group,'species'},...
{meas,'SL','SW','PL','PW'},...
'obsnames',ObsNames);
When you call grpstats with a dataset array as an argument, you invoke the grpstats method of the dataset class, grpstats (dataset), rather than the regular grpstats function. The method has a slightly different syntax than the regular grpstats function, but it returns the same results, with better labeling:
stats = grpstats(iris,'species',{@median,@iqr})
stats =
species GroupCount
setosa setosa 50
versicolor versicolor 50
virginica virginica 50
median_SL iqr_SL
setosa 5 0.4
versicolor 5.9 0.7
virginica 6.5 0.7
median_SW iqr_SW
setosa 3.4 0.5
versicolor 2.8 0.5
virginica 3 0.4
median_PL iqr_PL
setosa 1.5 0.2
versicolor 4.35 0.6
virginica 5.55 0.8
median_PW iqr_PW
setosa 0.2 0.1
versicolor 1.3 0.3
virginica 2 0.5 Grouping variables are also used to create visualizations based on categories of observations. The following scatter plot, created with the gscatter function, shows the correlation between sepal length and sepal width in two species of iris. The ismember function is used to subset the two species from group:
subset = ismember(group,{'setosa','versicolor'});
scattergroup = group(subset);
gscatter(iris.SL(subset),...
iris.SW(subset),...
scattergroup)
xlabel('Sepal Length')
ylabel('Sepal Width')

| Statistical Arrays | Descriptive Statistics | ![]() |
© 1984-2007 The MathWorks, Inc. Terms of Use Patents Trademarks Acknowledgments