## Grouped Data

### Grouping Variables

Grouping variables are utility variables used to indicate which elements in a data set are to be considered together when computing statistics and creating visualizations. They may be numeric vectors, string arrays, cell arrays of strings, or categorical arrays.

Grouping variables have the same length as the variables (columns) in a data set. Observations (rows) i and j are considered to be in the same group if the values of the corresponding grouping variable are identical at those indices. Grouping variables with multiple columns are used to specify different groups within multiple variables.

For example, the following loads the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species into the workspace:

` load fisheriris % Fisher's iris data (1936)`

The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica). To group the observations by species, the following are all acceptable (and equivalent) grouping variables:

```group1 = species; % Cell array of strings
group2 = grp2idx(species) % Numeric vector
group3 = char(species); % Character array
group4 = nominal(species); % Categorical array```

These grouping variables can be supplied as input arguments to any of the functions described in Functions for Grouped Data. Examples are given in Using Grouping Variables.

### Functions for Grouped Data

The following table lists functions in Statistics Toolbox that accept a grouping variable group as an input argument. The grouping variable may be in the form of a vector, string array, cell array of strings, or categorical array, as described in Grouping Variables.

For a full description of the syntax of any particular function, and examples of its use, consult its reference page, linked from the table. Using Grouping Variables also includes examples.

FunctionBasic Syntax for Grouped Data
andrewsplotandrewsplot(X, ... ,'Group',group)
anova1p = anova1(X,group)
anovanp = anovan(x,group)
aoctoolaoctool(x,y,group)
boxplotboxplot(x,group)
classifyclass = classify(sample,training,group)
controlchartcontrolchart(x,group)
crosstabcrosstab(group1,group2)
dummyvarD = dummyvar(group)
gagerrgagerr(x,group)
gplotmatrixgplotmatrix(x,y,group)
grp2idx[G,GN] = grp2idx(group)
grpstatsmeans = grpstats(X,group)
gscattergscatter(x,y,group)
interactionplotinteractionplot(X,group)
kruskalwallisp = kruskalwallis(X,group)
maineffectsplotmaineffectsplot(X,group)
manova1d = manova1(X,group)
multivarichartmultivarichart(x,group)
parallelcoordsparallelcoords(X, ... ,'Group',group)
silhouettesilhouette(X,group)
tabulatetabulate(group)
treefitT = treefit(X,y,'cost',S) or T = treefit(X,y,'priorprob',S), where S.group = group
vartestnvartestn(X,group)

### Using Grouping Variables

This section provides an example demonstrating the use of grouping variables and associated functions. Grouping variables are introduced in Grouping Variables. A list of functions accepting grouping variables as input arguments is given in Functions for Grouped Data.

Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species:

` load fisheriris % Fisher's iris data (1936)`

The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica).

Create a categorical array (see Categorical Arrays) from species to use as a grouping variable:

` group = nominal(species);`

While species, as a cell array of strings, is itself a grouping variable, the categorical array has the advantage that it can be easily manipulated with categorical methods. (See Categorical Array Operations.)

Compute some basic statistics for the data (median and interquartile range), by group, using the grpstats function:

```[order,number,group_median,group_iqr] = ...
grpstats(meas,group,{'gname','numel',@median,@iqr})
order =
'setosa'
'versicolor'
'virginica'
number =
50    50    50    50
50    50    50    50
50    50    50    50
group_median =
5.0000    3.4000    1.5000    0.2000
5.9000    2.8000    4.3500    1.3000
6.5000    3.0000    5.5500    2.0000
group_iqr =
0.4000    0.5000    0.2000    0.1000
0.7000    0.5000    0.6000    0.3000
0.7000    0.4000    0.8000    0.5000```

The statistics appear in 3-by-4 arrays, corresponding to the 3 groups (categories) and 4 variables in the data. The order of the groups (not encoded in the nominal array group) is indicated by the group names in order.

To improve the labeling of the data, create a dataset array (see Dataset Arrays) from meas:

```NumObs = size(meas,1);
ObsNames = strcat({'Obs'},num2str((1:NumObs)','%d'));
iris = dataset({group,'species'},...
{meas,'SL','SW','PL','PW'},...
'obsnames',ObsNames);
```

When you call grpstats with a dataset array as an argument, you invoke the grpstats method of the dataset class, grpstats (dataset), rather than the regular grpstats function. The method has a slightly different syntax than the regular grpstats function, but it returns the same results, with better labeling:

```stats = grpstats(iris,'species',{@median,@iqr})
stats =
species       GroupCount
setosa        setosa        50
versicolor    versicolor    50
virginica     virginica     50

median_SL    iqr_SL
setosa          5          0.4
versicolor    5.9          0.7
virginica     6.5          0.7

median_SW    iqr_SW
setosa        3.4          0.5
versicolor    2.8          0.5
virginica       3          0.4

median_PL    iqr_PL
setosa         1.5         0.2
versicolor    4.35         0.6
virginica     5.55         0.8

median_PW    iqr_PW
setosa        0.2          0.1
versicolor    1.3          0.3
virginica       2          0.5 ```

Grouping variables are also used to create visualizations based on categories of observations. The following scatter plot, created with the gscatter function, shows the correlation between sepal length and sepal width in two species of iris. The ismember function is used to subset the two species from group:

```subset = ismember(group,{'setosa','versicolor'});
scattergroup = group(subset);
gscatter(iris.SL(subset),...
iris.SW(subset),...
scattergroup)
xlabel('Sepal Length')
ylabel('Sepal Width')```