Grouped Data

www.kxcad.net Home > CAE Software Index > MATLAB Index >


Your Ad Here

Grouping Variables

Grouping variables are utility variables used to indicate which elements in a data set are to be considered together when computing statistics and creating visualizations. They may be numeric vectors, string arrays, cell arrays of strings, or categorical arrays.

Grouping variables have the same length as the variables (columns) in a data set. Observations (rows) i and j are considered to be in the same group if the values of the corresponding grouping variable are identical at those indices. Grouping variables with multiple columns are used to specify different groups within multiple variables.

For example, the following loads the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species into the workspace:

 load fisheriris % Fisher's iris data (1936)

The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica). To group the observations by species, the following are all acceptable (and equivalent) grouping variables:

group1 = species; % Cell array of strings
group2 = grp2idx(species) % Numeric vector
group3 = char(species); % Character array
group4 = nominal(species); % Categorical array

These grouping variables can be supplied as input arguments to any of the functions described in Functions for Grouped Data. Examples are given in Using Grouping Variables.

Functions for Grouped Data

The following table lists functions in Statistics Toolbox that accept a grouping variable group as an input argument. The grouping variable may be in the form of a vector, string array, cell array of strings, or categorical array, as described in Grouping Variables.

For a full description of the syntax of any particular function, and examples of its use, consult its reference page, linked from the table. Using Grouping Variables also includes examples.

FunctionBasic Syntax for Grouped Data
andrewsplotandrewsplot(X, ... ,'Group',group)
anova1p = anova1(X,group)
anovanp = anovan(x,group)
aoctoolaoctool(x,y,group)
boxplotboxplot(x,group)
classifyclass = classify(sample,training,group)
controlchartcontrolchart(x,group)
crosstabcrosstab(group1,group2)
dummyvarD = dummyvar(group)
gagerrgagerr(x,group)
gplotmatrixgplotmatrix(x,y,group)
grp2idx[G,GN] = grp2idx(group)
grpstatsmeans = grpstats(X,group)
gscattergscatter(x,y,group)
interactionplotinteractionplot(X,group)
kruskalwallisp = kruskalwallis(X,group)
maineffectsplotmaineffectsplot(X,group)
manova1d = manova1(X,group)
multivarichartmultivarichart(x,group)
parallelcoordsparallelcoords(X, ... ,'Group',group)
silhouettesilhouette(X,group)
tabulatetabulate(group)
treefitT = treefit(X,y,'cost',S) or T = treefit(X,y,'priorprob',S), where S.group = group
vartestnvartestn(X,group)

Using Grouping Variables

This section provides an example demonstrating the use of grouping variables and associated functions. Grouping variables are introduced in Grouping Variables. A list of functions accepting grouping variables as input arguments is given in Functions for Grouped Data.

Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings species:

 load fisheriris % Fisher's iris data (1936)

The data are 150 observations of four measured variables (by column number: sepal length, sepal width, petal length, and petal width, respectively) over three species of iris (setosa, versicolor, and virginica).

Create a categorical array (see Categorical Arrays) from species to use as a grouping variable:

 group = nominal(species);

While species, as a cell array of strings, is itself a grouping variable, the categorical array has the advantage that it can be easily manipulated with categorical methods. (See Categorical Array Operations.)

Compute some basic statistics for the data (median and interquartile range), by group, using the grpstats function:

[order,number,group_median,group_iqr] = ...
grpstats(meas,group,{'gname','numel',@median,@iqr})
order = 
    'setosa'
    'versicolor'
    'virginica'
number =
    50    50    50    50
    50    50    50    50
    50    50    50    50
group_median =
    5.0000    3.4000    1.5000    0.2000
    5.9000    2.8000    4.3500    1.3000
    6.5000    3.0000    5.5500    2.0000
group_iqr =
    0.4000    0.5000    0.2000    0.1000
    0.7000    0.5000    0.6000    0.3000
    0.7000    0.4000    0.8000    0.5000

The statistics appear in 3-by-4 arrays, corresponding to the 3 groups (categories) and 4 variables in the data. The order of the groups (not encoded in the nominal array group) is indicated by the group names in order.

To improve the labeling of the data, create a dataset array (see Dataset Arrays) from meas:

NumObs = size(meas,1);
ObsNames = strcat({'Obs'},num2str((1:NumObs)','%d'));
iris = dataset({group,'species'},...
               {meas,'SL','SW','PL','PW'},...
               'obsnames',ObsNames);

When you call grpstats with a dataset array as an argument, you invoke the grpstats method of the dataset class, grpstats (dataset), rather than the regular grpstats function. The method has a slightly different syntax than the regular grpstats function, but it returns the same results, with better labeling:

stats = grpstats(iris,'species',{@median,@iqr})
stats = 
                  species       GroupCount
    setosa        setosa        50        
    versicolor    versicolor    50        
    virginica     virginica     50        

                  median_SL    iqr_SL
    setosa          5          0.4   
    versicolor    5.9          0.7   
    virginica     6.5          0.7   

                  median_SW    iqr_SW
    setosa        3.4          0.5   
    versicolor    2.8          0.5   
    virginica       3          0.4   

                  median_PL    iqr_PL
    setosa         1.5         0.2   
    versicolor    4.35         0.6   
    virginica     5.55         0.8   

                  median_PW    iqr_PW
    setosa        0.2          0.1   
    versicolor    1.3          0.3   
    virginica       2          0.5 

Grouping variables are also used to create visualizations based on categories of observations. The following scatter plot, created with the gscatter function, shows the correlation between sepal length and sepal width in two species of iris. The ismember function is used to subset the two species from group:

subset = ismember(group,{'setosa','versicolor'});
scattergroup = group(subset);
gscatter(iris.SL(subset),...
         iris.SW(subset),...
         scattergroup)
xlabel('Sepal Length')
ylabel('Sepal Width')

  


© 1984-2007 The MathWorks, Inc. Terms of Use Patents Trademarks Acknowledgments

Your Ad Here