复制成功
  • 图案背景
  • 纯色背景
  •   |  注册
  • /
  • 批注本地保存成功,开通会员云端永久保存 去开通
  • An Introduction to Applied Multivariate Analysis with R

    下载积分:400

    内容提示: Use R!Series Editors:Robert GentlemanKurt HornikGiovanni ParmigianiFor other titles published in this series, go tohttp://www.springer.com/series/6991 An Introduction to AppliedMultivariate Analysis with RBrian Everitt • Torsten Hothorn Series Editors:Robert GentlemanProgram in Computational BiologyDivision of Public Health SciencesFred Hutchinson Cancer Research Center1100 Fairview Avenue, N. M2-B876Seattle, Washington 98109USAKurt HornikDepartment of Statistik and MathematikWirtschaftsuniversit...

    亚博足球app下载格式:PDF| 浏览次数:46| 上传日期:2011-05-05 15:37:22| 亚博足球app下载星级:
    Use R!Series Editors:Robert GentlemanKurt HornikGiovanni ParmigianiFor other titles published in this series, go tohttp://www.springer.com/series/6991 An Introduction to AppliedMultivariate Analysis with RBrian Everitt • Torsten Hothorn Series Editors:Robert GentlemanProgram in Computational BiologyDivision of Public Health SciencesFred Hutchinson Cancer Research Center1100 Fairview Avenue, N. M2-B876Seattle, Washington 98109USAKurt HornikDepartment of Statistik and MathematikWirtschaftsuniversität WienAugasse 2-6A-1090 WienAustriaGiovanni ParmigianiThe Sidney Kimmel ComprehensiveCancer Center at Johns Hopkins University550 North BroadwayBaltimore, MD 21205-2011USA ISBN 978-1-4419-9649-7DOI 10.1007/978-1-4419-9650-3Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Springer New York Dordrecht Heidelberg London © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. e-ISBN 978-1-4419-9650-3 Library of Congress Control Number: 2011926793King’s CollegeLondon, SE5 8AF UK brian.everitt@btopenworld.com Ludwigstr. 33 80539 München Germany Torsten.Hothorn@stat.uni-muenchen.de Torsten Hothorn Institut für Statistik Ludwig-Maximilians-Universität München Brian Everitt Professor Emeritus To our wives, Mary-Elizabeth and Carolin. PrefaceThe majority of data sets collected by researchers in all disciplines are mul-tivariate, meaning that several measurements, observations, or recordings aretaken on each of the units in the data set. These units might be human sub-jects, archaeological artifacts, countries, or a vast variety of other things. In afew cases, it may be sensible to isolate each variable and study it separately,but in most instances all the variables need to be examined simultaneouslyin order to fully grasp the structure and key features of the data. For thispurpose, one or another method of multivariate analysis might be helpful,and it is with such methods that this book is largely concerned. Multivariateanalysis includes methods both for describing and exploring such data and formaking formal inferences about them. The aim of all the techniques is, in ageneral sense, to display or extract the signal in the data in the presence ofnoise and to find out what the data show us in the midst of their apparentchaos.The computations involved in applying most multivariate techniques areconsiderable, and their routine use requires a suitable software package. Inaddition, most analyses of multivariate data should involve the constructionof appropriate graphs and diagrams, and this will also need to be carriedout using the same package. R is a statistical computing environment that ispowerful, flexible, and, in addition, has excellent graphical facilities. It is forthese reasons that it is the use of R for multivariate analysis that is illustratedin this book.In this book, we concentrate on what might be termed the“core”or“clas-sical”multivariate methodology, although mention will be made of recent de-velopments where these are considered relevant and useful. But there is anarea of multivariate statistics that we have omitted from this book, and thatis multivariate analysis of variance (MANOVA) and related techniques such asFisher’s linear discriminant function (LDF). There are a variety of reasons forthis omission. First, we are not convinced that MANOVA is now of much morethan historical interest; researchers may occasionally pay lip service to usingthe technique, but in most cases it really is no more than this. They quickly viiiPrefacemove on to looking at the results for individual variables. And MANOVA forrepeated measures has been largely superseded by the models that we shalldescribe in Chapter 8. Second, a classification technique such as LDF needsto be considered in the context of modern classification algorithms, and thesecannot be covered in an introductory book such as this.Some brief details of the theory behind each technique described are given,but the main concern of each chapter is the correct application of the meth-ods so as to extract as much information as possible from the data at hand,particularly as some type of graphical representation, via the R software.The book is aimed at students in applied statistics courses, both under-graduate and post-graduate, who have attended a good introductory coursein statistics that covered hypothesis testing, confidence intervals, simple re-gression and correlation, analysis of variance, and basic maximum likelihoodestimation. We also assume that readers will know some simple matrix alge-bra, including the manipulation of matrices and vectors and the concepts ofthe inverse and rank of a matrix. In addition, we assume that readers willhave some familiarity with R at the level of, say, Dalgaard (2002). In additionto such a student readership, we hope that many applied statisticians dealingwith multivariate data will find something of interest in the eight chapters ofour book.Throughout the book, we give many examples of R code used to apply themultivariate techniques to multivariate data. Samples of code that could beentered interactively at the R command line are formatted as follows:R> library("MVA")Here, R> denotes the prompt sign from the R command line, and the userenters everything else. The symbol + indicates additional lines, which areappropriately indented. Finally, output produced by function calls is shownbelow the associated code:R> rnorm(10)[1][8] -0.2993 -0.73551.88080.2572 -0.34120.40810.43440.70031.89440.8960In this book, we use several R packages to access different example data sets(many of them contained in the package HSAUR2), standard functions for thegeneral parametric analyses, and the MVA package to perform analyses. All ofthe packages used in this book are available at the Comprehensive R ArchiveNetwork (CRAN), which can be accessed from http://CRAN.R-project.org.The source code for the analyses presented in this book is available fromthe MVA package. A demo containing the R code to reproduce the individualresults is available for each chapter by invokingR> library("MVA")R> demo("Ch-MVA") ### Introduction to Multivariate AnalysisR> demo("Ch-Viz") ### Visualization PrefaceixR> demo("Ch-PCA") ### Principal Components AnalysisR> demo("Ch-EFA") ### Exploratory Factor AnalysisR> demo("Ch-MDS") ### Multidimensional ScalingR> demo("Ch-CA")### Cluster AnalysisR> demo("Ch-SEM") ### Structural Equation ModelsR> demo("Ch-LME") ### Linear Mixed-Effects ModelsThanks are due to Lisa M¨ ost, BSc., for help with data processing andLATEX typesetting, the copy editor for many helpful corrections, and to JohnKimmel, for all his support and patience during the writing of the book.January 2011Brian S. Everitt, LondonTorsten Hothorn, M¨ unchen ContentsPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii1Multivariate Data and Multivariate Analysis . . . . . . . . . . . . . .1.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2A brief history of the development of multivariate analysis . . . .1.3Types of variables and the possible problem of missing values .1.3.1Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.4Some multivariate data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.5Covariances, correlations, and distances . . . . . . . . . . . . . . . . . . . . 121.5.1Covariances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5.2Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5.3Distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6The multivariate normal density function . . . . . . . . . . . . . . . . . . . 151.7Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.8Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231134572Looking at Multivariate Data: Visualisation . . . . . . . . . . . . . . . 252.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2The scatterplot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2.1The bivariate boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.2The convex hull of bivariate data . . . . . . . . . . . . . . . . . . . . 322.2.3The chi-plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3The bubble and other glyph plots . . . . . . . . . . . . . . . . . . . . . . . . . 342.4The scatterplot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.5Enhancing the scatterplot with estimated bivariate densities . . 422.5.1Kernel density estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 422.6Three-dimensional plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.7Trellis graphics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.8Stalactite plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.9Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 xiiContents3Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2Principal components analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . 613.3Finding the sample principal components . . . . . . . . . . . . . . . . . . . 633.4Should principal components be extracted from thecovariance or the correlation matrix?. . . . . . . . . . . . . . . . . . . . . . . 653.5Principal components of bivariate data with correlationcoefficient r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.6Rescaling the principal components . . . . . . . . . . . . . . . . . . . . . . . . 703.7How the principal components predict the observedcovariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.8Choosing the number of components . . . . . . . . . . . . . . . . . . . . . . . 713.9Calculating principal components scores . . . . . . . . . . . . . . . . . . . . 723.10 Some examples of the application of principal componentsanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.10.1 Head lengths of first and second sons . . . . . . . . . . . . . . . . 743.10.2 Olympic heptathlon results . . . . . . . . . . . . . . . . . . . . . . . . . 783.10.3 Air pollution in US cities . . . . . . . . . . . . . . . . . . . . . . . . . . . 863.11 The biplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.12 Sample size for principal components analysis . . . . . . . . . . . . . . . 933.13 Canonical correlation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.13.1 Head measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963.13.2 Health and personality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.14 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.2Models for proximity data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.3Spatial models for proximities: Multidimensional scaling . . . . . . 1064.4Classical multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . 1064.4.1Classical multidimensional scaling: Technical details . . . 1074.4.2Examples of classical multidimensional scaling . . . . . . . . 1104.5Non-metric multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . 1214.5.1House of Representatives voting . . . . . . . . . . . . . . . . . . . . . 1234.5.2Judgements of World War II leaders . . . . . . . . . . . . . . . . . 1244.6Correspondence analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.6.1Teenage relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304.7Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314.8Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325Exploratory Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.2A simple example of a factor analysis model . . . . . . . . . . . . . . . . 1365.3The k-factor analysis model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Contentsxiii5.45.5Scale invariance of the k-factor model . . . . . . . . . . . . . . . . . . . . . . 138Estimating the parameters in the k-factor analysis model . . . . . 1395.5.1Principal factor analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415.5.2Maximum likelihood factor analysis. . . . . . . . . . . . . . . . . . 142Estimating the number of factors . . . . . . . . . . . . . . . . . . . . . . . . . . 142Factor rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Estimating factor scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147Two examples of exploratory factor analysis . . . . . . . . . . . . . . . . 1485.9.1Expectations of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1485.9.2Drug use by American college students . . . . . . . . . . . . . . . 1515.10 Factor analysis and principal components analysis compared . . 1575.11 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.65.75.85.96Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.2Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.3Agglomerative hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 1666.3.1Clustering jet fighters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1716.4K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1756.4.1Clustering the states of the USA on the basis of theircrime rate profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1766.4.2Clustering Romano-British pottery . . . . . . . . . . . . . . . . . . 1806.5Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836.5.1Finite mixture densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1866.5.2Maximum likelihood estimation in a finite mixturedensity with multivariate normal components . . . . . . . . . 1876.6Displaying clustering solutions graphically . . . . . . . . . . . . . . . . . . 1916.7Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1976.8Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2007Confirmatory Factor Analysis and Structural EquationModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2017.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2017.2Estimation, identification, and assessing fit for confirmatoryfactor and structural equation models . . . . . . . . . . . . . . . . . . . . . . 2027.2.1Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2027.2.2Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2037.2.3Assessing the fit of a model . . . . . . . . . . . . . . . . . . . . . . . . . 2047.3Confirmatory factor analysis models . . . . . . . . . . . . . . . . . . . . . . . 2067.3.1Ability and aspiration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2067.3.2A confirmatory factor analysis model for drug use . . . . . 2117.4Structural equation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2167.4.1Stability of alienation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2167.5Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 xivContents7.6Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2238The Analysis of Repeated Measures Data. . . . . . . . . . . . . . . . . . 2258.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2258.2Linear mixed-effects models for repeated measures data . . . . . . 2328.2.1Random intercept and random intercept and slopemodels for the timber slippage data. . . . . . . . . . . . . . . . . . 2338.2.2Applying the random intercept and the randomintercept and slope models to the timber slippage data . 2358.2.3Fitting random-effect models to the glucose challengedata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2408.3Prediction of random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2478.4Dropouts in longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2488.5Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2578.6Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 1Multivariate Data and Multivariate Analysis1.1 IntroductionMultivariate data arise when researchers record the values of several randomvariables on a number of subjects or objects or perhaps one of a variety ofother things (we will use the general term“units”) in which they are interested,leading to a vector-valued or multidimensional observation for each. Such dataare collected in a wide range of disciplines, and indeed it is probably reasonableto claim that the majority of data sets met in practise are multivariate. Insome studies, the variables are chosen by design because they are known tobe essential descriptors of the system under investigation. In other studies,particularly those that have been difficult or expensive to organise, manyvariables may be measured simply to collect as much information as possibleas a matter of expediency or economy.Multivariate data are ubiquitous as is illustrated by the following fourexamples:ˆPsychologists and other behavioural scientists often record the values ofseveral different cognitive variables on a number of subjects.Educational researchers may be interested in the examination marks ob-tained by students for a variety of different subjects.Archaeologists may make a set of measurements on artefacts of interest.Environmentalists might assess pollution levels of a set of cities along withnoting other characteristics of the cities related to climate and humanecology.ˆˆˆMost multivariate data sets can be represented in the same way, namely ina rectangular format known from spreadsheets, in which the elements of eachrow correspond to the variable values of a particular unit in the data set andthe elements of the columns correspond to the values taken by a particularvariable. We can write data in such a rectangular format as1 DOI 10.1007/978-1-4419-9650-3_1, © Springer Science+Business Media, LLC 2011 B. Everitt and T. Hothorn, An Introduction to Applied Multivariate Analysis with R: Use R!, 21 Multivariate Data and Multivariate AnalysisUnit Variable 1 ...1x11...nxn1Variable qx1q...xnq............where n is the number of units, q is the number of variables recordedon each unit, and xij denotes the value of the jth variable for the ith unit.The observation part of the table above is generally represented by an n ×q data matrix, X. In contrast to the observed data, the theoretical entitiesdescribing the univariate distributions of each of the q variables and theirjoint distribution are denoted by so-called random variables X1,...,Xq.Although in some cases where multivariate data have been collected it maymake sense to isolate each variable and study it separately, in the main it doesnot. Because the whole set of variables is measured on each unit, the variableswill be related to a greater or lesser degree. Consequently, if each variableis analysed in isolation, the full structure of the data may not be revealed.Multivariate statistical analysis is the simultaneous statistical analysis of acollection of variables, which improves upon separate univariate analyses ofeach variable by using information about the relationships between the vari-ables. Analysis of each variable separately is very likely to miss uncoveringthe key features of, and any interesting “patterns” in, the multivariate data.The units in a set of multivariate data are sometimes sampled from apopulation of interest to the investigator, a population about which he or shewishes to make some inference or other. More often perhaps, the units cannotreally be said to have been sampled from some population in any meaningfulsense, and the questions asked about the data are then largely exploratory innature. with the ubiquitous p-value of univariate statistics being notable byits absence. Consequently, there are methods of multivariate analysis that areessentially exploratory and others that can be used for statistical inference.For the exploration of multivariate data, formal models designed to yieldspecific answers to rigidly defined questions are not required. Instead, meth-ods are used that allow the detection of possibly unanticipated patterns in thedata, opening up a wide range of competing explanations. Such methods aregenerally characterised both by an emphasis on the importance of graphicaldisplays and visualisation of the data and the lack of any associated proba-bilistic model that would allow for formal inferences. Multivariate techniquesthat are largely exploratory are described in Chapters 2 to 6.A more formal analysis becomes possible in situations when it is realistic toassume that the individuals in a multivariate data set have been sampled fromsome population and the investigator wishes to test a well-defined hypothe-sis about the parameters of that population’s probability density function.Now the main focus will not be the sample data per se, but rather on usinginformation gathered from the sample data to draw inferences about the pop-ulation. And the probability density function almost universally assumed asthe basis of inferences for multivariate data is the multivariate normal. (For 1.2 A brief history of the development of multivariate analysis3a brief description of the multivariate normal density function and ways ofassessing whether a set of multivariate data conform to the density, see Sec-tion 1.6). Multivariate techniques for which formal inference is of importanceare described in Chapters 7 and 8. But in many cases when dealing withmultivariate data, this implied distinction between the exploratory and theinferential may be a red herring because the general aim of most multivariateanalyses, whether implicitly exploratory or inferential is to uncover, display,or extract any “signal” in the data in the presence of noise and to discoverwhat the data have to tell us.1.2 A brief history of the development of multivariateanalysisThe genesis of multivariate analysis is probably the work carried out by FrancisGalton and Karl Pearson in the late 19th century on quantifying the relation-ship between offspring and parental characteristics and the development ofthe correlation coefficient. And then, in the early years of the 20th century,Charles Spearman laid down the foundations of factor analysis (see Chapter 5)whilst investigating correlated intelligence quotient (IQ) tests. Over the nexttwo decades, Spearman’s work was extended by Hotelling and by Thurstone.Multivariate methods were also motivated by problems in scientific areasother than psychology, and in the 1930s Fisher developed linear discriminantfunction analysis to solve a taxonomic problem using multiple botanical mea-surements. And Fisher’s introduction of analysis of variance in the 1920s wassoon followed by its multivariate generalisation, multivariate analysis of vari-ance, based on work by Bartlett and Roy. (These techniques are not coveredin this text for the reasons set out in the Preface.)In these early days, computational aids to take the burden of the vastamounts of arithmetic involved in the application of the multivariate meth-ods being proposed were very limited and, consequently, developments wereprimarily mathematical and multivariate research was, at the time, largely abranch of linear algebra. However, the arrival and rapid expansion of the useof electronic computers in the second half of the 20th century led to increasedpractical application of existing methods of multivariate analysis and renewedinterest in the creation of new techniques.In the early years of the 21st century, the wide availability of relativelycheap and extremely powerful personal computers and laptops allied withflexible statistical software has meant that all the methods of multivariateanalysis can be applied routinely even to very large data sets such as thosegenerated in, for example, genetics, imaging, and astronomy. And the appli-cation of multivariate techniques to such large data sets has now been givenits own name, data mining, which has been defined as “the nontrivial extrac-tion of implicit, previously unknown and potentially useful information from 41 Multivariate Data and Multivariate Analysisdata.” Useful books on data mining are those of Fayyad, Piatetsky-Shapiro,Smyth, and Uthurusamy (1996) and Hand, Mannila, and Smyth (2001).1.3 Types of variables and the possible problem ofmissing valuesA hypothetical example of multivariate data is given in Table 1.1. The specialsymbol NA denotes missing values (being Not Available); the value of thisvariable for a subject is missing.Table 1.1: hypo data. Hypothetical Set of Multivariate Data.individualsex age IQ depressionMale 21 120Male 43 NAMale 22 135Male 86 150Male 60 926 Female 16 1307 Female NA 1508 Female 43 NA9 Female 22 8410 Female 80 70health weightYes Very goodNo Very goodNoAverageNo Very poorYesGoodYesGoodYes Very goodYesAverageNoAverageNoGood12345150160135140110110120120105100Here, the number of units (people in this case) is n = 10, with the number ofvariables being q = 7 and, for example, x34= 135. In R, a “data.frame” isthe appropriate data structure to represent such rectangular data. Subsets ofunits (rows) or variables (columns) can be extracted via the [ subset operator;i.e.,R> hypo[1:2, c("health", "weight")]health weight1 Very good2 Very good150160extracts the values x15,x16and x25,x26from the hypothetical data presentedin Table 1.1. These data illustrate that the variables that make up a set ofmultivariate data will not necessarily all be of the same type. Four levels ofmeasurements are often distinguished:Nominal: Unordered categorical variables. Examples include treatment al-location, the sex of the respondent, hair colour, presence or absence ofdepression, and so on. 1.3 Types of variables and the possible problem of missing values5Ordinal: Where there is an ordering but no implication of equal distancebetween the different points of the scale. Examples include social class,self-perception of health (each coded from I to V, say), and educationallevel (no schooling, primary, secondary, or tertiary education).Interval: Where there are equal differences between successive points on thescale but the position of zero is arbitrary. The classic example is the mea-surement of temperature using the Celsius or Fahrenheit scales.Ratio: The highest level of measurement, where one can investigate the rel-ative magnitudes of scores as well as the differences between them. Theposition of zero is fixed. The classic example is the absolute measure oftemperature (in Kelvin, for example), but other common ones includesage (or any other time from a fixed event), weight, and length.In many statistical textbooks, discussion of different types of measure-ments is often followed by recommendations as to which statistical techniquesare suitable for each type; for example, analyses on nominal data should belimited to summary statistics such as the number of cases, the mode, etc.And, for ordinal data, means and standard deviations are not suitable. ButVelleman and Wilkinson (1993) make the important point that restricting thechoice of statistical methods in this way may be a dangerous practise for dataanalysis–in essence the measurement taxonomy described is often too strictto apply to real-world data. This is not the place for a detailed discussion ofmeasurement, but we take a fairly pragmatic approach to such problems. Forexample, we will not agonise over treating variables such as measures of de-pression, anxiety, or intelligence as if they are interval-scaled, although strictlythey fit into the ordinal category described above.1.3.1 Missing valuesTable 1.1 also illustrates one of the problems often faced by statisticians un-dertaking statistical analysis in general and multivariate analysis in particular,namely the presence of missing values in the data; i.e., observations and mea-surements that should have been recorded but for one reason or another, werenot. Missing values in multivariate data may arise for a number of reasons;for example, non-response in sample surveys, dropouts in longitudinal data(see Chapter 8), or refusal to answer particular questions in a questionnaire.The most important approach for dealing with missing data is to try to avoidthem during the data-collection stage of a study. But despite all the efforts aresearcher may make, he or she may still be faced with a data set that con-tains a number of missing values. So what can be done? One answer to thisquestion is to take the complete-case analysis route because this is what moststatistical software packages do automatically. Using complete-case analysison multivariate data means omitting any case with a missing value on any ofthe variables. It is easy to see that if the number of variables is large, theneven a sparse pattern of missing values can result in a substantial number ofincomplete cases. One possibility to ease this problem is to simply drop any 61 Multivariate Data and Multivariate Analysisvariables that have many missing values. But complete-case analysis is notrecommended for two reasons:ˆOmitting a possibly substantial number of individuals will cause a largeamount of information to be discarded and lower the effective sample sizeof the data, making any analyses less effective than they would have beenif all the original sample had been available.More worrisome is that dropping the cases with missing values on oneor...

    关注我们

  • 新浪微博
  • 关注微信公众号

  • 打印亚博足球app下载
  • 复制文本
  • 下载An Introduction to Applied Multivariate Analysis with R.XDF
  • 您选择了以下内容