Generalized Linear Models for Insurance DataActuaries should have the tools they need. Generalized linear models areused in the insurance industry to support critical decisions. Yet no text intro-duces GLMs in this context and addresses problems specific to insurance data.Until now.Practical and rigorous, this books treats GLMs, covers all standard exponen-tial family distributions, extends the methodology to correlated data structures,and discusses other techniques of interest and how they contrast with GLMs.The focus is on issues which are specific to insurance data and all techniquesare illustrated on data sets relevant to insurance.Exercises and data-based practicals help readers to consolidate their skills,with solutions and data sets given on the companion website. Although thebook is package-independent, SAS code and output examples feature in anappendix and on the website. In addition, R code and output for all examplesare provided on the website.International Series on Actuarial ScienceMark Davis, Imperial College LondonJohn Hylands, Standard LifeJohn McCutcheon, Heriot-Watt UniversityRagnar Norberg, London School of EconomicsH. Panjer, Waterloo UniversityAndrew Wilson, Watson WyattThe International Series on Actuarial Science, published by CambridgeUniversity Press in conjunction with the Institute of Actuaries and the FacultyofActuaries, will contain textbooks for students taking courses in or related toactuarial science, as well as more advanced works designed for continuing pro-fessional development or for describing and synthesizing research. The serieswill be a vehicle for publishing books that reflect changes and developments inthe curriculum, that encourage the introduction ofcourses on actuarial sciencein universities, and that show how actuarial science can be used in all areaswhere there is long-term financial risk.
GENERALIZED LINEARMODELS FORINSURANCE DATAPI ET D E J O NGDepartment ofActuarial Studies, Macquarie University, SydneyGI LLI AN Z. H ELLERDepartment ofStatistics, Macquarie University, Sydney
C A M B R I D G E U N I V E R S I T Y P R E S SCambridge, New York, Melbourne, Madrid, Cape Town, Singapore, S˜ ao Paulo, DelhiCambridge University PressThe Edinburgh Building, Cambridge CB2 8RU, UKPublished in the United States of America by Cambridge University Press, New Yorkwww.cambridge.orgInformation on this title:http://www.afas.mq.edu.au/research/books/glms for insurance datac P. de Jong and G. Z. Heller 2008This publication is in copyright. Subject to statutory exceptionand to the provisions ofrelevant collective licensing agreements,no reproduction ofany part may take place withoutthe written permission of Cambridge University Press.First published 2008Third printing 2009Printed in the United Kingdom at the University Press, CambridgeA catalog record for this publication is available from the British LibraryISBN 978-0-521-87914-9 hardbackCambridge University Press has no responsibility for the persistence or accuracyofURLs for external or third-party internet websites referred to in this publication,and does not guarantee that any content on such websites is, or will remain,accurate or appropriate. Information regarding prices, travel timetables and otherfactual information given in this work are correct at the time of first printing butCambridge University Press does not guarantee the accuracy ofsuchinformation thereafter.
ContentsPrefacepage ix1Insurance data1.1Introduction1.2Types of variables1.3Data transformations1.4Data exploration1.5Grouping and runoff triangles1.6Assessing distributions1.7Data issues and biases1.8Data sets used1.9Outline of rest of book1234610121314192Response distributions2.1Discrete and continuous random variables2.2Bernoulli2.3Binomial2.4Poisson2.5Negative binomial2.6Normal2.7Chi-square and gamma2.8Inverse Gaussian2.9OverdispersionExercises20202122232426272930333Exponential family responses and estimation3.1Exponential family3.2The variance function3.3Proof of the mean and variance expressions3.4Standard distributions in the exponential family form3.5Fitting probability functions to dataExercises35353637373941v
viContents4Linear modeling4.1History and terminology of linear modeling4.2What does “linear” in linear model mean?4.3Simple linear modeling4.4Multiple linear modeling4.5The classical linear model4.6Least squares properties under the classical linear model4.7Weighted least squares4.8Grouped and ungrouped data4.9Transformations to normality and linearity4.10Categorical explanatory variables4.11Polynomial regression4.12Banding continuous explanatory variables4.13Interaction4.14Collinearity4.15Hypothesis testing4.16Checks using the residuals4.17Checking explanatory variable specifications4.18Outliers4.19Model selection42424343444647474849515354555556586061625Generalized linear models5.1The generalized linear model5.2Steps in generalized linear modeling5.3Links and canonical links5.4Offsets5.5Maximum likelihood estimation5.6Confidence intervals and prediction5.7Assessing fits and the deviance5.8Testing the significance of explanatory variables5.9Residuals5.10Further diagnostic tools5.11Model selectionExercises646465666667707174777980806Models for count data6.1Poisson regression6.2Poisson overdispersion and negative binomial regression6.3Quasi-likelihood6.4Counts and frequenciesExercises8181899496967Categorical responses7.1Binary responses7.2Logistic regression979798
Contentsvii7.37.47.57.67.77.87.9Application of logistic regression to vehicle insuranceCorrecting for exposureGrouped binary dataGoodness of fit for logistic regressionCategorical responses with more than two categoriesOrdinal responsesNominal responsesExercises991021051071101111161198Continuous responses8.1Gamma regression8.2Inverse Gaussian regression8.3Tweedie regressionExercises1201201251271289Correlated data9.1Random effects9.2Specification of within-cluster correlation9.3Generalized estimating equationsExercise12913113613714010Extensions to the generalized linear model10.1Generalized additive models10.2Double generalized linear models10.3Generalized additive models for location, scale and shape10.4Zero-adjusted inverse Gaussian regression10.5A mean and dispersion model for total claim sizeExercises141141143143145148149Appendix 1Computer code and outputPoisson regressionA1.2 Negative binomial regressionA1.3 Quasi-likelihood regressionA1.4 Logistic regressionA1.5 Ordinal regressionA1.6 Nominal regressionA1.7 Gamma regressionA1.8 Inverse Gaussian regressionA1.9 Logistic regression GLMMA1.10 Logistic regression GEEA1.11 Logistic regression GAMA1.12 GAMLSSA1.13 Zero-adjusted inverse Gaussian regressionBibliographyIndex150150156159160169175178181183185187189190192195A1.1
PrefaceThe motivation for this book arose out of our many years of teaching actu-arial students and analyzing insurance data. Generalized linear models areideally suited to the analysis ofnon-normal data which insurance analysts typ-ically encounter. However the acceptance, uptake and understanding of thismethodology has been slow in insurance compared to other disciplines. Partofthe reason may be the lack ofa suitable textbook geared towards an actuarialaudience. This book seeks to address that need.We have tried to make the book as practical as possible. Analyses are basedon real data. All but one of the data sets are available on the companion web-site to this book:http: //www. afas. mq. edu. au/research/books/glms forinsurance data.Computer code and output for all examples is given in Appendix 1.The SAS software is widely used in the insurance industry. Hence com-putations in this text are illustrated using SAS. The statistical language R isused where computations are not conveniently performed in SAS. In addition,R code and output for all the examples is provided on the companion web-site. Exercises are given at the end of chapters, and fully worked solutions areavailable on the website.The body of the text is independent of software or software “runs.” Inmost cases, fitting results are displayed in tabular form. Remarks on com-puter implementation are confined to paragraphs headed “SAS notes” and“Implementation” and these notes can be skipped without loss of continuity.Readers are assumed to be familiar with the following statistical concepts:discrete and continuous random variables, probability distributions, estima-tion, hypothesis testing, and linear regression (the normal model). Relevantbasics ofprobability and estimation are covered in Chapters 2 and 3, but famil-iarity with these concepts is assumed. Normal linear regression is covered inChapter 4: again it is expected readers have previously encountered the mate-rial. This chapter sets the scene for the rest of the book and discuss conceptsthat are applicable to regression models in general.ix
xPrefaceExcessive notation is avoided. The meanings of symbols will be clear fromthe context. For example a response variable is denoted by y, and there isno notational distinction between the random variable and its realization. Thevector of outcomes is also denoted by y. Derivatives are denoted using thedot notation:˙f(y) and double dots denote second derivatives. This avoidsconfusion with the notation for matrix transposition X, frequently requiredin the same mathematical expressions. Tedious and generally uninformativesubscripting is avoided. For example, the expression y = xβ used in thistext can be written as yi = xiβ, or even more explicitly and laboriously asyi= β0+ β1xi1+ . . . + βpxip. Generally such laboring is avoided. Usuallyx denotes the vector (1, x1, . . . , xp)and β denotes (β0, . . . , βp). The equiva-lence symbol “≡” is used when a quantity is defined. The symbol “∼” denotes“distributed as,” either exactly or approximately.Both authors contributed equally to this book, and authorship order wasdetermined by the alphabetical convention. Much of the book was writtenwhile GH was on sabbatical leave at CSIRO Mathematical and InformationSciences, Sydney, whom she thanks for their hospitality. We thank ChristineLu for her assistance. And to our families Dana, Doryon, Michelle and Dean,and Steven, Ilana and Monique, our heartfelt thanks for putting up with themany hours that we spent on this text.Piet de JongGillian HellerSydney, 2007
1Insurance dataGeneralized linear modeling is a methodology for modeling relationshipsbetween variables. It generalizes the classical normal linear model, by relax-ing some of its restrictive assumptions, and provides methods for the analysisof non-normal data. The tools date back to the original article by Nelder andWedderburn (1972) and have since become part of mainstream statistics, usedin many diverse areas of application.This text presents the generalized linear model (GLM) methodology, withapplications oriented to data that actuarial analysts are likely to encounter, andthe analyses that they are likely required to perform.With the GLM, the variability in one variable is explained by the changes inone or more other variables. The variable being explained is called the “depen-dent” or “response” variable, while the variables that are doing the explainingare the “explanatory” variables. In some contexts these are called “riskfactors”or “drivers of risk.” The model explains the connection between the responseand the explanatory variables.Statistical modeling in general and generalized linear modeling in particularis the art or science of designing, fitting and interpreting a model. A statisticalmodel helps in answering the following types of questions:• Which explanatory variables are predictive of the response, and what is theappropriate scale for their inclusion in the model?• Is the variability in the response well explained by the variability in theexplanatory variables?• What is the prediction of the response for given values of the explanatoryvariables, and what is the precision associated with this prediction?A statistical model is only as good as the data underlying it. Consequentlya good understanding of the data is an essential starting point for model-ing.A significant amount of time is spent on cleaning and exploring thedata. This chapter discusses different types of insurance data. Methods for1
2Insurance datathe display, exploration and transformation of the data are demonstrated andbiases typically encountered are highlighted.1.1 IntroductionFigure 1.1 displays summaries ofinsurance data relating to n = 22 036 settledpersonal injury insurance claims, described on page 14. These claims werereported during the period from July 1989 through to the end of 1999. Claimssettled with zero payment are excluded.The top left panel of Figure 1.1 displays a histogram of the dollar values ofthe claims. The top right indicates the proportion of cases which are legallyrepresented. The bottom left indicates the proportion of various injury codesas discussed in Section 1.2 below. The bottom right panel is a histogram ofsettlement delay.0204060801000.000.020.04Claim size ($1000s)FrequencyNoYesLegal representationFrequency0.00.20.40.61234569Injury codeFrequency0.00.20.40.6020Settlement delay (months)4060801000.0000.0150.030FrequencyFig. 1.1. Graphical representation ofpersonal injury insurance dataThis data set is typical of those amenable to generalized linear modeling.The aim ofstatistical modeling is usually to address questions ofthe followingnature:• What is the relationship between settlement delay and the finalized claimamount?• Does legal representation have any effect on the dollar value of the claim?• What is the impact on the dollar value of claims of the level of injury?• Given a claim has already dragged on for some time and given the level ofinjury and the fact that it is legally represented, what is the likely outcomeof the claim?
1.2 Types ofvariables3Answering such questions is subject to pitfalls and problems. This book aimsto point these out and outline useful tools that have been developed to aid inproviding answers.Modeling is not an end in itself, rather the aim is to provide a framework foranswering questions of interest. Different models can, and often are, appliedto the same data depending on the question of interest. This stresses thatmodeling is a pragmatic activity and there is no such thing as the “true” model.Models connect variables, and the art of connecting variables requires anunderstanding ofthe nature ofthe variables. Variables come in different forms:discrete or continuous, nominal, ordinal, categorical, and so on. It is impor-tant to distinguish between different types of variables, as the way that theycan reasonably enter a model depends on their type. Variables can, and oftenare, transformed. Part of modeling requires one to consider the appropriatetransformations of variables.1.2 Types of variablesInsurance data is usually organized in a two-way array according to cases andvariables. Cases can be policies, claims, individuals or accidents. Variablescan be level of injury, sex, dollar cost, whether there is legal representation,and so on. Cases and variables are flexible constructs: a variable in one studyforms the cases in another. Variables can be quantitative or qualitative. Thedata displayed in Figure 1.1 provide an illustration of types of variables oftenencountered in insurance:• Claim amount is an example of what is commonly regarded as continuousvariable even though, practically speaking, it is confined to an integer num-ber of dollars. In this case the variable is skewed to the right. Not indicatedon the graphs are a small number ofvery large claims in excess of$100 000.The largest claim is around $4.5 million dollars. Continuous variables arealso called “interval” variables to indicate they can take on values anywherein an interval of the real line.• Legal representation is a categorical variable with two levels “no” or “yes.”Variables taking on just two possible values are often coded “0” and “1” andare also called binary, indicatororBernoulli variables. Binary variables indi-cate the presence or absence ofan attribute, or occurrence or non-occurrenceof an event of interest such as a claim or fatality.• Injury code is a categorical variable, also called qualitative. The variablehas seven values corresponding to different levels of physical injury: 1–6and 9. Level 1 indicates the lowest level ofinjury, 2 the next level and so onup to level 5 which is a catastrophic level of injury, while level 6 indicatesdeath. Level 9 corresponds to an “unknown” or unrecorded level of injury
4Insurance dataand hence probably indicates no physical injury. The injury code variableis thus partially ordered, although there are no levels 7 and 8 and level 9does not conform to the ordering. Categorical variables generally take onone of a discrete set of values which are nominal in nature and need not beordered. Other types of categorical variables are the type of crash: (non-injury, injury, fatality); or claim type on household insurance: (burglary,storm, other types). When there is a natural ordering in the categories, suchas (none, mild, moderate, severe), then the variable is called ordinal.• The distribution of settlement delay is in the final panel. This is anotherexample of a continuous variable, which in practical terms is confined to aninteger number of months or days.Data are often converted to counts or frequencies. Examples of count vari-ables are: number of claims on a class of policy in a year, number of trafficaccidents at an intersection in a week, number of children in a family, num-ber of deaths in a population. Count variables are by their nature non-negativeintegers. They are sometimes expressed as relative frequencies or proportions.1.3 Data transformationsThe panels in Figure 1.2 indicate alternative transformations and displays ofthe personal injury data:• Histogram of log claim size. The top left panel displays the histogram oflog claim size. Compared to the histogram in Figure 1.1 ofactual claim size,the logarithm is roughly symmetric and indeed almost normal. Historicallynormal variables have been easier to model. However generalized linearmodeling has been at least partially developed to deal with data that are notnormally distributed.• Claim size versus settlement delay. The top right panel does not reveal aclear picture ofthe relationship between claim sizes and settlement delay. Itis expected that larger claims are associated with longer delays since largerclaims are often more contentious and difficult to quantify. Whatever therelationship, it is masked by noise.• Claim size versus operational time. The bottom left panel displays claimsize versus the percentile rank of the settlement delay. The percentile rankis the percentage of cases that settle faster than the given case. In insurancedata analysis the settlement delay percentile rank is called operational time.Thus a claim with operational time 23% means that 23% of claims in thegroup are settled faster than the given case. Note that both the mean andvariability of claim size appear to increase with operational time.• Log claim size versus operational time.Figure 1.2 plots log claim size versus operational time. The relationshipThe bottom right panel of
1.3 Data transformations5between claim and settlement delay is now apparent:increases virtually linearly with operational time. The log transform has“stabilized the variance.” Thus whereas in the bottom left panel the vari-ance appears to increase with the mean and operational time, in the bottomright panel the variance is approximately constant.transformations are further discussed in Section 4.9.log claim sizeVariance-stabilizing0510150.000.100.200.30Log claim size Frequency0255075100020004000Settlement delay (months)Claim size ($1000s)020406080100020004000Operational timeClaim size ($1000s)020406080100051015Operational timeLog claim sizeFig. 1.2. Relationships between variables in personal injury insurance data setThe above examples illustrate ways of transforming a variable. The aimof transformations is to make variables more easily amenable to statisticalanalysis, and to tease out trends and effects. Commonly used transformationsinclude:• Logarithms. The log transform applies to positive variables. Logs are usu-ally “natural” logs (to the base e ≈ 2.718 and denoted lny). If x = logb(y)then x = ln(y)/ ln(b) and hence logs to different bases are multiples ofeachother.• Powers. The power transform ofa variable y is yp. For mathematical conve-nience this is rewritten as y1−p/2for p = 2 and interpreted as ln y if p = 2.This is known as the “Box–Cox” transform. The case p = 0 corresponds tothe identity transform, p = 1 the square root and p = 4 the reciprocal. Thetransform is often used to stabilize the variance – see Section 4.9.• Percentile ranks and quantiles. The percentile rank of a case is the per-centage ofcases having a value less than the given case. Thus the percentile
6Insurance datarank depends on the value of the given case as well as all other case values.Percentile ranks are uniformly distributed from 0 to 100. The quantile of acase is the value associated with a given percentile rank. For example the75% quantile is the value ofthe case which has percentile rank 75. Quantilesare often called percentiles.• z-score. Given a variable y, the z-score of a case is the number of standarddeviations the value ofy for the given case is away from the mean. Both themean and standard deviation are computed from all cases and hence, similarto percentile ranks, z-scores depend on all cases.• Logits. If y is between 0 and 1 then the logit of y is ln{y/(1 − y)}. Logitslie between minus and plus infinity, and are used to transform a variable inthe (0,1) interval to one over the whole real line.1.4 Data explorationData exploration using appropriate graphical displays and tabulations is a firststep in model building. It makes for an overall understanding of relationshipsbetween variables, and it permits basic checks of the validity and appropri-ateness of individual data values, the likely direction of relationships and thelikely size of model parameters. Data exploration is also used to examine:(i) relationships between the response and potential explanatory variables;and(ii) relationships between potential explanatory variables.The findings of (i) suggest variables or risk factors for the model, and theirlikely effects on the response. The second point highlights which explanatoryvariables are associated. This understanding is essential for sensible modelbuilding. Strongly related explanatory variables are included in a model withcare.Data displays differ fundamentally, depending on whether the variables arecontinuous or categorical.Continuous by continuous. The relationship between two continuous vari-ables is explored with a scatterplot. A scatterplot is sometimes enhanced withthe inclusion of a third, categorical, variable using color and/or different sym-bols. This is illustrated in Figure 1.3, an enhanced version of the bottom rightpanel of Figure 1.2. Here legal representation is indicated by the color of theplotted points. It is clear that the lower claim sizes tend to be the faster-settledclaims without legal representation.Scatterplot smoothers are useful for uncovering relationships between vari-ables. These are similar in spirit to weighted moving average curves, albeitmore sophisticated. Splines are commonly used scatterplot smoothers. They
1.4 Data exploration702040Operational time6080100510150Loc glais m i ezLegal representationNoYesFig. 1.3. Scatterplot for personal injury data012345602000600010000Vehicle value in $10 000 unitsClaim size−101267891011Log vehicle valueLog claim sizeFig. 1.4. Scatterplots with splines for vehicle insurance datahave a tuning parametercontrolling the smoothness ofthe curve. The point ofascatterplot smoother is to reveal the shape ofa possibly nonlinear relationship.The left panel of Figure 1.4 displays claim size plotted against vehicle value,in the vehicle insurance data (described on page 15), with a spline curve super-imposed. The right panel shows the scatterplot and spline with both variableslog-transformed. Both plots suggest that the relationship between claim sizeand value is nonlinear. These displays do not indicate the strength or statisticalsignificance of the relationships.
8Insurance dataTable 1.1. Claim by driver’s age in vehicle insuranceDriver’s age category3Claim12456TotalYes4968.6%9327.2%1 1137.1%1 1046.8%6145.7%3655.6%4 6246.8%No5 24691.4%11 94392.8%14 65492.9%15 08593.2%10 12294.3%6 18294.4%63 23293.2%Total5 74212 87515 76716 18910 7366 54767 856Vehicle insurancePrivate health insuranceFig. 1.5. Mosaic plotsCategorical by categorical. A frequency table is the usual means of displaywhen examining the relationship between two categorical variables. Mosaicplots are also useful. A simple example is given in Table 1.1, displayingthe occurrence of a claim in the vehicle insurance data tabulated by driver’sage category. Column percentages are also shown. The overall percentageof no claims is 93.2%. This percentage increases monotonically from 91.4%for the youngest drivers to 94.4% for the oldest drivers. The effect is showngraphically in the mosaic plot in the left panel of Figure 1.5. The areas of therectangles are proportional to the frequencies in the corresponding cells in thetable, and the column widths are proportional to the square roots ofthe columnfrequencies. The relationship of claim occurrence with age is clearly visible.A more substantial example is the relationship oftype ofprivate health insur-ance with personal income, in the National Health Survey data, described onpage 17. The tabulation and mosaic plot are shown in Table 1.2 and the rightpanel of Figure 1.5, respectively. “Hospital and ancillary” insurance is codedas 1, and is indicated as the red cells on the mosaic plot. The trend for increas-ing uptake of hospital and ancillary insurance with increasing income level isapparent in the plot.
1.4 Data exploration9Table 1.2. Private health insurance type by incomeIncome<$20 000$20 000–$35 000$35 000–$50 000>$50 000TotalPrivate healthinsurance type1234Hospital andancillary12 17822.6%1 53432.3%8756935 28030.1%45.9%54.8%Hospital only26116.3%2695.7%1326.9%1209.5%1 1326.5%Ancillary only33974.1%3066.5%1196.2%468684.9%3.6%None56 45867.0%2 63855.6%78040510 28158.5%40.9%32.0%Total9 6444 7471 9061 26417 561Mosaic plots are less effective when the number of categories is large. Inthis case, judicious collapsing of categories is helpful. A reference for mosaicplots and other visual displays is Friendly (2000).Continuous by categorical. Boxplots are appropriate for examining a contin-uous variable against a categorical variable. The boxplots in Figure 1.6 displayclaim size against injury code and legal representation for the personal injurydata. The left plots are of raw claim sizes: the extreme skewness blurs therelationships. The right plots are of log claim size: the log transform clarifiesthe effect of injury code. The effect of legal representation is not as obvi-ous, but there is a suggestion that larger claim sizes are associated with legalrepresentation.Scatterplot smoothers are useful when a binary variable is plotted against acontinuous variable. Consider the occurrence of a claim versus vehicle value,in the vehicle insurance data. In Figure 1.7, boxplots ofvehicle value (top) andlog vehicle value (bottom), by claim occurrence, are on the left. On the right,occurrence of a claim (1 = yes, 0 = no) is plotted on the vertical axis, againstvehicle value on the horizontal axis, with a scatterplot smoother. Raw vehi-cle values are used in the top plot and log-transformed values in the bottomplot. In the boxplots, the only discernible difference between vehicle values ofthose policies which had a claim and those which did not, is that policies witha claim have a smaller variation in vehicle value. The plots on the right aremore informative. They show that the probability of a claim is nonlinear, pos-sibly quadratic, with the maximum probability occurring for vehicles valued
10Insurance data12345690e+002e+064e+06Injury codeClaim size123456924681014Injury codeLog claim sizeNoYes0e+002e+064e+06Legal representationClaim sizeNoYes24681014Legal representationLog claim sizeFig. 1.6. Personal injury claim sizes by injury code and legal representationaround $40 000. This information is important for formulating a model for theprobability of a claim. This is discussed in Section 7.3.1.5 Grouping and runoff trianglesCases are often grouped according to one or more categorical variables. Forexample, the personal injury insurance data may be grouped according toinjury code and whether or not there is legal representation. Table 1.3 displaysthe average log claim sizes for such different groups.An important form ofgrouping occurs when claims data is classified accord-ing to year of accident and settlement delay. Years are often replaced bymonths or quarters and the variable of interest is the total number of claimsor total amount for each combination. If i denotes the accident year and j thesettlement delay, then the matrix with (i, j) entry equal to the total numberor amount is called a runoff triangle. Table 1.4 displays the runoff trianglecorresponding to the personal injury data. Runoff triangles have a triangu-lar structure since i + j > n is not yet observed, where n is current time.
1.5 Grouping and runofftriangles11NoYes05102030Occurrence of claimVehicle value in $10 000 units05Vehicle value in $10 000 units10152025303501Occurrence of claimNoYes−10123Occurrence of claimLog vehicle value−1012301Log vehicle valueOccurrence of claimFig. 1.7. Vehicle insurance claims by vehicle valueTable 1.3. Personal injury average log claim sizesLegalrepresentationInjury code1234569NoYes9.059.5410.0610.4610.4210.8811.1611.0010.1711.259.369.977.699.07This runoff triangle also has the feature of zeroes in the top left, which occursbecause data collection on settlements did not commence until the fourth year.Accident years, corresponding to each row, show how claims “runoff.” Set-tlement delay, corresponding to the columns, is often called the “development”year. Each calendar year leads to another diagonal of entries. For this trian-gle the final diagonal corresponds to a partial year, explaining the low totals.Runoff triangles are often standardized on the number of policies written ineach year. The above triangle suggests this number is increasing over time.Runoff triangles are usually more regular than the one displayed inTable 1.4, with a smooth and consistent progression of settlement amountseither in the development year direction or in the accident year direction.
12Insurance dataTable 1.4. Runofftriangle ofamountsAccidentDevelopment year423050284642322926411293909855year00001002035678912345678962152322822191224463275048431126842085221046229472271811388169461667213181500810583134432713734018823535976521545269932996039780394741243477191347418546200542408435430885824824953735762716987998870710Runoff triangles are often the basis for forecasting incurred but not yet set-tled liabilities, corresponding to the lower triangular entries. One approach toforecasting these liabilities is to use generalized linear modeling, discussed inSection 8.1.1.6 Assessing distributionsStatistical modeling, including generalized linear modeling, usually makesassumptions about the random process generating the data. For example itmay be assumed that the logarithm of a variable is approximately normallydistributed. Distributional assumptions are checked by comparing empiricalpercentile ranks to those computed on the basis of the assumed distribution.Forexample suppose a variable is assumed normal. To checkthis, the observedpercentile ranks of the variable values are first computed. If the sample sizeis n then the smallest value has percentile rank 100/n, the second smallest200/n, and so on. These sample percentile ranks are compared to the theoret-ical percentile ranks of the sample values, based on the normal with the samemean and standard deviation as the given sample. The “pp-plot” is a graph-ical means of assessing the agreement between these two sets of ranks. Thesample and theoretical percentile ranks are plotted against one another. Pointsfalling near the 45◦line indicate the normal model fits the data well. A similarprocedure is used for testing distributions other than the normal.The two panels in Figure 1.8 illustrate pp-plots for assessing the distribu-tion of the size of claims for the personal injury insurance data. The l...