UniversidaddeValladolid - UVaDOC

Loading...

UniversidaddeValladolid FACULTAD DE CIENCIAS DEPARTAMENTO DE ESTAD´ISTICA ´ OPERATIVA E INVESTIGACION

TESIS DOCTORAL: ´ MEJORAS EN REGLAS DE CLASIFICACION ´ MEDIANTE INCORPORACION ´ ADICIONAL DE INFORMACION

Presentada por David Conde del R´ıo para optar al grado de doctor por la Universidad de Valladolid

Dirigida por: Bonifacio Salvador, Miguel A. Fern´andez

Bonifacio Salvador Gonz´alez, Catedr´atico de Universidad, y Miguel A. Fern´andez Temprano, Profesor Titular de Universidad, certifican que la presente memoria ha sido realizada, bajo su direcci´on, por David Conde del R´ıo en el Departamento de Estad´ıstica e Investigaci´on Operativa de la Universidad de Valladolid.

Valladolid, 31 de enero de 2014

Mi m´as sincero agradecimiento a mis directores, Bonifacio Salvador y Miguel A. Fern´andez, y a Cristina Rueda, por su dedicaci´on y tutela, por su comprensi´on, ideas y sugerencias, y, por extensi´on, a todos los miembros del departamento de Estad´ıstica e Investigaci´on Operativa.

A Alma, a mi familia

Contenido 1

Introducci´on

11

2

A classification rule for ordered exponential populations 2.1 Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Hip´otesis de partida . . . . . . . . . . . . . . . . . 2.1.2 Antecedentes . . . . . . . . . . . . . . . . . . . . 2.1.3 Objetivos concretos . . . . . . . . . . . . . . . . . 2.2 Metodolog´ıa . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Reglas para dos poblaciones . . . . . . . . . . . . 2.3.2 Reglas para dos poblaciones con datos censurados 2.3.3 Reglas para m´as de dos poblaciones . . . . . . . . 2.4 Conclusiones . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

19 19 19 19 19 20 20 20 22 23 23

. . . . . . . . . .

25 25 25 25 26 26 27 27 29 30 31

3

4

Classification of samples into two or more ordered populations with application to a cancer trial 3.1 Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Hip´otesis de partida . . . . . . . . . . . . . . . . . 3.1.2 Antecedentes . . . . . . . . . . . . . . . . . . . . 3.1.3 Objetivos concretos . . . . . . . . . . . . . . . . . 3.2 Metodolog´ıa . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Estimadores y reglas para m´as de dos poblaciones 3.3.2 Comportamiento de las reglas en simulaciones . . 3.3.3 Aplicaci´on . . . . . . . . . . . . . . . . . . . . . 3.4 Conclusiones . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Performance and estimation of the true error rate of classification rules built with additional information. An application to a cancer trial 4.1 Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Hip´otesis de partida . . . . . . . . . . . . . . . . . . . . . 4.1.2 Antecedentes . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Objetivos concretos . . . . . . . . . . . . . . . . . . . . . 4.2 Metodolog´ıa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Comparaci´on con reglas contractivas basadas en shrinkage 4.3.2 Estimaci´on de la tasa de error verdadero de las reglas restringidas . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 33 33 34 34 36 36 37

4.4 5

4.3.3 Aplicaci´on . . . . . . . . . . . . . . . . . . . . . . . . . 39 Conclusiones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Trabajos en desarrollo y futuros

43

Referencias

45

ANEXOS: Art´ıculos publicados

55

1

Introducci´on

Esta memoria se presenta en el formato denominado “compendio de publicaciones”, por lo que comenzamos describiendo el t´opico en el que se enmarcan las contribuciones que se presentan y resumiendo los resultados m´as destacados existentes en el mismo hasta el momento. El objetivo de esta descripci´on no es otro que situar los resultados obtenidos en el contexto apropiado y justificar la unidad tem´atica de los mismos. En las secciones subsiguientes se describir´an las aportaciones realizadas y los trabajos en curso en este momento. Las aportaciones contenidas en esta memoria se sit´uan dentro de la inferencia estad´ıstica con restricciones, una rama de la estad´ıstica que naci´o en los a˜nos 50 del siglo pasado y que tiene por objeto, cuando se dispone de informaci´on a priori no muestral, dise˜nar procedimientos que aprovechen esta informaci´on y resulten m´as eficientes que los procedimientos que no la toman en cuenta. En general, bajo la metodolog´ıa de la estad´ıstica frecuentista, esta informaci´on adicional se incorpora al modelo por medio de restricciones (de orden, no negatividad, forma...) sobre los par´ametros. En la pr´actica existen muchas situaciones en las que es razonable suponer restricciones de orden sobre los par´ametros. Por ejemplo, en un ensayo cl´ınico es habitual que los niveles medios de toxicidad de un medicamento crezcan con el nivel de la dosis, ver por ejemplo Gasparini y Eisele (2000). En otras ocasiones es el propio modelo el que impone dichas restricciones. Tal es el caso de la estimaci´on no param´etrica de funciones de distribuci´on, donde hay que tener en consideraci´on la naturaleza mon´otona de dichas funciones, ver Lo (1987). Referencias fundamentales en el campo de la inferencia bajo restricciones son Barlow et al. (1972), Robertson et al. (1988) y Silvapulle y Sen (2004), libros que recogen las aportaciones m´as relevantes hasta su fecha de publicaci´on. Una alternativa que no consideraremos en esta memoria es la metodolog´ıa bayesiana, donde la informaci´on adicional se incopora al modelo a trav´es de las distribuciones a priori. Algunas referencias de inter´es en este campo son Marchand y Strawderman (2004), Taylor et al. (2007) y Hoitjink (2013). Uno de los t´opicos m´as desarrollados dentro de la Inferencia estad´ıstica con restricciones es el de los contrastes con restricciones para medias de poblaciones normales. Estos contrastes est´an motivados por la necesidad en muchas aplicaciones de contrastar hip´otesis frente a alternativas restringidas, como por ejemplo homogeneidad o simetr´ıa frente a alternativas restringidas como son la monoton´ıa o unimodalidad. Se han desarrollado muchos m´etodos estad´ısticos para detectar las citadas propiedades dentro del modelo normal, y tambi´en fuera del modelo normal utilizando m´etodos no param´etricos.

11

La metodolog´ıa basada en la verosimilitud es la principal aproximaci´on a los contrastes con restricciones. Bartholomew (1959) estudi´o el contraste de homogeneidad de medias de poblaciones normales univariantes con varianzas iguales contra la hip´otesis alternativa de restricciones de orden entre las medias y determin´o el estad´ıstico de raz´on de verosimilitudes y su distribuci´on bajo la hip´otesis nula. Desde entonces, la bibliograf´ıa relativa a este problema ha sido abundante. Se ha obtenido la distribuci´on del estad´ıstico raz´on de verosimilitudes en distintas situaciones [Bartholomew (1961), Barlow et al. (1972), Sasabuchi et al. (1983), Kulatunga y Sasabuchi (1984), Shapiro (1988), Robertson et al. (1988), Cohen et al. (2000)], as´ı como propiedades de la funci´on de potencia de los mismos [Perlman (1969), Mukerjee et al. (1986), Tsai (1992), Praestgaard (2012)]. Se han detectado anomal´ıas en estos tests de raz´on de verosimilitudes bajo restricciones, habi´endose estudiado y caracterizado las situaciones en que est´an dominados por otros tests que no tienen en cuenta la informaci´on dada por las restricciones: Men´endez et al. (1991, 1992a, 1992b), Men´endez y Salvador (1991, 1992), Hu y Wright (1994), Cohen et al. (2000), Perlman y Wu (2002), Cohen y Sackrowitz (2004). Tambi´en se han considerado procedimientos ajenos a la raz´on de verosimilitudes con resultados positivos, como los tests lineales como el que proponen Abelson y Tukey (1963) y que despu´es extienden, entre otros, Schaafsma y Smid (1966), Shi y Kudˆo (1987), Rueda (1989) y Tsai y Sen (1993). Muy recientemente, El Barmi y McKeague (2013) han propuesto un estad´ıstico test basado en la verosimilitud emp´ırica para contrastar la presencia de orden estoc´astico entre k distribuciones univariantes. Otro de los grandes t´opicos de la Inferencia bajo restricciones es el de la estimaci´on de los par´ametros. Ayer et al. (1955) estudiaron el modelo binomial bajo restricciones lineales de orden en las probabilidades de e´ xito y dieron por primera vez una formulaci´on minimax para la soluci´on. Van Eeden (1956, 1957a, 1957b) estudia condiciones para la existencia y propone algoritmos para encontrar el Estimador M´aximo Veros´ımil (EMV) para el problema de varias poblaciones univariantes con desigualdades entre los par´ametros y cotas sobre ellos. Brunk (1955, 1958) considera el problema de varias poblaciones de la familia exponencial uniparam´etrica con restricciones entre los par´ametros. Muchos de estos procedimientos se enmarcan dentro de la regresi´on isot´onica o regresi´on isot´onica generalizada, ver Brunk (1970) y Robertson et al. (1988), donde se trata el tema en detalle. Cuando se considera globalmente la estimaci´on del vector multidimensional de medias, el EMV restringido tiene menor error cuadr´atico medio que el no restringido, ver Robertson et al. (1988). Sin embargo, cuando el inter´es est´a en estimar una funci´on lineal de las coordenadas, el resultado depende del tipo de restricciones y de la funci´on lineal a estimar. Rueda y Salvador (1995) proporcionan 12

condiciones bajo las que, para conos determinados por una o dos restricciones lineales, cualquier funci´on lineal del estimador restringido se comporta mejor que la funci´on lineal del estimador no restringido. En el caso de restricciones de orden, y para estimar las coordenadas del vector de medias, el resultado depende del tipo de restricci´on. En el caso de restricciones de orden simple, Kelly (1989) y Lee (1981) prueban que el EMV restringido domina al no restringido. Sin embargo, en el caso de restricciones tree order, Lee (1988) prueba que el estimador restringido no domina al no restringido para la componente ra´ız del a´ rbol. Hwang y Peddada (1994) refuerzan y unifican los resultados de Lee (1981, 1988) y Kelly (1989) con restricciones de orden simple y tree order: proponen estimadores alternativos que son mejores que los no restringidos para la i-´esima coordenada en el sentido de que los intervalos de confianza centrados en dichos estimadores tienen una probabilidad de cubrimiento mayor. En este sentido, Rueda et al. (1997a) muestran que si el vector de medias est´a restringido a un semiespacio, la probabilidad de cubrimiento de los intervalos de confianza es mayor para el EMV restringido, tanto si la estimaci´on es simult´anea como si no. Fern´andez et al. (1998, 1999, 2000) caracterizan situaciones en las que el estimador restringido se comporta mejor que el no restringido en t´erminos de la probabilidad de cubrimiento y del error cuadr´atico medio. El problema de la estimaci´on de par´ametros de escala con varios tipos de restricciones de orden es estudiado tambi´en por Hwang y Peddada (1994), que obtienen resultados de dominaci´on parcial: demuestran que, en un orden simple, el EMV restringido del par´ametro m´as peque˜no domina estoc´asticamente al no restringido, siendo dominado para el par´ametro mayor. Fern´andez et al. (1997) tambi´en estudian este problema para poblaciones uniformes, y determinan funciones para las cuales el estimador restringido domina al no restringido y una clase de funciones para las que no. Marchand y Strawderman (2004) y van Eeden (2006) repasan los m´etodos de estimaci´on con restricciones en los par´ametros presentes en la literatura y sus propiedades de admisibilidad y minimaximalidad, y cu´ando estas buenas propiedades se mantienen al estimar funciones del par´ametro, en particular la i-´esima coordenada. Estas propiedades dependen de la funci´on de p´erdida utilizada. La mayor´ıa de los autores usan funci´on de p´erdida cuadr´atica, pero tambi´en se utilizan otras funciones de p´erdida como, por ejemplo, las de Stein (Tsukuma y Kubokawa, 2011) o LINEX (Ma y Liu, 2013). Menci´on aparte merece el modelo lineal. La literatura es abundante en lo referente a la estimaci´on de los par´ametros del modelo lineal en espacios param´etricos restringidos, es decir, cuando el vector de medias (de dimensi´on k) est´a restringido a un subconjunto convexo y cerrado de Rk . Uno de los primeros trabajos se debe a Lovell y Prescott (1970). En Silvapulle y Sen (2004) se estudia en detalle este problema, con numerosas referencias. En van Eeden (2006) se repasa la biblio13

graf´ıa en t´erminos de admisibilidad en tres situaciones: una restricci´on de desigualdad lineal, restricciones intervalo y cuadrante, y restricciones poli´edricas y elipsoidales, centr´andose en este u´ ltimo caso en los resultados de Moors y van Houwelingen (1993). Rueda et al. (1997b) tratan la estimaci´on simult´anea en un modelo lineal cuando el vector de par´ametros est´a restringido a un cono poli´edrico. La literatura de la inferencia bajo restricciones en lo relativo a intervalos de confianza pr´acticamente se ha limitado a sugerir m´etodos para construir intervalos de confianza bajo restricciones de orden en los par´ametros. Marcus y Peritz (1976) construyen cotas inferiores de confianza simult´aneas bajo ciertos modelos lineales con restricciones y varianzas conocidas. Schoenfeld (1986) propone cotas de confianza para las medias ordenadas de poblaciones normales invirtiendo el test de raz´on de verosimilitudes. Hwang y Peddada (1994) construyen intervalos de confianza unidimensionales de amplitud constante para las coordenadas de la media de una variable sim´etrica con distribuci´on el´ıptica bajo restricciones de orden en las coordenadas. Estos intervalos, centrados en algunos estimadores mejorados como los procedentes de la regresi´on isot´onica, dominan a los intervalos de confianza est´andar (centrados en el EMV) cuando la matriz de covarianzas es diagonal. Asimismo, presentan un nuevo esquema para construir mejores intervalos de confianza, centrados en el estimador propuesto, para otras restricciones de orden generales. Korn (1982) propone un procedimiento para construir bandas de confianza para curvas dosis-respuesta isot´onicas para las que no asume ning´un modelo param´etrico. M´as recientemente, Sampson et al. (2008) han propuesto un nuevo prodecimiento que generaliza el de Korn (1982). Tambi´en se ha estudiado el bootstrap y otras metodolog´ıas de remuestreo en modelos con restricciones de orden. Peddada (1997) estudia la construcci´on de intervalos de confianza para medias de poblaciones con restricciones de orden utilizando metodolog´ıas jackknife y bootstrap. Rueda et al. (2002) proponen una metodolog´ıa para estimar una combinaci´on lineal de las coordenadas de la media basada en procedimientos bootstrap que produce estimadores con menor error cuadr´atico medio. Li et al. (2010) construyen intervalos y regiones de confianza para probabilidades binomiales ordenadas basados en distribuciones asint´oticas y en diversas versiones del bootstrap. Sin embargo, hay que tener cuidado con la aplicaci´on del bootstrap a modelos con restricciones: el estimador bootstrap es generalmente inconsistente en inferencia con restricciones de desigualdad (Shaw y Geyer, 1997). Andrews (2000) prueba que el estimador bootstrap es inconsistente cuando el par´ametro est´a en la frontera del espacio param´etrico definido por restricciones lineales o no lineales siendo alguna de desigualdad.

14

Tras el estudio de los procedimientos fundamentales de estimaci´on y contraste, los procedimientos de inferencia bajo restricciones se han aplicado tambi´en a t´ecnicas multivariantes. Entre los m´etodos del an´alisis multivariante que han incluido restricciones en los par´ametros figuran el an´alisis de correlaci´on can´onica, el an´alisis de correspondencias, el an´alisis cl´uster y el an´alisis discriminante. Das y Sen (1994) introducen restricciones en los par´ametros en el c´alculo de correlaciones can´onicas. En un principio, el an´alisis se centra en las restricciones de no negatividad, para despu´es extenderlo a restricciones de desigualdad m´as generales, y tambi´en a restricciones en solo algunos de los coeficientes. Cuando las restricciones son de no negatividad, Das y Sen (1996) establecen resultados de normalidad asint´otica para los estimadores (obtenidos a partir de la matriz de covarianzas muestral). Kuriki (2005) desarrolla contrastes para la independencia de tablas de contingencia de dos dimensiones ordenadas en un marco general de an´alisis de correlaciones can´onicas con restricciones de desigualdad, y discute algunos m´etodos num´ericos para el ajuste de modelos con restricciones de orden. Varios son los autores que han incluido restricciones en el an´alisis de correspondencias. Bockenholt y Bockenholt (1990) y Takane et al. (1991) consideran diversas formas de incorporar restricciones lineales, Groenen y Poblone (2003) aplican el an´alisis de correspondencias con restricciones de orden al problema de la dataci´on arqueol´ogica, y van de Velden et al. (2009) estudian su comportamiento mediante un estudio de simulaci´on. En cuanto al an´alisis cl´uster, Peddada et al. (2003) proponen un algoritmo que utiliza la inferencia estad´ıstica bajo restricciones de orden para seleccionar genes y agruparlos en cl´usters de acuerdo con su evoluci´on temporal. En la misma situaci´on, Liu et al. (2009) proponen un algoritmo r´apido y con buenas propiedades de precisi´on y robustez. Las aportaciones contenidas en esta memoria se desarrollan en el contexto del an´alisis discriminante con informaci´on adicional, por lo que la incorporaci´on de la informaci´on adicional al an´alisis discriminante a trav´es de restricciones sobre los par´ametros se repasar´a con mayor detenimiento. Supongamos que tenemos un n´umero finito de poblaciones Π1 , . . . , Πk , cuya existencia conocemos a priori. Una observaci´on cualquiera pertenece a una, y solo una, de las k poblaciones. Sea Y la variable categ´orica que identifica la poblaci´on a la que pertenece la observaci´on, donde Y = i significa que la observaci´on pertenece a la poblaci´on Πi , i = 1, ..., k. Asimismo, sea X = (X1 , ..., X p ) un vector p-dimensional que contiene p caracter´ısticas asociadas a la observaci´on. B´asicamente, el problema del an´alisis discriminante consiste en la predicci´on de Y a partir de X, es decir, en la clasificaci´on de la observaci´on en una de las k 15

poblaciones a partir de la informaci´on contenida en X. El an´alisis discriminante se conoce tambi´en como clasificaci´on supervisada, dado que en general las funciones de distribuci´on de las poblaciones son desconocidas y se dispone de una muestra de entrenamiento de tama˜no n, Mn = {(Xi ,Yi ), i = 1, ..., n}, formada por observaciones de las que conocemos el vector p-dimensional de clasificadores, Xi , y la poblaci´on a la que pertenecen, Yi , para i = 1, ..., n. El objetivo es encontrar una regla de clasificaci´on que sea o´ ptima en alg´un sentido. El an´alisis discriminante tiene una cantidad muy numerosa de aplicaciones: clasificaci´on biol´ogica (Sneath y Sokal, 1973), diagn´osticos m´edicos y reconocimiento de patrones (McLachlan, 2004), desarrollo de nuevos f´armacos (Wang, 2008), reconocimiento o´ ptico de caracteres (Bunke y Wang, 1997), reconocimiento facial (Li y Jain, 2011), reconocimiento de voz y caracteres escritos a mano (Hastie et al., 1995), esc´aners m´edicos (Sonka y Fitzpatrick, 2004), identificaci´on biom´etrica (Boulgouris et al., 2009), calificaciones de cr´edito (Thomas et al., 2002), modelos de relaciones estructura-actividad cuantitativas (Benigni, 2013), geo-estad´ıstica (Fraley y Raftery, 2002), procesamiento del lenguaje natural (Kao y Poteet, 2007), clasificaci´on de documentos (Berry, 2004), motores de b´usqueda (Chang et al., 2001)... Algunas de las ideas asociadas al an´alisis discriminante se remontan a la d´ecada de 1920, en concreto el coeficiente de concordancia racial de Pearson (1926) y el ´ındice posicional y varias medidas de distancia entre grupos de Mahalanobis (1927). El primer trabajo sobre an´alisis discriminante se debe a Fisher (1936), y trata un problema taxon´omico de dos poblaciones. Rao (1948) lo generaliza a m´as de dos poblaciones. Desde entonces, la bibliograf´ıa es extensa. Basten como ejemplos significativos: Lachenbruch y Goldstein (1979), Hand (1981), McLachlan (2004) y Huberty y Olejnik (2006). En esta memoria se estudia el an´alisis discriminante desde una perspectiva param´etrica cuando se dispone de informaci´on adicional sobre los par´ametros. No hay mucha bibliograf´ıa sobre este tema, si bien una referencia inicial es Long y Gupta (1998), que tratan el problema de dos poblaciones normales con la misma matriz de covarianzas. Obtienen resultados de aplicaci´on muy limitada, pues se limitan al caso de matriz de covarianzas igual a la matriz identidad y con las restricciones (de orden) de que una de las medias es mayor o igual que la otra, componente a componente. Las reglas se construyen sustituyendo en la regla de clasificaci´on de Fisher los estimadores usuales de las medias por otros que tienen en cuenta las restricciones. Demuestran que se logran algunas mejoras respecto de la regla de Fisher. Posteriormente, Fern´andez et al. (2006) proponen reglas para situaciones m´as generales, basadas en nuevos estimadores de la diferencia de medias que tienen en cuenta la informaci´on adicional de que dicha diferencia pertenece a un cono convexo, poli´edrico y cerrado, y que muestran un mejor com16

portamiento cuando la matriz de covarianzas es la misma entre las poblaciones y conocida, y comprueban mediante simulaci´on que esto tambi´en ocurre cuando la matriz de covarianzas es la misma pero desconocida. Salvador et al. (2008) estudian propiedades de robustez de las reglas de clasificaci´on que incorporan la informaci´on adicional contra contaminaci´on de la muestra de entrenamiento, y prueban que estas reglas no solo tienen menor probabilidad total de clasificaci´on err´onea sino que tambi´en previenen contra algunos tipos de contaminaci´on. Desde una perspectiva no param´etrica, en Dykstra et al. (1999) se presentan procedimientos de clasificaci´on no param´etricos con restricciones de orden entre la variable respuesta y las variables explicativas. En este sentido han aparecido recientemente algunos trabajos sobre clasificaci´on cuando la variable respuesta es de tipo ordinal y hay restricciones de orden entre las variables explicativas y la variable respuesta. Un ejemplo es el de Auh y Sampson (2006), quienes proponen un procedimiento de discriminaci´on isot´onica log´ıstica que generaliza la discriminaci´on log´ıstica lineal, relajando la linealidad y exigiendo tan solo monoton´ıa en los discriminadores. En Kotlowski y Slovinski (2013) se analizan los procedimientos de aprendizaje con restricciones de monoton´ıa no param´etricos desde un punto de vista estad´ıstico, a trav´es de un an´alisis te´orico confirmado por simulaciones. Para finalizar esta introducci´on, terminamos con un breve resumen de los tres art´ıculos que conforman esta memoria. En Conde et al. (2005) consideramos una regla de clasificaci´on para poblaciones exponenciales cuando se sabe que hay un orden entre los par´ametros, y probamos que dicha regla se comporta mejor que la basada en la verosimilitud. Adem´as, estudiamos su comportamiento en cada una de las poblaciones comparando las probabilidades de clasificaci´on err´onea, y consideramos la incorporaci´on de datos con censura de tipo II. Por u´ ltimo, evaluamos la regla para m´as de dos poblaciones por medio de un estudio de simulaci´on. El estudio de las poblaciones exponenciales muestra que se pueden mejorar las reglas de clasificaci´on en otros contextos distintos al de la distribuci´on normal. En Conde et al. (2012) hacemos una generalizaci´on no trivial a m´as de dos poblaciones de las reglas de clasificaci´on con restricciones que aparecen en Fern´andez et al. (2006), y ofrecemos evidencia emp´ırica que demuestra que la metodolog´ıa con restricciones propuesta se comporta mejor que la metodolog´ıa sin restricciones existente. Aplicamos la metodolog´ıa a una situaci´on de diagn´ostico y tratamiento del c´ancer, donde se quiere clasificar a los pacientes en uno de varios grupos de diagnosis a partir de determinados marcadores biol´ogicos y se sabe que algunas de las variables predictoras toman, en media, valores mayores o menores en los pacientes de algunos de los grupos que en los de otros grupos. Esta clase de datos es muy com´un en la pr´actica, por ejemplo al tratar con datos de prote´ınas o 17

microarrays. Utilizamos un conjunto de datos de este tipo, en concreto de c´ancer de vejiga, y comprobamos que las nuevas reglas restringidas se comportan mejor que las no restringidas. La mejora es tanto o m´as importante cuantas m´as restricciones no cumpla la muestra de entrenamiento. En Conde et al. (2013) comparamos las reglas expuestas en Fern´andez et al. (2006) con las basadas en estimadores shrinkage para las medias propuestas por Tong et al. (2012) con respecto a varios de los criterios m´as utilizados en la pr´actica: probabilidad de clasificaci´on err´onea total, a´ rea bajo la curva ROC, calibrado y refinamiento. La comparaci´on viene motivada por el hecho de que las reglas de Fern´andez et al. (2006) se pueden ver como reglas contractivas al estar basadas en proyecciones y ser la proyecci´on un operador contractivo. Comprobamos que estas reglas compiten bien, mejorando los resultados en varios de los escenarios. Adem´as, se prueban resultados acerca de la tasa de error aparente que muestran la necesidad de nuevos estimadores de la tasa de error verdadero para las reglas anteriores, y se proponen cuatro nuevos estimadores bootstrap. Comparamos el comportamiento de estos estimadores en un estudio de simulaci´on y los aplicamos a un conjunto de datos de c´ancer de vejiga. A continuaci´on, en las secciones 2, 3 y 4 se detallan los objetivos (especificando hip´otesis de partida, antecedentes y objetivos concretos), la metodolog´ıa, los resultados y las conclusiones m´as relevantes de estos tres art´ıculos. Por u´ ltimo, en la secci´on 5 se describen los trabajos en desarrollo y futuros.

18

2

A classification rule for ordered exponential populations

David Conde, Miguel A. Fern´andez, Bonifacio Salvador. Journal of Statistical Planning and Inference 135 (2), 339-356 (2005). Factor impacto 2012: 0.713. Puesto en categor´ıa Statistics and Probability: 72/117.

2.1 2.1.1

Objetivos Hip´otesis de partida

En muchos problemas reales tiene sentido considerar restricciones en los par´ametros utilizando la informaci´on adicional de que se dispone, y as´ı mejorar los resultados que se obtienen con m´etodos convencionales. Este es el caso del an´alisis discriminante. Hasta la fecha de este trabajo, no se hab´ıa investigado en profundidad la incorporaci´on de informaci´on adicional a las reglas de clasificaci´on a trav´es de restricciones en los par´ametros. Asimismo, mientras la bibliograf´ıa relativa al an´alisis discriminante con poblaciones normales es abundante, lo es menos en el caso no normal. La distribuci´on exponencial es muy interesante dado que es frecuente encontrarla en contextos pr´acticos como an´alisis de fiabilidad o supervivencia. Ejemplos donde se utiliza este an´alisis discriminante en poblaciones exponenciales se pueden encontrar en Zelen (1966) en el contexto de la investigaci´on sobre el c´ancer, y en Basu y Gupta (1974) sobre fiabilidad. En este trabajo abordamos el estudio del an´alisis discriminante con poblaciones exponenciales cuando hay restricciones de orden simple en sus par´ametros, y el modo de mejorar las reglas en este contexto. 2.1.2

Antecedentes

Algunas referencias sobre clasificaci´on en el caso exponencial con aplicaciones son Bhattacharya y Das Gupta (1964), Basu y Gupta (1974) y Adegboye (1993). Referencias interesantes relativas a la estimaci´on de par´ametros ordenados en el caso no normal son Kaur y Singh (1991) y Vijayasree y Singh (1991, 1993) para poblaciones exponenciales, y Chang y Shinozaki (2002) para poblaciones gamma. Aqu´ı se combinan ambas situaciones, considerando la clasificaci´on con poblaciones exponenciales con restricciones de orden simple sobre los par´ametros. 2.1.3

Objetivos concretos

Los objetivos concretos que se persiguen son: 19

1. Definici´on de reglas de clasificaci´on para el modelo de dos poblaciones exponenciales con restricciones de orden simple sobre los par´ametros y estudio de su consistencia. 2. Estudio del comportamiento de las reglas anteriores en comparaci´on con la basada en la verosimilitud, comparando la probabilidad de clasificaci´on err´onea a nivel global y en cada una de las dos poblaciones. 3. Incorporaci´on de datos con censura de tipo II a las reglas anteriores: definici´on y comportamiento. 4. Definici´on de reglas de clasificaci´on cuando hay m´as de dos poblaciones exponenciales con restricciones de orden simple sobre los par´ametros y estudio de su consistencia. 5. Simulaci´on de la ganancia de las reglas definidas en el apartado anterior con respecto a la regla usual basada en la raz´on de verosimilitudes.

2.2

Metodolog´ıa

En este trabajo nos planteamos la introducci´on de restricciones de orden sobre los par´ametros en el an´alisis discriminante de poblaciones exponenciales. Nuestro objetivo es mejorar las reglas de clasificaci´on en este contexto. La funci´on de p´erdida a utilizar es la 0-1, y por lo tanto la p´erdida esperada viene dada por la probabilidad de clasificaci´on err´onea, y se persigue la mejora respecto de las reglas que no tienen en cuenta la informaci´on adicional. Para ello, se buscan resultados anal´ıticos que nos permitan obtener estimadores y establecer propiedades de los mismos, y se trata de establecer las condiciones bajo las que las nuevas reglas con restricciones son consistentes y proporcionan mejores soluciones que las no restringidas. Adem´as, en todo momento se hace uso de computaci´on intensiva, necesaria para la implementaci´on y aplicaci´on de los nuevos procedimientos, y a la vez imprescindible como apoyo de las hip´otesis a contrastar anal´ıticamente, siendo en algunos casos, debido a la complejidad de los c´alculos (como cuando el n´umero de poblaciones es elevado), el u´ nico modo de abordar el problema.

2.3 2.3.1

Resultados Reglas para dos poblaciones

Sean Π1 y Π2 dos poblaciones con distribuciones exponenciales uniparam´etricas cuya funci´on de densidad es: fi (x) = λi e−λi x , x > 0, i = 1, 2.

20

Supongamos que las probabilidades a priori y los costes de clasificaci´on err´onea son iguales. La regla de clasificaci´on o´ ptima de Bayes es: R : Clasificar z en Π1 sii (z − x0 )(λ1 − λ2 ) < 0, λ2 . Si λ1 y λ2 son desconocidos y se dispone de una muestra de donde x0 = ln λλ1 −ln 1 −λ2 entrenamiento de cada poblaci´on de tama˜nos n1 y n2 , respectivamente, podemos obtener la regla estimando los par´ametros. La regla estimada es:

RU : Clasificar z en Π1 sii (z − xˆ0 )(λˆ 1 − λˆ 2 ) < 0, ˆ λˆ 2 donde λˆ 1 y λˆ 2 son los EMV de λ1 y λ2 y xˆ0 = ln λˆ 1 −ln . λ1 −λˆ 2 Si conocemos que λ1 ≥ λ2 > 0, es decir, el valor esperado en Π1 es menor o igual que en Π2 , podemos incorporar esta informaci´on a la regla. La regla que proponemos, y a la que llamaremos regla ordenada, es:

RO : Clasificar z en Π1 sii (z − xˆ0 ) < 0, dado que al tener la informaci´on de que λ1 ≥ λ2 > 0, el segundo t´ermino no es necesario. Supongamos que tenemos m observaciones z1 , ..., zm , de una de las poblaciones Π1 o Π2 (todas las observaciones provienen de la misma poblaci´on), y queremos clasificarlas (obviamente, todas en la misma poblaci´on). Sea P∗ (i/ j) la probabilidad de que las m observaciones de la poblaci´on Π j se clasifiquen en la poblaci´on Πi cuando se usa la regla R∗ , i, j = 1, 2, y sea P∗ (MC) = 21 (P∗ (1/2) + P∗ (2/1)) la probabilidad total de clasificaci´on err´onea de R∗ . Teorema 1 Sea z1 , ..., zm una muestra aleatoria de Π1 o Π2 . Si m, n1 , n2 > N, entonces: lim PO (MC) → 0. N→∞

Este resultado prueba que la regla ordenada es consistente, en el sentido de que la probabilidad de clasificaci´on err´onea tiende a 0 cuando los tama˜nos muestrales m, n1 y n2 crecen. Veamos ahora un resultado que demuestra que la nueva regla ordenada se comporta mejor que la regla usual: la probabilidad de clasificaci´on err´onea es menor para la regla ordenada. Teorema 2 Si λ1 ≥ λ2 > 0, entonces: PO (MC) ≤ PU (MC). Adem´as, PO (MC) < PU (MC) cuando λ1 > λ2 > 0. 21

Vamos a comparar ahora las probabilidades de clasificaci´on err´onea de las reglas RO y RU en cada una de las dos poblaciones. Teorema 3 Sean n1 = n2 = n y m = 1. 1. Si n > 1, PO (1/1) ≥ PU (1/1), ∀λ1 ≥ λ2 > 0. 2. Si n = 1, existe δ0 ∈ (0, 1) tal que PO (1/1) ≥ PU (1/1), ∀λ1 , λ2 > 0, 0 <

λ2 λ1

≤ δ0 .

λ2 λ1

PO (1/1) < PU (1/1), ∀λ1 , λ2 > 0, δ0 <

< 1.

Teorema 4 Sean n1 = n2 = n y m = 1. 1. Si n = 1, PO (2/2) ≥ PU (2/2), ∀λ1 ≥ λ2 > 0. 2. Si n > 1, existen δ0∗ y δ1∗ en (0, 1) tales que PO (2/2) ≥ PU (2/2), ∀λ1 , λ2 > 0, 0 <

λ2 λ1

PO (2/2) < PU (2/2), ∀λ1 , λ2 > 0, δ1∗ <

≤ δ0∗ .

λ2 λ1

< 1.

Cuando m > 1 los c´alculos se complican. Sin embargo, sabemos por las simulaciones que, cuando m crece, PO (1/1) − PU (1/1) decrece y PO (2/2) − PU (2/2) crece, disminuyendo la diferencia global PO (MC) − PU (MC). Por tanto, la regla RO es todav´ıa mejor que RU cuando m > 1 que cuando m = 1. 2.3.2

Reglas para dos poblaciones con datos censurados

La censura de tipo II aparece en experimentos que consisten en observar los tiempos de fallo de n unidades, finalizando cuando fallan un n´umero predeterminado r < n de unidades: las n − r que no han fallado tienen censura por la derecha. El EMV del par´ametro λ de una distribuci´on exponencial a partir de una muestra aleatoria X1 , ..., Xn es: λ∗ =

r

. Σri=1 X(i) + (n − r)X(r)

Es inmediato derivar las reglas anteriores bajo censura de tipo II: RU ∗ : Clasificar z en Π1 sii (z − x0∗ )(λ1∗ − λ2∗ ) < 0, RO∗ : Clasificar z en Π1 sii (z − x0∗ ) < 0, donde x0∗ =

ln λ1∗ −ln λ2∗ λ1∗ −λ2∗ .

En estas condiciones tambi´en demostramos que RO∗ es consistente, y que PO∗ (MC) ≤ PU ∗ (MC) para λ1 ≥ λ2 > 0. 22

2.3.3

Reglas para m´as de dos poblaciones

Cuando hay k > 2 poblaciones, las definiciones de las reglas son m´as complejas. Supongamos λ1 ≥...≥ λk , y sean λˆ 1 ,...,λˆ k , los EMV de los par´ametros en cada poblaci´on y λˆ (1) ≤...≤ λˆ (k) los par´ametros ordenados. Si λˆ (i) = λˆ j , entonces: RU : Clasificar z en Π j sii

ln λˆ (i−1) − ln λˆ (i) ln λˆ (i) − ln λˆ (i+1)
ln λˆ (i) − ln λˆ (i+1) ln λˆ (i−1) − ln λˆ (i) RO : Clasificar z en Πk−i+1 sii 2 poblaciones, incluso para valores altos de k, RO se comporta globalmente mejor que RU . Para el caso particular de tres poblaciones y tama˜nos muestrales iguales n1 = n2 = n3 = n, con n ≥ 1, las simulaciones indican que PO (1/1) ≥ PU (1/1), PO (2/2) ≤ PU (2/2) y PO (3/3) ≥ PU (3/3).

2.4

Conclusiones

Proponemos una regla de clasificaci´on RO para dos poblaciones Π1 y Π2 con distribuciones exponenciales univariantes de par´ametros λ1 y λ2 , respectivamente, cuando se sabe que, en media, las observaciones de Π1 son menores o iguales que las de Π2 , es decir, λ1 ≥ λ2 . Primero probamos que esta regla es consistente, es decir, la probabilidad de clasificaci´on err´onea de una muestra test cuyos m elementos pertenecen todos a una de las dos poblaciones tiende a 0 cuando los tama˜nos muestrales de las muestras test y entrenamiento tienden a infinito. Despu´es probamos que la nueva regla ordenada se comporta mejor que la usual (basada en los EMV) RU que no tiene en cuenta la informaci´on adicional contenida en las restricciones: la probabilidad de clasificaci´on err´onea es menor o igual para la regla ordenada para cualquier par de valores λ1 ≥ λ2 > 0, y estrictamente menor para λ1 > λ2 > 0. Esto se cumple para una u´ nica observaci´on a clasificar o para una muestra de observaciones provenientes todas de la misma poblaci´on. Es interesante destacar que las probabilidades de clasificaci´on err´onea de las reglas RO y RU dependen de λ1 y de λ2 solo a trav´es del cociente λ2 /λ1 , y tambi´en del tama˜no muestral, ya sea el mismo n1 = n2 = n o distinto. Cuando se clasifica una muestra test de tama˜no m > 1, hemos comprobado a partir de simulaciones que la probabilidad de clasificaci´on correcta es mayor para la regla ordenada, creciendo con m la diferencia con respecto a la regla no restringida. 23

Hemos estudiado el comportamiento de esta regla RO en cada una de las poblaciones, comparando las probabilidades de clasificaci´on correcta, y hemos comprobado, para iguales tama˜nos muestrales n1 = n2 = n y una u´ nica observaci´on test a clasificar, que aunque la regla ordenada se comporta mejor globalmente, hay algunas configuraciones de los par´ametros para las que esta propiedad no se verifica en una de las dos poblaciones. En el caso m´as habitual n > 1 la regla ordenada se comporta siempre mejor para la poblaci´on Π1 , no as´ı para Π2 , y lo contrario sucede cuando n = 1, caso de poco inter´es pr´actico. Los valores concretos que determinan estas configuraciones son interesantes desde un punto de vista pr´actico, y se pueden obtener num´ericamente. Se proporciona una tabla con dichos valores para distintos tama˜nos muestrales. En situaciones como los an´alisis de fiabilidad en que est´a presente la distribuci´on exponencial, es habitual que los datos tengan censura de tipo II. En este caso se conoce la expresi´on del EMV para cada par´ametro λi , i = 1, 2. Las reglas anteriores se definen para datos censurados sustituyendo estos estimadores en lugar de los anteriores para datos no censurados. Es f´acil comprobar que la regla ordenada tambi´en se comporta mejor que la no ordenada en el sentido de una menor probabilidad de clasificaci´on err´onea cuando λ1 ≥ λ2 > 0. Por u´ ltimo, definimos la regla cuando hay m´as de dos poblaciones y evaluamos su comportamiento por medio de un estudio de simulaci´on, el cual indica que la regla ordenada se comporta globalmente mejor que la no ordenada sea cual sea el n´umero de poblaciones.

24

3

Classification of samples into two or more ordered populations with application to a cancer trial

David Conde, Miguel A. Fern´andez, Cristina Rueda, Bonifacio Salvador. Statistics in Medicine 31, 3773-3786 (2012). Factor impacto 2012: 2.044. Puesto en categor´ıa Statistics and Probability: 11/117.

3.1 3.1.1

Objetivos Hip´otesis de partida

En muchas aplicaciones como las relacionadas con diagn´ostico y tratamiento del c´ancer, se quiere clasificar a los pacientes en uno de varios grupos de diagnosis a partir de determinados marcadores biol´ogicos. Con frecuencia, se sabe que algunas de las variables predictoras toman, en media, valores mayores o menores en los pacientes de algunos de los grupos que en los de otros grupos. En muchas ocasiones se dispone de dicha informaci´on adicional, y en otras muchas esta informaci´on se puede obtener indagando en el problema. Las reglas de clasificaci´on tradicionales que no tienen en cuenta la informaci´on adicional pueden presentar altas tasas de clasificaci´on err´onea, especialmente cuando el n´umero de grupos es mayor que dos. Aqu´ı nos proponemos dise˜nar reglas de clasificaci´on para m´as de dos poblaciones normales con la misma matriz de covarianzas, en situaciones en las que existe informaci´on sobre el orden de las medias de algunos predictores, que tengan una menor tasa de clasificaci´on err´onea que las reglas sin restricciones. Desde un punto de vista te´orico, lo que pretendemos es una extensi´on, que no es en absoluto trivial, de las reglas de clasificaci´on con restricciones de orden para dos poblaciones. Hay que destacar que la clase de datos que vamos a utilizar es muy com´un en la pr´actica, por ejemplo al tratar con datos de prote´ınas o microarrays, o, en general, cuando se dispone de un peque˜no conjunto de datos a partir del cual hay que derivar una regla de clasificaci´on y se dispone de informaci´on adicional. 3.1.2

Antecedentes

Los resultados previos m´as relevantes en el problema de discriminaci´on se refieren al caso de dos poblaciones normales. Se puede destacar el trabajo de Fern´andez et al. (2006), donde se dise˜nan reglas de clasificaci´on para dos poblaciones normales ordenadas, con matrices de covarianzas desconocidas pero iguales, que mejoran a la regla de Fisher incorporando nuevos estimadores. En Salvador et al. (2008) se estudian las propiedades de robustez de las reglas de Fern´andez et al. (2006).

25

Aqu´ı se pretende generalizar a m´as de dos poblaciones la metodolog´ıa existente para dos poblaciones en Fern´andez et al. (2006). Es importante se˜nalar que la extensi´on a m´as de dos poblaciones no es directa: cuando hay solo dos poblaciones, no hay muchas posibilidades de ordenarlas, pero cuando hay m´as de dos poblaciones puede aparecer un tipo de orden diferente para cada una de las variables predictoras. Adem´as, en Fern´andez et al. (2006) la definici´on de las reglas restringidas se basaba en la diferencia de las medias de las dos poblaciones mientras que en el caso de m´as de dos poblaciones existen m´ultiples diferencias por pares, por lo que la extensi´on no es en absoluto directa. 3.1.3

Objetivos concretos

Los objetivos concretos que se persiguen son: 1. Definici´on de estimadores para las medias de m´as de dos poblaciones normales con restricciones de orden entre los par´ametros, apropiados para la discriminaci´on. 2. Definici´on de reglas de clasificaci´on con los estimadores definidos en el apartado anterior y demostraci´on de que generalizan a las ya existentes para dos poblaciones. 3. Dise˜no de experimentos de simulaci´on suficientemente completos para mostrar que las nuevas reglas mejoran a las reglas que no tienen en cuenta la informaci´on adicional. 4. Aplicaci´on de las reglas de clasificaci´on a problemas reales.

3.2

Metodolog´ıa

En este trabajo nos planteamos la introducci´on de restricciones de orden en los par´ametros en el an´alisis discriminante con m´as de dos poblaciones normales. El hecho de que el problema tenga implicaciones pr´acticas hace que el mayor inter´es de las reglas est´e en su capacidad de clasificar correctamente las observaciones, por lo que los estimadores y las reglas definidas a partir de ellos deben tener en cuenta esta consideraci´on. La funci´on de p´erdida a utilizar es la 0-1, y por lo tanto la p´erdida esperada viene dada por la probabilidad de clasificaci´on err´onea, y se compara con la de las reglas que no tienen en cuenta la informaci´on adicional. Para ello, se hace uso de computaci´on intensiva, necesaria para la implementaci´on y aplicaci´on de los nuevos procedimientos, y al mismo tiempo imprescindible como apoyo de las hip´otesis a contrastar, siendo en algunos casos, a causa de la complejidad de los c´alculos cuando el n´umero de poblaciones es elevado, la u´ nica manera de abordar el problema.

26

3.3 3.3.1

Resultados Estimadores y reglas para m´as de dos poblaciones

Supongamos que tenemos k poblaciones Π1 , ..., Πk . Sea X = (X1 , ..., X p ) un vector de variables predictoras que vamos a usar para clasificar las observaciones. Asumimos normalidad, es decir, que si la observaci´on considerada proviene de la poblaci´on Πi , entonces X ∼ N p (µi , Σ), i = 1, ..., k. Suponemos diferentes medias en cada poblaci´on pero matriz de covarianzas com´un Σ, funci´on de p´erdida 0-1 e iguales probabilidades a priori para las poblaciones (la extensi´on a probabilidades a priori distintas es trivial). En esta situaci´on, la regla o´ ptima de Bayes es: Clasificar z en Πi sii (z − µi )0 Σ−1 (z − µi ) ≤ (z − µ j )0 Σ−1 (z − µ j ), j = 1, . . . , k, que, estimando los par´ametros µ1 , ..., µk y Σ cuando son desconocidos por sus estimadores X 1 , ..., X k y S, se convierte en la regla habitual de Fisher: Clasificar z en Πi sii (z − X i )0 S−1 (z − X i ) ≤ (z − X j )0 S−1 (z − X j ), j = 1, . . . , k, donde S es la matriz de covarianzas muestral pooled: k

∑ (n j − 1)S j S=

j=1

n−k

,

siendo n j y S j el tama˜no muestral y la matriz de covarianzas muestral de la muestra de la poblaci´on Π j , j = 1, ..., k. A continuaci´on, se definen reglas de clasificaci´on que tienen en cuenta la informaci´on adicional, que se incorpora a la regla restringiendo el espacio param´etrico. Proponemos trabajar en un espacio extendido Rk×p y, con la informaci´on adicional disponible, definir un cono C de restricciones en Rk×p al cual pertenezca el vector de medias: (µ10 , ..., µk0 )0 ∈ C. Si proyectamos el vector de medias muestral sobre el cono con la m´etrica adecuada nos aseguramos de que el estimador de las medias est´e tan pr´oximo al vector de medias muestral como sea posible al tiempo que verifica las restricciones que determinan la informaci´on adicional. La m´etrica para la proyecci´on que utilizaremos es la proporcionada por la matriz de dimensiones (k × p) × (k × p):   S −1 S −1 , ..., . S∗ = diag n1 nk Es f´acil expresar las restricciones de orden utilizando conos en Rk×p . En nuestro caso multivariante, si las restricciones afectan a un conjunto de variables T ⊆ {1, ..., p}, los conos m´as habituales se pueden escribir: 27

- Octante positivo: n o CO+ = x ∈ Rk×p : xt+np ≥ 0,t ∈ T, n = 0, 1, ..., k − 1 - Orden simple: n o CSO = x ∈ Rk×p : xt ≤ xt+p ≤ ... ≤ xt+(k−1)p ,t ∈ T

(1)

n o k×p CT O = x ∈ R : xt ≤ xt+np ,t ∈ T, n = 1, ..., k − 1

(2)

´ - Arbol:

- Loop: n o CLO = x ∈ Rk×p : xt ≤ [xt+p , ..., xt+(k−2)p ] ≤ xt+(k−1)p ,t ∈ T - Unimodalidad: n o CUO = x ∈ Rk×p : xt ≤ ... ≤ xt+(r−1)p ≥ xt+rp ≥ ... ≥ xt+(k−1)p ,t ∈ T - Estrella (star shaped):   xt + xt+p xt + xt+p + xt+2p k×p ≤ ≤ ...,t ∈ T CSS = x ∈ R : xt ≤ 2 3 biγ , γ ∈ [0, 1], i = 1, ..., k, utiAhora definimos una familia de estimadores µ lizando las proyecciones. b γ ∈ Rk×p para γ ∈ [0, 1] el l´ımite cuando l → ∞ del siguiente Definici´on 5 Sea µ procedimiento iterativo: b γ(l) = PS−1 (µ b γ(l−1) | C) − γPS−1 (µ b γ(l−1) | CP ), l = 1, 2, ..., µ ∗ ∗ 0  b γ(0) = X 01 , ..., X 0g ∈ Rk×p , PS−1 (Y | C) es la proyecci´on de Y sobre el donde µ ∗ cono C utilizando la m´etrica dada por S∗−1 y CP = {y ∈ Rk×p : y0 S∗−1 x ≤ 0, ∀x ∈ C} es el cono polar de C. Teorema 6 El procedimiento anterior converge y, adem´as, se cumple: b γ = lim µ γ(l) ∈ C, ∀γ ∈ [0, 1]. µ l→∞

28

La Figura 1 muestra la necesidad del procedimiento iterativo para algunos tipos de conos de restricciones, ya que en general un u´ nico paso del procedimiento no garantiza la pertenencia al cono.

Figura 1: Ejemplos del procedimiento iterativo para la estimaci´on de un vector de medias con un cono agudo (a) y no agudo (b).  biγ = (µ b γ )(i−1)p+1 , (µ b γ )(i−1)p+2 , ..., (µ b γ )ip 0 , i = 1, ..., k. Las nuevas reSea µ glas, que denotamos Rµ (γ), se definen: bi )0 S−1 (z − µ bi ) ≤ (z − µ b j )0 S−1 (z − µ b j ), j = 1, . . . , k. Clasificar z en Πi sii (z − µ γ

γ

γ

γ

b 0 es el EMV est´andar de la inferencia con restricciones de orden: Si γ = 0, µ es la proyecci´on sobre el cono de restricciones. Cuando γ > 0, los estimadores pertenecen al interior del cono (ver Figura 1). Se comprueba que las reglas Rµ (γ) con γ > 0 se comportan mejor que las reglas Rµ (0). El siguiente resultado prueba que estas reglas generalizan a las definidas para dos poblaciones en Fern´andez et al. (2006), denominadas Rδ (γ): Teorema 7 Si k = 2, entonces Rδ (γ) y Rµ (γ) son equivalentes para γ ∈ [0, 1]. 3.3.2

Comportamiento de las reglas en simulaciones

Para estudiar el comportamiento de las reglas Rµ (γ) hemos realizado un estudio de simulaci´on. Por simplicidad, nos concentramos en el caso k = 3 poblaciones con distribuci´on N3 (µi , Σ), i = 1, 2, 3, bajo las dos restricciones de orden m´as comunes en la pr´actica: el orden simple (1) y el orden en a´ rbol (2). Para cada una 29

de ellas, se consideran los escenarios correspondientes a varias configuraciones de los vectores de medias y varias matrices de covarianzas distintas que tratan de cubrir los valores de los coeficientes de correlaci´on m´as comunes en la pr´actica. Una primera conclusi´on es que las reglas Rµ (γ) se comportan mejor que la no restringida en todos los escenarios y restricciones de orden. Una segunda conclusi´on es la conveniencia de utilizar valores γ > 0: las simulaciones muestran que Rµ (1) se comporta mejor que Rµ (0) en pr´acticamente todas las situaciones. Sin embargo, como cabe esperar, las diferencias no son muy grandes, puesto que cuando las muestras de entrenamiento verifican las restricciones, las reglas coinciden con la de Fisher. Hemos estudiado los resultados seg´un el n´umero de restricciones que verifican las muestras de entrenamiento, y hemos comprobado que las mejoras son mayores cuando las muestras de entrenamiento no cumplen varias de las restricciones. Las restricciones de orden simple son m´as restrictivas que las de orden en a´ rbol, y los resultados de las simulaciones son mejores para las restricciones de orden simple, independientemente de la matriz de covarianzas. Podemos concluir que cuanto m´as precisa sea la informaci´on adicional disponible, mejores ser´an las reglas. 3.3.3

Aplicaci´on

Por u´ ltimo, tras obtener la autorizaci´on correspondiente, hemos aplicado las reglas a un conjunto de datos de c´ancer de vejiga proporcionados por Proteomika S.L. y Laboratorios SALVAT, S.A. y tomados con el objetivo de desarrollar un test in vitro no invasivo para el diagn´ostico de la recurrencia del c´ancer de vejiga y as´ı complementar y reducir el n´umero de cistoscopias. Se clasifican los pacientes en cinco niveles seg´un los resultados de la cistoscopia. El primer nivel es el control (ausencia de c´ancer de vejiga), siendo los dem´as Ta , T1 G1 , T1 G3 y T2 , niveles progresivamente m´as avanzados del c´ancer: la etapa T describe el tama˜no del tumor y su extensi´on, y el grado G est´a relacionado con la apariencia de las c´elulas en el microscopio. Hemos considerado dos conjuntos de datos, ambos obtenidos con la tecnolog´ıa R xMAP de Luminex para medir los valores de las prote´ınas. El primer conjunto de datos que recibimos, D1 , conten´ıa informaci´on de 141 pacientes y 11 prote´ınas adem´as del nivel real al que pertenec´ıan los pacientes. Este conjunto inicial de datos es con el que construimos las reglas: la muestra de entrenamiento. El segundo conjunto de datos, D2 , es la muestra test, que utilizamos para medir el comportamiento de las reglas de clasificaci´on construidas con el primero. Fue recibido con posterioridad y contiene informaci´on relativa a otros 149 pacientes y las mismas 11 prote´ınas, as´ı como el nivel real de la enfermedad. Tanto en una muestra como en la otra, tomamos logaritmos para que las variables tengan una 30

distribuc´on aproximadamente normal. A partir de la informaci´on adicional proporcionada, utilizamos para el an´alisis solo tres prote´ınas que eran consideradas relevantes, y que denotamos por P1 , P2 y P3 : los valores medios de estas tres prote´ınas aumentan con el nivel de la enfermedad. Adem´as, estudios sencillos utilizando t-tests nos indicaron que no todas las prote´ınas eran significativas. A causa de que los tama˜nos muestrales en algunos de los niveles son muy peque˜nos en comparaci´on con los otros y porque cl´ınicamente parec´ıa razonable, redujimos los niveles a tres: el grupo control y los grupos Ta + T1 G1 y T1 G3 + T2 . Hemos comprobado que no solo las nuevas reglas Rµ (γ) mejoran a la de Fisher, sino a otros muchos m´etodos de construcci´on de reglas de clasificaci´on como support vector machines, k vecinos m´as pr´oximos, clasificadores de Bayes naive o redes neuronales, e incluso clasificaci´on ordinal, y hemos comprobado que las nuevas reglas se comportan mucho mejor que las otras tanto para γ = 0 como, y sobre todo, para γ = 1, como puede verse en el art´ıculo.

3.4

Conclusiones

Los resultados de las simulaciones y el ejemplo muestran que las nuevas reglas restringidas definidas en este art´ıculo se comportan mejor que las no restringidas, y que la mejora puede ser muy significativa cuando la muestra de entrenamiento no cumple algunas de las restricciones. El hecho de que las reglas restringidas coincidan con las no restringidas cuando la muestra de entrenamiento verifica las restricciones permite recomendar el uso de las reglas propuestas en la pr´actica, ya que nunca se pierde con respecto a la regla usual no restringida. Para facilitar su uso, hemos compilado una librer´ıa de R para an´alisis discriminante con informaci´on adicional, de nombre dawai, que comentaremos m´as adelante. Hemos probado que cuanto m´as precisa sea la informaci´on adicional, m´as potentes ser´an las reglas que la incorporen. Por tanto, recomendamos incorporar cuanta informaci´on tengamos a nuestra disposici´on. Esta clase de informaci´on es frecuente, por ejemplo, en problemas de diagn´ostico m´edico, donde las variables predictoras suelen estar relacionadas isot´onicamente con la variable respuesta. Una cuesti´on importante es la estimaci´on de la probabilidad de clasificaci´on err´onea. En este art´ıculo queda abierta esta cuesti´on, si bien se intuye que el comportamiento de la tasa de error aparente de estas reglas es diferente al de la tasa de error aparente de las reglas usuales, por lo que puede ser necesaria la definici´on de nuevos procedimientos correctores del sesgo de la tasa de error aparente de las reglas restringidas.

31

32

4

Performance and estimation of the true error rate of classification rules built with additional information. An application to a cancer trial

David Conde, Bonifacio Salvador, Cristina Rueda, Miguel A. Fern´andez. Statistical Applications in Genetics and Molecular Biology 12 (5), 583-602 (2013). Factor impacto 2012: 1.717. Puesto en categor´ıa Statistics and Probability: 18/117.

4.1 4.1.1

Objetivos Hip´otesis de partida

Como ya hemos comentado, las reglas de clasificaci´on que incorporan la informaci´on adicional disponible en el problema a trav´es de restricciones sobre los par´ametros se comportan mejor que las reglas que no tienen en cuenta esta informaci´on adicional, dado que tienen menor probabilidad de clasificaci´on err´onea total que la regla de Fisher (Fern´andez et al., 2006). En este art´ıculo nos planteamos comparar estas reglas con las basadas en estimadores shrinkage para las medias propuestas por Tong et al. (2012) en escenarios de alta dimensionalidad con respecto a varios de los criterios m´as utilizados en la pr´actica. La comparaci´on viene motivada por el hecho de que estas reglas pueden verse como reglas contractivas al estar basadas en proyecciones. Un segundo objetivo de este art´ıculo es completar el estudio de las reglas con restricciones, en el caso de dos poblaciones, mediante la evaluaci´on del rendimiento de las mismas para una muestra de entrenamiento dada, lo que es m´as u´ til en la pr´actica que la probabilidad (incondicional) de clasificaci´on err´onea. La tasa de error verdadero es la probabilidad de clasificaci´on err´onea condicionada a la muestra de entrenamiento disponible. La mejor forma de estimar el error verdadero es con una muestra test, lo cual habitualmente no es posible en la pr´actica. Por lo tanto, nos proponemos encontrar estimadores de la tasa de error verdadero basados en la muestra de entrenamiento y la informaci´on adicional disponible. 4.1.2

Antecedentes

Como ya hemos mencionado anteriormente, en las aplicaciones es habitual que haya informaci´on adicional, y hemos visto que la incorporaci´on de esta clase de informaci´on en las reglas de clasificaci´on ha demostrado que mejora el rendimiento de la regla. Dado que las reglas de Fern´andez et al. (2006) se pueden ver como reglas contractivas al estar basadas en proyecciones y ser la proyecci´on un operador contrac33

tivo, comenzamos comparando estas reglas con las reglas propuestas por Tong et al. (2012) con respecto a varios de los criterios m´as utilizados: probabilidad de clasificaci´on err´onea total, a´ rea bajo la curva ROC (ver Pepe et al., 2006) y calibrado y refinamiento (introducidos por Kim y Simon, 2011). La otra cuesti´on de inter´es considerada en este art´ıculo, la estimaci´on de la tasa de error verdadero, es una cuesti´on ampliamente estudiada en la literatura en reglas como la de Fisher, la discriminante cuadr´atica, las de los vecinos m´as pr´oximos o las de random forests. Se han propuestos estimadores param´etricos y no param´etricos del error verdadero, y los estimadores no param´etricos basados en remuestreo se comportan bien para las reglas mencionadas. Schiavo y Hand (2000) resumen el trabajo realizado hasta la fecha. Referencias m´as recientes son Fu et al. (2005) o Kim (2009). 4.1.3

Objetivos concretos

Los objetivos concretos que se persiguen son: 1. Comparaci´on de las reglas restringidas con las contractivas de Tong et al. (2012) en escenarios de alta dimensionalidad. 2. Obtenci´on de propiedades de la tasa de error aparente de las reglas restringidas. 3. Definici´on de nuevos estimadores de la tasa de error verdadero basados en la informaci´on adicional. 4. Estudio del comportamiento de los nuevos estimadores de la tasa de error verdadero respecto de estimadores cl´asicos que no tienen en cuenta la informaci´on adicional. 5. Aplicaci´on de las reglas de clasificaci´on y los nuevos estimadores de la tasa de error verdadero a datos reales.

4.2

Metodolog´ıa

El primer objetivo se lleva a cabo utilizando varios de los criterios m´as utilizados en la actualidad, como son la probabilidad de clasificaci´on err´onea total, el a´ rea bajo la curva ROC, el calibrado y el refinamiento, haciendo uso de simulaciones, dado que, a causa de la alta dimensionalidad, es la u´ nica manera de abordar el problema. - El a´ rea bajo la curva ROC se utiliza con frecuencia (aunque no en exclusiva) en contextos m´edicos. Un diagn´ostico es positivo o negativo dependiendo de si el correspondiente clasificador probabil´ıstico p(u) ˜ es mayor que o menor o igual que un determinado valor umbral perteneciente al intervalo (0, 1). Para cada valor umbral, la sensibilidad se define como la probabilidad de un verdadero positivo y

34

la especifidad como la probabilidad de un verdadero negativo. La curva ROC es la gr´afica sensibilidad contra 1−especifidad para cada umbral. El a´ rea bajo la curva es obviamente la comprendida entre la curva y el eje 1−especifidad. Cuando se dispone de una muestra de tama˜nos n1 y n2 en cada poblaci´on, el a´ rea bajo la curva ROC (AUC) se puede estimar a trav´es del estad´ıstico U de Mann-Whitney (ver Pepe et al., 2006):   1 1 n1 n2 d AUC = ∑ ∑ I[ p(u ˜ i )> p(u ˜ j )] + 2 I[ p(u ˜ i )= p(u ˜ j )] . n1 n2 i=1 j=1 - Se dice que un clasificador probabil´ıstico p˜ est´a bien calibrado si se cumple P(Y = 1| p(u) ˜ = w) = w para cada probabilidad de predicci´on w, con 0 < w < 1. Es decir, si la proporci´on de los individuos que el clasificador clasifica en la poblaci´on Y = 1 con probabilidad w es efectivamente w. Se dice que un clasificador probabil´ıstico p˜ es refinado si las probabilidades de predicci´on w tienden a 0 o 1. Se define el refinamiento como: Ew [P(Y = 1| p(u) ˜ = w)P(Y = 2| p(u) ˜ = w)] . Para evaluar el calibrado y el refinamiento, Kim y Simon (2011) definen dos medidas: CS (calibration score) y RS (refinement score). Se divide el intervalo unidad en m subintervalos iguales Bk = ((k − 1)/m, k/m], k = 1, . . . , m; para cada Bk , sea qk la proporci´on de predicciones w que caen en Bk , rk la frecuencia relativa de predicciones en Bk para la poblaci´on 1, y uk el punto central de Bk , k = 1, . . . , m. En estas condiciones: m

CS =

m

∑ (rk − uk )2qk , RS =

∑ rk (1 − rk )qk .

k=1

k=1

Los restantes objetivos est´an relacionados con la estimaci´on de la tasa de error verdadero. Primero, se buscan resultados anal´ıticos de la tasa de error aparente que proporcionen informaci´on del comportamiento del sesgo del error aparente en las reglas con restricciones. Despu´es, se definen nuevos estimadores de la tasa de error verdadero para las reglas con restricciones que corrijan el sesgo del error aparente. Estos estimadores est´an basados en t´ecnicas de remuestreo, en concreto bootstrap y validaci´on cruzada, y su definici´on parte de la idea de que el mundo bootstrap deber´ıa reflejar el mundo real. Por u´ ltimo, se eval´ua, mediante un estudio de simulaci´on y en un conjunto de datos reales, el comportamiento de los estimadores propuestos a partir de la denominada deviation distribution (Braga-Neto y Dougherty, 2004): si Eˆ es un estimador del error verdadero En , la distribuci´on de la variable aleatoria (Eˆ − En ) se conoce como deviation distribution, y una medida global del comportamiento del estimador viene dada por E[(Eˆ − En )2 ]. 35

4.3 4.3.1

Resultados Comparaci´on con reglas contractivas basadas en shrinkage

Sean Π1 y Π2 dos poblaciones con distribuciones N p (µ1 , Σ) y N p (µ2 , Σ). Suponemos diferentes medias en cada poblaci´on pero matriz de covarianzas com´un, funci´on de p´erdida 0-1 e iguales probabilidades a priori para las poblaciones. Si z = (z1 , ..., z p )0 es una nueva observaci´on a clasificar, la regla o´ ptima (te´orica) de Bayes se puede escribir, como muestran Fern´andez et al. (2006): Clasificar z en Π1 sii (z − (c1 µ1 + c2 µ2 ) + cδ )0 Σ−1 δ ≥ 0, con ci = ni /(n1 + n2 ), i = 1, 2, c = c1 − c2 y δ = µ1 − µ2 . Si X 1 y X 2 son los vectores de medias muestrales para las observaciones de las poblaciones Π1 y Π2 , respectivamente, entonces la regla de Fisher se obtiene sustituyendo los par´ametros desconocidos por sus estimadores: Clasificar z en Π1 sii (z − (c1 X 1 + c2 X 2 ) + cδ )0 S−1 δ ≥ 0, siendo δ = X 1 − X 2 y S la matriz de covarianzas muestral pooled. Asumimos que la informaci´on adicional se incorpora a trav´es de la restricci´on δ ∈ C, con C un cono poli´edrico, cerrado y convexo de R p : C = {x ∈ R p : a0j x ≥ 0, j = 1, ..., m}. Las reglas de clasificaci´on definidas en Fern´andez et al. (2006), denominadas Rδ (γ), se obtienen sustituyendo en la regla de Bayes Σ por S, c1 µ1 + c2 µ2 por c1 X 1 + c2 X 2 y δ por un miembro de la familia δγ∗ , γ ∈ [0, 1], definida como el l´ımite de un procedimiento iterativo cuya convergencia se demuestra en dicho art´ıculo: (i) (i−1) (i−1) (0) δbγ = X 1 − X 2 , δbγ = PS−1 (δbγ /C) − γPS−1 (δbγ /CP ), i = 1, 2, ...

As´ı, las nuevas reglas son: ∗

Rδ (γ) : Clasificar z en Πi sii (z − (c1 X 1 + c2 X 2 ) + cδγ∗ )0 S−1 δ γ ≥ 0. ∗ = c X + c (X + δ ∗ ) y µ ∗ = c (X − δ ∗ ) + c X , entonces las Si denotamos µγ1 1 1 2 2 1 1 2 2 γ γ γ2 reglas Rδ (γ) se pueden obtener sustituyendo en la regla de Bayes µ1 , µ2 , δ y Σ ∗ , µ ∗ , µ ∗ − µ ∗ y S. por µγ1 γ2 γ1 γ2 ∗ y Dado que la proyecci´on es un operador contractivo, los estimadores µγ1 ∗ anteriores se pueden considerar como estimadores contractivos de las meµγ2 dias. Tong et al. (2012) proponen un estimador shrinkage del tipo James-Stein para la media bajo funci´on de p´erdida cuadr´atica, para tama˜no muestral fijo ni , i = 1, 2, y dimensi´on p elevada, y proponen la regla que denominan SmDLDA,

36

reemplazando las medias poblacionales por los estimadores shrinkage de las medias, y considerando una matriz de covarianzas diagonal para evitar problemas de singularidad si p es mayor que ni , i = 1, 2. Ver Tong et al. (2012) para una descripci´on detallada de su regla y el buen funcionamiento de la misma. A trav´es de un estudio de simulaci´on en escenarios similares a los que aparecen en Tong et al. (2012) (p grande y no tanto el tama˜no muestral), comparamos el comportamiento de las reglas de Fisher, SmDLDA y Rδ (γ) para γ = 0, 1, es decir, Rδ (0) y Rδ (1), con respecto a varios de los criterios m´as utilizados en la actualidad: probabilidad de clasificaci´on err´onea total (tambi´en conocida como tasa de clasificaci´on err´onea o error de predicci´on), a´ rea bajo la curva ROC, y calibrado y refinamiento (ver Pepe et al., 2006, y Kim y Simon, 2011). Como resultado de las simulaciones, podemos decir que, incluso con datos de altas dimensiones en problemas de tama˜no muestral peque˜no, Rδ (0) y Rδ (1) compiten bien con SmDLDA, super´andola en varias configuraciones. 4.3.2

Estimaci´on de la tasa de error verdadero de las reglas restringidas

En la pr´actica, es necesario poder evaluar el comportamiento de una regla de clasificaci´on con una muestra de entrenamiento dada, y dado que no es habitual disponer de una muestra test independiente con la que poder estimar el error verdadero (probabilidad de clasificaci´on err´onea condicionada a la muestra de entrenamiento disponible), nos proponemos encontrar estimadores de la tasa de error verdadero. La tasa de error aparente (APP) estima la tasa de error verdadero como la proporci´on de observaciones de la muestra de entrenamiento mal clasificadas por la regla. Se sabe que, en general, subestima la tasa de error verdadero pues la muestra de entrenamiento se utiliza a la vez para crear la regla y para evaluarla. Para las reglas con restricciones obtenemos el siguiente resultado en lo que se refiere a APP: Teorema 8 Si n1 = n2 , para matriz de covarianzas conocida Σ y γ ∈ [0, 1], se cumple: E(APP(Rδ (γ))) ≥ E(APP(Rδ (0))) ≥ E(APP(Fisher))). En Fern´andez et al. (2006) se prueba que, si n1 = n2 , la tasa de error verdadero de las reglas Rδ (γ), γ ∈ [0, 1], es menor que la de la regla de Fisher. Adem´as, se prueba en las simulaciones que la tasa de error verdadero es mayor que APP. En virtud del teorema anterior, si n1 = n2 el sesgo del error aparente de las reglas Rδ (γ), γ ∈ [0, 1], es menor que el de la regla de Fisher. Por lo tanto, no se espera que los procedimientos habituales basados en bootstrap para corregir el sesgo del

37

error aparente funcionen bien en las reglas Rδ (γ), lo que nos lleva a la necesidad de proponer nuevos estimadores para la tasa de error verdadero. Proponemos cuatro m´etodos basados en t´ecnicas de remuestreo, que modifican LOOBT (Efron, 1983) y BCV (Fu et al., 2005) para que tengan en cuenta la informaci´on adicional. Para obtener LOOBT se toman B muestras bootstrap, con cada una se obtiene la correspondiente versi´on bootstrap de la regla de clasificaci´on y con esta regla se clasifican las observaciones de la muestra original que no est´an en la muestra bootstrap, siendo el estimador LOOBT la proporci´on de observaciones mal clasificadas. Efron (1983) observa que LOOBT sobreestima la tasa de error verdadero y propone BT 632 = 0.368 · APP + 0.632 · LOOBT . BCV es el promedio de los errores de validaci´on cruzada (CV , Lachenbruck y Mickey, 1968) de B muestras bootstrap. Para cada muestra, se deja fuera cada una de las observaciones, se crea la regla con el resto y se clasifica la que se ha dejado fuera, siendo el error de validaci´on cruzada la proporci´on de observaciones mal clasificadas. El problema de la inconsistencia del bootstrap en modelos con restricciones proviene del hecho de que el mundo bootstrap no representa bien el mundo real porque en el mundo real los par´ametros de la poblaci´on est´an en el cono, δ ∈ C, mientras que en el mundo bootstrap la muestra es la poblaci´on y, en general, δ∈ / C. La definici´on de los nuevos estimadores del error verdadero para las reglas con restricciones se basa en la idea de que el mundo bootstrap deber´ıa reflejar el mundo real. Para ello hacemos dos propuestas. En la primera se modifica el cono de restricciones de acuerdo a lo observado y en la segunda se modifica la muestra para que verifique las restricciones. Sea C el cono aleatorio definido a partir de C y δ = X 1 − X 2 de la siguiente manera: ) ( 0 x ≥ 0 si a0 δ ≥ 0 a j , j = 1, ..., m . C = x ∈ R p : 0j 0 a j x ≤ 0 si a j δ < 0 Definimos BT 2 como LOOBT con la particularidad de que las reglas de clasificaci´on a partir de cada muestra bootstrap se obtienen proyectando no sobre C sino sobre C. De igual manera, definimos BT 2CV como BCV pero proyectando sobre C y no sobre C, observar que δ = X 1 − X 2 ∈ C. Como ya hemos dicho, nuestra segunda propuesta consiste en modificar la muestra de entrenamiento de forma que las nuevas medias muestrales verifiquen las restricciones. Para ello, transformamos la muestra de entrenamiento original {(Xi ,Yi ), i = 1, . . . , n} de la forma siguiente: Wi = Xi − X j + µγ∗j , si Yi = j, j = 1, 2. De esta forma, la muestra de entrenamiento transformada {(Wi ,Yi ), i = 1, . . . , n} 38

verifica las restricciones: ∗ ∗ ∈ C, − µγ2 W 1 −W 2 = µγ1

siendo W 1 y W 2 las nuevas medias muestrales. Definimos BT 3 y BT 3CV como LOOBT y BCV , respectivamente, una vez reemplazada la muestra de entrenamiento original por la transformada. Llevamos a cabo un estudio de simulaci´on para comparar los estimadores del error verdadero APP, CV , LOOBT , BT 632, BCV , BT 2, BT 2CV , BT 3 y BT 3CV de las reglas Rδ (γ). El comportamiento de un estimador Eˆ del error verdadero En se analiza a partir de la distribuci´on de la variable aleatoria (Eˆ − En ) (deviation distribution, Braga-Neto y Dougherty, 2004). Como medida global del comportamiento de Eˆ utilizamos E[(Eˆ − En )2 ], que se puede descomponer: E[(Eˆ − En )2 ] = Var(Eˆ − En ) + [E(Eˆ − En )]2 . ˆ = E(Eˆ − En ). Como en otros estudios ˆ = E[(Eˆ − En )2 ] 21 y B(E) Denotamos A(E) ˆ negativo, CV de simulaci´on, comprobamos que APP tiene el mayor sesgo (B(E)) ˆ LOOBT sesgo positivo, y BCV , BT 2CV , el menor sesgo pero el mayor A(E), BT 3CV y BT 632, sesgo negativo. BT 2 y BT 3 tienen un comportamiento similar, lo que sorprende al estar motivados por ideas muy diferentes, siendo los mejores estimadores de la tasa de error verdadero para los valores m´as peque˜nos de kδ k2 , precisamente la situaci´on m´as interesante en la pr´actica, cuando la clasificaci´on es m´as dif´ıcil y cuando la informaci´on adicional puede jugar un papel importante. 4.3.3

Aplicaci´on

Aplicamos la metodolog´ıa presentada a un conjunto de datos reales de c´ancer de vejiga, descritos anteriormente en el apartado 3.3.3. Bas´andonos en el conocimiento previo y en nuestros resultados, decidimos utilizar p = 4 prote´ınas, denominadas P1 , P2 , P3 y P4 , para discriminar entre dos poblaciones: Π1 , formada por el grupo control, y Π2 , formada por los grupos T1 G3 + T2 . Para cada una de estas cuatro prote´ınas se esperaba que, en media, tomaran valores m´as altos cuanto m´as avanzado fuese el nivel de la enfermedad, es decir, δ = µ2 − µ1 ∈ O+ 4 (octante positivo), restricciones que no se verifican en la muestra de entrenamiento, por lo que las reglas Rδ (γ) son relevantes en este problema. Se consideran cuatro reglas: Fisher, SmDLDA, Rδ (0) y Rδ (1), siendo Rδ (1) la que mejor comportamiento presenta. Tambi´en comparamos las estimaciones de la tasa de error verdadero para los nuevos estimadores, y comprobamos el buen comportamiento de BT 2 para Rδ (0) y Rδ (1). 39

4.4

Conclusiones

Las reglas Rδ (γ), γ ∈ [0, 1], se obtienen sustituyendo los par´ametros desconocidos de la regla de Bayes por estimadores que tienen en cuenta la informaci´on adicional. Estos estimadores se definen a partir de proyecciones de las medias muestrales sobre el cono definido por la informaci´on adicional y su cono polar. Como la proyecci´on es un operador contractivo, estas reglas restringidas se pueden ver como reglas contractivas. Tong et al. (2012) proponen estimadores shrinkage de las medias para definir reglas de clasificaci´on conocidas como reglas SmDLDA. Hemos comparado el comportamiento de las reglas Rδ (γ), SmDLDA y Fisher en escenarios similares a los de Tong et al. (2012) y usando varios de los m´as comunes criterios utilizados en la literatura, comprobando que las reglas Rδ (γ) compiten bien bajo dichos criterios, aun en situaciones de alta dimensionalidad, mejorando los resultados en muchos de los escenarios. Entendemos que el hecho de que los estimadores utilizados en las reglas Rδ (γ) incluyan la informaci´on adicional disponible en el problema es una ventaja conceptual respecto a la contracci´on efectuada en SmDLDA, al no estar esta motivada por informaci´on asociada al problema. Una cuesti´on importante para cualquier regla de clasificaci´on es la estimaci´on de la tasa de error verdadero. Hemos comprobado que el error aparente de las reglas restringidas tiene un sesgo menor que el de la regla de Fisher. Una posible explicaci´on es que las reglas Rδ (γ), γ ∈ [0, 1], son menos dependientes de la muestra de entrenamiento. Por tanto, los procedimientos habituales para reducir el sesgo de la tasa de error aparente para estimar la tasa de error verdadero no funcionan bien en este contexto, y nos proponemos encontrar nuevos estimadores para la tasa de error verdadero, espec´ıficos para las reglas restringidas. Consideramos dos m´etodos basados en procedimientos bootstrap diferentes que toman en consideraci´on la informaci´on adicional disponible. El primero, BT 2, ajusta el cono de restricciones a la muestra de entrenamiento, mientras el segundo, BT 3, ajusta la muestra de entrenamiento al cono de restricciones. BT 2CV y BT 3CV son las correspondientes validaciones cruzadas despu´es del bootstrap de estos procedimientos. A partir de un estudio de simulaci´on, comprobamos que BT 2 y BT 3 son los mejores estimadores de la tasa de error verdadero para los valores m´as peque˜nos de kδ k2 , situaciones donde las poblaciones no est´an muy separadas, el caso m´as probable para las muestras de entrenamiento que no verifiquen las restricciones, mientras que BT 2CV y BT 3CV son los mejores para mayores valores de kδ k2 . En la librer´ıa de R que hemos compilado, de nombre dawai y que describimos en la siguiente secci´on, se incluyen estos m´etodos. Se ha aplicado la metodolog´ıa presentada a un conjunto de datos reales de c´ancer de vejiga. Se han considerado cuatro reglas: Fisher, SmDLDA, Rδ (0) y Rδ (1), siendo Rδ (1) la que mejor comportamiento presenta. Tambi´en compara40

mos las estimaciones de la tasa de error verdadero para los nuevos estimadores y comprobamos el buen comportamiento de BT 2 para Rδ (0) y Rδ (1). Podemos finalizar se˜nalando que la combinaci´on de la regla Rδ (1) con el estimador BT 2 proporciona buenos resultados en este caso.

41

42

5

Trabajos en desarrollo y futuros

Como complemento de los trabajos expuestos en las secciones anteriores, hemos programado una librer´ıa en el entorno R, de nombre dawai, acr´onimo de an´alisis discriminante con informaci´on adicional (discriminant analysis with additional information), que se puede descargar en http://cran.r-project.org/web/ packages/dawai/. Esta librer´ıa contiene todas las funciones que permiten definir las reglas de clasificaci´on lineal Rµ (γ) definidas en Conde et al. (2012) que tienen en cuenta la informaci´on adicional en forma de restricciones sobre las medias, y clasificar muestras, as´ı como evaluar la precisi´on de los resultados a trav´es de los estimadores del error verdadero BT 2, BT 3, BT 2CV y BT 3CV propuestos en Conde et al. (2013). En un art´ıculo que se encuentra en proceso de referee, presentamos la librer´ıa dawai, extendemos los resultados y las definiciones que aparecen en Fern´andez et al. (2006), Conde et al. (2012) y Conde et al. (2013) para el caso de matrices de covarianzas iguales en las poblaciones al caso de matrices de covarianzas diferentes en las poblaciones, y por tanto definimos las correspondientes reglas de clasificaci´on cuadr´atica y sus estimadores del error verdadero, y tambi´en extendemos la definici´on de estimadores del error verdadero al caso general de m´as de dos poblaciones, incluy´endose todas estas funcionalidades en la librer´ıa dawai. El software se puede aplicar a una amplia variedad de contextos, y su uso se ilustra aplic´andolo a dos conjuntos de datos de campos diferentes como la biolog´ıa y el reconocimiento de patrones. Veamos una breve descripci´on de las principales funciones contenidas en la librer´ıa dawai: • rlda(): Construye reglas de clasificaci´on para poblaciones normales con matrices de covarianzas iguales que tienen en cuenta la informaci´on adicional en forma de restricciones sobre las medias. Adem´as, proporciona la tasa de error aparente de las reglas. • predict.rlda(): Permite clasificar observaciones multivariantes con las reglas de clasificaci´on con restricciones construidas con rlda(). • err.est.rlda(): Proporciona los estimadores del error verdadero BT 2, BT 3, BT 2CV y BT 3CV de las reglas de clasificaci´on con restricciones construidas con rlda(). • rqda(): Es el equivalente de rlda() para poblaciones normales con matrices de covarianzas distintas. • predict.rqda(): Permite clasificar observaciones multivariantes con las reglas de clasificaci´on con restricciones construidas con rqda(). 43

• err.est.rqda(): Proporciona los estimadores del error verdadero BT 2, BT 3, BT 2CV y BT 3CV de las reglas de clasificaci´on con restricciones construidas con rqda(). En un futuro pretendemos incorporar esta metodolog´ıa a otras t´ecnicas de clasificaci´on. En algunos trabajos se han modificado varios algoritmos de machine learning para garantizar la monoton´ıa, incluyendo, por ejemplo, inducci´on de reglas (Dembezynski et al., 2001), a´ rboles de decisi´on (Potharst y Feelders, 2002), k vecinos m´as pr´oximos (Duivesteijn y Feelders, 2008) y aprendizaje de reglas (Kotlowski y Slovinski, 2009). En Tian et al. (2012) se recogen los u´ ltimos avances al respecto de support vector machines. En otros trabajos, en lugar de modificar algoritmos para garantizar la monoton´ıa, se han desarrollado m´etodos de preprocesamiento de los datos, tales como t´ecnicas de reetiquetado, que reparan las posibles inconsistencias en la muestra de entrenamiento para que los clasificadores sean mon´otonos, ver Feelders (2010). Creemos que existen bastantes posibilidades aun de mejora en este tipo de procedimientos mediante la incorporaci´on de informaci´on adicional a trav´es de procedimientos como los desarrollados en la presente memoria.

44

Referencias [1] Abelson, R. P., Tukey, J. W. (1963). Efficient utilization of non-numerical information in quantitative analysis: general theory and the case of simple order. The Annals of Mathematical Statistics 34, 1347-1369. [2] Adegboye, O. S. (1993). The optimal classification rule for exponential populations. Australian Journal of Statistics 35 (2), 185-194. [3] Andrews, D. W. K. (2000). Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica 68, 399-405. [4] Auh, S., Sampson, A. R. (2006). Isotonic logistic discrimination. Biometrika 93 (4), 961-972. [5] Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics 26 (4), 641-647. [6] Barlow, R. E., Bartholomew, D. J., Bremner, J. M., Brunk, H. D. (1972). Statistical inference under order restrictions. Wiley. New York. [7] Bartholomew, D. J. (1959). A test of homogeneity for ordered alternatives. Biometrika 46, 36-48. [8] Bartholomew, D. J. (1961). A test of homogeneity of means under restricted alternatives. Journal of the Royal Statistical Society, Series B 23 (2), 239281. [9] Basu, A. P., Gupta, A. K. (1974). Classification rules for exponential populations. Proc. Conference on Reliability and Biometry. SIAM Philadelphia, 637-650. [10] Benigni, R. (2013). Quantitative Structure-Activity Relationship (QSAR) Models of Mutagens and Carcinogens. CRC Press. [11] Berry, M. W. (2004). Survey of Text Mining: Clustering, Classification, and Retrieval. Springer. [12] Bhattacharya, P. K., Das Gupta, S. (1964). Classification between univariate exponential distributions. Sankhya: The Indian Journal of Statistics, Series A 26, 17-24. [13] Bockenholt, U., Bockenholt, I. (1990). Canonical analysis of contingency tables with linear constraints. Psychometrika 55, 633-639. 45

[14] Boulgouris, N. V., Plataniotis, K. N., Micheli-Tzanakou, E. (2009). Biometrics: Theory, Methods, and Applications. Wiley. [15] Braga-Neto, U. M., Dougherty, E. R. (2004). Is croos-validation valid for small-sample microarray classification? Bioinformatics 20, 374-380. [16] Brunk, H. D. (1955). Maximum likelihood estimates of monotone parameters. The Annals of Mathematical Statistics 26 (4), 607-616. [17] Brunk, H. D. (1958). On the estimation of parameters restricted by inequalities. The Annals of Mathematical Statistics 29 (2), 437-454. [18] Brunk, H. D. (1970). Estimation of isotonic regression. Nonparametric Techniques in Statistical Inference. Cambridge University Press, 177-195. [19] Bunke, H., Wang, P. S. P. (1997). Handbook of Character Recognition and Document Image Analysis. World Scientific. [20] Chang, G., Healey, M. J., McHugh, J. A. M., Wang, J. T. L. (2001). Mining the World Wide Web: An Information Search Approach. Kluwer Academic Publishers. [21] Chang, Y. T., Shinozaki, N. (2002). A comparison of restricted and unrestricted estimators in estimating linear functions of ordered scale parameters of two gamma distributions. Annals of the Institute of Statistical Mathematics 54 (4), 848-860. [22] Cohen, A., Kemperman, J. H. B., Sackrowitz, H. B. (2000). Properties of likelihood inference for order restricted models. Journal of Multivariate Analysis 72, 50-77. [23] Cohen, A., Sackrowitz, H. B. (2004). A discussion of some inference issues in order restricted models. The Canadian Journal of Statistics 32, 199-205. [24] Conde, D., Fern´andez, M. A., Salvador, B. (2005). A classification rule for ordered exponential populations. Journal of Statistical Planning and Inference 135 (2), 339-356. [25] Conde, D., Fern´andez, M. A., Rueda, C., Salvador, B. (2012). Classification of samples into two or more ordered populations with application to a cancer trial. Statistics in Medicine 31, 3773-3786. [26] Conde, D., Salvador, B., Rueda, C., Fern´andez, M. A. (2013). Performance and estimation of the true error rate of classification rules built with additional information. An application to a cancer trial. Statistical Applications in Genetics and Molecular Biology 12 (5), 583-602. 46

[27] Das S., Sen, P. K. (1994). Restricted canonical correlations. Linear Algebra and its Applications 210, 29-47. [28] Das S., Sen, P. K. (1996). Asymptotic distribution of restricted canonical correlations and relevant resampling methods. Journal of Multivariate Analysis 56 (1), 1-19. [29] Dembezynski, K., Kotlowski, W., Slowinski, R. (2001). Learning rule ensembles for ordinal classification with monotonicity constraints. Fundamenta Informaticae XXI, 1001-1016. [30] Duivesteijn, W., Feelders, A. (2008). Nearest neighbour classification with monotonicity constraints. Lecture Notes in Computer Science 5211, 301316. [31] Dykstra, R., Hewett, J., Robertson, T. (1999). Nonparametric, isotonic discriminant procedures. Biometrika 86, 429-438. [32] Efron, B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association 78, 316-331. [33] El Barmi, H., McKeague, I. W. (2013). Empirical likelihood-based tests for stochastic ordering. Bernoulli 19 (1), 295-367. [34] Feelders, A. (2010). Monotone relabelling in ordinal classification. ICDM ’10 Proceedings of the 2010 IEEE International Conference on Data Mining, 803-808. [35] Fern´andez, M. A., Rueda, C., Salvador, B. (1997). On the maximum likelihood estimator under order restrictions in uniform probability models. Communications in Statistics - Theory and Methods 26 (8), 1971-1980. [36] Fern´andez, M. A., Rueda, C., Salvador, B. (1998). Simultaneous estimation by isotonic regression. Journal of Statistical Planning and Inference 70, 111-119. [37] Fern´andez, M. A., Rueda, C., Salvador, B. (1999). The loss of efficiency estimating contrast under restrictions. Scandinavian Journal of Statistics 26 (4), 579-592. [38] Fern´andez, M. A., Rueda, C., Salvador, B. (2000). Parameter estimation under orthant restrictions. The Canadian Journal of Statistics 28, 171-181.

47

[39] Fern´andez, M. A., Rueda, C., Salvador, B. (2006). Incorporating additional information to normal linear discriminant rules. Journal of the American Statistical Association 101 (474), 569-577. [40] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 (2), 179-188. [41] Fraley, C., Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association 97 (458), 611-631. [42] Fu, W. J., Carroll, R. J., Wang, S. (2005). Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics 21, 19791986. [43] Gasparini, M., Eisele, J. (2000). A curve-free method for phase I clinical trials. Biometrics 56, 609-615. [44] Groenen, P. J. F., Poblome, J. (2003). Constrained correspondence analysis for seriation in archaeology applied to Sagalassos ceramic tablewares. Exploratory Data Analysis in Empirical Research, 90-97. [45] Hand, D. J. 1981. Discrimination and Classification. New York. Wiley. [46] Hastie, T., Buja, A., Tibshirani, R. (1995). Penalized discriminant analysis. The Annals of Statistics, 73-102. [47] Hoitjink, H. (2013). Objective Bayes factors for inequality constrained hypotheses. International Statistical Review 81 (2), 207-229. [48] Hu, X., Wright, F. T. (1994). Likelihood ratio tests for a class of nonoblique hypotheses. Annals of the Institute of Statistical Mathematics 46, 137-145. [49] Huberty, C. J., Olejnik, S. (2006). Applied MANOVA and Discriminant Analysis. John Wiley & Sons. [50] Hwang, J. T. G., Peddada, S. D. (1994). Confidence interval estimation subject to order restrictions. The Annals of Statistics 22, 67-93. [51] Kao, A., Poteet, S. R. (2007). Natural Language Processing and Text Mining. Springer. [52] Kaur, A., Singh, H. (1991). On the estimation of ordered means of two exponential populations. Annals of the Institute of Statistical Mathematics 43 (2), 347-356. 48

[53] Kelly, R. (1989). Stochastic reduction of loss in estimating normal means by isotonic regression. The Annals of Statistics 17 (2), 937-910. [54] Kim, J. H. (2009). Estimating classification error rate: repeated crossvalidation, repeated hold-out and bootstrap. Computational Statistics & Data Analysis 53 (11), 3735-3745. [55] Kim, K. I., Simon, R. (2011). Probabilistic classifiers with highdimensional data. Biostatistics 12 (3), 399-412. [56] Korn, E. L. (1982). Confidence bands for isotonic dose-response curves. Applied Statistics 31 (1), 59-63. [57] Kotlowski, W., Slowinski, R. (2009). Rule learning with monotonicity constraints. ICML ’09 Proceedings of the 26th Annual International Conference on Machine Learning, 537-544. [58] Kotlowski, W., Slowinski, R. (2013). On nonparametric ordinal classification with monotonicity constraints. IEEE Transactions on Knowledge and Data Engineering 25 (11), 2576-2589. [59] Kulatunga, D. D. S., Sasabuchi, S. (1984). A test of homogeneity of mean vectors against multivariate isotonic alternatives. Mem. Fac. Sci. Kyushu Univ. Ser. A Math. 38, 151-161. [60] Kuriki, S. (2005). Asymptotic distribution of inequality-restricted canonical correlation with application to tests for independence in ordered contingency tables. Journal of Multivariate Analysis 94 (2), 420-449. [61] Lachenbruch, P. A., Goldstein, M. (1979). Discriminant analysis. Biometrics 35, 69-85. [62] Lachenbruch, P., Mickey, M.(1968). Estimation of error rates in discriminant analysis. Technometrics 10, 167-178. [63] Lee, C. I. C. (1981). The quadratic loss of isotonic regression under normality. The Annals of Statistics 9 (3), 686-688. [64] Lee, C. I. C. (1988). The quadratic loss of order restricted estimators for several treatment means and a control mean. The Annals of Statistics 16, 751-758. [65] Li, Z., Taylor, J. M. G., Nan, B. (2010). Construction of confidence intervals and regions for ordered binomial probabilities. The American Statistician 64 (4), 291-298. 49

[66] Li, S. Z., Jain, A. K. (2011). Handbook of Face Recognition. Springer. [67] Liu, T., Lin, N., Shi, N., Zhang, B. (2009). Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments. Bioinformatics 10, 146. [68] Lo, S. H.(1987). Estimation of distribution functions under order restrictions. Statistics and decisions 5, 251-262. [69] Long, T., Gupta, R. D. (1998). Alternative linear classification rules under order restrictions. Communications in Statistics - Theory and Methods 27 (3), 559-575. [70] Lovell, M. C., Prescott, E. (1970). Multiple regression with inequality constraints: pretesting bias, hypothesis testing and efficiency. Journal of the American Statistical Association 65 (330), 913-925. [71] Ma, T., Liu, S. (2013). Estimation of order-restricted means of two normal populations under the LINEX loss function. Metrika 76 (3), 409-425. [72] Mahalanobis P. C. (1927). Analysis of race mixture in Bengal. Journal and Proceedings of the Asiatic Society of Bengal 23, 301-333. [73] Marchand, E., Strawderman, W. E. (2004). Estimation in restricted parameter spaces: A review. A Festschrift for Herman Rubin. Institute of Mathematical Statistics Lecture Notes - Monograph Series 45, 21-44. [74] Marcus, R., Peritz, E. (1976). Some simultaneous confidence bounds in normal models with restricted alternatives. Journal of the Royal Statistical Society, Series B (Methodological) 38 (2), 157-165. [75] McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley-Interscience. [76] Men´endez, J. A., Rueda, C., Salvador, B. (1991). Conditional test for testing a face of the tree order cone. Communications in Statistics - Simulation and Computation 20, 751-762. [77] Men´endez, J. A., Rueda, C., Salvador, B. (1992a). Dominance of likelihood ratio tests under order constraints. The Annals of Statistics 20, 2087-2099. [78] Men´endez, J. A., Rueda, C., Salvador, B. (1992b). Testing non-oblique hypotheses. Communications in Statistics - Theory and Methods 21, 471484. 50

[79] Men´endez, J. A., Salvador, B. (1991). Anomalies of the likelihood ratio test for testing restricted hypothesis. The Annals of Statistics 19, 889-898. [80] Men´endez, J. A., Salvador, B. (1992). Equivalence of likelihood ratio tests and obliquity. Statistics & Probability Letters 14, 223-229. [81] Moors, J. J. A., van Houwelingen, J. C. (1993). Estimation of linear models with inequality restrictions. Statististica Neerlandica 47, 185-198. [82] Mukerjee, H., Robertson, T., Wright, F. T. (1986). A probability inequality for elliptically contoured densities with applications in order restricted inference. The Annals of Statistics 14, 1544-1554. [83] Pearson, K. (1926). On the coefficient of racial likeness. Biometrika 18 (1/2), 105-117. [84] Peddada, S. D. (1997). Confidence interval estimation of population means subject to order restrictions using resampling procedures. Statistics & Probability Letters 31, 255-265. [85] Peddada, S., Lobenhofer, E., Li, L., Afshari, C., Weinberg, C., Umbach, D. (2003). Gene Selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics 19, 834-841. [86] Pepe, M. S., Cai, T., Lognton, G. (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62 (1), 221-229. [87] Perlman, M. D. (1969). One-sided testing problems in multivariate analysis. The Annals of Mathematical Statistics 40 (2), 549-567. [88] Perlman, M. D., Wu, L. (2002). A defense of the likelihood ratio test for one-sided and order-restricted alternatives. Journal of Statistical Planning and Inference 107, 173-186. [89] Potharst, R., Feelders, A. (2002). Classification trees for problems with monotonicity constraints. ACM SIGKDD Explorations Newsletter 4 (1), 110. [90] Praestgaard, J. (2012). A note on the power superiority of the restricted likelihood ratio test. Journal of Multivariate Analysis 104, 1-15.

51

[91] Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society. Series B (Methodological) 10 (2), 159-203. [92] Robertson, T., Wright, F. T., Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, New York. [93] Rueda, C. (1989). Contrates de Hip´otesis con Restricciones bajo Condiciones de Oblicuidad. Tesis Doctoral. Universidad de Valladolid. [94] Rueda, C. Salvador, B. (1995). Reduction of risk using restricted estimators. Communications in Statistics - Theory and Methods 24, 1011-1022. [95] Rueda, C., Salvador, B., Fern´andez, M. A. (1997a). A good property of the maximum likelihood estimator in a restricted normal model. Test 6 (1), 127-135. [96] Rueda, C., Salvador, B, Fern´andez, M. A. (1997b). Simultaneous estimation in a restricted linear model. Journal of Multivariate Analysis 61, 61-66. [97] Rueda, C., Men´endez, J. A., Salvador, B. (2002). Bootstrap adjusted estimators in a restricted setting. Journal of Statistical Planning and Inference 107, 123-131. [98] Sampson, A. R., Singh, H., Whitaker, L. R. (2008). Simultaneous confidence bands for isotonic functions. Journal of Statistical Planning and Inference 139, 828-842. [99] Salvador, B., Fern´andez, M. A., Mart´ın, I., Rueda, C. (2008). Robustness of classification rules that incorporate additional information. Computational Statistics & Data Analysis 52, 2489-2495. [100] Sasabuchi, S., Inutsuka, M., Kulatunga, D. D. S. (1983). A multivariate version of isotonic regression. Biometrika 70 (2), 465-472. [101] Schaafsma, W., Smid, L. J. (1966). Most stringent somehere most powerful tests against alternatives restricted by a number of linear inequalities. The Annals of Mathematical Statistics 37, 1161-1172. [102] Schiavo, R. A., Hand, D. J. (2000). Ten more years of error rate research. International Statistical Review 68, 295-310. [103] Schoenfeld, D. A. (1986). Confidence bounds for normal means under order restrictions, with application to dose-response curves, toxicology experiments, and low-dose extrapolation. Journal of the American Statistical Association 81, 186-195. 52

[104] Shapiro, A. (1988). Towards a unified theory of inequality constrained testing in multivariate analysis. International Statistical Review 56 (1), 49-62. [105] Shaw, F. H., Geyer, C. J. (1997). Estimation and testing in constrained covariance component models. Biometrika 84 (1), 95-102. [106] Shi, N. Z., Kudˆo, A. (1987). The most stringent somehere most powerful one-sided test of the multivariate normal mean. Mem. Fac. Sci. Kyushu Univ. 29, 303-328. [107] Silvapulle, M. J., Sen, P. K. (2004). Constrained Statistical Inference. John Wiley & Sons, New Jersey. [108] Simmons, S., Peddada, S. (2007). Order-restricted inference for ordered gene expression (ORIOGEN) data under heteroscedastic variances. Bioinformation 1, 414-419. [109] Sneath, P. H. A., Sokal, R. R. (1973). Numerical taxonomy. The principles and practice of numerical classification. W. H. Freeman Limited. [110] Sonka, M., Fitzpatrick, J. (2004). Handbook of Medical Imaging: Medical image processing and analysis. The Society of Photo-Optical Instrumentation Engineers. [111] Takane, Y., Yanai, H., Mayekawa, S. (1991). Relationships among several methods of linearly constrained correspondence analysis. Psychometrika 56, 667-684. [112] Taylor, J. M. G., Wang, L., Li, Z. (2007). Analysis on binary responses with ordered covariates and missing data. Statistics in Medicine 26, 3443-3458. [113] Thomas, L. C., Edelman, D. B., Crook, J. N. (2002). Credit Scoring and Its Applications. SIAM. [114] Tian, Y., Shi, Y., Liu, X. (2012). Recent advances in support vector machines research. Technological and Economic Development of Economy 18, 5-33. [115] Tong, T., Chen, L., Zhao, H. (2012). Improved mean estimation and its application to diagonal discriminant analysis. Bioinformatics 28 (4), 531537. [116] Touissant, G. T. (1974). Bibliography on estimation of misclassification. IEEE Transactions on Information Theory 20 (4), 472-479. 53

[117] Tsai, M. (1992). On the power superiority of likelihood ratio tests for restricted alternatives. Journal of Multivariate Annalysis 42, 102-109. [118] Tsai, M., Sen, P. K. (1993). On the local optimality of optimal linear tests for restricted alternatives. Statistica Sinica 3, 103-115. [119] Tsukuma, H., Kubokawa, T. (2011). Modifying estimators of ordered positive parameters under the Stein loss. Journal of Multivariate Analysis 102 (1), 164-181. [120] Van de Velden, M., Groenen, P. J. F., Poblome, J. (2009). Seriation by constrained correspondence analysis: A simulation study. Computational Statistics and Data Analysis 53, 3129-3138. [121] Van Eeden, C. (1956). Maximum likelihood estimation of ordered probabilities. Proc. Kon. Nederl. Akad. Wetensch. Ser. A 59, 444-455. [122] Van Eeden, C. (1957a). Maximum likelihood estimation of partially or completely ordered parameters. Proc. Kon. Nederl. Akad. Wetensch. Ser. A 60, 128-136. [123] Van Eeden, C. (1957b). Note on two methods for estimating ordered parameters of probability distributions. Proc. Kon. Nederl. Akad. Wetensch. Ser. A 60, 506-512. [124] Van Eeden, C. (2006). Restricted parameter space estimation problems: admissibility and minimaxity properties. Springer. [125] Vijayasree, G., Singh, H. (1991). Simultaneous estimation of two ordered exponential parameters. Communications in Statistics - Theory and Methods 20 (8), 2559-2576. [126] Vijayasree, G., Singh, H. (1993). Mixed estimators of two ordered exponential means. Journal of Statistical Planning and Inference 35, 47-56. [127] Wang, F. (2008). Biomarker Methods in Drug Discovery and Development. Humana Press. [128] Zelen, M. (1966). Application of exponential models to problems in cancer research with discussion. Journal of the Royal Statistical Society: Series A 129, 368-398.

54

ANEXOS: Art´ıculos publicados

55

56

Journal of Statistical Planning and Inference 135 (2005) 339 – 356 www.elsevier.com/locate/jspi

A classification rule for ordered exponential populations夡 David Conde, Miguel A. Fernández∗ , Bonifacio Salvador Departamento de Estadística e Investigación Operativa, C/Prado de la Magdalena s/n, Universidad de Valladolid, 47005 Valladolid, Spain Received 30 April 2003; accepted 1 May 2004 Available online 21 July 2004

Abstract In this paper, we consider classification procedures for exponential populations when an order on the populations parameters is known. We define and study the behavior of a classification rule which takes into account the additional information and outperforms the likelihood-ratio-based rule when two populations are considered. Moreover, we study the behavior of this rule in each of the two populations and compare the misclassification probabilities with the classical ones. Type II censorship, which is usual in practice, is considered and results obtained. The performance for more than two populations is evaluated by simulation. © 2004 Elsevier B.V. All rights reserved. MSC: primary 62H30; 62F30 Keywords: Classification rules; Exponential populations; Restricted parameter spaces; Misclassification probabilities

1. Introduction There is an extensive literature on classification in normal populations. However not much work has been done for the non-normal case. The exponential distribution is perhaps one of the most interesting ones to be considered since it is quite frequent to find it in practical 夡

Research partially supported by Spanish DGES Grant PB 97-0475.

∗ Corresponding author. Fax: +34-9834-23013.

E-mail address: [email protected] (M.A. Fernández). 0378-3758/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2004.05.004

340

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

contexts such as reliability or survival analysis. Some papers dealing with classification for the exponential case and its applications are Basu and Gupta (1974), Bhattacharya and Das Gupta (1964) and Adegboye (1993). Interesting papers dealing with estimation of non-normal ordered parameters are Vijayasree and Singh (1991, 1993) for exponential populations and Chang and Shinozaki (2002) for gamma populations. However, the study of classification rules when additional information in the form of parameter ordering is available and how the rules can be improved in this context has not been thoroughly investigated. To our best knowledge the only paper in this line is Long and Gupta (1998) where an order restriction on the means of two normal populations is assumed and some improvements over Anderson’s classification rule are achieved. In this paper, we combine both situations and consider classification on exponential populations whose parameters are known to follow a simple order. Examples where this sort of scheme can be used may be found in Zelen (1966) in the cancer research context and in Basu and Gupta (1974) in the reliability one. The layout of the paper is the following. In Section 2, we study the two populations case. First, we define a rule which takes into account the information given by the order restrictions and compare it with the usual and unrestricted likelihood-ratio based one. We find that the proposed rule is consistent and that it performs globally better than the usual rule. The second part of Section 2 is related to the question of misclassification in each of the two populations for the new rule and results comparing these probabilities with the ones for the usual rule are obtained. In Section 3, we consider type II censorship and how this scheme affects the rule. We obtain results for this situation quite usual in reliability and survival analysis applications. Section 4 is devoted to the case of more than two populations. The proposed rule is extended to this case and shown to be consistent. We also compare it with the likelihood-ratio based-one and using simulation we evaluate its performance in terms of misclassification probabilities both globally and in each of the populations involved. An appendix contains the more involved proofs that have been delayed to improve the readability of the paper. 2. Two populations case 2.1. Global rule behavior Let 1 and 2 be two one-parameter exponential populations with probability density functions fi (x) = i e−i x , x > 0,

i > 0,

i = 1, 2,

such that it is known that 1 2 . In other words, the expected value in the first population is not bigger than in the second one. We also assume that we have a training sample from each of the two populations X = (X1 , . . . , Xn1 ), Y = (Y1 , . . . , Yn2 ). Now, we want to classify a new observation z coming from one of those two populations but whose exact origin is unknown. In what follows, we will suppose that the populations are equally likely and that the costs of misclassifications are equal.

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

341

The usual likelihood classification rule appearing in Basu and Gupta (1974) can be written in the following way:  1 iff (z − x0 )(1 − 2 ) < 0, R : Classify z into 2 otherwise, 2 . If we consider the usual case where the parameters are unknown the where x0 = ln 1 −ln 1 −2 rule turns into  1 iff (z −  x0 )( 1 −  2 ) < 0, (1) RU : Classify z into 2 otherwise,

1 and  2 are the MLEs of 1 and 2 and where   x0 =

1 − ln  2 ln  .  1 −  2

This rule is shown to be consistent in Theorem 3 of the same aforementioned paper and has been compared with the one based on Fisher’s Discriminant Function for normal populations in Adegboye (1993) where the superiority of RU over the normal translated one is exhibited. The rule we propose, which we will refer to as the ordered rule, takes into account that we know that 1 2 and is defined as follows:  1 iff (z −  x0 ) < 0, (2) RO : Classify z into 2 otherwise. This rule is based on the estimatior given in Vijayasree and Singh (1991) with mixing parameter  = 0. First, we prove the consistency of this rule. Let us denote as P∗ (i/j ) the probability that an observation coming from population j is classified in population i when the rule R∗ is being used and let P∗ (MC) = 21 (P∗ ( 21 ) + P∗ ( 21 )) be the global misclassification probability of R∗ . Theorem 1. Let z1 , . . . , zm be a random sample from 0 where 0 = i for exactly one i (i = 1 or 2) then PO (MC) −→ 0 f or m, n1 , n2 > N when N → ∞. Proof. The proof of this result follows the same lines of Theorems 2 and 3 in Basu and Gupta (1974) and therefore is not developed here.  Next, we prove that the new ordered rule outperforms the usual one which does not take into account the additional information given by the restrictions. Notice that this is not obvious at all since restricted procedures do not always perform better that the unrestricted ones. This has been observed as much in likelihood ratio tests under order restrictions (see Menéndez and Salvador, 1991) as in restricted parameter estimation (see Lee, 1988; Fernández et al., 1999).

342

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

Theorem 2. Let RO and RU be the classification rules defined in (2) and (1), respectively, then PO (MC)  PU (MC) for any 1 2 > 0. Proof. Let us work with the correct classification probabilities      1 − PO (MC)= 21 PO 11 + PO 22 1  2 ) + P1 (Z   1 >  2 )] x0 ,  x0 ,  = 21 [P1 (Z   x0 ,  1 >  2 )] x0 ,  1  2 ) + P (Z >  + 1 [P (Z >  2

2

2

(3)

and for the usual rule      1 − PU (MC)= 21 PU 11 + PU 22 x0 ,  1  2 ) + P1 (Z   1 >  2 )] x0 ,  = 21 [P1 (Z >  2 ) + P (Z >  x0 ,  1 >  2 )]. x0 ,  1  + 1 [P (Z   2

2

2

(4)

Obviously, it is enough to prove P1 (Z   1  2 ) + P2 (Z >  x0 ,  1  2 ) x0 ,  x0 ,  1  2 ) + P (Z   1  2 ). x0 ,   P (Z >  1

2

Taking into account that  1 = nn1 1 ,  2 = i=1 Xi n2 and YT = i=1 Yi (2 , n2 ) if we denote f (x, , n) =

n n22

i=1

Yi

and that XT =

n1

i=1 Xi (1 , n1 )

n e−x x n−1 , (n − 1)!

(5)

we can write 

 n 1 P (Z   1  2 )=P Z   x0 ,  x0 , XT  YT n  2    n1 = x0 , XT  YT /XT = x, YT = y P Z   n n2 x  n1 y 2

×dPXT (x) dPYT (y)   = P (Z   x0 ) dPXT (x) dPYT (y) n  = 0

x  n1 y 2 ∞ ∞ n1 n2 y

x0 (1 − e− )f (x, 1 , n1 )f (y, 2 , n2 ) dx dy.

(6)

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

343

From this expression it is clear that the probability, we are calculating is increasing in  so that as 1 2

1  2 ). x0 ,  1  2 )  P2 (Z   x0 ,  P1 (Z  

(7)

Moreover, as P1 (Z   1  2 ) + P1 (Z >  x0 ,  1  2 ) x0 ,  1  2 ) 1  2 ) = P ( = P ( 2

1

x0 ,  1  2 ), 1  2 ) + P2 (Z >  = P2 (Z   x0 , 

we have P2 (Z >  1  2 )  P1 (Z >  x0 ,  1  2 ) x0 ,  and the theorem is done.

(8)



Remark 3. Notice that from the proof it is clear that PO (MC) > PU (MC) whenever 1 > 2 > 0, that is the equality only holds when the parameters of the two populations are equal. Corollary 4. Theorem 2 is also true when we have to classify a sample of size m > 1 coming from the same unknown population. Proof. In this case we use the sufficient statistic Z = m1 (Z1 + · · · + Zm ) instead of Z for the classification. Following the same lines as the previous theorem it is enough to x0 in Eq. (6) with the probability corresponding to the gamma replace P (Z <  x0 ) = 1 − e−  x0 m)i − x0 m ( x0 ) = 1 − m−1 which is also increasing in  when all distribution P (Z <  i=0 e i! other parameters are fixed.  These results can be applied to situations interesting in practice as the air-conditioning equipment in two aircraft example appearing in Basu and Gupta (1974), if it is known that one of the two aircraft has a mean time to failure lower than the other. In that case, the use of rule RO instead RU will decrease the misclassification probabilities as shown in the previous results. It is clearly interesting to know more precisely how much we gain. It will become apparent in the proof of Theorems 5 and 6 in next subsection that the misclassification probabilities of RO and RU depend on 1 and 2 only through 1 /2 . This fact has been already noticed by Adegboye (1993) and we use it in the graphics appearing in the paper. The horizontal axe represents 2 /1 ∈ (0, 1] and the vertical axe the correct classification probability differences between RO and RU . In Fig. 1, we show how the correct classification probabilities change when the samples sizes are equal (n1 = n2 = n). When samples sizes are different we have observed that the value depends much more strongly on n2 than on n1 . The maximum value is obtained when n2 = 1 and n1 → ∞ and is about 0.0747.

344

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

n=1 0.06 0.05

n=2

0.04 0.03 0.02

n=10 0.01

0.2

0.4

0.6

0.8

1

Fig. 1. Global correct probability difference between RO and RU for several sample sizes.

2.2. Misclassification probabilities in each population The next step in the evaluation of the rule is to compare the misclassification probabilities of RO and RU in each of the two populations. Notice that the proof of Theorem 2 cannot be used for this purpose as the inequalities (7) and (8) appearing there do not involve the appropriate sets to be compared. The comparison of individual populations probabilities is interesting since in Long and Gupta (1998) some ordered rules for the restricted normal classification problem are defined and one of them is shown to improve the misclassification probabilities for both populations. Unfortunately, this property is not true for all possible parameter values when dealing with exponential populations. In this subsection of the paper, we will assume that both training samples sizes are equal, that is n1 = n2 = n and that m = 1. The proof of the results is delayed to the appendix in order to improve the readability of the paper. The results are written in terms of the correct classification probabilities. We have the following result for the first population Theorem 5. Assume n1 = n2 = n. 1. If n > 1, PO ( 11 ) > PU ( 11 ) ∀1 2 > 0. 2. If n = 1 there is 0 ∈ (0, 1) such that PO PO

1 1

1 1

 PU < PU

1 1

1 1

,

∀1 , 2 > 0,

,

∀1 , 2 > 0,

2 0 , 1 2 0 < < 1. 1 0<

And we have the following theorem for the second one

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

345

Table 1 Values of 0 , ∗0 and ∗1 for different sample sizes Sample size (n)

1

2

5

10

25

50

0 ∗0 = ∗1

0.56108 —

— 0.84678

— 0.63983

— 0.58354

— 0.55164

— 0.54110

Theorem 6. Assume n1 = n2 = n. 1. If n = 1, PO ( 22 )  PU ( 22 ) ∀1 2 > 0. 2. If n > 1 there are ∗0 and ∗1 in (0, 1) such that PO PO

2 2

2 2

 PU < PU

2 2

2 2

∀1 , 2 > 0, ∀1 , 2 > 0,

2 ∗0 , 1 2 ∗ 1 < < 1. 1 0<

From these results it is apparent that although the rule performs globally better there are certain configurations of the parameters for which that property does not hold for each of the populations. For the most usual, n > 1, case performance is always better for the first population but not for the second one while the opposite is true when the training samples are of size 1. These results are obviously interesting when interest is focused on correct classification in one of the two populations. In Fig. 2 these results can be checked graphically for different sample sizes. The figure also shows how big the differences are. The values of 0 , ∗0 and ∗1 are interesting from a practical point of view. From the proof of the theorems it is clear that they cannot be written in terms of elementary functions but these values can be computed numerically and from Fig. 2 it is also clear that ∗0 = ∗1 . Table 1 gives the values of these parameters for different sample sizes. Remark 7. When m > 1 the calculations are much more involved so that no analytic results can be easily obtained. However some comments can be given from what we have observed in the simulations made for this situation. For m > 1 the difference of probabilities of correct classification for the first population PO ( 11 )−PU ( 11 ) behaves worse when m increases while the difference for the second population PO ( 22 ) − PU ( 22 ) increases when m > 1. The final result is that the global difference PO (MC) − PU (MC) behaves better for m > 1 than for m = 1. Therefore, in this situation our rule R0 is even better than in the usual m = 1 case when compared with RU . 3. Classification under type II censorship Up to this point we have assumed that complete samples are available for classification. In this section, we will study how the proposed rule performs when data are censored. It is well known that in many physical situations where the exponential distribution appears

346

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

First Population 0.1 n=10

0.075 0.05

n=2

0.025 0.2

0.4

0.6

0.8

1

-0.025 n=1

-0.05 Second Population 0.1

n=1 n=2

0.05

0.2

0.4

0.6

0.8

1

-0.05 n=10 -0.1 Fig. 2. Difference between the correct classification probabilities for RO and RU for each of the two populations for several sample sizes.

such as life-testing problems or reliability analysis the data are often censored. We will consider here one of the most frequent types of censorship, the so-called type II censorship. In the reliability context, this kind of censorship appears, for example, when n units are put to test and the study stops when the first r units have failed. In this type of censorship, the number of data we collect is known in advance but the time needed to complete the study is random. In type I censorship the opposite happens, number of data is random and time to end is fixed. Under type II censorship, the MLE of the exponential parameter  from a size n random sample X1 , . . . , Xn is known to be r i=1 X(i) + (n − r)X(r)

∗ = r

using the common notation for the ordered statistic.

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

347

The usual and ordered rules under censorship RU∗ and RO∗ can be easily defined replacing  i by ∗i in the uncensored rules RU and RO defined in (1) and (2).  1 iff (z − x0∗ )(∗1 − ∗2 ) < 0, RU∗ : Classify z into 2 otherwise, where x0∗ =

ln ∗1 −ln ∗2 ∗1 −∗2

and 

iff (z − x0∗ ) < 0, . otherwise.  Now taking into account that ri=1 X(i) + (n − r)X(r) follows a (r, ) distribution (cf. Basu, 1965) and the proofs of the theorems in the previous section we can obtain the following results in this case: From Theorem 1 we can prove that the rule RO∗ is consistent. Using the proof of Theorem 2 it is easy to check that this rule outperforms RU∗ , i.e. PO∗ (MC)  PU∗ (MC) for any 1 2 > 0. Finally, from the proof of Theorems 5 and 6 we can study the behavior in each population provided the same number of observations is fixed in each population (r1 = r2 ). RO∗ : Classify z into

1 2

4. More than two populations case When k > 2 populations are present the rules definitions are more complex. Assume 1 ,  2 , · · · ,  k be the (unrestricted) MLEs of the parameters in each 1  · · · k . Let     population, (1)  (2)  · · ·  (k) the ordered parameters and suppose  (i) =  j . Then the usual rule can be written in the following way: z ∈ Pj ⇔

(i) − ln  (i+1) ln  (i−1) − ln  (i) ln  < z ,   (i) −  (i+1) (i−1) −  (i)

while the ordered rule is z ∈ Pk−i+1 ⇔

(i) − ln  (i+1) ln  (i−1) − ln  (i) ln  < z ,   (i) −  (i+1) (i−1) −  (i)

where in both cases only the appropriate inequality is considered for the extreme populations. The consistency of the ordered rule can be proved using the same techniques appearing in Basu and Gupta (1974). We have used simulations to evaluate the global behavior of the ordered rule. The simulations considered for these cases indicate that the ordered rule globally outperforms the usual one no matter how many populations are considered. Table 2 deals with the three populations case taking into account that the classification probabilities only depend on 1 , 2 , 3 through 2 /1 and 3 /2 . In this table we have assumed n1 = · · · = nk = n = 10. In each cell of the table we give the following values: left column contains values for RU , first the probability of correct classification and then PU ( 11 ), PU ( 22 ) and PU ( 33 ). Right column of each cell contains same values for the ordered rule RO . Notice that in all cases PO ( 11 )  PU ( 11 ), PO ( 22 )  PU ( 22 ) and PO ( 33 )  PO ( 33 ). Same configuration holds for other values of n including n = 1.

348

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

Table 2 Global and individual correct classification probabilities for k = 3

3 /2 \2 /1

0.1

0.25

0.5

0.75

0.9

1

0.1

0.77 0.77 0.98 0.98 0.54 0.54 0.78 0.78

0.69 0.69 0.96 0.96 0.32 0.32 0.78 0.78

0.62 0.62 0.94 0.94 0.14 0.14 0.78 0.78

0.59 0.59 0.91 0.91 0.07 0.07 0.77 0.77

0.57 0.57 0.89 0.89 0.06 0.06 0.76 0.76

0.56 0.56 0.88 0.88 0.05 0.05 0.75 0.75

0.25

0.71 0.71 0.96 0.96 0.52 0.52 0.64 0.64

0.63 0.63 0.93 0.93 0.30 0.30 0.65 0.65

0.56 0.56 0.89 0.89 0.15 0.15 0.64 0.64

0.52 0.52 0.84 0.84 0.08 0.08 0.63 0.63

0.50 0.50 0.81 0.81 0.07 0.07 0.63 0.63

0.49 0.49 0.78 0.78 0.07 0.07 0.62 0.62

0.5

0.61 0.64 0.86 0.94 0.46 0.46 0.50 0.52

0.55 0.56 0.84 0.89 0.30 0.28 0.51 0.52

0.47 0.49 0.72 0.81 0.22 0.15 0.49 0.52

0.44 0.45 0.70 0.76 0.14 0.08 0.47 0.50

0.42 0.43 0.69 0.73 0.10 0.06 0.47 0.49

0.40 0.41 0.62 0.69 0.16 0.07 0.43 0.49

0.75

0.51 0.60 0.64 0.92 0.46 0.43 0.43 0.44

0.47 0.52 0.63 0.87 0.38 0.26 0.39 0.44

0.42 0.45 0.58 0.78 0.29 0.13 0.37 0.44

0.38 0.40 0.50 0.71 0.27 0.07 0.35 0.43

0.36 0.38 0.49 0.67 0.25 0.06 0.34 0.41

0.35 0.37 0.45 0.65 0.27 0.06 0.33 0.40

0.9

0.48 0.57 0.56 0.92 0.47 0.39 0.40 0.41

0.43 0.50 0.51 0.83 0.43 0.25 0.35 0.41

0.38 0.43 0.45 0.76 0.39 0.12 0.31 0.40

0.36 0.38 0.43 0.68 0.31 0.06 0.32 0.40

0.34 0.36 0.38 0.64 0.30 0.06 0.33 0.37

0.33 0.35 0.36 0.62 0.32 0.05 0.32 0.36

1

0.46 0.49 0.47 0.40

0.41 0.41 0.50 0.31

0.37 0.39 0.42 0.29

0.34 0.39 0.35 0.30

0.33 0.35 0.30 0.35

0.33 0.32 0.35 0.33

0.56 0.91 0.38 0.40

0.49 0.84 0.23 0.39

0.42 0.76 0.11 0.38

0.37 0.68 0.06 0.37

0.35 0.64 0.05 0.35

0.33 0.61 0.06 0.34

Table 3 Global correct classification probabilities for different values of k

\k

4

5

7

10

0.1 0.25 0.5 0.75 0.9 0.99

0.70 0.70 0.56 0.56 0.41 0.42 0.30 0.33 0.26 0.28 0.25 0.25

0.64 0.64 0.50 0.50 0.37 0.37 0.26 0.28 0.21 0.23 0.20 0.20

0.55 0.55 0.40 0.41 0.30 0.31 0.22 0.23 0.16 0.18 0.14 0.15

0.44 0.44 0.31 0.31 0.24 0.24 0.17 0.18 0.12 0.13 0.10 0.10

In Table 3, we have considered several values of k from 4 to 10. The simulations have also been done for ni = 10, i = 1, . . . , k and now we have further assumed i+1 /i =  ∈ (0, 1] for i = 1, . . . , k − 1. For this case only the global classification probabilities are given. The first value in each cell corresponds to RU and the second one to RO . We can observe that the ordered rule seems to perform globally better even for high values of k.

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

349

Acknowledgements The authors thank the referees for their suggestions and comments which led to this improved version of the paper.

Appendix A. Here, we develop the proofs of Theorems 5 and 6. First, we state a pair of lemmas that will be used in the proof of the theorems. Lemma A.1. For any n > 0 and a, b ∈ R+ 1 (1 + a)2



(1 + a + n1 b)2n (1 + a +

1 2n+2 n+1 b)

.

Let us denote g(t, n, ) =

1 (1 + t)

2n



2 (1 + t +

1 t ln (t) 2n n t−1 )

.

(A.1)

Lemma A.2. For each  ∈ (0, 1] and n > 0 there is a single point t0 (, n) ∈ (0, 1) such that g(t, n, ) < 0 for 0 < t < t0 (, n) and g(t, n, ) > 0 for t0 (, n) < t < 1. Proof of Lemma A.1. Define f (b) =

(1 + a + n1 b)2n (1 + a +

1 2n+2 n+1 b)

.

It is straightforward to check that f  (b) < 0 as a and b are positive numbers. Therefore, f (b) is decreasing in b, and we have f (b)  f (0), ∀b  0, so that (1 + a + n1 b)2n (1 + a +

1 n+1 b)

 2n+2

(1 + a)2n (1 + a)2n+2

=

1 (1 + a)2

.



Proof of Lemma A.2. Define   1 t ln(t) 2n ∗ − 2(1 + t)2n g (t) = 1 + t + n t −1 and assume that  and n are fixed numbers. It is clear that to prove the lemma it is enough to show that for these fixed values of  and n there is a single value t0 ∈ (0, 1) such that g ∗ (t0 ) = 0. It is easy to check that lim g ∗ (t) = −1 < 0.

t→0+

350

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

Next, we prove that   1 2n ∗ − 2(1 + )2n > 0. lim g (t) = 1 +  + − n t→1 This is equivalent to 1 1/2n − 1)−1 > 1 +  (2 n but this is true since n1 (21/2n − 1)−1 is increasing in n and we have 1 n √ 21/2n − 1

1 2−1

> 2  1 + , for any  ∈ (0, 1].

We have proved that there is a point t0 ∈ (0, 1) such that g ∗ (t0 ) = 0. To conclude the result we just have to check that this point is unique. 2n  t ln(t) 1 ∗ g (t) = 0 ⇔ 1 + − 2 = 0. n (t − 1)(1 + t) Then to prove the result it is enough to prove that f (t) =

t ln(t) (t − 1)(1 + t)

is strictly increasing in t ∈ (0, 1). The first derivative is f  (t) =

(t − 1)(1 + t ) − (1 + t 2 ) ln(t) (t − 1)2 (1 + t)2

.

Using Taylor’s expansion of logarithm ∞ (1 − t)2 (1 − t)i > (1 − t) + − ln(t) = i 2

∀t ∈ (0, 1),

i=1

we have (t − 1)(1 + t ) − (1 + t 2 ) ln(t)



(1 − t)2 > (t − 1)(1 + t ) + (1 + t 2 ) (1 − t) + 2

1 + t 2 = (1 − t)2 − t 2



and 1+t2  − t  > 0 in (0, 1) so that f  (t) > 0 ∀t ∈ (0, 1), f (t) is strictly increasing ∀t ∈ (0, 1) and the result is done.  Now we prove the theorems. 2

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

351

Proof of Theorem 5. First part: Taking into account (3) and (4) we just have to prove P1 (Z   1  2 ) 2 )  P1 (Z >  x0 ,  x0 ,  1  or  P1 Z   x0 ,

n

n

Xi 

i=1



 x0 ,  P1 Z > 

Yi

i=1

n i=1

Xi 

n

 Yi .

i=1

If we use (5) as in Theorem 2 we have  P1 Z >  x0 , 

∞ x

n

Xi 

i=1  ∞

n

 Yi

i=1

1 e−1 z f (x, 1 , n)f (y, 2 , n) dz dy dx  0 0 x0    ∞ x y ln 1 x = exp y f (x, 1 , n)f (y, 2 , n) dy dx. n 1 − yx 0 0 =

Now, if we consider the change of variable t = yx , u = x the integral can be written as   1 ut ln(t) f (u, 1 , n)f (ut, 2 , n) dt du u exp − n t −1 0 0      1 ∞ (1 2 )n t n−1 2n−1 1 t ln(t) u du dt u exp − 1 + 2 t + = n(t − 1) ((n − 1)!)2 0 0  1 (1 2 )n (2n − 1)! = t n−1 dt 2 1 t 2n ((n − 1)!) 0 (1 + 2 t + n t−1 ln(t))  1 h1 (t, n, ) dt, =K



∞ 1

0

where

2 = ∈ (0, 1], 1

n (2n − 1)! K= ((n − 1)!)2

and h1 (t, n, ) =

t n−1 (1 + t +

1 t n t−1

ln(t))2n

.

(A.2)

352

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

With similar arguments   n n P ( Yi 1  2 )=P Xi  =

i=1 i=1 n (1 2 ) (2n − 1)! 2



1

t n−1

2n ((n − 1)!) 0 (1 + 2 t)  1 =K h0 (t, n, ) dt

dt

0

where h0 (t, n, ) =

t n−1 (1 + t)2n

.

(A.3)

Now from (A.2) and (A.3) P1 (Z   2 )=P ( 2 ) − P1 (Z >  x0 ,  1  2 ) 1  1  x0 ,   1 (h0 (t, n, ) − h1 (t, n, )) dt =K 0

and it is enough to prove P1 (Z   1  2 ) − P1 (Z >  x0 ,  1  2 ) > 0 x0 ,  or

 K

1

(h0 (t, n, ) − 2h1 (t, n, )) dt > 0.

0

As K > 0 we will just check that the integral is positive. To prove this we use induction on n. Denote  1 (h0 (t, n, ) − 2h1 (t, n, )) dt. (A.4) H (n, ) = 0

We start with n = 2. In this case  1  1 2 H (2, ) = t − 4 (1 + t) (1 + t + 21 0

 t ln(t) 4 t−1 )

dt

since t ln(t)/(t − 1) is a concave function it is not difficult to check  t ln(t) 2t, t ∈ [0, 0.2),  3 1 t −1 4 t + 4 , t ∈ [0.2, 1], so that H (2, )  =



1 0

t (1 + t)4



0.2

dt − 0

2t (1 + t + t)4

 dt −

1

0.2 4

2t (1 + t + 38 t + 18 )4

dt

7128 + 10350 − 10082 − 83473 − 5041 − 13385 − 1726 − 87 6(1 + )3 (3 + 2)2 (6 + )3

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

353

and the quotient is positive because the denominator is positive in (0, 1] and the numerator is a concave function with positive values in 0 and 1. As we are using induction on n now we assume the result is true for n and we check it for n + 1. From (A.2) and (A.3) 

1

H (n + 1, )=

(h0 (t, n + 1, ) − 2h1 (t, n + 1, )) dt   1  ln(t) 2n (1 + t + n1 t t−1 ) h0 (t, n, ) = dt. − 2h1 (t, n, ) t 1 t ln(t) 2n+2 (1 + t)2 ) (1 + t + n+1 0 t−1 0

Now using Lemma A.1 H (n + 1, ) 



t

1 0

 =

1

(1 + t)2 t (1 + t)

0

2

(h0 (t, n, ) − 2h1 (t, n, )) dt t n−1 g(t, n, ) dt,

where g(t, n, ) is the function defined in (A.1) and used in Lemma A.2. t is positive and increasing in t ∈ [0, 1] It is straightforward to check that r(t) = (1+t)2 so if we consider t0 defined in Lemma A.2 H (n + 1, ) 



 

t0

r(t)t

n−1

 g(t, n, ) dt +

0 t0 0

r(t0 )t 

= r(t0 )

1

n−1

1 t0

 g(t, n, ) dt +

r(t)t n−1 g(t, n, ) dt

1 t0

r(t0 )t n−1 g(t, n, ) dt

t n−1 g(t, n, ) dt

0

= r(t0 )H (n, ) > 0 for each  ∈ (0, 1] since in Lemma A.2 we proved, g(t, n, ) < 0 for t < t0 and positive for t > t0 . Then the first part of the theorem is proved. Second part: Now n = 1. For this part it is enough to prove ∃0 ∈ (0, 1] such that P1 (Z   x0 ,  1  2 )  P1 (Z >  x0 ,  1  2 ) for any 1 , 2 > 0 satisfying 0 < 2 0 and the opposite inequality for 0 < 2  1. Using 1 1 (A.4), we can write P1 (Z   x0 ,  1  2 ) − P1 (Z >  x0 ,  1  2 ) =  · H (1, ).

354

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

H (1, ) is strictly decreasing in  ∈ (0, 1) because   1  1 2 H  (1, )=(−2) dt t − 3 (1 + t)3 (1 + t + t ln(t) ) 0 t−1   1  1 2 dt  (−2) t − (1 + t)3 (1 + ( + 1)t)3 0 2 − 2 <0 = (1 + )2 (2 + )2

(A.5)

where in (A.5) we have used the left side of the inequality ln(t)  t + 0.22 ∀t ∈ (0, 1). t −1 Using (A.6) again    1  1 2 2 H (0, 1) = 1− dt > 1− dt = 0. 2 (1 + t)2 (1 + t ln(t) 0 0 t−1 )   1 1 2 dt − H (1, 1)= 2 (1 + t)2 (1 + t + t ln(t) ) 0 t−1   1 1 2 dt = −0.009 < 0,  − (1 + t)2 (1 + t + t + 0.22)2 0 t
(A.6)

and the result is done.  Proof of Theorem 6. Using arguments similar to those in the proof of Theorem 5 we can write     PO 22 − PU 22 =P2 (Z >  x0 ,  1  2 ) − P2 (Z   1  2 ) x0 ,  n  (2n − 1)! = H2 (n, ), ((n − 1)!)2 where



1

H2 (n, ) =

 t n−1

0

2 (1 + t + n

t ln(t) 2n t−1 )



1 (1 + t)2n

 dt.

First part: Now, we fix n = 1. From (A.6) and (A.7)  1 H2 (1, 0) = (2 − 1) dt = 1 0



1

H2 (1, 1)=





0

0



1

2 2 (1 + t + t ln(t) t−1 )



2 (1 + t + t + 0.22)2



1 (1 + t)2 −

dt

1 (1 + t)2

 dt = 0.009 > 0.

(A.7)

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

H2 (1, )=(−2)

 (−2)

 

1 0



1

2(t + t ln(t) t−1 ) (1 + t + t 4t

ln(t) 3 t−1 )



355



t (1 + t)3 t

dt

dt − (1 + t + (t + 0.22))3 (1 + t)3 1355313 + 2381502 − 417500 − 375000 < 0, = (540 − 111)2 (1 + )2 (50 + 11) 0

the last fraction is negative in (0, 1] since the denominator is positive and the numerator is a convex function whose values in 0 and 1 are negative. Therefore, H2 (1, ) is decreasing in  and the first part of the theorem is proved. Second part: From Theorem 2         PO 11 + PO 22  PU 11 + PU 22 and in Remark 3 we have noticed that the equality only holds when 1 = 2 , i. e. when  = 1. Moreover from Theorem 5 when n > 1     PO 11 > PU 11 for any 1 2 > 0. Then we have     for  = 1, PO 22 < PU 22 and as the probabilities are continuous functions in  that is true in a neighborhood of  = 1. On the other hand  1 1 H2 (n, 0) = t n−1 dt = > 0 n 0 n

and as  (2n−1)!2 > 0 when  > 0 using continuity in  again the difference is positive in a ((n−1)!) neighborhood of  = 0 and the result is done.  References Adegboye, O.S., 1993. The optimal classification rule for exponential populations. Austral. J. Statist. 35 (2), 185 –194. Basu, A.P., 1965. On some tests of hypotheses relating to the exponential distribution when some outliers are present. J. Amer. Statist. Assoc. 60, 548–559. Basu, A.P., Gupta, A.K., 1974. Classification rules for exponential populations. In: Proschan, F., Serfling, R.J. (Eds.), Reliability and Biometry. SIAM, Philadelphia, pp. 637–650. Bhattacharya, P.K., Das Gupta, S., 1964. Classification between univariate exponential distributions. Sankhya Ser. A 26, 17–24. Chang, Y.-T., Shinozaki, N., 2002. A comparison of restricted and unrestricted estimators in estimating linear functions of ordered scale parameters of two gamma distributions. Ann. Inst. Statist. Math. 54 (4), 848–860. Fernández, M.A., Rueda, C., Salvador, B., 1999. The loss of efficiency estimating contrast under restrictions. Scand. J. Statist. 26, 79–92. Lee, C.C., 1988. The quadratic loss of order restricted estimators for treatment means with a control. Ann. Statist. 16, 751–758.

356

D. Conde et al. / Journal of Statistical Planning and Inference 135 (2005) 339 – 356

Long, T., Gupta, R.D., 1998. Alternative linear classification rules under order restrictions. Comm. Statist.—Theory Methods 27 (3), 559–575. Menéndez, J.A., Salvador, B., 1991. Anomalies of the likelihood ratio test for testing restricted hypotheses. Ann. Statist. 19, 889–898. Vijayasree, G., Singh, H., 1991. Simultaneous estimation of two ordered exponential parameters. Comm. Statist.—Theory Methods 20 (8), 2559–2576. Vijayasree, G., Singh, H., 1993. Mixed estimators of two ordered exponential means. J. Statist. Plann. Inference 35, 47–56. Zelen, M., 1966. Application of exponential models to problems in cancer research with discussion. J. Roy. Statist. Soc. Ser. A 129, 368–398.

Research Article Received 11 August 2011,

Accepted 14 May 2012

Published online 19 July 2012 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.5476

Classification of samples into two or more ordered populations with application to a cancer trial D. Conde, M. A. Fernández,* † C. Rueda and B. Salvador In many applications, especially in cancer treatment and diagnosis, investigators are interested in classifying patients into various diagnosis groups on the basis of molecular data such as gene expression or proteomic data. Often, some of the diagnosis groups are known to be related to higher or lower values of some of the predictors. The standard methods of classifying patients into various groups do not take into account the underlying order. This could potentially result in high misclasiffication rates, especially when the number of groups is larger than two. In this article, we develop classification procedures that exploit the underlying order among the mean values of the predictor variables and the diagnostic groups by using ideas from order-restricted inference. We generalize the existing methodology on discrimination under restrictions and provide empirical evidence to demonstrate that the proposed methodology improves over the existing unrestricted methodology. The proposed methodology is applied to a bladder cancer data set where the researchers are interested in classifying patients into various groups. Copyright © 2012 John Wiley & Sons, Ltd. Keywords:

classification rules; order-restricted inference; cancer diagnostic test research

1. Introduction An important component of proper patient care, especially in cancer treatment, is correct classification of the patient into one of the disease stages. Such a classification problem in bladder cancer motivated this work. We aim to investigate new methodologies to select the best classifiers in the context of an in vitro diagnostic tool for the disease. Our industrial and pharmaceutical partners in this research are Proteomika S.L. and Laboratorios SALVAT, S.A. The final aim of the research is to develop a non-invasive in vitro test for the diagnosis of bladder cancer recurrence (transitional cell carcinoma). Because of its high recurrence rates, the disease is perceived as chronic (as many as 80% of patients have at least one recurrence). Therefore, high rates of recurrence and progression make careful long-term follow-up a clinical priority. An ideal diagnostic test must show elevated levels of sensitivity and specificity and could be used on patients with a history of transitional cell carcinoma and on a clinical follow-up. Development of new non-invasive methods to monitor patients could be very useful to complement and reduce the number of cystoscopes, which up to date remains the gold standard for diagnosis. The institutional review board of the clinical center approved the study, and all patients signed an informed consent. We classified patients in five levels on the basis of cytoscopy. The first level is the control level (i.e., negative result of cytoscopy, therefore considered as absence of bladder cancer), and the other four levels are denoted as Ta, T1G1, T1G3, and T2, each of them corresponding to increasingly advanced levels of cancer. This combines the TNM staging, which uses the size and extension of the primary Tumor, its lymphatic Nodes involvement, and the presence of Metastases to classify the progression of cancer [1, 2] and the grading. Grading is also important as there is interobserver variability in classifying

Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

3773

Departamento de Estadística e Investigación Operativa, Universidad de Valladolid, Spain *Correspondence to: M. A. Fernández, Departamento de Estadística e Investigación Operativa. C/Prado de la Magdalena s/n. Universidad de Valladolid, 47005 Valladolid, Spain. † E-mail: [email protected]

D. CONDE ET AL.

due not only to staging but also to grading [3]. To be more precise, stage T describes the size of the tumor and whether it has spread, and grade G refers to the appearance of the cells under the microscope. In Ta stage, the tumor is only in the innermost lining of the bladder, whereas in T1, it has started to grow into the connective tissue just under the bladder lining. Ta and T1 are non-invasive tumors, but T2 is an invasive tumor because in that stage, the tumor has grown through the connective tissue into the muscle. The grade also gives an idea about how rapidly the cancer may develop [4]. G1, low grade, means that the cancer looks much like normal bladder cells, whereas in the G3 grade, the cells look very abnormal and are likely to grow more quickly and are more likely to spread. As usual, in this kind of research, an initial database with a moderate number of patients was provided. The purpose of this pilot study was to confirm or discard the associations among the proteins and the illness (i.e., to check if the protein levels are useful to predict the level of the illness) in order to establish a larger multicenter study. On the basis of a priori information provided by our pharmaceutical and industrial partners, we use three proteins for our analysis. As, for intellectual property reasons, the names of the proteins used in this study are not disclosed in the paper, we denote these proteins as P1 , P2 , and P3 . We expect the mean values of these three proteins to increase with the severity of the illness. Because sample sizes in some of the illness groups are relatively very small compared with those of the others and also because clinically it seemed reasonable, we merged the five initial levels of the illness in the following three groups: the control group, G1 ; the Ta + T1G1 group, G2 ; and the T1G3 + T2 group, G3 . As usual, the values of the protein levels have been transformed logarithmically so that the variables are approximately normally distributed. When the database was provided, the classification results based on standard statistical methodology were surprisingly poor. Some protein levels were expected to have a good classification power for discriminating among the levels of the illness as those protein levels were expected to increase with the level of the sickness. However, the percentages of correct classification obtained with these protein levels as predictors were quite low. A careful exploration of the database exposed that the levels of those proteins known to be associated with the disease did not conform to the level of the sickness. In situations such as these, the classical classification rules, which are designed to discriminate between groups, do not exploit the underlying information regarding the order relations among the mean values of the predictor values and the diagnostic groups. Consequently, as will be seen in this paper, they are not expected to perform as well as rules that honor the underlying ‘inequality’ structure. In this work, we present discriminant rules designed to deal with that sort of information. From the theoretical point of view, this method is an extension and non-trivial generalization of the previous efforts made in trying to define classification rules under order restrictions for the two population cases [5, 6]. However, the extension to more than two populations is not straightforward. When only two populations are present, there are not many ordering chances among them; but when there are more than two populations, a different ordering for each of the predictors may appear. Notice that, although the populations may be ordered, our focus is different from those considering an ordinal response variable as those ordinal rules only rely on the natural ordering existing among the populations. From the practical point of view, we would like to remark the fact that the kind of data set used in this work is quite usual in the statistical practice when dealing with protein or microarray data or, in general, when a small data set is provided to derive a classification rule and there is some additional knowledge on the problem. This work shows the power of order-restricted inference in the solution of real statistical problems. Some recent papers in this line are [7–11]. The layout of the work is as follows. In Section 2, we develop the methodology underlying the proposed solution and its properties. We devote Section 3 to the simulation study exposing the good behavior of the new rules. We present in Section 4 the results for the bladder cancer problem; finally, we give the discussion and future developments in Section 5.

2. Isotonized rules

3774

In this section, we describe the new classification rules we propose. These rules are specially interesting for the analysis of problems such as the one described in the introduction. Initially, our problem is the classical statistical problem of classifying observations in one of k populations. The main difference is that we assume additional information on the parameters of the populations. We describe Fisher’s linear discriminant rule, which is the classical solution for this situation and is available in any statistical Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

D. CONDE ET AL.

package dealing with discrimination problems. Then, we detail how the additional information is mathematically taken into account to improve the rules and define the new rules incorporating that additional information. The new isotonized rules have a very similar structure to Fisher’s rule. The main difference is that we used alternative estimators of the parameters that incorporate the additional information available on the problem. 0  Let us denote population i as Gi and as X D X1 ; : : : ; Xp the vector of predictors, that is, the variables to be used for classifying the observations. We assume normality, so that if the observation considered comes from Gi , then X  Np .i ; †/, where i D .i1 ; : : : ; ip /0 is the mean of vector X in population i and † is the covariance matrix of vector X . We are assuming different means in each population but a common covariance matrix. In the rest of this work, we will also assume equal a priori probabilities for the groups. The case of different a priori probabilities is straightforward from what follows. Under these assumptions, it is well known that the best way of classifying a new observation ´ D .´1 ; : : : ; ´p /0 is the following linear classification rule n o 0 classify ´ in Gi iff i D argj min ´  j †1 .´  j /; j D 1; : : : ; k : As the mean vectors and the covariance matrix are usually unknown, they are estimated so that the usual classification rule (Fisher’s rule) for this case is n o 0 classify ´ in Gi iff i D argj min ´  x j S 1 .´  x j /; j D 1; : : : ; k : (1) Next, we define rules that take into account the additional information. This information is incorporated into the rule restricting the parameter space. For our bladder cancer case study, we are assuming that the levels of the proteins increase with the stage of the illness. In the usual statistical terminology, we can say that there is a simple order among the three population means in each of the three variables. Mathematically, the additional information is formulated using cones. In our case, we can write that ˚   .1s ; 2s ; 3s / 2 CBC D x 2 R3 W x1 6 x2 6 x3 for s D 1; 2; 3: Fernández et al [6] dealt with case k D 2. In that case, the problem can be simplified rewriting the rule and the additional information on the parameters in terms of the difference of means ı D 1  2 . Then, the usual estimator of the difference of means is projected onto the cone containing the information. A family of estimators of the difference of means, ı , with  2 Œ0; 1, that fulfill the additional information is obtained by means of an iterative procedure that is shown to converge. These estimators are plugged into the original rule to obtain the following isotonized rules    n1  n2  0 1  n1 x 1 C n2 x 2 Rı ./: classify ´ in G1 iff ´   S ı > 0; (2) ı n1 C n2 2.n1 C n2 / where n1 and n2 are the sample sizes of the populations. For full details on these rules, the reader is referred to [6], where results and simulations showing that this rule performs better than the two populations version of Fisher’s rule are given. However, these rules do not have a direct generalization to the case of k > 2 populations as in that case, the reduction of the problem to a single parameter ı 2 Rp is not possible. Our proposal in this paper is to work in an extended space Rkp and to define a cone of restrictions in Rkp representing the additional information available on the problem. For the bladder cancer case study, we have that ˚  .01 ; 02 ; 03 / 2 CBC D x 2 R9 W xi 6 xi C3 6 xi C6 ; i D 1; 2; 3 :

Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

3775

Projecting the sample means onto the cone with the appropriate metric ensures that we find the estimator of the means that is as close as possible to the sample means and also verifies the additional information. It is easy to write the most usual order restrictions using cones. For example, the simple order cone appears, as in our case study, when there is an increasing relation between the level of the illness and the predictors. The cone can be expressed as n o CSO D x 2 Rkp W xt 6 xt Cp 6 : : : 6 xt C.k1/p ; t 2 T ;

D. CONDE ET AL.

where T  f1; : : : ; pg is the set of predictors for which the simple order relation is known to hold. It is obvious that CBC is a particular case of CSO with k D p D 3 and T D f1; 2; 3g. Other usual cone of restrictions among the mean values of variables is the so-called tree order. This cone usually appears when comparing k 1 treatments with a control. In this case, it is common to know that, if G1 is the control group, 1j 6 ij for i D 2; : : : ; k and j 2 T . This cone can be written as n o CTO D x 2 Rkp W xt 6 xt Cnp ; t 2 T; n D 1; : : : ; k  1 : Because the projections are made in a kp dimensional space, we also need a metric for the projection. The appropriate metric is the one given by the kp square matrix   S 1 S S S1 D diag ; ;:::; ; (3) n1 n2 nk where n1 ; : : : ; nk are the sample sizes in each of the populations i D 1; 2; : : : ; k. For details on cones of restrictions, their geometry, and projections on them, the reader may consult [12, 13] which are books containing the main results in restricted inference. Now, a family of estimators indexed by  2 Œ0; 1, b i , i D 1; : : : k, is defined using projections as follows. Definition 1 Let b  2 Rkp for  2 Œ0; 1 be the limit value, when m ! 1, of the following iterative procedure



ˇ b .m/ D PS1 b .m1/ jC  PS1 b .m1/ ˇC P ; m D 1; 2; : : : ;  0 where b .0/ D x 01 ; : : : ; x 0k 2 Rkp , PS1 .Y jC / is the projection of Y onto cone C using the metric n o 0 given by S1 , and C P D y 2 Rkp W y x 6 0; 8x 2 C . Remark 2 0   From this b i D .b  /.i 1/pC1 ; : : : ; .b  /ip for i D 1; : : : ; k. Notice that the iterative procedure is the same one appearing in [6]. In that paper, the procedure was used to estimate ı D 1  2 , whereas here, we estimate 1 , 2 ,. . . ,p . The proof of the convergence of this iterative procedure is similar to the one in [6] and therefore is not given here. When  D 0, b 0i is a standard estimator in order-restricted inference. It is the projection of the sample means onto the cone of restrictions and also the restricted maximum likelihood estimator. When the value of  increases, we obtain estimators that are more deeply inside the cone. We will see in what follows that rules based in estimators with  > 0 perform better than the one with  D 0. The main reason for this, which also justifies the choice of this family of estimators, may be the fact that restricted estimators belonging to the frontier of the cone of restrictions (i.e., the points of the cone for which at least one of the inequalities in the definition of the cone is verified as an equality) are not admissible under the squared error loss [14]. The question of the choice of  will be further considered in the final Discussion section. Now, the new rules, that we denote as R ./, can be defined as n o j /0 S 1 .´  b j /; j D 1; : : : ; k : R ./: classify ´ in Gi iff i D argj min .´  b The following result states that these rules generalize the ones defined for two populations. Therefore, the isotonized rules share the good properties already proved for the k D 2 rules defined in the aforementioned paper [6] (i.e., the probability of correct classification is higher than that of Fisher’s rule). The proof of the result is deferred to the appendix to improve the readability of the paper. Theorem 3 If k D 2, rules Rı ./ and R ./ are equivalent for  2 Œ0; 1.

3776

Therefore, we have managed to extend the rules defined for two populations to the situation of more than two populations. Notice that the case of three or more populations is more interesting from the additional information point of view, because more different ordering relations can be established in Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

D. CONDE ET AL.

this case. For example, in the case of two populations, the simple order CSO and tree order cones CTO coincide, whereas this does not happen if there are more populations. All these different situations can be treated in a homogeneous way with the procedure we have just developed.

3. Simulation study 3.1. Study design In order to check the good performance of the R ./ rules, we have designed and developed the following simulation study. For simplicity, we concentrate on the case k D 3. We consider three 3-dimensional populations with distribution N3 .i ; †/ under two different order restrictions. Namely, we consider the simple order and tree order restrictions that are the most common in practice. The simple order cone for this three populations and three variables case is again the cone CBC appearing in our bladder cancer example, and the tree order cone is the CTO cone defined earlier for k D 3 and p D 3. We generate training samples of size n1 D n2 D n3 D 5. It can be objected that these training sample sizes are too small for what is usual in practice. However, notice that also variability has to be taken into account. We have also run simulations with higher sample sizes (n1 D n2 D n3 D 50) rescaling the covariance matrices accordingly, obtaining similar results. Sixty four different simulations are conducted to show cases where the means and/or † are different for each of the two order cones. The parameter configurations are different for each cone and are generated by 16 mean vectors and 4 covariance matrices. The values chosen for the covariance matrix are intended to cover usual values in practice for the correlation coefficients. As for the means, they are chosen to be able to compare the performance of the different restricted rules for different parameter configurations. In all cases, the mean 1 of the first population has been set equal to .0; 0; 0/. Then, the 64 configurations are generated as 2i 3j †k with i; j; k D 1; : : : ; 4, where full details of the configurations are given in Table I. For each scenario, we generated 10; 000 training samples for which the rules are determined. For each of these training samples, 100 test observations coming from each of the three populations have been classified.

3.2. Simulation results The results of the simulation study appear in Tables II and III. Each cell of those tables contains the total probability of correct classification b p for the corresponding rule. We obtained this probability from the estimated probabilities of correct classification in each population as b p D 13 .b p 22 C b p 33 / where p 11 C b

Table I. Mean vectors and covariance matrices. Simple order cone means 21 22 23 24

Copyright © 2012 John Wiley & Sons, Ltd.

3 5 3 5

Statist. Med. 2012, 31 3773–3786

3777

D 1 C .0:1; 0:1; 0:1/ 31 D 2 C .0:1; 0:1; 0:1/ D 1 C .0:1; 0:5; 0:5/ 32 D 2 C .0:1; 0:5; 0:5/ D 1 C .0:5; 0:5; 1/ 33 D 2 C .0:5; 0:5; 1/ D 1 C .1; 1; 1/ 34 D 2 C .1; 1; 1/ Tree order cone means 21 D 1 C .0:1; 0:1; 0:1/ 31 D 1 C .0:2; 0:2; 0:2/ 22 D 1 C .0:1; 0:5; 0:5/ 32 D 1 C .0:2; 1; 1/ 33 D 1 C .1; 1; 2/ 23 D 1 C .0:5; 0:5; 1/ 34 D 1 C .2; 2; 2/ 24 D 1 C .1; 1; 1/ Covariance matrices 2 2 3 1 0 0 1 0:3 0:3 P P 4 0 1 0 5 4 0:3 1 0:3 1D 2D 0 0 1 0:3 0:3 1 2 2 3 1 0:3 0:3 1 0:7 0:7 P P 4 4 5 0:3 1 0:3 0:7 1 0:7 D D 3 4 0:3 0:3 1 0:7 0:7 1

D. CONDE ET AL.

Table II. Correct classification probabilities under CSO . Covariance†1

Covariance†2

Means

Fisher

R .0/

R .1/

Fisher

R .0/

R .1/

21 31 32 33 34 22 31 32 33 34 23 31 32 33 34 24 31 32 33 34

0.342 0.388 0.450 0.507 0.388 0.443 0.500 0.554 0.449 0.501 0.559 0.612 0.506 0.554 0.611 0.666

0.359 0.410 0.466 0.516 0.409 0.465 0.517 0.567 0.466 0.519 0.574 0.622 0.516 0.566 0.622 0.671

0.368 0.419 0.473 0.519 0.419 0.473 0.525 0.571 0.472 0.526 0.581 0.627 0.518 0.571 0.626 0.675

0:339 0:376 0:423 0:464 0:377 0:425 0:469 0:507 0:424 0:469 0:517 0:554 0:464 0:507 0:554 0:592

0:352 0:394 0:439 0:474 0:395 0:444 0:485 0:518 0:440 0:486 0:531 0:563 0:474 0:518 0:563 0:598

0:360 0:401 0:445 0:477 0:402 0:451 0:492 0:522 0:446 0:492 0:538 0:568 0:477 0:522 0:568 0:602

Covariance †3

Covariance †4

Means

Fisher

R .0/

R .1/

Fisher

R .0/

R .1/

21 31 32 33 34 22 31 32 33 34 23 31 32 33 34 24 31 32 33 34

0:345 0:406 0:480 0:534 0:406 0:475 0:543 0:594 0:480 0:544 0:611 0:663 0:534 0:595 0:664 0:716

0:364 0:428 0:494 0:544 0:428 0:496 0:560 0:607 0:494 0:560 0:624 0:672 0:544 0:608 0:672 0:720

0:375 0:438 0:500 0:547 0:437 0:505 0:566 0:611 0:500 0:567 0:630 0:676 0:547 0:613 0:676 0:723

0:337 0:383 0:422 0:432 0:383 0:445 0:478 0:484 0:424 0:478 0:521 0:525 0:432 0:483 0:524 0:535

0:346 0:396 0:434 0:442 0:396 0:457 0:490 0:492 0:436 0:489 0:531 0:533 0:441 0:492 0:532 0:541

0:352 0:400 0:440 0:445 0:400 0:460 0:493 0:494 0:440 0:493 0:536 0:537 0:444 0:495 0:536 0:545

3778

b p i i is the estimated probability that an observation coming from population Gi is correctly classified in that population for i D 1; 2; 3. We performed all these simulations by using R. It is worth commenting the results appearing in Tables II and III. The first obvious conclusion is that the new rules perform better than the one not taking into account the additional information for all the scenarios and order restrictions considered. Another interesting conclusion is the convenience of values of  strictly bigger than 0, as in all situations but one, the rule with  D 1 outperforms the one with  D 0 based on the restricted maximum likelihood estimator. The differences between the correct classification probabilities of the rules are not too big. The main reason is that the isotonized rules are equal to Fisher’s rule when the training sample verifies the restrictions imposed by the additional information. The isotonized rules show their power when the training set does not verify all the restrictions which is obviously when these rules should be used in statistical practice. In fact, the results improve more when the number of restrictions that are not verified is higher. For example, consider the scenario given by CSO , covariance †1 , and means 23 and 33 . Table IV details the results obtained for this case depending on the number of restrictions fulfilled. The improvements in the correct classification probabilities with respect to the unrestricted Fisher rule are bigger when there are several restrictions not verified by the training set. The scenario appearing in our case study is one of these. Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

D. CONDE ET AL.

Table III. Correct classification probabilities under CTO . Covariance †1

Covariance †2

Means

Fisher

R .0/

R .1/

Fisher

R .0/

R .1/

21 31 32 33 34 22 31 32 33 34 23 31 32 33 34 24 31 32 33 34

0:342 0:456 0:556 0:616 0:370 0:442 0:562 0:638 0:425 0:471 0:560 0:664 0:482 0:518 0:583 0:665

0:356 0:466 0:563 0:623 0:388 0:457 0:573 0:650 0:440 0:483 0:567 0:672 0:493 0:525 0:586 0:668

0:361 0:466 0:564 0:626 0:394 0:460 0:577 0:654 0:444 0:485 0:570 0:675 0:495 0:525 0:589 0:670

0:339 0:437 0:525 0:577 0:366 0:425 0:527 0:591 0:406 0:455 0:517 0:605 0:442 0:492 0:551 0:591

0:349 0:448 0:531 0:582 0:380 0:437 0:536 0:600 0:420 0:467 0:525 0:612 0:452 0:500 0:556 0:594

0:354 0:448 0:531 0:582 0:385 0:439 0:538 0:601 0:423 0:471 0:528 0:615 0:452 0:501 0:559 0:597

Covariance †3

Covariance †4

Means

Fisher

R .0/

R .1/

Fisher

R .0/

R .1/

21 31 32 33 34 22 31 32 33 34 23 31 32 33 34 24 31 32 33 34

0:345 0:485 0:589 0:637 0:383 0:474 0:609 0:674 0:451 0:510 0:611 0:708 0:511 0:554 0:617 0:714

0:360 0:496 0:597 0:645 0:402 0:488 0:621 0:687 0:466 0:519 0:618 0:715 0:523 0:559 0:622 0:717

0:366 0:497 0:599 0:649 0:409 0:490 0:625 0:692 0:470 0:521 0:620 0:718 0:527 0:558 0:623 0:718

0:337 0:463 0:530 0:538 0:376 0:445 0:544 0:564 0:411 0:494 0:521 0:577 0:414 0:513 0:570 0:537

0:344 0:471 0:534 0:543 0:386 0:453 0:551 0:571 0:421 0:504 0:526 0:583 0:422 0:522 0:574 0:539

0:349 0:473 0:536 0:542 0:391 0:453 0:552 0:571 0:426 0:506 0:529 0:586 0:423 0:524 0:578 0:542

Table IV. Correct classification probabilities under CSO depending on the number of restrictions verified by the training set for covariance †1 and means 23 and 33 . Restrictions verified

Fisher

R .0/

R .1/

6 5 4 3 2 1

0:583 0:560 0:535 0:495 0:431 0:354

0:583 0:574 0:565 0:543 0:512 0:501

0:583 0:581 0:580 0:576 0:573 0:606

Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

3779

Next, we compare the results obtained using different cones. Notice that the pairs of means 2i 3i for i D 1; : : : ; 4 are the same for the two cones we are considering (CSO and CTO ). Figure 1 shows the results obtained for the first two pairs of means (i D 1; 2) for the unrestricted rule and for the new rules when  equals 0 or 1. It is easy to check that, independently from the covariance matrix, better classification results are obtained if the information given by CSO is included instead of the one given by CTO . Similar results are obtained for the other two pairs of means (i D 3; 4). Therefore, as CSO  CTO , we conclude that better classification results are obtained if the additional information available is more precise.

D. CONDE ET AL.

Figure 1. Correct classification probabilities for means 21 31 (lower set of lines) and 22 32 (upper set of lines).

4. Case study We will consider two data sets in this work. Both data sets are obtained using the xMAP® technology from Luminex for measuring the levels of the proteins. This technology is widely used in many medical and biological research areas such as gene expression or cancer markers. An example of its recent use in bladder cancer research is in [15]. The first data set we received, D1 , contained information on 141 patients and 11 proteins together with the real stage of the illness the patients belonged to. The proteins were expected to be useful for correctly classifying the patients in one of the five levels we have defined. This is the initial data set and it is the one we will use to build the rules. In the usual statistical terminology, this is the training set. The second data set, D2 , is the test set, that is, the one we will use for checking the goodness of the classification rule built with the first one. This data set was received in a later stage during the research and contains information on a different set of 149 patients for whom we have measures on the same 11 proteins and we know their real illness stage. As stated in the Introduction, using the judgment provided by our pharmaceutical and industrial partners, we consider three proteins, P1 , P2 , and P3 , for illustrating the methodology. However, we also analyzed the data using all 11 available proteins. As expected by our partners, the 11 proteins when used together for classifying subjects were not as informative as the three specific proteins they suggested us to focus on, and the estimated probability of misclassification increased significantly when all proteins were considered. The mean values in each of the groups and the pooled covariance matrix for P1 , P2 and P3 , obtained from data set D1 , appear in Table V.

Table V. Mean for each group and pooled covariance matrix from D1 . Means

G1 G2 G3

log.P1 /

log.P2 /

log.P3 /

41 68 32

2:935 2:670 0 3:245 1:018 0:469 S D @ 0:469 0:985 0:410 0:284

3:879 3:944 4:348 1

1:416 1:029 1:578

3780

N

Copyright © 2012 John Wiley & Sons, Ltd.

0:410 0:284 A 0:575

Statist. Med. 2012, 31 3773–3786

D. CONDE ET AL.

Table VI. Classification of test data using Fisher’s rule. Predicted Observed

G1

G2

G3

G1 G2 G3 Correct

12 5 2

66 18 11 33:56%

7 8 20

Table VII. Classification results using several discrimination procedures. Correct classification Procedure Fisher SVM type 1 SVM type 2 1-NN 2-NN NBC ANN Ordinal

Training set (%)

Test set (%)

46:10 48:23 45:39   48:22 51:77 43:97

33:56 20:80 26:85 28:85 29:53 21:47 32:88 35:57

The results of the application of the Fisher rule to our test data set, D2 , appear in Table VI. We can check that the overall percentage of correct classification is 33:56%, which is quite a poor result. The overall percentage of correct classification in the training sample was 46:10%, showing that the bias in correct classification appearing in this rule is quite high. In Table VII, we summarize the results obtained considering other more ‘state-of-the-art’ methods of building classification rules such as Support Vector Machines (SVM), K-Nearest Neighbors (K-NN), Naive Bayes Classifiers (NBC), or Artificial Neural Networks (ANN). Because this situation could also be viewed as an ordinal classification problem, in order to compare this ordinal focus with the one proposed here, we also include ordinal classification in the comparison list. There are many recent references where these techniques are used in cancer research [16–19]. For a type 1 SVM with capacity 10 and a radial basis function as kernel ( D 0:333), we obtained a classification accuracy of 48:23% for the training set and 20:80% for the test set. When we considered a type 2 SVM with nu D 0:1 and a radial basis kernel with the same value for gamma, the accuracy improved for the test set as we obtain 45:39% for the training set and 26:85% for the test set. Other methods such as K-NN, NBC, ANN, or Ordinal classification did not reach 36% of observations correctly classified in the test data set. Therefore, we can conclude that the usual methods do not perform well in this case. The main reason is that, as it can be observed in Figure 2, the training set does not verify the expected ordering among the levels of the illness and those of the proteins while the test data set does follow that ordering. Consequently, rules built using D1 do not perform as expected. The usual solution would be to discard the training data set. That would be time wasting and expensive. Now, we show the results obtained applying the isotonized rules. In this particular case, there are several restrictions that are not fulfilled (check Table V) so that an improvement when the new rules are used is expected. For this case, the new rules defined in the previous section are n o R ./: classify ´ in Gi iff i D argj min .´  b j /0 S 1 .´  b j /; j D 1; 2; 3 ;

Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

3781

where the b j values for j D 1; 2; 3 and  D 0; 1 appear in Table VIII. The classification matrices obtained when these rules are used for classifying the observations in the test set D2 appear in Table IX. In that Table, we can check that the new rules yield very good results for our case study. In fact, using the new rules, we can see that the global probabilities of correct classification in the test set are 53:02%

D. CONDE ET AL.

Figure 2. Box plot of variables log.P1 /, log.P2 /, and log.P3 / for each data set and illness category. Table VIII. Restricted estimators of the means for  D 0; 1.

G1 G2 G3

b 0j

b 1j

.2:762; 3:760; 1:175/ .2:774; 4:016; 1:175/ .3:245; 4:348; 1:578/

.2:589; 3:640; 0:933/ .2:878; 4:088; 1:321/ .3:245; 4:348; 1:578/

Table IX. Classification of test data using rules R ./. Predicted  D0

 D1

Observed

G1

G2

G3

G1

G2

G3

G1 G2 G3 Correct

42 5 1

34 16 11 53:02%

9 10 21

67 15 5

11 8 7 64:43%

7 8 21

for  D 0 and 64:43% for  D 1. It is certainly an improvement over the values under 36% obtained with the non-restricted classification procedures in Table VII.

5. Discussion

3782

The case study and the simulations results in Sections 3 and 4 show that the new restricted rules defined in this paper outperform the unrestricted ones and that, as in Tables IV and IX, the improvement can be quite significative when several restrictions are not fulfilled by the training sample. These results are also very interesting from the industrial and pharmaceutical points of view because they confirm the results from previous works that suggested the orderings used to define the rules. Moreover, these new rules allow research to go on without dropping the initial sample, which saves time and money. The fact that there is no loss in classification when the training set verifies the restrictions allows us to recommend the use of the restricted rules proposed in this paper in the general practice. In order to make them easy to use, an R library for restricted classification is being compiled. For the moment, R programs for performing classification under additional information can be obtained from the authors on request. Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

D. CONDE ET AL.

Table X. Error rates for the discrimination procedures.

Actual error estimate (%) Apparent error rate (%)

Fisher

 D0

 D1

66:44 53:90

46:98 58:15

35:57 67:37

We have also shown that more precise additional information translates into more powerful rules. Therefore, we recommend researchers to incorporate as much information as the application at hand provides. This kind of information is often available in applications as, for example, in medical diagnosis problems, where the predictors are usually isotonically related to the response. Another interesting point is the estimation of the actual probability of misclassification. In our case study, a test sample is available to evaluate the misclassification probability so that we have a good estimator of the actual error rate. However, in practice, it is quite usual that there is no second sample to do this. The usual procedure in practice is to evaluate the rule on the same sample that has been used to build it. In this way, the so-called apparent error rate is obtained. This procedure is obviously biased downward, and correction procedures have to be used in order to obtain a good estimator of the actual error rate. There is much literature on this subject. For example, the research of Jiang and Simon [20] contains a good study of the most usual estimators of the actual error rate. However, it is clear from Table X that, for our case study, the bias of the apparent error rate for the restricted rules does not behave like that of Fisher’s rule. Therefore, new procedures specifically designed to correct the apparent error rate in restricted procedures have to be developed. These procedures will also help the user make the optimal choice of parameter . Although we have seen in this work that  D 1 is a good choice, the value of  for which the estimator of the actual error rate is lowest will obviously be the natural one. This is part of our present research and will hopefully appear in future works. Finally, we would like to mention that, as pointed by one of the reviewers, the methodology developed in this work opens the possibility of modifying other classification procedures to make them able to cope with the sort of additional information considered here.

Appendix: Proofs In order to prove Theorem 3, we will first prove the following lemma. ( n o We denote as K and K the cones K D ´ 2 Rp W aj0 ´ > 0; j D 1; : : : ; r and K D ´ 2 R2p W !) a j bj0 ´ > 0; j D 1; : : : ; r with bj D . Let S be a p dimensional positive definite matrix and denote aj   0 d1 S 1 1 where d1 , d2 > 0. as S D 0 d2 S 1 Lemma 4 If x, y 2


! !   x d1 x C d2 .y C v/ : jK D d1 .x  v/ C d2 y y

Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

3783

Proof First, let us check that we can assume d1 C d2 D 1 without loss of generality. If that is not ful  condition  1 S 0 d   1 1 . filled, as in (3), we can define d1 D d1 =.d1 Cd2 /, d2 D d2 =.d1 Cd2 / and S D 0 d2 S 1   The new values d1 , d2 verify the condition and, as ! these!changes of scale ! in the ! projection matrix do x x not change the projection, we have that PS1 jK D PS jK . Therefore, in the rest 1 y y of the proof, we assume d1 C d2 D 1.

D. CONDE ET AL.

Now, denote w1 D d1 x C d2 .y C v/ and w2 D d1 .x  v/ C d2 y. To prove the result, we have to check (cf. [12] Theorem 1.3.2) that    0   x w1 w1  D0 S1 y w2 w2 0    x w1  S1 h 6 0; 8h 2 K : w2 y As v D PS 1 .xy jK /, we know that .x  y  v/0 S 1 v D 0 .x  y  v/0 S 1 f 6 0; 8f 2 K: Now, 

x y



 

w1 w2

0

S1



w1 w2



D d1 d2 .x  y  v/0 S 1 .d1 x C d2 .y C v//  d1 d2 .x  y  v/0 S 1 .d1 .x  v/ C d2 y/ D d1 d2 .x  y  v/0 S 1 v D 0:

! h1 2 K . Notice that h 2 K if and only if h1  h2 2 K. Then, Let h D h2 

x y



 

w1 w2

0

S1 h D d1 d2 .x  y  v/0 S 1 .h1  h2 / 6 0

and the proof is done.



Corollary 5 Under the conditions of Lemma 4 and assuming d1 C d2 D 1, v D w1  w2 and d1 w1 C d2 w2 D d1 x C d2 y. Now we can prove the pending result. Proof of Theorem 3 After some easy calculations, we can write the rule R ./ for two populations as      b 1 C b 2 S 1 b classify´ in G1 iff ´  1  b 2 > 0 2 so that, taking into account (2), to prove the result, we have to check that ı D b 1  b 2 c1 x 1 C c2 x 2  cı

b 1 C b 2 D : 2

These estimators are obtained as the limit of the iterative procedure appearing in Definition 1 so that ! ! ! / b .i !1.i / b 1.i 1/ 1   .i 1/ ; where / D .1 C / b .i !2.i / b 2 2 ! 1/ c1 b .i C c2 .b 2.i 1/ C  .i / / 1 D PS1 D ; 1/ jK 1/ c1 .b !2.i / b .i .i   .i / / C c2b 2.i 1/ 2 1 ! ! .0/

b  x 1 1 2.i 1/ / jK and D : 1.i 1/  b  .i / D PS 1 .b x2 b .0/ 2 !1.i /

!

1/ b .i 1

3784

Copyright © 2012 John Wiley & Sons, Ltd.

!

!

Statist. Med. 2012, 31 3773–3786

D. CONDE ET AL.

It is easy to prove that, for any i > 1, / / 1/ c1 b .i C c2 b .i D .1 C /.c1 !1.i / C c2 !2.i / /  .c1b .i C c2 b 2.i 1/ / 1 2 1 1/ 1.i 1/ C c2b .i /  .c1b 1.i 1/ C c2b 2.i 1/ / D .1 C /.c1b 2 1/ 1/ D c1 b .i C c2 b .i D c1 b .0/ C c2 b .0/ 1 2 1 2 D c1 x 1 C c2 x 2 .

(4)

Now we use an induction argument to obtain the proof. First, we assume m D 1. Then from Corollary 5, b .1/ b .1/ D .1 C /.!1.1/  !2.1/ /   .x 1  x 2 / 1 2 D .1 C / .1/   .x 1  x 2 / D ı.1/ . Cb .1/ ! .1/ C !2.1/ b .1/ x1 C x2 1 2 D .1 C / 1  2 2 2



D .1 C / c1 x 1 C c2 x 2  c .1/   .c1 x 1 C c2 x 2  c.x 1  x 2 //

D c1 x 1 C c2 x 2  c .1 C / .1/  .x 1  x 2 / D c1 x 1 C c2 x 2  cı.1/ :

Now we assume that the equalities hold for m D i 1 and we prove them for m D i. Taking into account Corollary 5 and (4), we have / / 1/ b .i b .i D .1 C /.!1.i /  !2.i / /  .b 1.i 1/  b .i / 1 2 2

D .1 C / .i /  ı.i 1/ D ı.i / : / / 1/ b .i Cb .i .i ! .i / C !2.i / b .i 1/ C b 1 2 2 D .1 C / 1  1 2 2 2 1/ .i 1/ .i / D .1 C /.c1b .i C c b   c /  .c1 x 1 C c2 x 2  cı.i 1/ / 2 2 1

D .1 C / c1 x 1 C c2 x 2  c .i /  .c1 x 1 C c2 x 2  cı.i 1/ /

D c1 x 1 C c2 x 2  cı.i / ; and the proof is finished.



Acknowledgements The authors would like to thank the editor and two anonymous reviewers for their detailed reading that resulted in this improved version of the paper. This research is partially supported by Spanish DGES grant MTM2009-11161.

References

Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

3785

1. UICC. TNM Classification of Malignant Tumours, 7th edition. Wiley-Blackwell: New Jersey, 2009. 2. Oosterlinck W, Bernard L, Jakse G, Malmström P, Stöckle M, Sternberg C. Guidelines on bladder cancer. European Urology 2002; 41:105–112. 3. Tosoni I, Wagner U, Sauter G, Egloff M, Knönagel H, Alund G, Bannwart F, Mihatsch MJ, Gasser TC, Maurer R. Clinical significance of interobserver differences in the staging and grading of superficial bladder cancer. British Journal of Urology International 2000; 85:48–53. 4. Lokershwar VB, Young MJ, Gourdazi G, Iida N, Yudin AI, Cherr GN, Selzer MG. Identification of bladder tumor-derived hyaluronidase: its similarity to HYAL1. Cancer Research 1999; 59:4464–4470. 5. Long T, Gupta RD. Alternative linear classification rules under order restrictions. Communications in Statistics - Theory and Methods 1998; 27:559–575. 6. Fernández MA, Rueda C, Salvador B. Incorporating additional information to normal linear discriminant rules. Journal of the American Statistical Association 2006; 101:569–577. 7. Peddada SD, Lobenhofer E, Li LP, Afshari CA, Weinberg CR, Umbach D. Gene selection and clustering for time course and dose-response microarray experiments using order-restricted inference. Bioinformatics 2003; 19:834–841.

D. CONDE ET AL. 8. Peddada SD, Dinse G, Haseman J. A survival-adjusted quantal response test for comparing tumor incidence rates. Journal of the Royal Statistical Society, Series C 2005; 54:51–61. 9. Cai B, Dunson DB. Bayesian multivariate isotonic regression splines: applications to carcinogenicity studies. Journal of the American Statistical Association 2007; 102:1158–1171. 10. Simmons S, Peddada SD. Order-restricted inference for ordered gene expression (ORIOGEN) data under heteroscedastic variances. Bioinformation 2007; 1:414–419. 11. Ghosh D, Banerjee M, Biswas P. Inference for constrained estimation of tumor size distributions. Biometrics 2008; 64:1009–1017. 12. Robertson T, Wright FT, Dykstra RL. Order restricted statistical inference. John Wiley & Sons: New York, 1988. 13. Silvapulle MJ, Sen PK. Constrained Statistical Inference. John Wiley & Sons: New Jersey, 2005. 14. Van Eeden C. Restricted Parameter Space Estimation Problems, Lecture notes in Statistics 188. Springer: New York, 2006. 15. Wang M, Wang M, Zhang W, Yuan L, Fu G, Wei Q, Zhang Z. Common genetic variants on 8q24 contribute to susceptibility to bladder cancer in a Chinese population. Carcinogenesis 2009; 30:991–996. 16. Rapaport F, Barillot E, Vert J-P. Classification of arrayCGH data using fused SVM. Bioinformatics 2008; 24:375–382. 17. Dulewicz A, Jaszezak P, Pietka BD. Pattern Recognition Techniques in Recognition of Neoplastic Changes in Images of Cell Nuclei. In IFMBE proceedings. Vol.25/5. Information and Communication in Medicine, Telemedicine and e-health, Dössel O, Schlegel WC (eds). Springer: New York, 2010; 105–108. 18. Bengtsson S, Krogh M, Al-Khalili C, Uhlen M, Schedvins K, Silfverswärd C, Linder S, Auer G, Alaiya A, James P. Large-scale proteomics analysis of human ovarian cancer for biomarkers. Journal of Proteome Research 2007; 6:1440–1450. 19. Margulis V, Lotan Y, Montorsi F, Shariat SF. Predicting survival after radical cystectomy for bladder cancer. British Journal of Urology International 2008; 102:15–22. 20. Jiang W, Simon R. A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification. Statistics in Medicine 2007; 26:5320–5334.

3786 Copyright © 2012 John Wiley & Sons, Ltd.

Statist. Med. 2012, 31 3773–3786

DOI 10.1515/sagmb-2012-0037      Statistical Applications in Genetics and Molecular Biology 2013; 12(5): 583–602

David Conde, Bonifacio Salvador, Cristina Rueda and Miguel A. Fernández*

Performance and estimation of the true error rate of classification rules built with additional information. An application to a cancer trial Abstract: Classification rules that incorporate additional information usually present in discrimination problems are receiving certain attention during the last years as they perform better than the usual rules. Fernández, M. A., C. Rueda and B. Salvador (2006): “Incorporating additional information to normal linear discriminant rules,” J. Am. Stat. Assoc., 101, 569–577, proved that these rules have lower total misclassification probability than the usual Fisher’s rule. In this paper we consider two issues; on the one hand, we compare these rules with those based on shrinkage estimators of the mean proposed by Tong, T., L. Chen and H. Zhao (2012): “Improved mean estimation and its application to diagonal discriminant analysis,” Bioinformatics, 28(4): 531–537. with regard to four criteria: total misclassification probability, area under ROC curve, well-calibratedness and refinement; on the other hand, we consider the estimation of the true error rate, which is a very interesting parameter in applications. We prove results on the apparent error rate of the rules that expose the need of new estimators of their true error rate. We propose four such new estimators. Two of them are defined incorporating the additional information into the leave-one-out-bootstrap. The other two are the corresponding cross-validation after bootstrap versions. We compare these estimators with the usual ones in a simulation study and in a cancer trial application, showing the good behavior of the rules that incorporate additional information and of the new leave-one-out bootstrap estimators of their true error rate. Keywords: area under ROC curve; bootstrap; cancer diagnostic test research; discriminant analysis; order restrictions; true error rate. *Corresponding author: Miguel A. Fernández, Departamento de Estadística e I.O.,Universidad de Valladolid, 47011 Valladolid, Spain, e-mail: [email protected] David Conde, Bonifacio Salvador and Cristina Rueda: Departamento de Estadística e I.O.,Universidad de Valladolid, 47011 Valladolid, Spain

1 Introduction Consider the classical discrimination problem with two populations Π1 and Π2. Denote the training sample from which the rule is built as Mn = {(Xi, Yi ), i = 1, …, n}, where X is the p-dimensional vector of classifiers and Y = 1, 2 is the binary variable identifying the population. Denote also as PXY the joint distribution of the vector (X, Y). With this scheme, a classification rule is an application Rn :{ R p × { 1,2 }} n × R p → { 1,2 } that classifies a new observation u∈R p into one of the two available populations: Rn(Mn, u)∈{1, 2}. In applications it is usual that some additional information is available. Recent papers considering additional information issues are, for example, Lin et al. (2007), Simmons and Peddada (2007), Beran and Dümbgen (2010) and Oh and Shin (2011). It is frequent that this information tells us that the observations from one of the populations, for example Π1, take higher (or lower) values than those coming from the other, i.e., Π2. The incorporation of this kind of information into the classification rule has been shown to improve the performance of the rule. To our best knowledge, the first paper in this line was Long and Gupta (1998). More recently, Fernández et al. (2006) generalized and improved the results in that paper and proposed rules that take into account this additional information and have lower total misclassification probabilities (TMP) than the classical rules that do not consider this information. A good example of this situation appears in Section 5 where bladder cancer patients are known to usually take higher values in some variables (and lower

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

584      D. Conde et al.: True error rate of restricted rules in others) than healthy people. This information is then used to build a classification rule that outperforms Fisher’s rule and that, as studied in Salvador et al. (2008), has good robustness properties with respect to its theoretical assumptions. From now on, we will refer to these rules as restricted rules, as they come from order restriction information on the populations. There are more rules proposed in the literature that may be compared with those in Fernández et  al. (2006). In this paper we compare the latter with those based on shrinkage estimators of the mean proposed by Tong et al. (2012). The comparison is motivated by the fact that the restricted rules can also be viewed as “shrinkage” rules since they are based on projections, and projection is a contractive operator. In this paper we will not only compare the behavior of the discrimination procedures through the TMP (which was the only criterion considered in Fernández et al.). It is well known that there are some other ways of comparing the behavior of the procedures. For example, in medical classification problems, probabilistic classifiers are more frequently used than classification rules since they allow making complex clinical decisions. For probabilistic classifiers, it is frequent to use other measures such as the area under the ROC (recei­ ving operating characteristic) curve, usually known as AUC (see, for example, Faraggi and Reiser, 2002; Pepe et al., 2004, and references therein), or well-calibratedness and refinement, introduced and developed by Kim and Simon (2011). All these measures will be considered in the comparison of the procedures. Another key issue for a discrimination procedure is the evaluation of its performance for a given training sample. When TMP is considered, this is usually done by estimating the true error rate En of the rule Rn, which is the misclassification probability of the rule conditioned to the available training sample. Namely, En = Error(Rn) = PXY(Rn(Mn, X)≠Y/Mn). The evaluation of the behavior of the rule using TMP, which is the expected, or unconditional, true error rate E(En), allows the study of global properties of the rule but not the evaluation of En for a given sample Mn. The best way of estimating En for a classification rule is to use an independent sample, usually called test sample. However, in practice it is common that the sample size is not large enough to split it into a training and a test sample as that would decrease the efficiency of the rule. For this reason, the estimation of En for the usual rules such as Fisher’s linear rule, the quadratic discriminant rule, the nearest neighbors rules or random forest rules, is a widely studied topic in the literature. Parametric and non-parametric estimators of En have been proposed, and non-parametric estimators based on resampling have shown a good performance for the above mentioned rules. Schiavo and Hand (2000) summarizes the work made on this topic until that date. More recent references are, for example, Steele and Patterson (2000), Wehberg and Schumacher (2004), Fu et al. (2005), Molinaro et al. (2005), Kim and Cha (2006) or Kim (2009). In this paper, we check that the usual estimators are not expected to work well for the restricted rules and tackle the problem of proper estimation of En for the restricted rules defining new estimators that will also be useful when other performance measures, such as AUC, are considered. The layout of the paper is as follows. In Section 2 we start reviewing the restricted rules defined in Fernández et al. and those in Tong et al. (2012). In Section 3 we consider different performance measures and compare the behavior of the rules described in Section 2 with respect to those measures. Section 4 is devoted to the practical estimation of En. A real data case dealing with bladder cancer is presented in Section 5. Finally, in Section 6 we discuss the results and summarize the conclusions.

2 Classification rules In this Section we present the rules that will be compared throughout the paper. The rules that incorporate additional information defined in Fernández et al. are summarized in Subsection 2.1 and related to Fisher’s discriminant rule and to those based on shrinkage defined in Tong et al., which are described in Subsection 2.2. From now on, we assume two p-dimensional normal populations Π1 and Π2 with means μ1 and μ2 respectively, and common covariance matrix Σ. Using the notation given in the Introduction, we have that

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      585

X/Y = j~Np(  μj, Σ), j = 1, 2. Let us denote as X j the sample mean vector of the observations coming from population j (i.e., X j = ( Σin=1 Xi I( Y = j ) ) /( Σin=1 I( Y = j ) ) ), for j = 1, 2, and S the pooled sample covariance matrix. i i If we denote as u∈R p a new observation to be classified, the optimal (theoretical) Bayes rule is:



 µ + µ ′ π Classify u in Π 1 iff  u − 1 2  Σ−1 δ≥ log 2 , 2  π1 

(1) 

where δ = μ1–μ2. π Let us further assume equal a priori probabilities πj  , j = 1, 2, so, from now on, log 2 = 0. π1 The usual linear classification rule, also known as Fisher’s discriminant rule, is obtained replacing the unknown parameters μ1, μ2 and Σ by their estimators X 1 , X 2 and S:  X +X ′ Classify u in Π 1 iff  u − 1 2  S −1 δ ≥0, 2   where δ= X 1 − X 2 . As already mentioned, we will also assume that there is some additional information available on the problem being considered. For example, in medical contexts it is usual that patients take higher values in some variables (and lower in others) than healthy people. Mathematically, this information on the mean vectors can be incorporated using cones that restrict the parameter space. Namely, we assume that the differp ence between the mean vectors δ belongs to C, where C is a closed, convex, polyhedral cone in R , C = { z ∈R p : a ′j z ≥0, j = 1,…,m }. For the two populations case, one of the most interesting cones is the positive orthant cone O + = { x ∈R p : xi ≥0,i ∈T }, where T is the subset of predictors for which the means of the two populations are known to be higher in the first of the two populations. Generally speaking, polyhedral cones are widely used in restricted inference, because they cover the most interesting cases from a practical standpoint. Among these cones are the simple order cone CSO = { z ∈R q : z 1 ≤ …≤ zq } (where q is the number of illness levels with 1 denoting absence of illness or the control level), used when there is an increasing relation between the level of the illness and the predictors; the tree order cone CTO = { z ∈R q : z 1 ≤ zi ,i = 2,…,q }, used when it is known that the level of the predictors increase when the illness is present but it is not sure that they increase when the severity of the illness increases; and the already mentioned positive orthant cone. For more details on cones of restrictions and their geometry, the reader may consult Robertson et  al. (1988) or Silvapulle and Sen (2005).

2.1 Classification rules with additional information In order to obtain a classification rule that incorporates the additional information available for the problem, Fernández et al. (2006) start rewriting rule (1) as Classify u in Π1 iff (u–(c1   μ1+c2   μ2 )+cδ)′Σ–1δ  ≥  0, where ci = ni/(n1+n2), i = 1, 2 and c = (c1–c2)/2. The new classification rule is then obtained replacing Σ by S, c1  μ1+c2   μ2 by c1 X 1 + c2 X 2 and the restricted parameter δ by an estimator that incorporates the additional information. To be more precise, δ is replaced by a member of the family δ∗γ , with γ∈[0, 1], defined as the limit of the following iterative procedure that Fernández et al. show to be convergent. Let ˆδ(γ0 ) = X 1 − X 2 , and ˆδ(γi ) = pS −1 ( ˆδ(γi−1) / C ) − γpS −1 ( ˆδ(γi−1) / C P ) for i = 1, 2, …, where

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

586      D. Conde et al.: True error rate of restricted rules

pS −1 ( Z / C ) is the projection of Z onto cone C with the metric given by S–1 and C P = { z ∈R p : z ′x ≤ 0, x ∈C } is the so called polar cone of C. In this way the following family of new classification rules Rn(γ) with γ∈[0, 1] is obtained:

Classify u in Π 1 iff ( u −( c1 X 1 + c2 X 2 ) + c δ∗γ ) ′ S −1 δ∗γ ≥0.



(2)

If we denote as µ∗γ 1 = c1 X 1 + c2 ( X 2 + δ∗γ ) and µ∗γ 2 = c1 ( X 1 − δ∗γ ) + c2 X 2 , then rule (2) can also be obtained replacing μ1, μ2 and Σ in (1) by their estimators µ∗γ 1 , µ∗γ 2 and S. The computation of the projection of a vector onto a cone can been carried out using the “lsConstrain. fit” method contained in “ibdreg” R package. Moreover, the R programs with which results presented throughout the paper have been calculated will be included in an R library that is being compiled. For the moment, these programs, which include the computation of the estimators, can be obtained from the authors on request. For more details on these rules and their properties the reader is referred to Fernández et al. (2006). These rules can also be extended to deal with more than two populations. Full details on how this is done can be found in Conde et al. (2012).

2.2 Rules based on shrinkage estimators Since projection is a contractive operator, the estimators µ∗γ 1 and µ∗γ 2 proposed in the previous subsection may be seen as “shrinkage estimators” for the mean. Other shrinkage estimators have been proposed for discrimination rules. Tong et al. describe an analytical James-Stein type shrinkage estimator for the mean under the cuadratic loss function. The estimator is proposed for fixed sampling size nj, j = 1, 2, for each population, and large dimension p. They construct a shrinkage-based discriminant dia­ gonal rule replacing the population means by the proposed shrinkage means and considering a diagonal covariance matrix. The reason for considering a diagonal covariance matrix is the fact that, if p is greater than nj, j = 1, 2, singularity problems appear. To overcome these problems, Dudoit et al. (2002) introduced the diagonal linear discriminant analysis (DLDA), which, when the sample size is small, performed very well compared with more sophisticated classifiers in terms of accuracy and stability (Dettling, 2005; Lee et al., 2005). The shrinkage mean-based diagonal linear discriminant analysis (SmDLDA) proposed by Tong et al. is based on the following shrinkage estimators for μj  , j = 1, 2:   ˆrjopt  j ( ˆrjopt ) = X j +  1−  ( X − X j ), µ j 2  || X j − X ||S  j j ( n j − 1)( p − 2) opt ˆrj ≈ , n j ( n j − 3) where ˆrjopt is the optimal shrinkage parameter estimation, Sj the diagonal of the sample covariance matrix, X j the sample mean vector and X j the grand sample mean across all variables in population Πj for j = 1, 2: X j = ( X j⋅ ,…, X j⋅ ), X j⋅ =

1 p ∑X p k=1 jk

Tong et al. show that SmDLDA outperforms DLDA in a wide range of situations when TMP criterion is considered.

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      587

3 Performance measures In this section we compare the performance of Fisher, SmDLDA, Rn(0) and Rn(1) rules with regard to some of the most usual performance measures considered in practice, in scenarios similar to those in Tong et al. (with large p and not so large n), via a simulation study. As already mentioned, we consider two populations Π1 and Π2 with multivariate normal distributions Np(μ1, Σ) and Np(μ2, Σ). In the first scenario we assume an identity covariance matrix, i.e., Σ = Ip. We consider μ1 = (0, …, 0, μ1, d+1, …, μ1, p) and μ2 = –μ1, where (μ1, d+1, …, μ1, p) is a random sample of size p–d from the uniform distribution U(0, 0.5). With these vector means, it is clear that we must focus on the positive orthant restrictions case, that is, δ= µ1 − µ2 ∈O + = { x ∈R p : xi ≥0,i = 1,…, p }. We consider two values of p (50 and 200) and two values of n = n1 = n2 (10 and 20). For each p, we consider six different values of d: 0, 0.1 × p, …, 0.5 × p. For each simulation, we generate a training set of size n and a test set of size 5n for each population Πj, j = 1, 2. We repeat the procedure 1000 times. To overcome the singularity of the pooled sample covariance matrix, all rules use as estimated covariance matrix the diagonal of the pooled sample covariance matrix, diag(S). The results for p = 50 are similar to those for p = 200, so we only include the results for p = 50. The second scenario considers the case where the observations are correlated. Let the following block diagonal structure be the true covariance matrix:  Σρ 0   0 Σ− ρ Σ= 0 0  0 0 … … 

0

0

…  0 0 … Σ ρ 0 …  0 Σ− ρ … … … …

,

p× p

where Σρ has the auto-regresive structure:  1 ρ …  ρ 1 … Σ ρ = … … …  9 8  ρ ρ …

ρ9   ρ8  . …  1  10×10

We use different values of the correlation coefficient: ρ = 0, 0.2, 0.4, 0.6, 0.8. Except for d = 0.1 × p, all other settings remain the same.

3.1 TMP The total misclassification probability (TMP), also known as misclassification rate or prediction error, is the probability that a new observation is misclassified. To approximate this value we use the proportion of misclassified observations over the total number of observations in the test sets. The TMP values obtained for Fisher, SmDLDA, Rn(0) and Rn(1) rules are summarized in Figure 1. As we can see in the figure, Rn(0) takes the lowest TMP values for high values of d followed by Rn(1), while for low values of d SmDLDA is the one that yields the lowest TMP values. These values are not far from those of Rn(0) and Rn(1), which take very similar TMP values. For the second scenario, we can see that, except for ρ = 0, Rn(0) takes lowest values with regard to the TMP, followed very closely by Rn(1).

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

588      D. Conde et al.: True error rate of restricted rules n=10

0.21

n=20

0.21

n=10

0.21

n=20

0.21

0.14

0.14

0.14

0.07

0.07

0.07

0.07

TMP

0.14

0

0

5

10

15

20

25

0

0

5

10

d

15

20

25

0

0

0.2

d Fisher

0.4

0.6

0.8

0

0

ρ Rn(0)

Rn(1)

0.2

0.4

0.6

0.8

ρ SmDLDA

Figure 1 TMP values for Σ = I (two left graphs) and auto-regresive Σ (two right graphs).

3.2 Other performance measures As already commented in the Introduction, probabilistic classifiers that provide an estimate of the probabi­ lity of membership in each class for new cases are often more useful than classification rules that just assign cases to a class. For this reason, in the construction of medical probabilistic classifiers, other performance measures than just the misclassification rates are used. In fact, one-dimensional summary measures such as the misclassification rate are rarely used in practice (Pepe et al., 2004). Specifically, the most commonly used global index of diagnosis accuracy is the area under the ROC curve (AUC). Additionally, other measures such as well-calibratedness and refinement, introduced and developed by Kim and Simon, can be used. In order to assess performance measures such as AUC, well-calibratedness and refinement, we need to transform the classification rules [Fisher, SmDLDA and Rn(γ), γ = 0, 1] into probabilistic classifiers. Let f1 and f2 be the normal probability density functions of populations Π1 and Π2. The optimal (theoretical) Bayes rule (1) is equivalent to Classify u in Π 1 iff p( u; µ1 , µ2 ,Σ ) =

π 1 f1 ( u; µ1 ,Σ ) 1 ≥ , π 1 f1 ( u; µ1 ,Σ ) + π 2 f2 ( u; µ2 ,Σ ) 2

where p(u; μ1, μ2, Σ) is the predictive probability of class 1 for a new case u. The function p(.; µ1 , µ2 ,Σ ): R p → 0,1 is the corresponding probabilistic classifier. 1 As we did in the previous section, in the rest of the paper we will continue assuming that π 1 = π 2 = . If we 2 replace in p(u; μ1, μ2, Σ) the unknown parameters μ1, μ2 and Σ by each of the estimators considered in Section 2, we have different predictive probabilities of class 1 for a new case u, and, therefore, different probabilistic  1 ( ˆr1opt ), µ  2 ( ˆr2opt ) and diag(S), we obtain SmDLDA probabilistic classifier, classifiers. If the estimators are µ ∗ that we will denote as pSmDLDA . If the estimators are X 1 , X 2 and S, we obtain Fisher probabilistic classifier, that we will denote as p∗Fisher . Finally, if the estimators are µ∗γ 1 , µ∗γ 2 and S, we obtain the restricted Rn(γ) probabilistic classifiers, that we will denote as pR∗ ( γ ) . For the simulations, as the dimension p is too high and n it is not possible to estimate all values in S, we will replace S by diag(S).

3.2.1 AUC The AUC is very frequently used in medical contexts as a measure for the effectiveness of diagnostic markers. In these contexts, a patient is assessed as positive or negative depending on whether the corresponding probabilistic classifier value p ( u ) is greater than or less than or equal to a given threshold value. For each thres­hold value, there is a corresponding probability of a true positive (sensitivity) and a corresponding probability of a true negative (specifity). The ROC curve is a plot of sensitivity versus 1–specifity for all threshold values.

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      589

The AUC of the empirical ROC curve for a probabilistic classifier p is the Mann-Whitney U statistic (cf. Pepe et al., 2006): n

n

 1 = 1 1 2  I AUC + I [ p ( u )= p ( u ) ]  . ∑∑ [ p ( ui )> p ( u j ) ]  i j n1n2 i=1 j=1  2  ∗ , pR∗ (0) and pR∗ (1) with regard to AUC The results of the simulations conducted to compare p∗Fisher , pSmDLDA n n appear in Figure 2. From this figure, it is clear that pR∗ (0) takes the highest mean AUC values with the lowest standard devian ∗ tions for high values of d, followed by pR∗ (1) , while pSmDLDA yields the highest mean AUC values with lowest n standard deviations for low values of d. For the second scenario, pR∗ (0) and pR∗ (1) take the highest mean AUC values ( pR∗ (1) for n = 10, pR∗ (0) for n n n n ∗ n = 20) with the lowest standard deviations for ρ > 0, while pSmDLDA yields highest mean AUC values with lowest standard deviations for ρ = 0.

3.2.2 Well-calibratedness and refinement A probabilistic classifier p is well calibrated if for any predictive probability w, 0 < w < 1, P( C = 1| p ( u ) = w ) = w . That is, the proportion of the cases that the probabilistic classifier predicts with probability w are actually class 1, is w. A probabilistic classifier is refined if the predictive probabilities w tend to be close to 0 or 1. Refinement is defined as the expected value of P( C = 1| p ( u ) = w )(1− P( C = 1| p ( u ) = w )) with respect to the predictive probability w.

n=20

AUC (mean)

n=10

n=20

1.00

1.00

0.97

0.97

0.95

0.95

0.94

0.94

0.90

0.90

0.91

0

5

10

15

20

25

0.91

0

5

10

d

15

20

25

0.85

1.00

0

0.2

0.4

d

n=10

0.033

AUC (sd)

n=10

1.00

0.8

0.85

n=10

0.042

0.028

0.024

0.011

0.007

0.017

0.012

5

10

15

20

25

0

0

5

10

15

20

25

0

0

0.2

0.4

d

d p*Fisher

0.6

0.8

0

0

ρ p*R

n(0)

p*R

0.4

0.6

0.8

0.6

0.8

n=20

0.036

0.014

0

0.2

ρ

0.022

0

0

ρ

n=20

0.021

0.6

n(1)

0.2

0.4 ρ

p*SmDLDA

Figure 2 AUC values for Σ = I (four left graphs) and auto-regresive Σ (four right graphs).

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

590      D. Conde et al.: True error rate of restricted rules To assess the calibration and refinement, Kim and Simon define two measures: calibration score (CS) and refinement score (RS). First, let us first partition the unit interval into m equal subintervals or bins Bk = [(k–1)/m, k/m], k = 1, …, m. For each bin Bk, let qk be the proportion of predictions w that fall into Bk, rk the relative frequency of predictions in Bk for class 1, and uk the center point of Bk. Then m

m

k =1

k =1

CS = ∑( rk − uk ) 2 qk , RS = ∑rk (1− rk ) qk . ∗ , pR∗ (0) and pR∗ (1) with regard to wellThe results of the simulations conducted to compare p∗Fisher , pSmDLDA n n calibratedness and refinement appear in Figure 3. From this figure, it is clear that, again, pR∗ (0) and pR∗ (1) take very close CS and RS values. We can also see n n ∗ that pR∗ (0) is the one with lowest calibration scores for high values of ρ (ρ = 0.6, 0.8), while pSmDLDA is the one n with lowest calibration scores for all values of d and for low values of ρ. With regard to refinement, it can be noticed that pR∗ (0) has lowest refinement scores for high values of d n ∗ and ρ, while pSmDLDA takes lowest refinement scores for low values of d and ρ. As a final conclusion for this Section, we can say that, even in high dimensional data with small sample size problems, Rn(0) and Rn(1) compete well with SmDLDA outperforming it in a number of configurations. We also think that the fact that the estimators used in restricted rules are motivated by the additional information available on the problem to be considered can be seen as a conceptual advantage over the “blind” shrinkage estimators considered by SmDLDA.

4 Practical estimation of true error rate In the previous Section we considered the evaluation of the global (unconditional) performance of the discrimination procedures. As mentioned in the Introduction, in practice it is necessary to evaluate the behavior of the discrimination procedure for a given training data set. The best way of performing this conditional

n=10

0.03

n=20

0.021

n=10

0.039

n=20

0.03

0.014

0.026

0.02

0.01

0.007

0.013

0.01

CS

0.02

0

0

5

10

15

20

25

0

0

5

10

d

20

25

0

0

0.2

d

n=10

0.15

15

0.6

0.8

0

0

0.2

ρ

n=20

0.15

0.4

0.6

0.8

0.6

0.8

ρ

n=10

0.15

0.4 n=20

0.15

0.10

0.10

0.10

0.05

0.05

0.05

0.05

RS

0.10

0

0

5

10

15

20

25

0

d

0

5

10

15

20

25

0

0

0.2

d p*Fisher

0.4

0.6

0.8

0

ρ p*R

n(0)

p*R

n(1)

0.2

0.4 ρ

p*SmDLDA

Figure 3 CS and RS values for Σ = I (four left graphs) and auto-regresive Σ (four right graphs).

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

0

D. Conde et al.: True error rate of restricted rules      591

evaluation would be to have an independent test data set. However, in practice it is not common to have this second data set, so estimators of performance measures not based on the existence of the test set have been developed. Many of these procedures are, in one way or another, based on the resubstitution error also called apparent error rate. The resubstitution estimator or apparent error rate, APP, estimates the true error rate En as the proportion of observations in the training sample that are wrongly classified by the rule. It is well known (see, for example, McLachlan, 1976, or Efron, 1983) that APP is a biased estimator that underestimates the true error rate because the training sample data are used twice, both to build the rule and to check its accuracy. We start this Section by describing, in Subsection 4.1, the most usual procedures designed to correct the bias of APP. Then, in Subsection 4.2 we prove a property of the APP of the restricted rules that expose that the APP of these rules does not behave in the same way than that of Fisher’s rule. For this reason, the usual correction procedures cannot be expected to perform well. Consequently, in Subsection 4.3 we give new proposals for the estimation of the true error rate for the restricted rules. Finally, in Subsection 4.4 we show how these estimators perform via a simulation study. Notice that these estimators can also be useful for other performance measures such as AUC since the estimation of these measures can be based on the computation of misclassification probabilities (1–specifi­ city and 1–sensibility).

4.1 Usual resampling based estimators There are many non parametric estimators for the true error rate of a classification rule based on resampling techniques. In this subsection we describe the most usual ones. The cross-validation, or leave-one-out, method was proposed in Lachenbruch and Mickey (1968). With this method one of the observations in the training sample is left out, then the classification rule is computed and the excluded observation is classified. This is repeated with each of the observations in the training sample. Then the cross-validation error, CV, is just the proportion of observations misclassified using this procedure. It is well known that this estimator has lower bias than APP. Efron shows that CV has a low bias but a not so low variability and proposes estimators based on the bootstrap methodology. Let us denote as Mn = {(Xi, Yi), i = 1, …, n} the original training sample. A bootstrap training sample Mn∗ = {( Xi∗ ,Yi∗ ),i = 1, …,n } is a size n randomly obtained (with replacement) sample from the 1 original training sample (i.e., P( ( Xi∗ ,Yi∗ ) = ( X s , Ys ) ) = with s, i∈{1, …, n}). The probability that an observan tion from the original training sample is not included in the bootstrap training sample depends on n and is approximately 0.368. The bootstrap version of the classification rule is the rule based on the bootstrap training sample. From this methodology Efron proposes several ways of estimating the classification error. We consider two of them: the leave-one-out bootstrap (LOOBT) and the bootstrap 632 (BT632). For the LOOBT estimator, B bootstrap training samples are considered and B bootstrap versions of the classification rule are obtained. Then each of these rules is used for classifying the original observations not belonging to the corresponding bootstrap training sample. Finally, LOOBT is the proportion of observations not correctly classified using this procedure. Efron notices that LOOBT tends to overestimate the true error rate and then proposes BT632 = 0.368‧APP+0.632‧LOOBT. In certain cases the value of APP is close to 0 (overfitting) so that BT632 is close to 0.632‧LOOBT and the true error is underestimated. For these situations with high overfitting, Efron and Tibshirani (1997) propose the bootstrap 632+, defined as BT632+ = (1–α) APP+αLOOBT, with α > 0.632. In Section 4.2 we will see that APP for rules Rn(γ), γ∈[0, 1], is higher than APP for Fisher’s rule. Consequently, the overfitting problem for these rules is less important and we do not consider BT632+ in our study. More recently, Fu et al. propose a method based on cross-validation after bootstrap (BCV) that has a lower relative mean squared error than LOOBT and BT632 for small training samples. In this procedure B bootstrap samples M b∗ , b = 1,… , B are obtained from Mn. Let CVb be the true error rate estimator obtained using the crossB

validation method on sample M b∗ . The final true error rate estimator is now BCV = B −1 ∑ CVb . b=1

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

592      D. Conde et al.: True error rate of restricted rules

4.2 Theoretical results for the APP of restricted rules Here we obtain some properties of APP for the restricted rules. In particular, in Theorem 1 we prove that APP is less optimistic for the rules Rn(γ), γ∈[0, 1], than for Fisher’s rule. The proof of the Theorem is deferred to the Appendix in order to improve the readability of the paper. The result will be proved for known Σ. Under this condition a simple transformation allows us to further assume that Σ = I. Recall also that throughout the paper equal a priori probabilities are being considered. In these conditions, the apparent error rate of rule Rn(γ), γ∈[0, 1], is (App(γ)+APP2(γ))/2, where APP1 ( γ ) =

1 n I ∗ ∗ ∑I n1 i=1 [( Xi −( c1 X1 +c2 X2 )+cδγ ) ′ δγ <0 ] [ Yi =1]

APP2 ( γ ) =

1 n I ∗ ∗ ∑I n2 i=1 [( Xi −( c1 X1 +c2 X2 )+cδγ ) ′ δγ >0 ] [ Yi =2 ]

and

are the apparent error rates of populations Π1 and Π2, respectively. This is the result: Theorem 1 If n1 = n2 then, for any γ∈[0, 1], E(APP(γ))  ≥  E(APP(0))  ≥  E(APP(Fisher)). Remark 2 In Fernández et al. it is proved that, if n1 = n2, the true error rate of rules Rn(γ), γ∈[0, 1], is lower than that of Fisher’s rule. Moreover, in all simulations performed the true error rate of rules Rn(γ), is higher than their expected apparent error rates. As from Theorem 1, E(APP(γ))  ≥  E(APP(Fisher)), this suggests that if n1 = n2 the bias of APP for rules Rn(γ), γ∈[0, 1] is lower than that for Fisher’s rule. A possible explanation for this is that the restricted rules are less dependent on the training sample values than Fisher’s rule, as they are built not only using the training sample but also the additional information available for the problem. Consequently, the usual procedures designed to correct the bias of APP cannot be expected to perform well. This points to the need for new estimators of En, specific for these restricted rules.

4.3 New proposals In this subsection we propose new true error rate estimation procedures based on resampling techniques for the restricted rules. These methods modify LOOBT and BCV to make them able to cope properly with the information included in the rule. We will denote as BT2 and BT3 the methods generated from LOOBT, and BT2CV and BT3CV the ones coming from BCV. The additional information we are considering can be written as δ = μ1–μ2∈C, where C = { z ∈R p : a ′j z ≥0, j = 1, …,m } is the appropriate cone of restrictions. Let us denote as C the following random cone generated by δ= X 1 − X 2 : a ′ z ≥0, if a ′j δ ≥0   C =  z ∈R p : j , j = 1, …, m . δ 0, if <0 a z ≤ a ′j ′j   The true error rate estimator BT2 is computed in a way similar to LOOBT but considering bootstrap classification rules generated using projections onto cone C instead of C for each bootstrap training sample. In

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      593

other words, for each bootstrap sample M b∗ = {( Xi∗b ,Yi∗b ),i = 1,2,…,n } we compute the bootstrap version of the estimator of δ that we denote as δ∗γb (with γ∈[0, 1]) defined as the limit of the following iterative procedure b similar to the one considered in Section 2. Let ˆδ(0) = X 1∗b − X 2∗b and ˆδ(γi ) b = pS −1 ( ˆδ(γi−1) b / C ) − γpS −1 ( ˆδ(γi−1) b / C P ) for γ ∗b i = 1, 2, … Now we denote as Rn ( γ ) the bootstrap versions of the classification rules Rn(γ) defined as: Classify u in Π 1 iff ( u −( c1 X 1∗b + c 2 X 2∗b ) + c δ∗γb ) ′ S −1 δ∗γb ≥0. For each rule Rn∗b ( γ ), b = 1, 2, …, B, we classify the observations in the original training sample that do not belong to the bootstrap sample M b∗ . The true error rate estimator BT2 is the proportion of observations wrongly classified. The heuristic under BT2 is that the “bootstrap world” should mirror the “real world”. In the “real world” the original training sample Mn is obtained from the populations Πj , j = 1, 2, that verify δ = μ1–μ2∈C. In the “bootstrap world” the population is Mn, which verifies δ= X 1 − X 2 ∈C . Therefore, the bootstrap versions of the rules should be obtained replacing the cone C by C . Our second proposal to use the additional information in a way that the “bootstrap world” imitates the “real world” is to adapt the original training sample, instead of modifying the cone, as follows. Assume that the original training sample Mn does not verify the restrictions, i.e., δ= X 1 − X 2 ∉C . For any γ∈[0, 1] we can use δ∗γ , the restricted estimator of δ, to obtain estimators for μi i = 1, 2. As μ1 = (μ1+μ2+δ)/2 and μ2 = (μ1+μ2–δ)/2, we can consider µ∗γ 1 = ( X 1 + X 2 + δ∗γ ) / 2 and µ∗γ 2 = ( X 1 + X 2 − δ∗γ ) / 2 as estimators for μ1 and μ2, respectively. Now, we transform the original training sample in such a way that the difference of the new sample means belongs to C. The transformed training sample is {(Wi, Yi ), i = 1, 2, …, n}, where Wi = Xi − X j + µ∗γj if Yi = j , j = 1,2. In this way W1 −W2 = µ∗γ 1 − µ∗γ 2 = δ∗γ ∈C . Now, the proposed estimator of the true error rate, that we denote as BT3, is LOOBT replacing the original training sample by the transformed one. In this way, the bootstrap samples are extracted from populations that verify the same property that is fulfilled by the populations from which the original training sample is extracted. We also consider cross-validation after bootstrap versions of BT2 and BT3. They are denoted as BT2CV and BT3CV, respectively.

4.4 Estimators behavior: simulation study The behavior of an estimator Eˆ of the true error rate En is analyzed through the distribution of the random variable Eˆ − En . This distribution has been called deviation distribution of the error estimator by Braga-Neto and Dougherty (2004). As a global measure of the behavior of Eˆ we will use E [ ( Eˆ − En ) 2 ]. As usual, this measure can be decomposed in a variance and a bias component: E [ ( Eˆ − En ) 2 ] =Var ( Eˆ − En ) + [ E ( Eˆ − En ) ] 2 . In this section we conduct a simulation study to compare the behavior of the estimators APP(γ), CV(γ), LOOBT(γ), BT632(γ), BCV(γ), BT2(γ), BT3(γ), BT2CV(γ) and BT3CV(γ) of the true error rate En(γ) of the restricted classification rules Rn(γ). The purpose of this study is to propose a reasonable estimator of En(γ) when the training sample does not fulfill the restrictions. For simplicity we consider p = 3 and identity covariance matrix and study the positive orthant restrictions case, i.e., δ∈O3+ = { x ∈R 3 : xi ≥0,i = 1,2, 3}. We generate training samples of size n1 = n2 = 10, from populations Π1, N3(δ, Σ), and Π2, N3(0, Σ), for different values of δ and Σ. Since in practice sample sizes and covariances are usually larger, we have also run the simulations with bigger sample sizes (n1 = n2 = 50)

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

594      D. Conde et al.: True error rate of restricted rules accordingly rescaling the covariance matrix, obtaining similar results. The simulations were performed for many values of δ both in the interior of the cone and on the frontier of O3+ and for several values of Σ. However, since there was no significative variation in the results, in order to save space we only present here the results obtained for Σ = I and when δ is the vertex of the cone (0, 0, 0) or it is in the interior of O3+ . To be more precise, we show the results for values of δ in the diagonal direction of the cone, i.e., δ = λ(1, 1, 1) with λ  ≥  0. The values of λ have been chosen so that ||δ||2 = 0, 0.25, 0.5, …, 2.5. Notice that the values considered cover the situations where discrimination is easy since the distance between the means ||δ||2 is large, and others where the samples from the populations are much more likely to overlap. Larger values of ||δ||2 are not given since for those values the restrictions are almost always fulfilled and therefore the true error estimation procedures are equivalent. For each scenario, we generate 1000 training samples for which the rules Rn(γ), with γ∈{0, 0.5, 1}, are determined. For each of these three rules we compute APP, CV, LOOBT, BT632, BCV, BT2, BT3, BT2CV and BT3CV. The number of bootstrap replicas considered for the estimators involving bootstrap was B = 100. The true error rate En for each training sample was computed using a test sample with 1000 observations from each of the two populations. Using this procedure we have 1000 values of the deviation distribution of each of the 9 error estimators. With these values we approximate the values of 1 ( E [( Eˆ − En ) 2 ]) 2 and E ( Eˆ − En ) that we will denote as A( Eˆ ) and B( Eˆ ) respectively. For example, if we denote as BT2i(0.5) and Eni (0.5) the values of BT2 and En obtained from the i-th training sample for rule 1

2 Rn(0.5) we can estimate A( BT 2(0.5) ) = ( E [( BT 2(0.5) − En (0.5)) ]) 2 and B(BT2(0.5)) = E(BT2(0.5)–En(0.5)) by 1

2  1 1000 1000 1 i i i i 2  1000 ∑ i=1 [ BT 2 (0.5) − En (0.5)]  and 1000 ∑ i=1 [ BT 2 (0.5) − En (0.5)], respectively. Again, in order to save space, as the results obtained for the three classification rules were similar, in Table 1 we only present the values for γ = 1. For each value of ||δ||2, the two lowest values of A( Eˆ ) appear in bold. Notice that the lowest values are the ones for BT2 and BT3 for almost all values of ||δ||2. In Figure 4 we represent the values of A( Eˆ ) and B( Eˆ ) depending on ||δ||2 for the nine estimators of the true error rate that we are considering. As in other simulation studies, APP generally has the largest negative bias, CV has the lowest bias but it is the one with highest variance, and LOOBT shows a positive bias. Estimators BCV, BT2CV, BT3CV and BT632 exhibit a negative bias. The new estimators proposed in this paper BT2 and BT3, which modify the bootstrap in order to cope with the additional information incorporated to the rules, have similar behavior. This is somehow surprising since they are based on very different ideas. They are also the best estimators of the true error rate for the smallest values of ||δ||2. These are obviously the most interesting situations in practice, as they correspond to scenarios where the discrimination is more difficult and where the additional information is more likely to play a key role in the rule. In order to have a more thorough idea of their behavior, we also obtained kernel estimators of the density of the deviation distribution for each of the nine estimators of En. The kernel density estimators corresponding to scenario ||δ||2 = 0.3 for the new estimators proposed in this paper, namely BT2, BT3, BT632, BT2CV and BT3CV, are represented in Figure 5. From this figure it is clear that the kernel estimators for BT2 and BT3 have the lowest values of bias and variance among the five represented in the graph. The estimators BT2CV and BT3CV have a similar variance component but are much more biased, while the BT632 has a higher variance component than the rest. Therefore, we conclude with the recommendation of BT2 and BT3 as estimators of the true error rate of the restricted rules.

5 Application: bladder cancer data The data considered in this application come from a bladder cancer project aimed to select classifiers in the context of an in vitro diagnostic tool for the disease. Our industrial and pharmaceutical partners in this

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      595 Table 1 Simulations results for the nine estimators under Σ = I and γ = 1. Estimator

APP

||δ2||

A B A B A B A B A B A B A B A B A B

CV LOOBT BT632 BCV BT2 BT3 BT2CV BT3CV

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

0.144 –0.015 0.130 0.001 0.085 –0.005 0.104 –0.009 0.114 –0.011 0.094 –0.069 0.093 –0.069 0.146 –0.129 0.145 –0.128

0.141 –0.054 0.126 0.008 0.090 0.008 0.102 –0.015 0.114 –0.033 0.079 –0.010 0.078 –0.012 0.109 –0.074 0.110 –0.076

0.141 –0.071 0.126 0.001 0.091 0.006 0.102 –0.022 0.112 –0.045 0.086 –0.002 0.084 –0.002 0.109 –0.068 0.108 –0.067

0.137 –0.076 0.123 0.006 0.092 0.014 0.098 –0.019 0.108 –0.046 0.088 0.010 0.087 0.010 0.105 –0.059 0.104 –0.058

0.133 –0.076 0.118 0.003 0.090 0.015 0.095 –0.018 0.105 –0.049 0.088 0.013 0.087 0.013 0.102 –0.057 0.102 –0.057

0.133 –0.074 0.125 0.009 0.097 0.018 0.100 –0.016 0.107 –0.048 0.093 0.015 0.094 0.016 0.103 –0.055 0.103 –0.055

0.129 –0.080 0.114 0.000 0.094 0.014 0.095 –0.021 0.103 –0.052 0.093 0.013 0.092 0.013 0.102 –0.055 0.102 –0.056

0.131 –0.080 0.119 0.006 0.098 0.019 0.098 –0.017 0.103 –0.049 0.096 0.018 0.096 0.018 0.100 –0.052 0.100 –0.053

0.121 –0.075 0.112 0.008 0.094 0.022 0.091 –0.014 0.097 –0.046 0.095 0.021 0.094 0.021 0.097 –0.048 0.097 –0.048

0.122 –0.081 0.104 –0.006 0.092 0.014 0.091 –0.021 0.097 –0.053 0.091 0.014 0.091 0.014 0.096 –0.054 0.097 –0.054

0.115 –0.072 0.104 0.005 0.091 0.024 0.087 –0.011 0.092 –0.043 0.092 0.024 0.091 0.024 0.091 –0.044 0.091 –0.044

0.16 0.14 0.12 A(Ê )

0.10 0.08 0.06 0.04 0.02 0

0

0.75

1.50

2.25

||δ||2 EAPP

CV

LOOB

BT632

BCV

BT2

BT3

BT2CV

BT3CV

0.04 0.02 0

B(Ê )

-0.02

0

0.75

1.50

2.25

-0.04 -0.06 -0.08 -0.10 -0.12 ||δ||2

-0.14 EAPP

CV

LOOB

BT632

BCV

BT2

BT3

BT2CV

BT3CV

Figure 4  A( Eˆ ) and B( Eˆ ) for the true error rate estimators for Σ = I and γ = 1.

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

596      D. Conde et al.: True error rate of restricted rules 5

BT2 BT3 BT632 BT2CV BT3CV

4

Density

3

2

1

0 -0.3

-0.2

-0.1 0 0.1 Deviation distribution

0.2

0.3

Figure 5 Kernel estimators for the density function of Eˆ −En for several estimators for ||δ||2 = 0.3.

research are Proteomika S.L. and Laboratorios SALVAT, S.A. For intellectual property reasons, the names of the proteins used in this study are not disclosed in the paper. Patients were classified in five levels based on cytoscopy. First level is control level (i.e., negative result of cytoscopy, therefore considered as absence of bladder cancer) and the other four are denoted as Ta, T1G1, T1G3 and T2, each of them corresponding to increasingly advanced levels of cancer. This combines the TNM staging (see UICC, 2009) and the grading. To be more precise, stage T describes the size of the tumor and whether it has spread and grade G refers to the appearance of the cells under the microscope. For this application, and in order to keep the populations balanced we will consider the control level as population Π1 and levels T1G3 and T2 as population Π2. As usual in this kind of research, an initial database was provided. The purpose of this pilot study was to confirm or discard the associations among the proteins and the illness in order to establish a larger multi­ center study. The data set D1 contained information on 41 patients from Π1 and 32 from Π2 and 11 proteins together with the real stage of the illness the patients belonged to. This is the initial data set and the one we will use to build the rules. In the usual statistical terminology this is the training set. We started considering all 11 available proteins but we obtained that, when these proteins were used together for classifying, the results were not as good as when a smaller number of proteins was considered. This was confirmed when simple studies made using t-tests told us that not all proteins were relevant. More­ over, our industrial and pharmaceutical partners informed us that, based on previous knowledge, three of the 11 proteins were expected to be more informative than the rest. Therefore, based on that previous knowledge and on the results of our studies, we decided to use four proteins in this study to discriminate among Π1 and Π2. It is clear that this sort of selection procedure can only be done when there is some prior information and the total number of variables is small. Obviously, we are aware that nowadays in many disciplines it is common to collect high dimensional data in a limited number of samples, and it becomes necessary to use variable selection algorithms. This is a very important issue that will be considered further in the discussion Section. We will denote the four proteins selected as P1, P2, P3 and P4. For each of these four proteins it was expected that higher values on the proteins were related to more advanced stages of the illness, i.e., δ = μ2–μ1∈O+. As usual, the values of the proteins levels have been transformed logarithmically so that the variables are approximately normally distributed. The mean values in each of the populations and the pooled covariance matrix obtained from this data set appear in Table 2. From this table it is obvious that the additional information was not fulfilled by the training set so the classifications rules Rn(γ) are relevant in this problem. Table 3 contains the values for the restricted estimator δ∗γ appearing in these rules for γ∈{0, 1}.

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      597 Table 2 Mean for each group and pooled covariance matrix from D1. Means

Π1 Π2

N

log(P1)

log(P2)

log(P3)

log(P4)

41 32

1.416 1.409

1.356 0.976

3.879 4.348

1.417 1.578

 1.065 0.455 −0.154  0.455 0.515 −0.052 S =  −0.154 −0.052 0.544   0.106 −0.053 0.148

0.106  −0.053 0.148   0.450 

Table 3 Values of δ*γ for the Rn(γ) rules built from D1.

Rn(0) Rn(1)

(0.328, 0, 0.430, 0.123) (0.664, 0.380, 0.392, 0.084)

We use this data set D1 as a training set to build the rules. In this bladder cancer research, a second data set D2 containing measures on the same 11 proteins and the real illness stage for a different set of 118 patients was received in a later stage. We use this second set D2 as a test set. Before obtaining an estimator of the true error rate of the rules, it will be useful to compare the 4 rules considered in the paper with regard to the performance measures introduced in Section 3. Rules are built from D1 and evaluated with D2. We can see in Table 4 that Rn(1) takes the lowest calibration score. Refinement scores are very similar for all the rules, being Rn(0) the one that takes the lowest value. In Figure 6 we represent the ROC curves for the rules built from D1. We can see that the best ROC curve is that of Rn(1). Consequently, Rn(1) has the largest AUC (see Table 4). We use D2 set as a test set in order to obtain an estimator of the true error rate of the rules. In this way we will be able to compare the estimators of the true error rate previously defined with another value obtained from an independent sample and evaluate the behavior of the true error rate estimators in this application. Table 5 contains the results obtained with the nine estimators of the true error rate considered in the paper and the independent estimation obtained from D2. The bootstrap values have been obtained generating B = 100 bootstrap samples as in Subsection 4.4. There are several questions that are worth noticing. One of them is the fact that, as mentioned in Subsection 4.2, APP increases with γ, which is logical since the rules with higher values of γ are less dependent from the original training sample. Moreover, for the data in the application, APP is higher than the independent estimation of the true error obtained from D2 for rule Rn(1). This is not usual for Fisher’s rule although it may happen more frequently for the restricted rules since APP

Table 4 Performance measures of the rules. Measure

Fisher

SmDLDA

Rn(0)

Rn(1)

CS RS AUC

0.082 0.170 0.652

0.083 0.171 0.652

0.048 0.163 0.681

0.028 0.170 0.708

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

598      D. Conde et al.: True error rate of restricted rules 1.0

Sensitivity

0.8

0.6

0.4

Fisher Rn(0)

0.2

Rn(1) SmDLDA

0 0

0.2

0.4

0.6

0.8

1.0

1-Specifity

Figure 6 ROC curves of the rules.

Table 5 Estimations of the true error rate of the rules. Estimator APP CV LOOBT BT632 BCV BT2 BT3 BT2CV BT3CV Estimation from D2

Fisher

SmDLDA

Rn(0)

Rn(1)

30.14% 36.99% 36.49% 34.15% 29.37% – – – – 39.83%

30.14% 36.99% 36.22% 33.98% 30.75% – – – – 39.83%

32.88% 36.99% 39.05% 36.78% 32.95% 35.53% 41.77% 29.05% 34.67% 36.44%

50.68% 49.32% 48.93% 49.57% 45.15% 34.76% 42.80% 29.33% 37.03% 33.90%

usually increases and the true error decreases with γ. Notice, however, that from the results obtained in the simulations Section APP still has a negative bias as estimator of the true error rate. We can observe that, even in this not standard case, the BT2 estimator, which had the second best behavior in the simulations, has a very good performance for the values of γ considered. In conclusion, we can see that, as in previous Sections, the combination of rule Rn(1) with the estimator BT2 yields good results in this case.

6 Discussion In the classical problem of discrimination between two normal populations with equal covariance matrices, Fernández et al. defined new classification rules that take into account the additional information that is frequently available in classification problems (restricted rules) and showed that these rules have lower misclassification error than the usual Fisher’s rule. These rules, that we denote as Rn(γ), γ∈[0, 1], are obtained from Bayes rule considering estimators of the mean vector of the populations that take into account the

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      599

additional information of the problem by projecting the sample means on the cone defined by the additional information. As projection is a contractive operator, these restricted rules can be viewed as shrinkage rules. Tong el al. have recently proposed to consider James-Stein type shrinkage estimators of the means to define discrimination rules known as SmDLDA rules. Considering scenarios similar to those in Tong et al., we have compared, via a simulation study, the performance of the restricted rules Rn(γ) with that of SmDLDA and Fisher’s rule using several of the most common criteria used in the literature. The criteria considered are the total misclassification probability (TMP), the area under ROC curve (AUC), the refinement (RS) and the wellcalibratedness (CS). The results obtained show that the restricted rules compete well with SmDLDA, even for high dimensional situations, under any of the four criteria considered, showing better performance under many of the conditions considered in the simulations. The restricted rules also have the advantage that the shrinkage in these rules is not “blind” but motivated by the information at hand. Another important issue for any classification rule is the estimation of the true error rate, i.e., the error rate of a given training sample. This problem has not been considered so far for the restricted rules. In this paper we check that the apparent error rate (APP) as estimator of the true error rate of the restricted rules has a different behavior than that of Fisher’s rule. Namely, in Theorem 1 we prove that the expected apparent error of these rules is higher than that of Fisher’s rule. As the true error rate of Fisher’s rule is higher than that of the new rules, this means that these new rules do not suffer so much overfitting as Fisher’s rule. Consequently, the usual procedures that try to reduce the bias of APP for estimating the true error rate such as CV, LOOBT, BT632 or BCV do not work as well as they should, and new estimators for the true error rate, specific for the restricted rules, are needed. We consider two methods based on different bootstrap procedures that take into account the additional information available on the problem. The first one, that we denote as BT2, adjusts the cone of restrictions to the training sample while the second one, denoted as BT3, adjusts the training sample to the cone of restrictions. The corresponding cross-validation after bootstrap versions of these procedures, BT2CV and BT3CV, are also considered. Based on a simulation study we check that the new procedures BT2 and BT3 generally perform better as estimators of the true error rate, En, than the usual estimators designed for rules that do not account for additional information. Their performance is especially good for situations where the populations are not too separated. This is the scenario where the new rules are more interesting since it is the case where training samples not fulfilling the restrictions are more likely to appear. We can also notice that for these rules it is not necessary to perform cross-validation after bootstrap, since BT2CV and BT3CV do not behave better than BT2 or BT3. Therefore, we conclude with the recommendation of estimators BT2 and BT3 to evaluate the true error rate of the discrimination rules defined in Fernández et al. (2006). All the work in this paper has been applied to a real bladder cancer project. Four rules have been considered [Fisher’s, SmDLDA, Rn(0) and Rn(1)]. We observed that Rn(1) is the one with best behavior with respect to three of the four evaluation measures considered (namely, true error rate, area under ROC curve and calibration score) and that all four rules have similar results for the refinement score. We also compared the estimations of the true error rate for the new estimators proposed and we exposed the good behavior of the new estimator BT2 for the restricted rules Rn(0) and Rn(1). There is still work to be done for the restricted rules. A very important issue for a rule is variable selection. In many disciplines high dimensional data with small sample size are collected. These cases highlight the need for variable selection algorithms, with which to perform some kind of general variable selection strategy. A good recent proposal is that of Graf and Bauer (2009), who perform model selection based on FDR-thresholding optimizing the area under the ROC curve. We think that this proposal may be very useful for restricted rules. In any case, this is a very interesting area that we will explore in future works. Acknowledgment: This research was partially supported by Spanish DGES grant MTM2012-37129. We thank the anonymous reviewers for several useful comments which improved this manuscript.

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

600      D. Conde et al.: True error rate of restricted rules

Appendix The expected apparent error rate for Π1 is E ( APP1 ( γ ) ) = P(( X 1 −( c1 X 1 + c 2 X 2 ) + c δ∗γ )′δ∗γ <0,Y1 = 1)

= E [ P(( X 1 −( c1 X 1 + c 2 X 2 ) + c δ∗γ )′δ∗γ <0,Y1 = 1/ X 1 , X 2 )].

For the proof of Theorem 1 we will need a previous result: Lemma 3 P(( X 1 −( c1 X 1 + c 2 X 2 ) + c δ∗γ ) ′ δ∗γ <0, Y1 = 1/ X 1 , X 2 ) ∗ ∗  n1 ( c2 δ + c δγ ) ′ δγ . =Φ −  n1 − 1  δ∗γ′ δ∗γ  

Proof. In order to make the proof clearer and to remark the dependence of δ∗γ on X 1 and X 2 during the proof we will write δ∗γ as δ∗γ ( X 1 , X 2 ). It is easy to check that ( X 1 −( c1 X 1 + c2 X 2 ) + c δ∗γ ( X 1 , X 2 )) ′ δ∗γ ( X 1 , X 2 )

=( X 1 − X 1 ) ′ δ∗γ ( X 1 , X 2 ) + ( c2 δ + c δ∗γ ( X 1 , X 2 )) ′ δ∗γ ( X 1 , X 2 )

so that P(( X 1 −( c1 X 1 + c 2 X 2 ) + c δ∗γ ( X 1 , X 2 ))′δ∗γ ( X 1 , X 2 )<0,Y1 = 1/ X 1 = t 1 , X 2 = t 2 ) = P( ( X 1 − X 1 )′δ∗γ ( X 1 , X 2 )< −( c2 δ+ c δ∗γ ( X 1 , X 2 )) ′δ∗γ ( X 1 , X 2 ),Y1 = 1/ X 1 = t 1 , X 2 = t 2 )

(

)

= P( ( X 1 − X 1 )′δ∗γ ( t 1 , t 2 )< −( c2 t 1 − t 2 + c δ∗γ ( t 1 , t 2 )) ′δ∗γ ( t 1 , t 2 ),Y1 = 1/ X 1 = t 1 , X 2 = t 2 ).  n −1  Now, ( X 1 − X 1 ) ′ δ∗γ ( t 1 , t 2 ) ∼ N  0, 1 δ∗γ ( t 1 , t 2 ) ′ δ∗γ ( t 1 , t 2 )  is an ancillary statistic as its distribution does not n   1 depend on μ1 or μ2. As ( X 1 , X 2 ) is sufficient and complete, from Basu’s theorem we have that ( X 1 − X 1 ) ′ δ∗γ ( t 1 , t 2 ) and ( X 1 , X 2 ) are independent. From this fact we have that P( ( X 1 − X 1 ) ′ δ∗γ ( t 1 ,t 2 )< −( c 2 ( t 1 − t 2 ) + c δ∗γ ( t 1 ,t 2 )) ′ δ∗γ ( t 1 ,t 2 ),Y1 = 1/ X 1 = t 1 , X 2 = t 2 ) = P( ( X 1 − X 1 ) ′ δ∗γ ( t 1 ,t 2 )< −( c 2 ( t 1 − t 2 ) + c δ∗γ ( t 1 ,t 2 )) ′ δ∗γ ( t 1 ,t 2 ),Y1 = 1)

∗ ∗   n1 ( c2 ( t 1 − t 2 ) + c δγ ( t 1 ,t 2 )) ′ δγ ( t 1 ,t 2 ) . =Φ −  n1 − 1  δ∗γ ( t 1 ,t 2 ) ′ δ∗γ ( t 1 ,t 2 )  

See Lehmann and Casella (1998: p. 93) for the same argument in a similar situation. In a similar way, for Π2 we have E ( APP2 ( γ )) = P(( X 1 −( c1 X 1 + c2 X 2 ) + c δ∗γ ) ′ δ∗γ >0,Y1 = 2 )

= E [ P(( X 1 −( c1 X 1 + c 2 X 2 ) + c δ∗γ ) ′ δ∗γ >0, Y1 = 2 / X 1 , X 2 )]

and

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

D. Conde et al.: True error rate of restricted rules      601

P(( X 1 −( c1 X 1 + c 2 X 2 ) + c δ∗γ ) ′ δ∗γ >0, Y1 = 2 / X 1 , X 2 ) ∗ ∗  n2 ( c1 δ − c δγ ) ′ δγ . = Φ − ∗′ ∗  n2 − 1  δ δ   γ γ

Following the same lines we can also prove that for Fisher’s rule     ′ 1 E ( APP1 ( Fisher ) )= E  P   X 1 − ( X 1 + X 2 ) δ<0, Y1 = 1/ X 1 , X 2     2           ′ 1 E ( APP2 ( Fisher ) )= E  P   X 1 − ( X 1 + X 2 ) δ>0, Y1 = 2 / X 1 , X 2     2       and    1 n   ′ 1 1 P   X 1 − ( X 1 + X 2 ) δ<0, Y1 = 1/ X 1 , X 2  = Φ  − || δ ||   2 n1 − 1  2         1 n   ′ 1 2 P   X 1 − ( X 1 + X 2 ) δ>0, Y1 = 2 / X 1 , X 2  = Φ  − || δ ||  .  2 n2 − 1  2      Now we are ready to prove Theorem 1: 1 and c = 0. Now, δ∗0 = p( δ / C ) and δ∗γ ∈C so taking into 2 account Theorem 1.3.2 in Robertson et al. (1988), ( δ − δ∗0 ) ′ δ∗0 = 0 and ( δ − δ∗0 ) ′ δ∗γ ≤ 0 From this, Proof of Theorem. As n1 = n2 we have that c1 = c2 =

δ′ δ∗γ δ∗γ′ δ∗γ



δ∗0′ δ∗γ δ∗γ′ δ∗γ

=|| δ∗0 || cos( δ∗0 , δ∗γ ) ≤|| δ∗0 || =

δ′ δ∗0 δ∗0′ δ∗0

≤|| δ ||,

and the result follows from Lemma 3.

References Beran, R. and L. Dümbgen (2010): “Least squares and shrinkage estimation under bimonotonicity constraints,” Stat. Comput., 20(2), 177–189. Braga-Neto, U. M. and E. R. Dougherty (2004): “Is cross-validation valid for small-sample microarray classification?,” Bioinformatics, 20, 374–380. Conde, D., M. A. Fernández, C. Rueda and B. Salvador (2012): “Classification of samples into two or more ordered populations with application to a cancer trial,” Stat. Med., 31(28), 3773–3786. Dettling, M. (2005): “Bagboosting for tumor classification with gene expression data,” Bioinformatics, 20, 3583–3593. Dudoit, S., J. Fridlyand and T. P. Speed (2002): “Comparison of discrimination methods for the classification of tumor using gene expression data,” J. Am. Stat. Assoc., 97, 77–87. Efron, B. (1983): “Estimating the error rate of a prediction rule: Improvement on cross-validation,” J. Am. Stat. Assoc., 78, 316–331. Efron, B. and R. Tibshirani (1997): “Improvement on cross-validation: the 632+bootstrap method,” J. Am. Stat. Assoc., 92, 548–560. Faraggi, D. and B. Reiser (2002): “Estimation of the area under the ROC curve,” Stat. Med., 21(20), 3093–3106. Fernández, M. A., C. Rueda and B. Salvador (2006): “Incorporating additional information to normal linear discriminant rules,” J. Am. Stat. Assoc., 101, 569–577.

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

602      D. Conde et al.: True error rate of restricted rules Fu, W. J., R. J. Carroll and S. Wang (2005): “Estimating misclassification error with small samples via bootstrap cross-validation,” Bioinformatics, 21, 1979–1986. Graf, A. C. and P. Bauer (2009): “Model selection based on FDR-thresholding optimizing the area under the ROC-curve,” Stat. Appl. Genet. Mol. Biol., 8(1), 1–20. Kim, J. H. (2009): “Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap,” Comput. Stat. Data An., 53(11), 3735–3745. Kim, J. and E. Cha (2006): “Estimating prediction errors in binary classification problem: Cross-validation versus bootstrap,” Korean Commun. Stat., 13, 151–165. Kim, K. I. and R. Simon (2011): “Probabilistic classifiers with high-dimensional data,” Biostatistics, 12(3), 399–412. Lachenbruch, P. and M. Mickey (1968): “Estimation of error rates in discriminant analysis,” Technometrics, 10, 167–178. Lee, J. W., J. B. Lee, M. Park and S. H. Song (2005): “An extensive comparison of recent classification tools applied microarray data,” Comput. Stat. Data An., 48(4), 869–885. Lehmann, E. L. and G. Casella (1998): Theory of Point Estimation, 2nd edition. New York: Springer-Verlag. Lin, D., Z. Shkedy, D. Yekutieli, T. Burzykowski, H. W. H. Göhlmann, A. De Bondt, T. Perera, T. Geerts and L. Bijnens (2007): “Testing for trends in dose-response microarray experiments: a comparison of several testing procedures, multiplicity and resampling-based inference,” Stat. Appl. Genet. Mol. Biol., 6(1), article 26. Long, T. and R. D. Gupta (1998): “Alternative linear classification rules under order restrictions,” Commun. Stat. A-Theor, 27, 559–575. McLachlan, G. J. (1976): “The bias of the apparent error rate in discriminant analysis,” Biometrika, 63, 239–244. Molinaro, A. M., R. Simon and R. M. Pfeiffer (2005): “Prediction error estimation: a comparison of resampling methods,” Bioinformatics, 15, 3301–3307. Oh, M. S. and D. W. Shin (2011): “A unified Bayesian inference on treatment means with order constraints,” Comput. Stat. Data An., 55(1), 924–934. Pepe, M. S., H. Janes, G. Longton, W. Leisenring and P. Newcomb (2004): “Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker,” Am. J. Epidemiol., 159, 882–890. Pepe, M. S., T. Cai and G. Longton (2006): “Combining predictors for classification using the area under the receiver operating characteristic curve,” Biometrics, 62(1), 221–229. Robertson, T., F. T. Wright and R. L. Dykstra (1988): Order Restricted Statistical Inference, New York: Wiley. Salvador, B., M. A. Fernández, I. Martn and C. Rueda (2008): “Robustness of classification rules that incorporate additional information,” Comput. Stat. Data An., 52(5), 2489–2495. Schiavo, R. A. and D. J. Hand (2000): “Ten more years of error rate research,” Int. Stat. Rev., 68, 295–310. Silvapulle, M. J. and P. K. Sen (2005): Constrained Statistical Inference, New Jersey: John Wiley & Sons. Simmons, S. and S. D. Peddada (2007): “Order-restricted inference for ordered gene expression (ORIOGEN) data under heteroscedastic variances,” Bioinformation, 1, 414–419. Steele, B. M. and D. A. Patterson (2000): “Ideal bootstrap estimation of expected prediction error for k-nearest neighbor classifiers: applications for classification and error assessment,” Stat. Comput., 10(4), 349–355. Tong, T., L. Chen and H. Zhao (2012): “Improved mean estimation and its application to diagonal discriminant analysis,” Bioinformatics, 28(4): 531–537. UICC (2009): TNM Classification of Malignant Tumours, 7th edition. New Jersey: Wiley-Blackwell. Wehberg, S. and M. Schumacher (2004): “A comparison of nonparametric error rate estimation methods in classification problems,” Biometrical J., 46, 35–47.

Brought to you by | National Institutes of Health Library Authenticated | 157.98.70.144 Download Date | 11/27/13 5:31 PM

Loading...

UniversidaddeValladolid - UVaDOC

UniversidaddeValladolid FACULTAD DE CIENCIAS DEPARTAMENTO DE ESTAD´ISTICA ´ OPERATIVA E INVESTIGACION TESIS DOCTORAL: ´ MEJORAS EN REGLAS DE CLASIFIC...

3MB Sizes 2 Downloads 0 Views

Recommend Documents

Bioestadıstica - UVaDOC
FANOVA models, with correlated error term, in the statistical analysis of. fMRI data. J. ´Alvarez Liébana, M. D. Ruiz

Materials and methods - UVaDOC
The. 83. Avrami model has been used to investigate starch retrogradation phenomena (Zhang and. 84. Jackson, 1992) and th

TFM-F 22.pdf - UVaDOC
Jun 18, 2012 - El 20 de febrero, tres días antes del golpe, Ricardo Paseyro, corresponsal de. Paris Match en Madrid, es

ESCUELA de INGENIERÍAS INDUSTRIALES - UVaDOC
Presentada por. Luis Manuel Mayo Monge para optar al grado de doctor por la Universidad de Valladolid y dirigida por el

TESIS DOCTORAL: Structural Studies of Biomolecular - UVaDOC
Jan 22, 2014 - financiación recibida por el grupo a través de los proyectos de ... microondista y por recordarme que t

JMT version deposito - UVaDOC - Universidad de Valladolid
Currently, the dissemination of the Kelvin, according to the International Temperature Scale. (ITS-90), at high temperat

TFM - 2015 Alt - UVaDOC - Universidad de Valladolid
Jun 15, 2015 - El uso de diferentes tecnologías para configurar un esqueleto funcional del mismo se desarrolla a contin

TESIS DOCTORAL: Comparative evaluation of - UVaDOC
Escuela de Ingenierías Industriales de la Universidad de Valladolid. Considerando que dicho trabajo reúne los requisit

TESIS DOCTORAL: Monomial multisummability through - UVaDOC
Esta técnica es llamada por H. Poincaré “sumación de los astrónomos”, en contraposición a la “sumación de lo

tesis doctoral - UVaDOC - Universidad de Valladolid
48 La preocupación por evitar hablar en nombre de otro, traduciendo a la experiencia personal la realidad ajena, es pro