Wright-et-al_2019_Splitting on categorical predictors in random forests.pdf 1,36MB
1000 Titel
  • Splitting on categorical predictors in random forests
1000 Autor/in
  1. Wright, Marvin N. |
  2. König, Inke R. |
1000 Erscheinungsjahr 2019
1000 LeibnizOpen
1000 Art der Datei
1000 Publikationstyp
  1. Artikel |
1000 Online veröffentlicht
  • 2019-02-07
1000 Erschienen in
1000 Quellenangabe
  • 7: e6339
1000 FRL-Sammlung
1000 Copyrightjahr
  • 2019
1000 Lizenz
1000 Verlagsversion
  • |
  • |
1000 Ergänzendes Material
  • |
1000 Publikationsstatus
1000 Begutachtungsstatus
1000 Sprache der Publikation
1000 Abstract/Summary
  • One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2k − 1 − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.
1000 Sacherschließung
lokal Survival analysis
lokal Classification
lokal Random forest
lokal Categorical predictors
1000 Fachgruppe
  1. Medizin |
  2. Gesundheitswesen |
1000 Fächerklassifikation (DDC)
1000 Liste der Beteiligten
1000 Label
1000 Förderer
  1. National Institutes of Health
  2. National Arthritis Foundation
1000 Fördernummer
  1. NO1-AR-2-2263; RO1-AR-44422
  2. -
1000 Förderprogramm
  1. -
  2. -
1000 Dateien
  1. Splitting on categorical predictors in random forests
1000 Objektart article
1000 Beschrieben durch
1000 @id frl:6413210.rdf
1000 Erstellt am 2019-03-04T14:55:33.393+0100
1000 Erstellt von 266
1000 beschreibt frl:6413210
1000 Bearbeitet von 122
1000 Zuletzt bearbeitet Thu Jan 30 19:39:20 CET 2020
1000 Objekt bearb. Wed Mar 06 13:38:52 CET 2019
1000 Vgl. frl:6413210
1000 Oai Id
  1. |
1000 Sichtbarkeit Metadaten public
1000 Sichtbarkeit Daten public
1000 Gegenstand von

View source