Finding Influential Training Samples for Gradient Boosted Decision Trees
| Authors |
|
|---|---|
| Publication date | 2018 |
| Journal | Proceedings of Machine Learning Research |
| Event | 35th International Conference on Machine Learning, ICML 2018 |
| Volume | Issue number | 80 |
| Pages (from-to) | 4577-4585 |
| Organisations |
|
| Abstract |
We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model's predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency. |
| Document type | Article |
| Note | International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm Sweden. - With supplementary file. - In print proceedings pp. 7287-7296. |
| Language | English |
| Published at | http://proceedings.mlr.press/v80/sharchilev18a.html |
| Other links | http://www.proceedings.com/40527.html https://www.scopus.com/pages/publications/85057338736 |
| Downloads |
sharchilev18a
(Final published version)
|
| Supplementary materials | |
| Permalink to this page | |
