2020-11-27 Parallelization of Random Forest predictions#
I’ve been struggling to understand why the first implementation of TreeEnsemble could not get as fast as scikit-learn implementation for a RandomForest when the number of observations was 100.000 or above, 100 trees and a depth >= 10. The only difference was that the computation was parallelized by trees and not by observations. These observations are benchmarked in Benchmark Random Forests, Tree Ensemble, (AoS and SoA) (Benchmark Random Forests, Tree Ensemble, Multi-Classification for the multiclass version).
Parallelizing by tree requires to save the outputs of every observation. That means the computation requires an additional buffer (one per thread at least) to save the trees outputs. However, that approximatively two, three times faster to do it that way instead of parallelizing per observations. The computational is the same in both cases. The only explanation would be a better use of the caches (L1, L2, L3) when the computation is parallelized per tree. The answer is probably hidden in that book.
The next investigation should be a study of the difference between a tree described as an array of nodes or a structure of arrays where every node field gets its own array.
Other readings: