Category Archives: scikit-learn

Swap Training and Test Data During Cross-Validation in scikit-learn

Last updated on October 11, 2018

Scikit-learn is a well known Python machine learning library. It provides various utilities for machine learning, including those for cross-validation. In a standard \(K\)-fold cross-validation, the data are split into \(K\) subsets (with equal size). There are \(K\) rounds of training and testing. In each round, one subset is used as test data and all other subsets are used as training data. Under this setup, as long as \(K > 2\), there are always more training data than test data in each round of the cross-validation. Whilst this is desirable in most cases, in some machine learning applications, it is more desirable to have training data less than test data. For example, in graph embedding, each node in the network has a vector representation and labels. When running cross-validation, it is more desirable to use a smaller number of nodes as training data than the number of nodes as test data, since this better mimics the real-world scenario in terms of the amount of available training data (e.g., here). In scikit-learn, we can achieve this by swapping training and test data.

Continue reading