Swap Training and Test Data During Cross-Validation in scikit-learn

Scikit-learn is a well known Python machine learning library. It provides various utilities for machine learning, including those for cross-validation. In a standard $$K$$-fold cross-validation, the data are split into $$K$$ subsets (with equal size). There are $$K$$ rounds of training and testing. In each round, one subset is used as test data and all other subsets are used as training data. Under this setup, as long as $$K > 2$$, there are always more training data than test data in each round of the cross-validation. Whilst this is desirable in most cases, in some machine learning applications, it is more desirable to have training data less than test data. For example, in graph embedding, each node in the network has a vector representation and labels. When running cross-validation, it is more desirable to use a smaller number of nodes as training data than the number of nodes as test data, since this better mimics the real-world scenario in terms of the amount of available training data (e.g., here). In scikit-learn, we can achieve this by swapping training and test data.

Enable Auto Completion for pip in Zsh

Pip is a package management system for installing and managing Python software packages. To enable auto completion for pip in zsh, the documentation of pip suggests adding the following line to ~/.zshrc:

eval "pip completion --zsh"

However, merely having this line would not enable auto completion for pip3. To enable auto completion for pip3 as well, add the following line after the line above:

compctl -K _pip_completion pip3

Too Many Escaping Backslashes? Avoid Them!

Backslash escaping is common in programming. Sometimes we may let a file go through a few filters or template engines, such as markdown, quik, etc. and things become even worse if we are writing the template files from a string which requires backslash escaping for any literal backslashes appearing in the string. On Windows, things are more horrible than on Unices (You know why, right? Hint: path separator). Then, if you need a “real” backslash in the final output, you may end up with four or eight or sixteen backslashes in the original file. This is horrible. To avoid this situation, I wrote a short preprocessing script backbackslash.py in Python to double or quadruple or octuple or zzzuple your backslashes.

Load A Matrix from An ASCII Format File (C++ and Python)

It is common for an scientific program to load an ASCII format matrix file, i.e. an ASCII text file consisting of lines of float numbers separated by whitespaces. In this post, I am gonna show my code (C++ and Python) to load a matrix from an ASCII file.

Use Travis CI with Jython

This post was updated on Feb 11, 2013, since the old way never works now.

Travis CI is a hosted continuous integration service for the open source community, helping run tests for your GitHub projects for every single push and pull request. However, by the time this post is written, Travis CI has not officially supported Jython, a Python interpreter written in Java. This post will help you setup a Jython testing environment for a Python project on Travis CI.