Skip to content

GSoC 2015 Proposal: Cross validation and Meta estimators for Semi supervised Learning

Boyuan Deng edited this page Mar 25, 2015 · 7 revisions

Student Information

University Information

  • University: Saarland University (Universität des Saarlandes)
  • Major: Erasmus Mundus LCT (mainly natural language processing, also doing machine learning and information retrieval at Max-Planck Institute for Informatics. And after GSoC I’ll move to another university as arranged by the program.)
  • Current Year and Expected Graduation date: Year 1, expected to graduate in latter half of 2016.
  • Degree: MSc

Other Related Backgrounds

KDD Cup 2013 - Author-Paper Identification Challenge (Track 1) 6th/554

Project Proposal

Abstract

Though being the de facto statistical machine learning library for Python, scikit-learn’s capabilities on semi-supervised learning are still not fully established.

The goal of this project is to provide new algorithm implementations for the sklearn.semi_supervised subpackage, improve existing ones and enable it to interact smoothly and correctly with other components. We particularly want to support cross validation for semi-supervised learning.

Details

Cross Validation for Semi-supervised Learning

Currently the sklearn.cross_validation module is unaware of unlabeled data. When splitting the dataset, it blindly puts unlabeled data into testing set, which is meaningless and also confuses the scoring function.

We have to modify the current cross validation infrastructure to make it work correctly for semi-supervised algorithms (including newly added ones) and try to maintain backward compatibility (existing code for cross validation on supervised learning can run without modification).

Due to that we are going to modify the API for cross validation, this step should be done before new algorithm implementations.

If anyone will be doing the “Multiple Metric Support for Cross Validation and Gridsearches” project, then API designs need to be fully discussed with that participant and our mentors.

Improve Existing Implementations

Until now sklearn.semi_supervised subpackage only provides graph-based label propagation algorithms.

There's still room for improvement on the existing implementations. For example, documentation and a closed form version presented in [2] to replace the current iterative framework.

New Algorithm Implementations for Semi-supervised Learning

We plan to implement more algorithms for sklearn.semi_supervised.

One example is the “self-taught learning” algorithm as specified in [1]. It’s generally a semi-supervised algorithm because it uses both labeled and unlabeled data, though the authors emphasize that labeled and unlabeled data don’t necessarily share the same distribution. So this algorithm is quite useful when only random unlabeled data can be obtained. In this situation label propagation won't work.

Timeline

  • Week 1 (May 25 - May 31) : API design (may start early in the community bonding period and along with other participants) .
  • Week 2, 3 (Jun 1 - Jun 14) : Implement the new API for cross validation.
  • Week 4, 5 (Jun 15 - Jun 28) : Continue implementation, write tests and update documentation. The new API should be mergeable at the end of this period.
  • Week 6, 7 (Jun 29 - Jul 12) : Improve existing graph-based algorithms.
  • Week 8, 9 (July 13 - Jul 26) : Implement new semi-supervised algorithms and write corresponding documentation.
  • Week 10, 11 (July 27 - Aug 9) : Continue implementation and write tests.
  • Week 12 (Aug 10 - Aug 16) : Improve documentation.

Links to patches

References

  1. Raina, Rajat, et al. "Self-taught learning: transfer learning from unlabeled data." Proceedings of the 24th international conference on Machine learning. ACM, 2007.
  2. Zhou, Dengyong, et al. "Learning with local and global consistency." Advances in neural information processing systems 16.16 (2004): 321-328.
  3. https://github.com/scikit-learn/scikit-learn/issues/1243
  4. https://github.com/scikit-learn/scikit-learn/issues/2593
  5. https://github.com/scikit-learn/scikit-learn/issues/4449

Other Schedule Information

It’s actually still during teaching period (summer semester) in Germany when GSoC goes on. But there won’t be much course workload for me due to the extra credits I got earlier this year. And of course, I’m glad to work on weekends for GSoC. I have some exam(s) in August (or maybe in the last week of July).

On June 8-9, I’ll attend a meeting in Groningen, the Netherlands.

Clone this wiki locally