-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC 2015 Proposal: Cross validation and Meta estimators for Semi supervised Learning
- Name: Boyuan Deng
- Email: bryanhsudeng@gmail.com
- Telephone: 008618906286868
- Time zone: Central European Time (UTC+1)
- IRC handle including network: bryandeng@irc.freenode.net
- Source control username: bryandeng on Github
- Twitter: @boyuandeng
- Blog: http://boyuandeng-gsoc2015.blogspot.de
- GSoC Blog RSS feed: http://boyuandeng-gsoc2015.blogspot.com/feeds/posts/default
- University: Saarland University (Universität des Saarlandes)
- Major: Erasmus Mundus LCT (mainly natural language processing, also doing machine learning and information retrieval at Max-Planck Institute for Informatics. And after GSoC I’ll move to another university as arranged by the program.)
- Current Year and Expected Graduation date: Year 1, expected to graduate in latter half of 2016.
- Degree: MSc
KDD Cup 2013 - Author-Paper Identification Challenge (Track 1) 6th/554
Though being the de facto statistical machine learning library for Python, scikit-learn’s capabilities on semi-supervised learning are still not fully established.
The goal of this project is to provide new algorithm implementations for the sklearn.semi_supervised subpackage, improve existing ones and enable it to interact smoothly and correctly with other components. We particularly want to support cross validation for semi-supervised learning.
Currently the sklearn.cross_validation module is unaware of unlabeled data. When splitting the dataset, it blindly puts unlabeled data into testing set, which is meaningless and also confuses the scoring function.
We have to modify the current cross validation infrastructure to make it work correctly for semi-supervised algorithms (including newly added ones) and try to maintain backward compatibility (existing code for cross validation on supervised learning can run without modification).
Due to that we are going to modify the API for cross validation, this step should be done before new algorithm implementations.
If anyone will be doing the “Multiple Metric Support for Cross Validation and Gridsearches” project, then API designs need to be fully discussed with that participant and our mentors.
Until now sklearn.semi_supervised subpackage only provides graph-based label propagation algorithms.
There's still room for improvement on the existing implementations. For example, documentation and a closed form version presented in [2] to replace the current iterative framework.
We plan to implement more algorithms for sklearn.semi_supervised.
One example is the “self-taught learning” algorithm as specified in [1]. It’s generally a semi-supervised algorithm because it uses both labeled and unlabeled data, though the authors emphasize that labeled and unlabeled data don’t necessarily share the same distribution. So this algorithm is quite useful when only random unlabeled data can be obtained. In this situation label propagation won't work.
- Week 1 (May 25 - May 31) : API design (may start early in the community bonding period and along with other participants) .
- Week 2, 3 (Jun 1 - Jun 14) : Implement the new API for cross validation.
- Week 4, 5 (Jun 15 - Jun 28) : Continue implementation, write tests and update documentation. The new API should be mergeable at the end of this period.
- Week 6, 7 (Jun 29 - Jul 12) : Improve existing graph-based algorithms.
- Week 8, 9 (July 13 - Jul 26) : Implement new semi-supervised algorithms and write corresponding documentation.
- Week 10, 11 (July 27 - Aug 9) : Continue implementation and write tests.
- Week 12 (Aug 10 - Aug 16) : Improve documentation.
- https://github.com/scikit-learn/scikit-learn/pull/4409
- https://github.com/scikit-learn/scikit-learn/pull/4439
- Raina, Rajat, et al. "Self-taught learning: transfer learning from unlabeled data." Proceedings of the 24th international conference on Machine learning. ACM, 2007.
- Zhou, Dengyong, et al. "Learning with local and global consistency." Advances in neural information processing systems 16.16 (2004): 321-328.
- https://github.com/scikit-learn/scikit-learn/issues/1243
- https://github.com/scikit-learn/scikit-learn/issues/2593
- https://github.com/scikit-learn/scikit-learn/issues/4449
It’s actually still during teaching period (summer semester) in Germany when GSoC goes on. But there won’t be much course workload for me due to the extra credits I got earlier this year. And of course, I’m glad to work on weekends for GSoC. I have some exam(s) in August (or maybe in the last week of July).
On June 8-9, I’ll attend a meeting in Groningen, the Netherlands.