Master’s thesis

实际上这东西持续了一年, 从2012年11月份到2013年11月份. 其中2012-11至2013-2 算seminar, 然后又继续做实验到2013-5. master’s project算起来是从5月份去ams开会后才开始的. 所以持续6个月, 到2013-11才结束.

整个事情要从2012年11月份说起了. 那时候正好实习合同要到期, 而且实习项目的母项目也正好结束, 就没有继续做下去, 转而回学校找master’s project. 运气也比较好, 找以前一老师当导师, 他同意了, 且他手上一系列项目刚启动, 就给了个方向, 并让一博后带着我. 这东西是关于pattern mining的, 又跟时髦的大数据沾点边, 其实就是把一个算法并行化.

2012-11到2013-02就是做这并行化. 开始用python写了个,  但后来又用java重写了. 这算法里头需要高精度计算, 用了个java的lib(apfloat), 后来由于性能原因放弃了. 导致算法正确性略有问题, 到毕设结束都没解决.

2013年2月份后还在捣鼓这算法, 跑跑实验之类. 直到五月份, 去ams跟公司开了个会, 弄了点他们说已经处理好的数据. 回来才知道被坑了, 还是得自己预处理. 完全没经验, 胡搞到八月份才有点头绪. 七月份自己读论文和书发现这些东西前人已经研究了, 叫web usage mining. 而且LinkedIn竟然有个DataFu工具包, 当时真是泪流满面.

九月份晓得整个过程应该是怎样, 但由于前面的精度问题, 在选特征上依旧不敢随便. 捣鼓了点实验, 很差的结果. 看看网上也没人公布这方面的结果, 也没办法对比. 十月份毕设还无法结束, 其实主要的实验还没做. 就又延期到11月份, 继而到11月底. 日期定了就快了, 带着一堆狗屎一样的东西, 就毕业了.

以上就是整个毕设咯. 玩了一个月, 我真的都快忘记了.

Topic: Distributed Context Discovery for Predictive Modeling

A Dutch portal website wanted to know whether a user would click on their index page.
Knowing such information can help them adjust contents and do personal recommendation.
Not like Google or Facebook, the portal website knows very little about users, since most users are not registered in their website.
For example, Google knows what a user has visited, what a user has clicked, and the user’s age… However, this portal website does not.
So my task was to predict whether a user will click based on this little information.


The raw data was not classifiable. No documentation about the semantics of the data.
The data is big and dynamic.
The features are weak.
The class distribution is skewed.

Project/ What you did

Preprocessed raw data for classification
Identified the semantics of tables and columns
Sessionized pageviews
Transformed raw sessions into classifiable data
Developed a distributed algorithm that can mine discriminative features for classification
Built classification models over data streams
Predicted incoming pageviews


The pattern mining algorithms scales up with the number of mappers, and sample sizes.
The click prediction achieved an average AUC of 0.675 over a period of 36 days

Learnt technologies
  1. Hadoop
    • command lines for MapReduce and HDFS
    • Java APIs (0.20 and 0.22, CDH3 and CDH4)
    • streaming
  2. Mahout
    • command line – vectorization, nb, feature hashing, PFP
    • Java API – regression, model dissection
  3. Hive
    • HiveQL
    • UDFs and streaming with Python
  4. Pig
    • scripting
    • UDFs, with Python and DataFu
    • streaming with Python
  5. Python – processing data and plotting with Matplotlib
  6. Bash
  7. Apfloat – Java lib for high precision calculation
  8. R (introductory)

Leave a Reply

Your email address will not be published. Required fields are marked *