Category Archives: Uncategorized

Master’s thesis

实际上这东西持续了一年, 从2012年11月份到2013年11月份. 其中2012-11至2013-2 算seminar, 然后又继续做实验到2013-5. master’s project算起来是从5月份去ams开会后才开始的. 所以持续6个月, 到2013-11才结束.

整个事情要从2012年11月份说起了. 那时候正好实习合同要到期, 而且实习项目的母项目也正好结束, 就没有继续做下去, 转而回学校找master’s project. 运气也比较好, 找以前一老师当导师, 他同意了, 且他手上一系列项目刚启动, 就给了个方向, 并让一博后带着我. 这东西是关于pattern mining的, 又跟时髦的大数据沾点边, 其实就是把一个算法并行化.

2012-11到2013-02就是做这并行化. 开始用python写了个,  但后来又用java重写了. 这算法里头需要高精度计算, 用了个java的lib(apfloat), 后来由于性能原因放弃了. 导致算法正确性略有问题, 到毕设结束都没解决.

2013年2月份后还在捣鼓这算法, 跑跑实验之类. 直到五月份, 去ams跟公司开了个会, 弄了点他们说已经处理好的数据. 回来才知道被坑了, 还是得自己预处理. 完全没经验, 胡搞到八月份才有点头绪. 七月份自己读论文和书发现这些东西前人已经研究了, 叫web usage mining. 而且LinkedIn竟然有个DataFu工具包, 当时真是泪流满面.

九月份晓得整个过程应该是怎样, 但由于前面的精度问题, 在选特征上依旧不敢随便. 捣鼓了点实验, 很差的结果. 看看网上也没人公布这方面的结果, 也没办法对比. 十月份毕设还无法结束, 其实主要的实验还没做. 就又延期到11月份, 继而到11月底. 日期定了就快了, 带着一堆狗屎一样的东西, 就毕业了.

以上就是整个毕设咯. 玩了一个月, 我真的都快忘记了.

Topic: Distributed Context Discovery for Predictive Modeling
Scenario

A Dutch portal website wanted to know whether a user would click on their index page.
Knowing such information can help them adjust contents and do personal recommendation.
Not like Google or Facebook, the portal website knows very little about users, since most users are not registered in their website.
For example, Google knows what a user has visited, what a user has clicked, and the user’s age… However, this portal website does not.
So my task was to predict whether a user will click based on this little information.

Challenges

The raw data was not classifiable. No documentation about the semantics of the data.
The data is big and dynamic.
The features are weak.
The class distribution is skewed.

Project/ What you did

Preprocessed raw data for classification
Identified the semantics of tables and columns
Sessionized pageviews
Transformed raw sessions into classifiable data
Developed a distributed algorithm that can mine discriminative features for classification
Built classification models over data streams
Predicted incoming pageviews

Achieved

The pattern mining algorithms scales up with the number of mappers, and sample sizes.
The click prediction achieved an average AUC of 0.675 over a period of 36 days

Learnt technologies
  1. Hadoop
    • command lines for MapReduce and HDFS
    • Java APIs (0.20 and 0.22, CDH3 and CDH4)
    • streaming
  2. Mahout
    • command line – vectorization, nb, feature hashing, PFP
    • Java API – regression, model dissection
  3. Hive
    • HiveQL
    • UDFs and streaming with Python
  4. Pig
    • scripting
    • UDFs, with Python and DataFu
    • streaming with Python
  5. Python – processing data and plotting with Matplotlib
  6. Bash
  7. Apfloat – Java lib for high precision calculation
  8. R (introductory)

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!

100块钱买了一年主机空间, 折腾了半天我博客又上线了.

没想到当初备份时候忘记备份图片, 所以只有文字能恢复.

折腾半天, 好歹这个页面可以打开了.

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!

大家好, 我准备用wordpress写博客了. 以后就这里了.

本博客肯定大量抱怨, 缓解我中二症.

从此在这里安家, 告别各种蛋疼的博客比如百度校内qzone,blogbus, blogcn等等等等.