Master’s thesis

实际上这东西持续了一年, 从2012年11月份到2013年11月份. 其中2012-11至2013-2 算seminar, 然后又继续做实验到2013-5. master’s project算起来是从5月份去ams开会后才开始的. 所以持续6个月, 到2013-11才结束.

整个事情要从2012年11月份说起了. 那时候正好实习合同要到期, 而且实习项目的母项目也正好结束, 就没有继续做下去, 转而回学校找master’s project. 运气也比较好, 找以前一老师当导师, 他同意了, 且他手上一系列项目刚启动, 就给了个方向, 并让一博后带着我. 这东西是关于pattern mining的, 又跟时髦的大数据沾点边, 其实就是把一个算法并行化.

2012-11到2013-02就是做这并行化. 开始用python写了个,  但后来又用java重写了. 这算法里头需要高精度计算, 用了个java的lib(apfloat), 后来由于性能原因放弃了. 导致算法正确性略有问题, 到毕设结束都没解决.

2013年2月份后还在捣鼓这算法, 跑跑实验之类. 直到五月份, 去ams跟公司开了个会, 弄了点他们说已经处理好的数据. 回来才知道被坑了, 还是得自己预处理. 完全没经验, 胡搞到八月份才有点头绪. 七月份自己读论文和书发现这些东西前人已经研究了, 叫web usage mining. 而且LinkedIn竟然有个DataFu工具包, 当时真是泪流满面.

九月份晓得整个过程应该是怎样, 但由于前面的精度问题, 在选特征上依旧不敢随便. 捣鼓了点实验, 很差的结果. 看看网上也没人公布这方面的结果, 也没办法对比. 十月份毕设还无法结束, 其实主要的实验还没做. 就又延期到11月份, 继而到11月底. 日期定了就快了, 带着一堆狗屎一样的东西, 就毕业了.

以上就是整个毕设咯. 玩了一个月, 我真的都快忘记了.

Topic: Distributed Context Discovery for Predictive Modeling
Scenario

A Dutch portal website wanted to know whether a user would click on their index page.
Knowing such information can help them adjust contents and do personal recommendation.
Not like Google or Facebook, the portal website knows very little about users, since most users are not registered in their website.
For example, Google knows what a user has visited, what a user has clicked, and the user’s age… However, this portal website does not.
So my task was to predict whether a user will click based on this little information.

Challenges

The raw data was not classifiable. No documentation about the semantics of the data.
The data is big and dynamic.
The features are weak.
The class distribution is skewed.

Project/ What you did

Preprocessed raw data for classification
Identified the semantics of tables and columns
Sessionized pageviews
Transformed raw sessions into classifiable data
Developed a distributed algorithm that can mine discriminative features for classification
Built classification models over data streams
Predicted incoming pageviews

Achieved

The pattern mining algorithms scales up with the number of mappers, and sample sizes.
The click prediction achieved an average AUC of 0.675 over a period of 36 days

Learnt technologies
  1. Hadoop
    • command lines for MapReduce and HDFS
    • Java APIs (0.20 and 0.22, CDH3 and CDH4)
    • streaming
  2. Mahout
    • command line – vectorization, nb, feature hashing, PFP
    • Java API – regression, model dissection
  3. Hive
    • HiveQL
    • UDFs and streaming with Python
  4. Pig
    • scripting
    • UDFs, with Python and DataFu
    • streaming with Python
  5. Python – processing data and plotting with Matplotlib
  6. Bash
  7. Apfloat – Java lib for high precision calculation
  8. R (introductory)

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!

100块钱买了一年主机空间, 折腾了半天我博客又上线了.

没想到当初备份时候忘记备份图片, 所以只有文字能恢复.

折腾半天, 好歹这个页面可以打开了.

2012.9-2013.1 Semester 3 上的课

第三个学期

2IW15 Automated Reasoning

  • Leaned:
    • Obtaining insight how various problems can be transformed to formulas, and can be solved automatically by computer programs manipulating these formulas.
  • Project:
    • 3 assignments
  • Skills:
    • Yices, bddsolve
  • Ps: 感觉就是练了vim和python

2IN28 Grid and Cloud Computing

  • Leaned:
    • Scheduling and Resource Management
    • Data Centers and Energy Efficiency
    • Multi-tenancy and Virtualization
    • Cloud Programming Models
  • Project:
    • Term extraction system on Amazon EC2
    • Deployed a term extraction system on Amazon EC2
    • The system consists of two parts: the resource management server and the labors
    • The labors are virtual machines on Amazon EC2, and processes inputs, outputs the terms
    • The resource management server controls the resources (labors); elastically allocates the resources.
  • Skills:
    • AWS SDK, web2py, Spring framework
  • Ps: elaborate

2IV35 Visualization

  • Leaned:
    • color mapping, contouring
    • Vector visualization
    • Volume visualization
    • Information visualization
  • Project:
    • 3 assignments
  • Skills:
    • OpenGL in Java, Prefuse library

2ID95 Information System Seminar

  • Project: distributed sampling
    • Altered a pattern sampling algorithm to a distributed fashion
      • Involving reservoir sampling
      • Implemented Cartesian Join in hadoop
    • Implemented it in hadoop environment
  • Skills: hadoop old and new APIs

Internship in Philips Research

  • Position: software engineer
  • Project: TClouds, TPaaS (Trustworthy Platform as a Service)
  • Skills: Spring framework (MVC, Security), Hibernate, Maven, OAuth

2012.2-2012.7 Semester 2 上的课

拖到今天终于要写了, 不晓得还记得以前干了什么不.

9ST14 Academic Skills in English

  • Learned: presentation and writing
  • Assignments: presentation and writing
  • Ps: 其实很好的一门课, 可惜当时不知道在干吗

5MB20 Adaptive Information Processing

  • Learned: Bayesian machine learning; Gaussians; Linear regression; Generative classification; Discriminative classification; Gaussian mixture models; Entropy maximum algorithm; Hidden Markov models; Context tree; MDL principle
  • Ps: 这课还是学了很多东西的, 就是忘光了

2ID45 Advanced Databases

  • Learned:
    • deductive databases (datalog);
    • data warehousing and online analytical processing (OLAP);
    • XML data model
  • Project: a CV generator website. The user can enter blocks of information to the website, decide which of those would be included each time to the final CV edition, and in the end, produces a pdf version of his CV.
    • Used BaseX server on an Apache server to support the XML database and XQuery
    • Used PHP to implement the webpage interface to allow users to access their profiles
    • Used latex as the backend processor to generate CV
  • Skill acquired: XQuery, BaseX, PHP
  • Ps: 俄罗斯哥说这课上东西没实用价值

2II55 Business Process Management Systems

  • Learned:
    • Modeling and implementation of workflows
    • analysis of workflows/business processes
  • Project: modeled a Retail Supply Chain, and a Robotic Distribution Centre
  • Skill acquired: YAWL, Protos
  • Ps: 挂了, 补考过了

1BM46 Data Mining and Process Mining

  • Learned:
    • Data mining:
      • K-nearest neighbors, decision trees, information gain, over fitting
      • performance measurements, experimental design, k-fold cross-validation
      • ANN, clustering
    • Process mining:
      • structure of an event-log, CPN-tools for generating event logs, alpha-algorithm
      • Conformance checking (fitness precision, generalization, understand ability) and the LTL checker)
      • Process mining in practice: C-nets, Flexible Heuristics Miner, Fuzzy Miner
      • Process mining and k-fold cross-validation, multidimensional process mining
  • Project:
    • Prediction of cancer patients’ duration in hospital
    • Divided the duration into two classes: short-term and long-term
    • Eliminate irrelevant attributes by backward feature elimination
    • Develop a decision tree
    • Cross validated the develop tree, and achieved an accuracy of 75.89%
    • Explored the relation between durations and treating physicians
  • Skill acquired: KNIME, CPN-tools, ProM 5&6

2ID35 Database Technology

  • Learned:
    • Storage, the I/O computational model, & external sorting
    • Indexing: B-trees, R-trees, and GiST
    • Query processing and optimization
    • Distributed query processing
    • Transaction management
  • Project:
    • Cardinality estimation method for RDF star joins
    • implemented the characteristic sets using Java
  • Skill acquired: Apache Jena APIs, Sparql

2IL55 Geometry Algorithms

  • Learned: Geometry algorithms, e.g.
  • Project: Map matching
    • Map matching is a process of identifying the roads which the vehicle actually drives on based on its GPS trajectory.
    • Implemented two algorithms for the map matching problem:
      • Incremental
      • Global based on weak Fréchet distance
  • Skill acquired: Nothing
  • Ps: 俄罗斯哥一手包办了project的coding, report和presentation

2IW02 Real-time Software Development

  • Learned: CSP (Communicating sequential processes)
  • Project: controlling a camera…
  • Skill acquired:
  • Ps: 最后的作业去UTwente做的, 不过没做完, 考试也没考, 挂了.
    • 奇葩课程, 工具是一边开发一边给学生用
    • 在twente还看见了开发者, 一中国妹子, 吓尿了

5N520 Statistical Bioinformatics

  • Learned:
    • essential molecular biology
    • sequence alignment and dynamic programming
    • BLAST statistics and substitution matrices
    • multiple sequence alignment
    • Hidden Markov Models for sequence alignment
    • phylogenetic trees
    • sequencing and genome assembly
  • Project: No project
  • Skill acquired: Matlab Bioinformatics Toolbox

果然是DNS问题

网络断断续续了一个礼拜, QQ能上, 实况能打..一直搞不清为什么, 直觉是DNS问题. DNS手动设置成8888又没用, 好奇怪. 下载个彗星DNS, 扫描了一遍, 自动设置成8888然后网络好了. 真是奇怪, 为什么手动就不行了. 对了不得不说谷歌牛逼, 世界上1900DNS服务器测速下来, 8888最快, 8844第三. , 河南的UPC真不靠谱. KPN也是, 自己的dns都没么. ———————————————– 貌似还不是DNS问题. 好奇怪

“you don’t have permission to access on this server”

昨天在Ubuntu下配Apache2+PHP环境时候, default的index可以访问, 但如果改了目录(a2dissite default && a2ensite [mysite])后就有这个错误了.

网上很多改apache2配置文件的, 我也按那方法改了但没用.

遂想到文件权限的问题, 把我的index.php全部改成rwx, 但也没用.

后来发现, 是Dropbox搞的鬼, 我把网页文件都放在Dropbox某个目录里了.

这货文件权限竟然是

根据https://help.ubuntu.com/community/FilePermissions

里说, 如果一个文件夹没有x, 那么就无法cd.

把Dropbox改成755, 成功解决.

另, 俄罗斯哥, 测试说, 有test1/test2/, 如果test2为r, 那么不管test1权限, test2都是r.

数据库技术(Database Technology)作业

Cardinality Estimation using Characteristic Sets

- RDF格式, rdfs, tdb, notation3, n-triple.

- Jena API, TDB 打开数据库, 建立Model

- SPARQL, Star-join.


这project比想象的要难. Characteristic Set 算法很简单, 但是需要…多次扫描整个数据库.
一开始, 想要先算Plain Set without annotations 然后再计算annotations, 当然失败告终, 数据结构设计的太差太差.
后来有重构(每次project都要这样, 真不知道怎么办, 真的需要重看软工的书了), 略好些.
幻想用简单的SPARQL先分组再全部扫描, 当然悲剧了. 那时还不知道ARQ 中 ResultSet竟然是个流, 还天真以为会一次做完查询然后把结果存内存里.
貌似数据库结果都是这样做的, 本科时数据量太小没意识到.
Stackoverflow里问人得知还有个命令叫 group_concat, 果断用之, 把所有同样集合里的count全部concat起来, 然后再拆分. 嗯…一下是我写出来的..

	String queryString =
		 "SELECT (COUNT(?s) AS ?distinct) " 
		+ "?propset "
	        + "(group_concat(?count; separator = "\t") AS ?counts)"
		+ "{"
		+ "SELECT ?s "
		+ "(group_concat(?p; separator = " ") AS ?propset) " 
		+ "(group_concat(?c; separator = " ") AS ?count)" 
		+ " {"
		+ "SELECT ?s ?p ?c"
		+ " WHERE "
		+ "{"
		+ "SELECT ?s ?p (COUNT(*) AS ?c)  "
		+ "WHERE { ?s ?p ?o .}"
		+ " GROUP BY ?s ?p"
		+ "} ORDER BY ?s ?p"
		+ "} GROUP BY ?s ORDER BY ?s"
		+ "} GROUP BY ?propset ORDER BY ?propset";

当然, 效率无比低下, 貌似要扫三次整个数据库, 还不算group by和order by的代价.
在Yago上彻底悲剧了, 一夜都没结果.

写信问作者, 他说要三步…最后还是要扫整个数据库并进行group by的. 真想骂句竟然好意思在文章里写”implemented it by only two group-by operators”
不过我想他也是受不了我信里明显讽刺才详细回我的吧= =


嗯现在知道为什么慢了, 就是因为TDB不是一次性载入的. 然后自然就想直接把YAGO载入内存里嘛, 反正我有8G, 区区2G数据算什么?
于是我又悲剧了. Jena无法解析Yago..说好的都是N3文件呢, 为什么无法解析呢.
继续天真, 找了几个文本工具想把不能解析的地方改过去, 当然又悲剧了. btw, Notepad++无法打开这么大的, vim可以打开, 可以修改, 保存的话就直接挂了. 010Edit倒是可以,
但我发现有很多地方需要修改啊, 用下搜索吧…于是这软件也挂了.
这就怪了啊, 为什么无法解析, 找了下发现YAGO这货竟然是用ASCII编码的..然后他还能自成一套能把UTF-8的东西也编进去, 完全震惊了.
他们也没提供可以转的工具, 只有一个可以转String的. 那我岂不是要2G的String?
最后..既然Jena可以读TDB, 那我把TDB转成N3不就行了?
貌似还是天真了, 从TDB中创建的Model, write了20分钟, 文件大小还是0, 不知道他在干嘛…
嗯睡觉了, 明天早上来看. 希望有惊喜.

deadline前吃火锅

今天是来荷兰第二次吃火锅,比上一次给力点, 底料有中国味, 吃了牛肉, 幸好没羊肉, 以及土豆.

在国内吃火锅我饮料喝的最多, 这里也差不多, 感谢有可乐啊. 唯一不足是那所谓桂林辣椒酱是咸的.

Business Process Management Systems这门课我想那两个队友再没反应我就索性退了. 礼拜五上课去教室空的, 原来课取消了. 奇葩的事情是, 我正准备走, 两个队友来了, 一看没人得知不上课, 就说, 索性开会吧…

然后罗马尼亚队友说他不能退课了, 否则学分不够就要交学费. 我也没好意思再说退课. 荷兰人说他protos里东西做完了, 老师要求15个tasks, 他做了40个. 我见状不妙赶紧说我要写报告, 然后荷兰人死活不肯做yawl, 他也知道40个任务yawl绝对死. 于是我只好答应咯.

礼拜六, 也就是今天啦, 拖延症发作不想写作业, 勉强画完yawl, data还没做, 就愉快地吃火锅了.

嗯, 现在1点, 明天6:30起来.