SeriousProg: 2012

Wednesday, December 5, 2012

Web Data Mining

Web数据挖掘，刘兵(UIC)

Chapter 1: Association Rule and Sequence Pattern

关联规则是从统计的意义来调查数据中蕴含的规律的方法，它用两个变量来描述一组数据的规律：支持度(Support)和置信度(Confidence)。给出一组事务(Transaction)数据，通常是客户的购买记录，点击记录等数据，我们希望找出具备较明显的关联的一些项目(Item)。我们定义规则'X->Y'为一个关联规则(Association Rule)，假设其支持度为10%，置信度为80%，就表明，所有数据中，X和Y同时出现的概率为10%，而X出现的情况下，Y出现的情况为80%。

通常我们使用Apriori算法来计算给定的一组事务，一个最低支持度阈值(频繁项目集)，最低置信度阈值(可信关联规则)时，所有的关联规则。

Apriori算法包括了两个步骤：1）求出所有的频繁项目集；2）从频繁项目集中找出可信关联规则。

频繁项目集有一个很好的特性，任一个包含k个项目的频繁项目集中任何一个非空子集都满足频繁项目集的条件。我们从只含1个项目的平凡集开始，根据k-1个项目的平凡集构造含k个项目的平凡集。构造的方法有个技巧，首先选取两个k频繁项目集f1和f2，其前k-1个元素相等，第k个元素f1[k]<f2[k]，那么构造的k+1频繁项目集的一个元素就为前k-1个元素，f1[k], f2[k]，共计k+1个项目。

这一算法的正确性很容易证明：根据频繁项目集的特点，k+1个元素的频繁项目集的任意的k元素子集都一定是频繁项目集，并且我们已经求出，那么显然{a1, a2, a3, … ak-1, ak}和{a1, a2, a3, … ak-1, ak+1}都在这个k频繁项目集中，那么聚合他们一定能得到k+1个元素的频繁项目集。

而关联规则的生成则更加简单，在有了频繁项目集的基础之上，穷举f - Y->Y中的Y并且求置信度即可。这里面还有一个技巧，如果f-Y->Y满足置信度要求，那么所有的f-Ysub->Ysub都满足置信度要求。因为该公式的置信度计算公式为f.count / (f – Ysub).count。

然而在实际应用中，单最小支持度的限制还是使得该算法的实用性不佳，特别是对那些本来就比较稀有的项目。因此引入了多最小支持度的关联规则。

Talking about trigger with HBase

We have been pretty familiar with triggers in traditional SQL databases. Database administrators use triggers to keep data integrity or finish some useful actions. Most of these triggers are implemented based on the ECA model (event-condition-action). Today's HBase has not provided this features which i believe is critical for lots of applications, so in next few articles i would like to talk about how i designed and implemented this feature based on current HBase release(0.94.2).

1. Our goal

As co-processor has been introduced in HBase since 0.92, it makes no sense to give HBase a new tool for keeping data integrity or finishing some pre-, post-actions. In fact, what we are doing here is a general framework that allows programmers to write applications which can monitor some fields of HBase table, check whether conditions are fulfilled, and finally run user defined functions. So basically, we need a trigger-plused HBase that:

allows programmers to submit 'TRIGGERS'
supports "ACTIONS" to be executed distributed automatically

There are lots of difficulties here:

we needs to guarantee that when programmers got the successful submission return value, the "TRIGGERS" must have began to work.
we need to make sure all the relevant servers should contain the "TRIGGER" information. Parts are not allowed.
we need to provide an efficient way to make sure that trigger event can be detected quickly with small 'PUT' performance degradation.
user-defined functions are not restricted to some types, so we must keep eyes on failures, inconsistency, and dis-order.
we need to provide good load-balance algorithm to redistribute "TRIGGERS" according to both the disk load and cpu load.
we need to have a simple failure-tolerant approach to make whole system workable in practical environment.

2. The Architecture

Just like the fellowing figure shows, Triggers are submitted from users to the TriggerMaster node, which is also the HMaster node in HBase implementation. The TriggerMaster will distribute the trigger structure to other TriggerWorkers, which takes charge of the trigger deployment and execution.

This architecture is quite simple, but it will be much easier to do schedule or fault-tolerance stuffs. Also, the TriggerMaster was selected from all the nodes in the data center through a ZooKeeper like service, so even this single node fails, the system can recovery by select a new TriggerMaster again.

2.1 The TriggerMaster

Trigger
2.2 The TriggerWorker
2.3 The Communication

3. Fault-Tolerance
4. Scheduler
5. Use Cases
6. Analysis