基于时滞特征的时序依赖情节发现

2019-08-01顾佩月刘峥李云李涛

计算机应用 2019年2期

关键词：时滞

顾佩月刘峥李云李涛

摘要：对于事件序列中的时序依赖发现，传统的频繁情节发现方法一方面使用时间窗口机制挖掘事件之间简单的关联依赖，另一方面无法有效处理事件的交叉时序关联。针对以上问题，提出了时滞情节发现的概念，在频繁情节发现的基础上，设计了一种基于相邻事件匹配集（AEM）的时滞情节发现算法。首先，引入时滞的概率统计模型进行事件序列匹配，避免预先设定时间窗口，处理可能存在的交叉关联;然后，将时滞挖掘转化为最优化问题，使用迭代的方式得到时滞情节之间的时间间隔分布;最后，利用假设检验区分串行时滞情节和并行时滞情节。理论分析与实验结果表明，与目前最新的时滞挖掘方法迭代最近事件（ICE）算法相比，基于AEM的时滞情节发现算法模拟的时滞分布与真实时滞分布的平均KL距离为0.056，缩短了20.68%。基于AEM的时滞情节发现算法通过时滞的概率统计模型衡量事件多种匹配情况的可能性，获得一对多的相邻事件匹配集，比ICE算法中的一对一匹配更加有效地模拟了实际情况。

关键词：时序依赖;事件序列;频繁情节;时滞;概率统计模型

中图分类号： TP391

文献标志码：A

Abstract： Concerning the problem that a predefined time window is usually used to mine simple association dependencies between events in the traditional frequent episode discovery， which cannot effectively handle interleaved temporal correlations between events， a concept of time-lag episode discovery was proposed. And on the basis of frequent episode discovery， Adjacent Event Matching set （AEM） based time-lag episode discovery algorithm was proposed. Firstly， a probability statistical model introduced with time-lag was introduced to realize event sequence matching and handle optional interleaved associations without a predefined time window. Then the discovery of time lag was formulated as an optimization problem which can be solved iteratively to obtain time interval distribution between time-lag episodes. Finally， the hypothesis test was used to distinguish serial and parallel time-lag episodes. The experimental results show that compared with Iterative Closest Event （ICE） algorithm which is the latest method of time-lag mining， the Kullback-Leibler （KL） divergence between true and experimental distributions discovered by AEM is 0.056 on average， which is decreased by 20.68%. AEM algorithm measures the possibility of multiple matches of events through a probability statistical model of time lag and obtains a one-to-many adjacent event matching set， which is more effective than the one-to-one matching set in ICE for simulating the actual situation.

Key words： temporal dependency; event sequence; frequent episode; time lag; probability statistical model

0 引言

近年來，随着数据采集和存储技术的不断发展，各领域海量时序数据的研究逐渐成为数据挖掘和知识发现的热点问题[1]。时序数据通常记录了事物随着时间变化发展的不同状态。

在各类领域中，时序数据通常会有两种形式：第一种是时间序列，记录连续时间内事物变化值，例如股票数据、系统内存占用率等;第二种是事件数据，记录离散时间内事物变化情况，例如购物篮数据[2]、Web数据[3]、社交网络事件数据[4]等。同一事物的发展往往蕴含着一些不为人知的规律或固有模式，从事件数据中发现学习这些隐藏的有趣时序依赖模式，有助于对未来事物的发展进行预测或是对引起事件的根源进行追查。不同形式的时序数据依赖发现方式也有很大不同，本文主要研究离散时间上的序列集合，即事件数据中的时序依赖发现。

关联规则[5]是现有工作中针对事件序列最简单的时序依赖发现，也是后续许多时序依赖发现方法的基础。关联规则旨在发现时序数据中频繁出现的项集，比较经典的应用就是购物篮数据。在此数据集中，每一条记录代表一个客户所有的购买记录，通过关联分析得到客户可能同时购买的商品组合，从而帮助商场进行营销。但是关联规则的依赖事件是不考虑时间先后关系的，随之又衍生了频繁情节挖掘[6]，即在关联规则的基础上添加简单的时间约束，通过设定时间窗口从发现的依赖中得知事件发生的先后顺序。随着生产应用的需求变化，时序依赖发现中的时间参数从先后关系再到模糊的时间区间进而转向具体的滞后时间挖掘。

在考虑两类事件A、B之间的依赖时，通常是基于依赖存在于相邻事件的假设上进行挖掘的，即若事件A、B是依赖的，那么事件B的发生只能与它之前的第一个事件A有关。但在实际复杂的生产环境中，事件之间的依赖不仅仅存在于相邻发生，也有可能出现交叉依赖[7]的情况，导致无法判断事件之间的对应关系。例如图1，在关联规则、序列模式等时序依赖发现中，通常只考虑彼此相邻的两类事件依赖，即实线箭头表示的依赖关系，但真实的时序依赖是相互交叉的，事件B可能与它之前的第一个或第二个甚至更前面的A有关，即b3不仅可能与a3有关，也可能与a2或a1有关。以上这些需求的变化都为事件时序依赖发现带来了更多的问题和困难。

1 相关工作

来自不同领域的多样化需求衍生了多种类型的事件依赖，一般根据处理的数据类型分为两类。第一类是事务数据库，其中每个事务是一系列事件的集合。每个事务通常记录的是同一个实体的事件，例如银行账户流水、顾客购物记录。这种类型的时序依赖主要用于发现在某些事务中频繁出现的子序列，关注不同个体之间相同的依赖模式。序列模式[8-9]、全依赖模式[10]、互依赖模式[11]、部分周期模式[12]是该类事件依赖中比较典型的模式依赖发现。

在实际应用中因为某些条件限制，用户无法获得足够的信息将事件数据划分为事务，此时用户面对的就是一系列事件。这就是第二类的时序依赖，也是本文重点关注的类型。其中，最基本的时序依赖是频繁情节发现[6]。同一个情节中的事件发生在相近的时间区间内，所以情节发现的一般方式都是采用用户预定义时间窗口机制。通过滑动时间窗口将发生在同一个窗口中的事件视作一个事务，然后将从事务数据库中发现频繁子序列的思想应用到频繁情节发现中。传统的情节发现只能揭示在一个预定义窗口内频繁发生的有序事件，情节内事件之间发生的具体时间间隔是未知的。例如在大小为5的时间窗口中发现的情节〈A，B〉表示事件B会在事件A发生之后的4个单位时间内发生，但不能准确提供具体的发生时刻。文献[13]提出了固定间隔情节的概念，构建了一种基于前缀树的精确位置情节规则（Precise-Position Episode Rule trie， PER-trie）挖掘框架，用于获得事件之间准确发生的时间间隔区间。

以上这些挖掘算法通常都需要用户提前选定一个合适的时间窗口大小，这在实际应用中是一件非常困难的事情，因为实际应用中时序依赖发现的时间区间可能会从1秒到1天甚至更长不等，而不合适的时间窗口可能会导致一些超出窗口以外的依赖被忽视。文献[14]算法不需要预先定义窗口，将空间统计学中的思想应用到事件的时序依赖发现中，用空间点过程的距离度量计算事件发生的无条件分布和条件分布，将成对事件之间的依赖关系问题转化为这两个概率分布的比较，而事件之间的等待时间用一个卡方检验来计算。Tang等[7]考虑到时间间隔的随机性，将交错事件存在关联的可能性纳入依赖发现过程中，构建了一个基于有序表的算法STScan （Sorted Table Scan），将成对事件之间可能存在的所有时间间隔都保存在链表中，通过扫描链表的子段得到所有的合格滞后区间。文献[15]将时序依赖发现问题转化为成对事件的时间间隔分布的学习，提出了一种基于期望最大化（Expectation Maximization， EM）的方法用于发现时序依赖中时滞的最大似然模型。Zller[16]基于迭代最近点（Iterative Closest Point， ICP）算法[17]思想构建了迭代最近事件（Iterative Closest Event， ICE）算法寻找最合适的事件匹配集合满足时滞分布波动性最小。

6 结语

本文针对事件数据中的时序依赖问题进行研究，提出了时滞情节发现这一概念，采用基于相邻事件匹配集（AEM）的时滞情节发现算法，不需要预先定义时间窗口，通过时滞的概率统计模型发现成对并行和串行时滞情节。与算法ICE相比，基于AEM的时滞情节发现算法有效地提高了时滞模拟效果。但目前本文研究只关注了成对的时滞情节，三个及三个以上事件类型之间的时滞情节发现需要在事件序列匹配中构建更加复杂的概率模型，这也是我们未来工作的重点研究方向，以便更好地将时滞情节发现应用于实际生产中。

参考文献：

[1] 李涛，刘峥，周绮凤.应用驱动的大数据挖掘[J].中兴通讯技术，2016，22（2）：49-52. （LI T， LIU Z， ZHOU Q F. Application-driven big data mining[J]. ZTE Communications， 2016，22（2）：49-52.）

[2] WU C-W， LIN Y-F， YU P S， et al. Mining high utility episodes in complex event sequences [C]// Proceedings of the 2013 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2013： 536-544.

[3] 武健.时序Web数据挖掘方法[J].计算机应用，2014，34（S2）：120-122. （WU J. Data mining method from time series Web data [J]. Journal of Computer Applications， 2014， 34（S2）： 120-122.）

[4] 郭跇秀，呂学强，李卓.基于突发词聚类的微博突发事件检测方法[J].计算机应用，2014，34（2）：486-490. （GUO Y X， LYU X Q， LI Z. Bursty topics detection approach on Chinese microblog based on burst words clustering [J]. Journal of Computer Applications， 2014， 34（2）： 486-490.）

[5] AGRAWAL R， IMIELINSKI T， SWAMI A N. Mining association rules between sets of items in large databases [C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York： ACM， 1993： 207-216.

[6] MANNILA H， TOIVONEN H， VERKAMO A I. Discovery of frequent episodes in event sequences [J]. Data Mining & Knowledge Discovery， 1997， 1（3）：259-289.

[7] TANG L， LI T， SHWARTZ L. Discovering lag intervals for temporal dependencies [C]// Proceedings of the 2012 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2012： 633-641.

[8] AGRAWAL R， SRIKANT R. Mining sequential patterns [C]// Proceedings of the 1995 Eleventh International Conference on Data Engineering. Washington， DC： IEEE Computer Society， 1995： 3-14.

[9] 李艷辉，刘浩，袁野，等.基于差分隐私的频繁序列模式挖掘算法[J].计算机应用，2017，37（2）：316-321. （LI Y H， LIU H， YUAN Y. Frequent sequence pattern mining with differential privacy[J]. Journal of Computer Applications， 2017， 37（2）： 316-321.）

[10] LIANG F， MA S， HELLERSTEIN J L. Discovering fully dependent patterns [C]// Proceedings of the 2002 SIAM International Conference on Data Mining. Philadelphia， PA： Society for Industrial and Applied Mathematics （SIAM）， 2002： 511-527.

[11] MA S， HELLERSTEIN J L. Mining mutually dependent patterns [C]// Proceedings of the 2001 IEEE International Conference on Data Mining. Washington， DC： IEEE Computer Society， 2001： 409-416.

[12] MA S， HELLERSTEIN J L. Mining partially periodic event patterns with unknown periods [C]//Proceedings of the 2001 17th International Conference on Data Engineering. Piscataway， NJ： IEEE， 2001：205-214.

[13] AO X， LUO P， WANG J， et al. Mining precise-positioning episode rules from event sequences [J]. IEEE Transactions on Knowledge and Data Engineering， 2018， 30（3）： 530-543.

[14] LI T， MA S. Mining temporal patterns without predefined time windows [C]// Proceedings of the 4th IEEE International Conference on Data Mining. Washington， DC： IEEE Computer Society， 2004： 451-454.

[15] ZENG C， TANG L， LI T， et al. Mining temporal lag from fluctuating events for correlation and root cause analysis [C]// Proceedings of the 2014 10th International Conference on Network and Service Management. Washington， DC： IEEE Computer Society， 2014： 19-27.

[16] ZLLER M-A， BAUM M， HUBER M F. Framework for mining event correlations and time lags in large event sequences [C]// Proceedings of the IEEE 2017 15th International Conference of Industrial Informatics. Piscataway， NJ： IEEE， 2017： 3-4.

[17] BESL P J， McKAY N D. A method for registration of 3-D shapes [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2002， 14（2）：239-256.

[18] GUO S， LIU Z， CHEN W， et al. Event extration from streaming system logs [C]// ICISA 2018Proceedings of the 9th International Conference on Information Science and Applications， LNEE 514. Singapore： Springer， 2018： 465-474.

[19] LI T， JIANG Y， ZENG C， et al. FLAP： an end-to-end event log analysis platform for system management [C]// Proceedings of the 2017 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2017： 1547-1556.

[20] FISCHLER M A， BOLLES R C. Random sample consensus： a paradigm for model fitting with applications to image analysis and automated cartography [J]. Readings in Computer Vision， 1987： 726-740.

[21] LI T， LIANG F， MA S， et al. An integrated framework on mining logs files for computing system management[C]// Proceedings of the 2005 Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2005： 776-781.