CN109033419B - Multi-source data stream frequent plot mining method and device - Google Patents

Multi-source data stream frequent plot mining method and device Download PDF

Info

Publication number
CN109033419B
CN109033419B CN201810889153.4A CN201810889153A CN109033419B CN 109033419 B CN109033419 B CN 109033419B CN 201810889153 A CN201810889153 A CN 201810889153A CN 109033419 B CN109033419 B CN 109033419B
Authority
CN
China
Prior art keywords
data
lattice
grid
order
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810889153.4A
Other languages
Chinese (zh)
Other versions
CN109033419A (en
Inventor
尤涛
杜承烈
陈进朝
李亚敏
李宇博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201810889153.4A priority Critical patent/CN109033419B/en
Publication of CN109033419A publication Critical patent/CN109033419A/en
Application granted granted Critical
Publication of CN109033419B publication Critical patent/CN109033419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for mining frequent plots of a multi-source data stream, and relates to the technical field of data mining. The method is used for solving the problem that the existing data stream mining technology cannot mine multi-level, multi-angle and multi-azimuth because a mining object is a single data stream. The method comprises the following steps: traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and determining the data streams without precedence relationship as a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes; and confirming mixed frequent plots in the data grid sequence by a frequent plot mining method.

Description

Multi-source data stream frequent plot mining method and device
Technical Field
The invention relates to the technical field of data mining, in particular to a method and a device for mining frequent plots of a multi-source data stream.
Background
At present, data mining is widely applied to an intelligent traffic system, and is a comprehensive traffic management system which is established by effectively integrating and applying advanced information technology, data communication transmission technology, electronic sensing technology, control technology, computer technology and the like to the whole ground traffic management system, plays a role in a large range in all directions, and is real-time, accurate and efficient. The application of multi-source data stream frequent plot mining in intelligent traffic road section flow control is mainly characterized in that according to passing road information data collected by traffic sensors on different vehicles, a multi-source data stream frequent plot mining method is combined to mine that multiple vehicles can possibly pass through a passageway road section, and related road improvement measures are provided, so that traffic jam and traffic accidents can be prevented, the highest allowable vehicle flow speed on the road is ensured, and the activities of people such as working, traveling and the like are facilitated.
In the wireless sensor network system, each sensor node deployed in the monitoring area cooperatively senses, collects and processes information of a sensed object in the coverage area of the network, and sends the information to an observer. The perception of the object can be realized jointly by combining the data collected by each node, and complete information can be obtained. For example, the stream data { ABADCCA … }, { HFFEGHH … } is based on data on two nodes in a certain wireless sensor network system, where B occurs before E and G occurs before D, there is an observer observing the data stream in the order of { HFFABEGADCCAHH … }.
The existing basic technology of data stream mining is limited to single data stream mining, and is usually performed based on a certain specific time interval or window, and there are three common window models: a landmark window model; sliding the window model; an attenuation window model. For the landmark window model, the data set used for mining is the set of all tuples starting from the data stream to the current arrival, and the window size increases with the data stream. For the sliding window model, the data set used for mining is a set of N tuples that have arrived recently, starting from the current time scale, where N is the size of the sliding window, and the position of the window slides continuously with the data stream on the time axis. For the decay window model, the data set used for mining is the set of all tuples starting from the data stream to the current arrival, but the tuples are given different weights, and the weight of each tuple decays continuously over time according to some decay function. The mining object of the existing typical data stream mining technology is a single-source data stream, but in many applications, the data information source is multi-level, multi-angle and multi-directional, and the single-source data stream mining under the traditional mode cannot adapt to the development of social progress.
Disclosure of Invention
The embodiment of the invention provides a multi-source data stream frequent plot mining method and device, which are used for solving the problem that a mining object of the existing data stream mining technology is a single data stream, and mining cannot be performed aiming at multiple layers, multiple angles and multiple directions.
The embodiment of the invention provides a multi-source data stream frequent plot mining method, which comprises the following steps:
traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and determining the data streams without precedence relationship as a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes;
and confirming mixed frequent plots in the data grid sequence by a frequent plot mining method.
Preferably, the data grid sequence comprises single-order data grids, multi-order data grids and complex data grids;
before the data stream without the precedence relationship is determined as the data lattice sequence, the method further includes:
confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
Preferably, the determining the data stream without the precedence relationship as a data lattice sequence specifically includes:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
Preferably, the determining the mixed frequent episodes in the data lattice sequence by the frequent episode mining method specifically includes:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
An embodiment of the present invention further provides a device for mining frequent plots of multiple source data streams, including:
the first confirming unit is used for traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and confirming the data streams without precedence relationship into a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes;
and the second confirming unit is used for confirming the mixed frequent plots in the data grid sequence by a frequent plot mining method.
Preferably, the data grid sequence comprises single-order data grids, multi-order data grids and complex data grids;
the first validation unit is further configured to: confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
Preferably, the first confirming unit is specifically configured to:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
Preferably, the second confirming unit is specifically configured to:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
The embodiment of the invention provides a method and a device for mining frequent plots of a multi-source data stream, wherein the method comprises the following steps: traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and determining the data streams without precedence relationship as a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes; and confirming mixed frequent plots in the data grid sequence by a frequent plot mining method. According to the method, aiming at data streams without precedence relationship, the data streams with the precedence relationship are combined into a complex data grid according to the characteristic that different data streams corresponding to different processes have intervals, combination of multiple source data streams is achieved, and compared with a traditional method, the parallel relationship among data is reserved. Furthermore, on the premise of ensuring that the data source information is not missed, a foundation is provided for the subsequent frequent item mining, and the effect of reducing the plot redundancy is realized, so that in the subsequent frequent item mining process, the calculation complexity of the dynamic planning method is reduced, the storage cost is reduced, and the method is more suitable for the complex data environment.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for mining a frequent plot of a multi-source data stream according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data grid structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating two connection scenarios of complex data grids according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a data flow structure with "precedence relationship" according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an embodiment of the present invention for providing a data cell structure corresponding to FIG. 4;
fig. 6 is a schematic diagram of a method for mining a frequent plot of a multi-source data stream according to a second embodiment of the present invention;
fig. 7 is a schematic structural diagram of a device for mining frequent episodes of a multi-source data stream according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 exemplarily shows a flow diagram of a multi-source data stream frequent episode mining method provided by an embodiment of the present invention, as shown in fig. 1, the method mainly includes the following steps:
step 101, traversing data streams including non-precedence data points in a data grid from an initial point position of the data grid, and determining the non-precedence data streams as a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes;
and 102, confirming mixed frequent plots in the data grid sequence by a frequent plot mining method.
It should be noted that the data lattices discussed in the embodiments of the present invention include two types, namely, "data points with precedence relationship" and "data points without precedence relationship". In practical applications, the "precedence relationship" includes the sequence relationship of events occurring in time and the causal relationship of events occurring logically. In the embodiment of the invention, uncertain characteristics exist in the precedence relationship of concurrent data generated by a plurality of source data streams. Therefore, when a multi-source data stream is synthesized, there are many possible cases of the data lattice sequence of the whole data stream observed in the conventional mode, that is, there are multiple data lattice sequences based on the sequence characteristics.
In the embodiment of the invention, two data flows W are used(1)、W(2)The data grid generated by the combination is taken as an example,
fig. 2 is a schematic diagram of a data grid structure provided in an embodiment of the present invention, and as shown in fig. 2,' is used to represent that there is no precedence relationship between two corresponding events, which is referred to as a "no precedence relationship data point"; the 'x' is used to indicate that the precedence relationship exists between the corresponding two events, and is called a "precedence relationship data point".
As shown in fig. 2, the data cells include two types, a single-order data cell and a multi-order data cell, wherein, in the single-order data cell,
Figure BDA0001755381660000072
when there is a uniquely determined sequence among data included in the data grid shown in fig. 2
Figure BDA0001755381660000073
When the sequence is found to be correct, the data cell formed by the sequence is called a single-sequence data cell and is denoted as S(n-1)Where n represents the number of data points in a single-order data grid. S in FIG. 22I.e. a single-order data grid.
Further, in the multi-order data grid,
Figure BDA0001755381660000074
when there are multiple uncertain sequences among data included in the data lattice shown in fig. 2 and data corresponding to different processes have characteristics of interval occurrence, the data lattice formed by the sequences is called a multi-sequence data lattice, which is denoted as M(m-1)(n-1)And m and n represent the number of data points corresponding to different processes in the multi-sequence data grid. M in FIG. 231And M22All are multi-order data grids.
Table 1 shows the above symbols.
TABLE 1
Figure BDA0001755381660000071
Figure BDA0001755381660000081
In practical applications, it is known from the data lattice model of the two source data streams that, along with the combination of the two sets of data streams, the data lattice may have single-order data lattices, multiple-order data lattices, and complex data lattices. In which single-order data cell S(n-1)Can generate a unique sequence
Figure BDA0001755381660000082
The single-order data grid has the characteristic of low calculation and storage space overhead. For the multi-order data grids, the order of the data items in the data stream is generated under the condition of meeting the precedence relationshipIn the embodiment of the invention, a method for constructing a binary sequence is provided, and a simple binary sequence is generated to represent multiple possible sequence conditions in a data grid, so that the space overhead can be reduced.
In particular, the following describes methods for generating sequences of single-order data cells, multiple-order data cells, and complex data cells that may occur within a sequence of data cells.
When the data grid is composed of two groups of different data streams corresponding to different processes, because the precedence relationship does not exist between the two groups of different data streams, a plurality of sequences which can possibly occur exist in the data grid.
One of the situations is:
when a group of data forming the single-order data lattice is confirmed in the data stream without the precedence relationship included in the data lattice, the group of data is confirmed as the single-order data lattice. For example, S in FIG. 22It is a single-order data grid.
One of the situations is:
when two groups of data items are confirmed to form a multi-order data grid from data streams without precedence relation included in the data grids, and the two groups of data items are distributed to belong to different data streams, the two groups of data items are confirmed to form the multi-order data grid.
In another case:
when only two groups of multi-order data grids exist in the data stream without the precedence relationship in the data grids and at least one adjacent edge exists between the two groups of multi-order data grids, the two groups of multi-order data grids with at least one adjacent edge are confirmed to be complex data grids.
The following describes the validation step of a multi-order data grid in detail with reference to fig. 2:
step 201: traversing data streams represented by non-successive data points in the data grids from the initial point position at the lower left corner of the data grids, and finding out multi-sequence data grids with characteristics that different processes correspond to different data streams and intervals exist;
step 202: determining different data streams constituting different processes in a multi-order data grid, e.g. process W(1)Participating in the constitutionThe first data item of the order data grid is
Figure BDA0001755381660000091
Process W(2)The second data item participating in the formation of the multi-order data lattice is
Figure BDA0001755381660000092
Step 203: generating the following form according to the requirement of generating the binary sequence according to the first data item and the second data item
Figure BDA0001755381660000093
In the parentheses, line 1 represents a sequence from W(1)Line 2 represents data items from W(2)The data item of (1). It should be noted that, in the parentheses, the first row and the second row may be inverted up and down, and in the embodiment of the present invention, the specific type of the binary sequence is not limited.
In practical application, in the process of traversing the data grids, a plurality of multi-order data grid cross-connection scenarios may exist, and the situations are identified as complex data grids, and in practical situations, the complex data grids are quite possible to happen. Therefore, for such a case, the binary sequence set is adopted in the embodiment of the present invention, that is, a plurality of binary sequences are adopted to represent all possible sequence cases in the complex data grid.
The complex data lattice validation steps are described in detail below in conjunction with FIG. 3:
fig. 3 is a schematic diagram of two connection situations of complex data lattices according to an embodiment of the present invention, and a confirmation method of the complex data lattices is described below with reference to fig. 3:
step 301, starting from the initial point position of the lower left corner of the data grid, traversing the data stream represented by the data points without precedence relationship in the data grid, and finding out a complex data grid with a plurality of cross-connection features of the multi-sequence state grid.
Step 302: when one edge of the two multi-order data grids is overlapped, the complex data grid is divided according to the multi-order data grids, for example, in (a) in fig. 3, a 'non-precedence relationship data point' between an event e and an event f can be determined as a dividing point; when two sides of two multi-order data grids are in cross connection, the complex data grid is divided according to the multi-order data grids, for example, in fig. 3 (b), a "non-precedence data point" between an event s and an event t and a "non-precedence data point" between an event n and an event p can be determined as a division point of each data process.
Step 303: listing binary sequences according to division points, combining to obtain a binary sequence group representing a complex data grid, wherein the binary sequence group of (a) in figure 3 is composed of
Figure BDA0001755381660000101
Three, the binary sequence group of (b) in FIG. 3 is composed of
Figure BDA0001755381660000102
Four items are formed.
When the data lattices include the data flow of the data lattices without the precedence relationship in the data lattices, the data lattices include a multi-order data lattice and a single-order data lattice, and simultaneously include the complex data lattices, the multi-order data lattice, the single-order data lattice and the complex data lattices can be combined into a data lattice sequence. It should be noted that, in the embodiment of the present invention, the method for composing the data grid sequence may include the following various ways:
one way is as follows:
when it is determined that the data stream without the precedence relationship includes at least one multi-order data cell and at least one single-order data cell, the at least one multi-order data cell and the at least one single-order data cell may be combined to form a data cell sequence. In this manner, the data cells do not include complex data cells, but the specific number of included multi-order data cells and single-order data cells is not limited.
In another mode:
when it is determined that the data stream without the precedence relationship includes at least one multi-order data cell and at least one complex data cell, the at least one multi-order data cell and the at least one complex data cell may be combined to form a data cell sequence. In this manner, single-order data cells are not included in the data cells, but the specific number of multi-order data cells and complex data cells included is not limited.
In another mode:
when it is determined that the data stream without the precedence relationship includes at least one single-order data cell and at least one complex data cell, the at least one single-order data cell and the at least one complex data cell may be combined to form a data cell sequence. In this manner, the data cells do not include multiple-order data cells, but the specific number of single-order data cells and complex data cells included is not limited.
In another mode:
when it is determined that the data stream without the precedence relationship includes at least one single-order data cell, at least one complex data cell, and at least one multi-order data cell, the at least one single-order data cell, the at least one complex data cell, and the at least one multi-order data cell may be combined to form a data cell sequence. In this manner, the specific number of single-order data cells, complex data cells, and multi-order data cells included in the data cells is not limited.
In step 102, after the data streams without precedence relationship are combined into a data lattice sequence, the data lattices may be processed according to a frequent plot mining method.
In practical application, the frequent episode mining is mainly divided into serial episode mining and parallel episode mining, and the mixed episode mining is to perform aggregate merging on the existing serial frequent episode mining and the existing parallel frequent episode mining.
Specifically, a plurality of plots included in the data lattice and the length of each plot are confirmed first, and in practical applications, the plots include multiple-order data lattices and data items constituting a single-order data lattice.
For example, made of
Figure BDA0001755381660000111
A complex data grid of components, wherein,
Figure BDA0001755381660000112
both the middle a and the bcd occur at time 1, and the length is marked as 3 with the longest plot as a standard; e occurs at time 2 and is 1 in length; f occurs at time 3, has a length of 1,
Figure BDA0001755381660000113
xa and by in (b) both occur at time 4 and are 2 in length.
The mining steps of frequent plot mining are described with reference to the above example:
step 401: scanning the data grid to determine the occurrence of a plurality of episodes included in the data grid and the length of each episode, e.g. episode
Figure BDA0001755381660000114
The length of the middle a and bcd is 3, the length of plot e is 1, the length of plot f is 1, plot
Figure BDA0001755381660000115
Xa and by in (2) have a length of 2.
Step 402: a minimum support, sequence window, is set, each time the sequence window slides backwards at a particular time interval. When a new data item comes in the data stream, discarding the plot when the support degree of the plot is smaller than the minimum support degree threshold value in the detection window; otherwise, the episode greater than the minimum support threshold is identified as a serial frequent episode.
For example, the first sequence window
Figure BDA0001755381660000121
Plot of things<A,C>Is a frequent pattern, which occurs twice in total,<(A,1),(C,3)>and<(A,3),(C,4)>,sup(<A,C>) Min _ sup is more than or equal to 2. Plot of things<A,C>Referred to as a serial frequent episode.
Step 403: when a binary sequence structure is generated at a certain moment during scanning of the data stream, the binary sequence structure is stored by adopting a two-dimensional array until a next binary sequence is met, and data items which exist in at least two multi-sequence data grids at the same time and are larger than a minimum support threshold are confirmed as parallel frequent plots.
Such as
Figure BDA0001755381660000122
And
Figure BDA0001755381660000123
is a binary sequence structure in a data stream, both of which can contain episodes
Figure BDA0001755381660000124
sup
Figure BDA0001755381660000125
Are frequently parallel episodes, and consider episode expansion,
Figure BDA0001755381660000126
also meeting the need for greater than the minimum support threshold, then
Figure BDA0001755381660000127
Identified as parallel frequent episodes.
Step 304: and finally, carrying out set combination on the serial frequent plots and the parallel frequent plots to obtain mixed frequent plots.
For example, the above scenario<A,C>And plot
Figure BDA0001755381660000128
Merging to obtain mixed frequent plots<A,C>,
Figure BDA0001755381660000129
In order to more clearly describe the method for mining the frequent episode of the multi-source data stream according to the embodiment of the present invention, the method is further described below by taking fig. 4, fig. 5, and fig. 6 as examples.
Example one
Fig. 4 is a schematic diagram of a data flow structure with "precedence relationship" according to an embodiment of the present invention; fig. 5 is a schematic diagram of providing a data grid structure corresponding to fig. 4 according to an embodiment of the present invention.
As shown in FIG. 4, there is W(1)、W(2)Two data streams, wherein W(1):ADCADCBC…,W(2):BECABEAA…。W(1)、W(2)The precedence relationship of the data items included in the two data streams is shown in fig. 4. In fig. 5, the ' is used to indicate that there is no precedence between two corresponding events, and the ' x ' is used to indicate that there is precedence between two corresponding events.
For data stream W(1)And W(2)Frequent episode mining may be performed according to fig. 5. The method specifically comprises the following steps:
step 501, traverse the whole data grid, find out the first multi-order data grid M31
Step 502, determining the first multi-order data grid M31Two sets of data items, data source W(1)Is DCA, the data source W(2)The data item participating in (1) is { E };
step 503, generating a binary sequence from the two groups of data items according to a binary sequence generation method
Figure BDA0001755381660000131
It should be noted that, by repeating steps 501 to 503, a second multi-order data grid M can be found22To obtain a binary sequence
Figure BDA0001755381660000132
First single-order data cell S2The corresponding sequence is { CA }, the second single-order data cell S4The corresponding sequence is { BCAA }, and the multiple multi-order data grids and the multiple single-order data grids are combined to form a data grid sequence, specifically to form a data grid sequence
Figure BDA0001755381660000133
Step 504, scanning the data grid sequence,the data grid is identified to include a plurality of episodes and the length of each episode, e.g., episode a occurs at time 1, episode B occurs at time 2,
Figure BDA0001755381660000134
occurring at time 3 and having a length of 3, C and a are at times 4 and 5, respectively, both having a length of 1,
Figure BDA0001755381660000135
occurs at time 6 and is 2 in length, and BCAAs occur at times 7,8,9, and 10, respectively, and are all 1 in length.
Step 505, determining the serial frequent scenario, for example, setting a minimum support degree min _ sup equal to 2, setting a sequence window to 6, sliding the sequence window backward at a time interval of 2 each time, and simulating a stream state of data arrival. First sequence window
Figure BDA0001755381660000136
Plot of things<A,C>Is a frequent pattern, which occurs twice in total,<(A,1),(C,3)>and<(A,3),(C,4)>,sup(<A,C>) Min _ sup is more than or equal to 2. Plot of things<A,C>Referred to as serial frequent episodes; then, the sequence window is slid backward at time interval 2, and the next occurrence is judged.
In step 506, mining of frequent episodes is performed in parallel, for example,
Figure BDA0001755381660000141
and
Figure BDA0001755381660000142
is a binary sequence structure in a data stream, both of which can contain episodes
Figure BDA0001755381660000143
sup
Figure BDA0001755381660000144
Are frequently parallel episodes, and consider episode expansion,
Figure BDA0001755381660000145
also meeting the need for more than a minimum support threshold. Then
Figure BDA0001755381660000146
Identified as parallel frequent episodes.
Step 507, finally, the serial frequent plots and the parallel frequent plots are combined together to obtain a mixed frequent plot<A,C>,
Figure BDA0001755381660000147
Example two
Fig. 6 is a schematic diagram of a method for mining a frequent episode of a multi-source data stream according to a second embodiment of the present invention, as shown in fig. 6, the method includes the following steps:
601, forming a data grid by two groups of data streams without precedence relation;
step 602, traversing data grids without precedence relationship, if the data grids are confirmed to include single-order data grids, executing step 603, and if the data grids are confirmed to include multiple-order data grids, executing step 604;
step 603, because the single-sequence data lattice can generate a unique sequence, confirming a first data item or a second data item forming a first single-sequence data lattice from a data stream without a precedence relationship, and directly generating a single-sequence data lattice sequence according to the first data item or the second data item;
step 604, if it is determined that there is no adjacent relationship or cross relationship between the multiple data grids, then step 605 is executed, otherwise, step 606 is executed;
step 605, determining a third data item or a fourth data item which form a multi-order data grid from the data stream without the precedence relationship, and forming a binary sequence according to the third data item and the fourth data item, wherein the binary sequence corresponds to the multi-order data grid;
step 606, determining at least two groups of multi-sequence data grids from the data stream without precedence relationship, wherein the two groups of multi-sequence data grids have at least one adjacent edge, and the two multi-sequence data grids generate a binary sequence group according to the two groups of multi-sequence data grids;
it should be noted that, in step 606, there may be a case where one edge of the two multi-order data grids coincides with each other, and there may also be a case where two edges are cross-connected, which can be described in detail with reference to fig. 3.
Step 607, when the current data is the last item in the whole data flow without precedence relationship, executing step 608, otherwise, repeatedly executing step 602-step 606;
step 608, combining the sequences formed in step 603, step 605 and step 606 to form a data grid sequence.
In summary, in the method for mining frequent episodes of multiple source data streams, for data streams without precedence relationship, according to the feature that there are intervals between different data streams corresponding to different processes, the data streams with precedence relationship are combined into complex data grids, so that the combination of multiple source data streams is realized, and compared with the conventional method, the parallel relationship between data is maintained. Furthermore, on the premise of ensuring that the data source information is not missed, a foundation is provided for the subsequent frequent item mining, and the effect of reducing the plot redundancy is realized, so that in the subsequent frequent item mining process, the calculation complexity of the dynamic planning method is reduced, the storage cost is reduced, and the method is more suitable for the complex data environment.
Based on the same inventive concept, embodiments of the present invention provide a device for mining a multiple-source data stream frequent plot, and because the principle of the device for solving the technical problem is similar to that of a method for mining a multiple-source data stream frequent plot, the implementation of the device may refer to the implementation of the method, and repeated parts are not described again.
Fig. 7 is a schematic structural diagram of a multi-source data stream frequent episode mining apparatus according to an embodiment of the present invention, and as shown in fig. 7, the apparatus includes a first confirmation unit 701 and a second confirmation unit 702.
A first determining unit 701, configured to traverse a data stream including a non-precedence data point in a data lattice from an initial point position of the data lattice, and determine the non-precedence data stream as a data lattice sequence according to a feature that an interval occurs in the data stream corresponding to different processes;
a second confirming unit 702, configured to confirm the mixed frequent episodes in the data lattice sequence by a frequent episode mining method.
Further, the data grid sequence comprises single-order data grids, multi-order data grids and complex data grids;
the first confirming unit 701 is further configured to: confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
Further, the first confirming unit 701 is specifically configured to:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence. Further, the second confirming unit 702 is specifically configured to:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
It should be understood that the above multiple-source data stream frequent plot mining apparatus includes units that are only logically divided according to the functions implemented by the device apparatus, and in practical applications, the above units may be stacked or split. The functions implemented by the multi-source data stream frequent episode mining apparatus provided in this embodiment correspond to the multi-source data stream frequent episode mining method provided in the above embodiment one to one, and for a more detailed processing flow implemented by the apparatus, detailed description is already made in the above method embodiment, and detailed description is not repeated here.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (6)

1. A multi-source data stream frequent plot mining method is characterized by comprising the following steps:
traversing data streams without precedence relations in the data grids from the initial point positions of the data grids, and confirming the data streams without precedence relations as data grid sequences according to the characteristics of interval occurrence of the data streams corresponding to different processes;
confirming mixed frequent plots in the data grid sequence by a frequent plot mining method;
the method for determining the mixed frequent plots in the data grid sequence by the frequent plot mining method specifically comprises the following steps:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
2. The method of claim 1, wherein the sequence of data cells comprises a single-order data cell, a multiple-order data cell, a complex data cell;
before the data stream without the precedence relationship is determined as the data lattice sequence, the method further includes:
confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
3. The method according to claim 2, wherein the confirming the data stream without the precedence relationship as a data lattice sequence specifically comprises:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
4. A multi-source data stream frequent plot mining device is characterized by comprising:
the first confirming unit is used for traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and confirming the data streams without precedence relationship into a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes;
the second confirming unit is used for confirming the mixed frequent plots in the data grid sequence by a frequent plot mining method;
wherein the second confirmation unit is specifically configured to:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
5. The apparatus of claim 4, wherein the sequence of data cells comprises a single-order data cell, a multiple-order data cell, a complex data cell;
the first validation unit is further configured to: confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
6. The apparatus of claim 5, wherein the first acknowledgment unit is specifically configured to:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
CN201810889153.4A 2018-08-06 2018-08-06 Multi-source data stream frequent plot mining method and device Active CN109033419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810889153.4A CN109033419B (en) 2018-08-06 2018-08-06 Multi-source data stream frequent plot mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810889153.4A CN109033419B (en) 2018-08-06 2018-08-06 Multi-source data stream frequent plot mining method and device

Publications (2)

Publication Number Publication Date
CN109033419A CN109033419A (en) 2018-12-18
CN109033419B true CN109033419B (en) 2022-03-11

Family

ID=64649765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810889153.4A Active CN109033419B (en) 2018-08-06 2018-08-06 Multi-source data stream frequent plot mining method and device

Country Status (1)

Country Link
CN (1) CN109033419B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667197A (en) * 2009-09-18 2010-03-10 浙江大学 Mining method of data stream association rules based on sliding window
US20110184922A1 (en) * 2010-01-18 2011-07-28 Industry-Academic Cooperation Foundation, Yonsei University Method for finding frequent itemsets over long transaction data streams

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667197A (en) * 2009-09-18 2010-03-10 浙江大学 Mining method of data stream association rules based on sliding window
US20110184922A1 (en) * 2010-01-18 2011-07-28 Industry-Academic Cooperation Foundation, Yonsei University Method for finding frequent itemsets over long transaction data streams

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Online Frequent Episode Mining》;Xiang Ao 等;《2015 IEEE 31st International Conference on Data Engineering》;20150601;正文第891-902页 *
《面向多源异构信息的频繁项集挖掘算法》;刘自力 等;《计算机技术与发展》;20170630;第27卷(第6期);正文第76-80页 *

Also Published As

Publication number Publication date
CN109033419A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
US7537523B2 (en) Dynamic player groups for interest management in multi-character virtual environments
US7627611B2 (en) Conflict resolution in database replication through autonomous node qualified folding
CN111024092A (en) Method for rapidly planning tracks of intelligent aircraft under multi-constraint conditions
CN109697512B (en) Personal data analysis method based on Bayesian network and computer storage medium
CN110135716A (en) A kind of power grid construction project dynamic early-warning recognition methods and system
CN108093213B (en) Target track fuzzy data fusion method based on video monitoring
CN105205052A (en) Method and device for mining data
CN109658249A (en) A kind of block chain performance optimization method
CN109033419B (en) Multi-source data stream frequent plot mining method and device
CN104270789B (en) The sampling task dispatching method of wireless sensor network based on data sharing
CN103870562B (en) Regulation verifying method and system in intelligent building system
CN106611213A (en) Hybrid particle swarm algorithm for solving workshop scheduling problem
CN115878729B (en) Node block storage allocation optimization method and system based on alliance chain
CN104978382A (en) Clustering method based on local density on MapReduce platform
CN111988131B (en) Block chain construction method facing mobile crowd sensing
Sheshikala et al. Parallel approach for finding co-location pattern–a map reduce framework
CN109726895B (en) Multi-target-point task execution planning method and device
CN111179580B (en) Service route evaluation method and device
CN111145548B (en) Important intersection identification and subregion division method based on data field and node compression
Kadjouh et al. A new leader election algorithm based on the WBS algorithm dedicated to smart-cities
CN117349031B (en) Distributed super computing resource scheduling analysis method, system, terminal and medium
CN112949686B (en) Matching method based on optimal local distance
CN116036603B (en) Data processing method, device, computer and readable storage medium
Zhang et al. Step-coordination algorithm of traffic control based on multi-agent system
CN116088540B (en) Unmanned aerial vehicle and unmanned aerial vehicle cooperated cable channel inspection method and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant