CN109033419B - Multi-source data stream frequent plot mining method and device - Google Patents
Multi-source data stream frequent plot mining method and device Download PDFInfo
- Publication number
- CN109033419B CN109033419B CN201810889153.4A CN201810889153A CN109033419B CN 109033419 B CN109033419 B CN 109033419B CN 201810889153 A CN201810889153 A CN 201810889153A CN 109033419 B CN109033419 B CN 109033419B
- Authority
- CN
- China
- Prior art keywords
- data
- lattice
- grid
- order
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method and a device for mining frequent plots of a multi-source data stream, and relates to the technical field of data mining. The method is used for solving the problem that the existing data stream mining technology cannot mine multi-level, multi-angle and multi-azimuth because a mining object is a single data stream. The method comprises the following steps: traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and determining the data streams without precedence relationship as a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes; and confirming mixed frequent plots in the data grid sequence by a frequent plot mining method.
Description
Technical Field
The invention relates to the technical field of data mining, in particular to a method and a device for mining frequent plots of a multi-source data stream.
Background
At present, data mining is widely applied to an intelligent traffic system, and is a comprehensive traffic management system which is established by effectively integrating and applying advanced information technology, data communication transmission technology, electronic sensing technology, control technology, computer technology and the like to the whole ground traffic management system, plays a role in a large range in all directions, and is real-time, accurate and efficient. The application of multi-source data stream frequent plot mining in intelligent traffic road section flow control is mainly characterized in that according to passing road information data collected by traffic sensors on different vehicles, a multi-source data stream frequent plot mining method is combined to mine that multiple vehicles can possibly pass through a passageway road section, and related road improvement measures are provided, so that traffic jam and traffic accidents can be prevented, the highest allowable vehicle flow speed on the road is ensured, and the activities of people such as working, traveling and the like are facilitated.
In the wireless sensor network system, each sensor node deployed in the monitoring area cooperatively senses, collects and processes information of a sensed object in the coverage area of the network, and sends the information to an observer. The perception of the object can be realized jointly by combining the data collected by each node, and complete information can be obtained. For example, the stream data { ABADCCA … }, { HFFEGHH … } is based on data on two nodes in a certain wireless sensor network system, where B occurs before E and G occurs before D, there is an observer observing the data stream in the order of { HFFABEGADCCAHH … }.
The existing basic technology of data stream mining is limited to single data stream mining, and is usually performed based on a certain specific time interval or window, and there are three common window models: a landmark window model; sliding the window model; an attenuation window model. For the landmark window model, the data set used for mining is the set of all tuples starting from the data stream to the current arrival, and the window size increases with the data stream. For the sliding window model, the data set used for mining is a set of N tuples that have arrived recently, starting from the current time scale, where N is the size of the sliding window, and the position of the window slides continuously with the data stream on the time axis. For the decay window model, the data set used for mining is the set of all tuples starting from the data stream to the current arrival, but the tuples are given different weights, and the weight of each tuple decays continuously over time according to some decay function. The mining object of the existing typical data stream mining technology is a single-source data stream, but in many applications, the data information source is multi-level, multi-angle and multi-directional, and the single-source data stream mining under the traditional mode cannot adapt to the development of social progress.
Disclosure of Invention
The embodiment of the invention provides a multi-source data stream frequent plot mining method and device, which are used for solving the problem that a mining object of the existing data stream mining technology is a single data stream, and mining cannot be performed aiming at multiple layers, multiple angles and multiple directions.
The embodiment of the invention provides a multi-source data stream frequent plot mining method, which comprises the following steps:
traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and determining the data streams without precedence relationship as a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes;
and confirming mixed frequent plots in the data grid sequence by a frequent plot mining method.
Preferably, the data grid sequence comprises single-order data grids, multi-order data grids and complex data grids;
before the data stream without the precedence relationship is determined as the data lattice sequence, the method further includes:
confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
Preferably, the determining the data stream without the precedence relationship as a data lattice sequence specifically includes:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
Preferably, the determining the mixed frequent episodes in the data lattice sequence by the frequent episode mining method specifically includes:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
An embodiment of the present invention further provides a device for mining frequent plots of multiple source data streams, including:
the first confirming unit is used for traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and confirming the data streams without precedence relationship into a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes;
and the second confirming unit is used for confirming the mixed frequent plots in the data grid sequence by a frequent plot mining method.
Preferably, the data grid sequence comprises single-order data grids, multi-order data grids and complex data grids;
the first validation unit is further configured to: confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
Preferably, the first confirming unit is specifically configured to:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
Preferably, the second confirming unit is specifically configured to:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
The embodiment of the invention provides a method and a device for mining frequent plots of a multi-source data stream, wherein the method comprises the following steps: traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and determining the data streams without precedence relationship as a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes; and confirming mixed frequent plots in the data grid sequence by a frequent plot mining method. According to the method, aiming at data streams without precedence relationship, the data streams with the precedence relationship are combined into a complex data grid according to the characteristic that different data streams corresponding to different processes have intervals, combination of multiple source data streams is achieved, and compared with a traditional method, the parallel relationship among data is reserved. Furthermore, on the premise of ensuring that the data source information is not missed, a foundation is provided for the subsequent frequent item mining, and the effect of reducing the plot redundancy is realized, so that in the subsequent frequent item mining process, the calculation complexity of the dynamic planning method is reduced, the storage cost is reduced, and the method is more suitable for the complex data environment.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for mining a frequent plot of a multi-source data stream according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data grid structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating two connection scenarios of complex data grids according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a data flow structure with "precedence relationship" according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an embodiment of the present invention for providing a data cell structure corresponding to FIG. 4;
fig. 6 is a schematic diagram of a method for mining a frequent plot of a multi-source data stream according to a second embodiment of the present invention;
fig. 7 is a schematic structural diagram of a device for mining frequent episodes of a multi-source data stream according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 exemplarily shows a flow diagram of a multi-source data stream frequent episode mining method provided by an embodiment of the present invention, as shown in fig. 1, the method mainly includes the following steps:
and 102, confirming mixed frequent plots in the data grid sequence by a frequent plot mining method.
It should be noted that the data lattices discussed in the embodiments of the present invention include two types, namely, "data points with precedence relationship" and "data points without precedence relationship". In practical applications, the "precedence relationship" includes the sequence relationship of events occurring in time and the causal relationship of events occurring logically. In the embodiment of the invention, uncertain characteristics exist in the precedence relationship of concurrent data generated by a plurality of source data streams. Therefore, when a multi-source data stream is synthesized, there are many possible cases of the data lattice sequence of the whole data stream observed in the conventional mode, that is, there are multiple data lattice sequences based on the sequence characteristics.
In the embodiment of the invention, two data flows W are used(1)、W(2)The data grid generated by the combination is taken as an example,
fig. 2 is a schematic diagram of a data grid structure provided in an embodiment of the present invention, and as shown in fig. 2,' is used to represent that there is no precedence relationship between two corresponding events, which is referred to as a "no precedence relationship data point"; the 'x' is used to indicate that the precedence relationship exists between the corresponding two events, and is called a "precedence relationship data point".
As shown in fig. 2, the data cells include two types, a single-order data cell and a multi-order data cell, wherein, in the single-order data cell,when there is a uniquely determined sequence among data included in the data grid shown in fig. 2When the sequence is found to be correct, the data cell formed by the sequence is called a single-sequence data cell and is denoted as S(n-1)Where n represents the number of data points in a single-order data grid. S in FIG. 22I.e. a single-order data grid.
Further, in the multi-order data grid,when there are multiple uncertain sequences among data included in the data lattice shown in fig. 2 and data corresponding to different processes have characteristics of interval occurrence, the data lattice formed by the sequences is called a multi-sequence data lattice, which is denoted as M(m-1)(n-1)And m and n represent the number of data points corresponding to different processes in the multi-sequence data grid. M in FIG. 231And M22All are multi-order data grids.
Table 1 shows the above symbols.
TABLE 1
In practical applications, it is known from the data lattice model of the two source data streams that, along with the combination of the two sets of data streams, the data lattice may have single-order data lattices, multiple-order data lattices, and complex data lattices. In which single-order data cell S(n-1)Can generate a unique sequenceThe single-order data grid has the characteristic of low calculation and storage space overhead. For the multi-order data grids, the order of the data items in the data stream is generated under the condition of meeting the precedence relationshipIn the embodiment of the invention, a method for constructing a binary sequence is provided, and a simple binary sequence is generated to represent multiple possible sequence conditions in a data grid, so that the space overhead can be reduced.
In particular, the following describes methods for generating sequences of single-order data cells, multiple-order data cells, and complex data cells that may occur within a sequence of data cells.
When the data grid is composed of two groups of different data streams corresponding to different processes, because the precedence relationship does not exist between the two groups of different data streams, a plurality of sequences which can possibly occur exist in the data grid.
One of the situations is:
when a group of data forming the single-order data lattice is confirmed in the data stream without the precedence relationship included in the data lattice, the group of data is confirmed as the single-order data lattice. For example, S in FIG. 22It is a single-order data grid.
One of the situations is:
when two groups of data items are confirmed to form a multi-order data grid from data streams without precedence relation included in the data grids, and the two groups of data items are distributed to belong to different data streams, the two groups of data items are confirmed to form the multi-order data grid.
In another case:
when only two groups of multi-order data grids exist in the data stream without the precedence relationship in the data grids and at least one adjacent edge exists between the two groups of multi-order data grids, the two groups of multi-order data grids with at least one adjacent edge are confirmed to be complex data grids.
The following describes the validation step of a multi-order data grid in detail with reference to fig. 2:
step 201: traversing data streams represented by non-successive data points in the data grids from the initial point position at the lower left corner of the data grids, and finding out multi-sequence data grids with characteristics that different processes correspond to different data streams and intervals exist;
step 202: determining different data streams constituting different processes in a multi-order data grid, e.g. process W(1)Participating in the constitutionThe first data item of the order data grid isProcess W(2)The second data item participating in the formation of the multi-order data lattice is
Step 203: generating the following form according to the requirement of generating the binary sequence according to the first data item and the second data itemIn the parentheses, line 1 represents a sequence from W(1)Line 2 represents data items from W(2)The data item of (1). It should be noted that, in the parentheses, the first row and the second row may be inverted up and down, and in the embodiment of the present invention, the specific type of the binary sequence is not limited.
In practical application, in the process of traversing the data grids, a plurality of multi-order data grid cross-connection scenarios may exist, and the situations are identified as complex data grids, and in practical situations, the complex data grids are quite possible to happen. Therefore, for such a case, the binary sequence set is adopted in the embodiment of the present invention, that is, a plurality of binary sequences are adopted to represent all possible sequence cases in the complex data grid.
The complex data lattice validation steps are described in detail below in conjunction with FIG. 3:
fig. 3 is a schematic diagram of two connection situations of complex data lattices according to an embodiment of the present invention, and a confirmation method of the complex data lattices is described below with reference to fig. 3:
step 301, starting from the initial point position of the lower left corner of the data grid, traversing the data stream represented by the data points without precedence relationship in the data grid, and finding out a complex data grid with a plurality of cross-connection features of the multi-sequence state grid.
Step 302: when one edge of the two multi-order data grids is overlapped, the complex data grid is divided according to the multi-order data grids, for example, in (a) in fig. 3, a 'non-precedence relationship data point' between an event e and an event f can be determined as a dividing point; when two sides of two multi-order data grids are in cross connection, the complex data grid is divided according to the multi-order data grids, for example, in fig. 3 (b), a "non-precedence data point" between an event s and an event t and a "non-precedence data point" between an event n and an event p can be determined as a division point of each data process.
Step 303: listing binary sequences according to division points, combining to obtain a binary sequence group representing a complex data grid, wherein the binary sequence group of (a) in figure 3 is composed ofThree, the binary sequence group of (b) in FIG. 3 is composed ofFour items are formed.
When the data lattices include the data flow of the data lattices without the precedence relationship in the data lattices, the data lattices include a multi-order data lattice and a single-order data lattice, and simultaneously include the complex data lattices, the multi-order data lattice, the single-order data lattice and the complex data lattices can be combined into a data lattice sequence. It should be noted that, in the embodiment of the present invention, the method for composing the data grid sequence may include the following various ways:
one way is as follows:
when it is determined that the data stream without the precedence relationship includes at least one multi-order data cell and at least one single-order data cell, the at least one multi-order data cell and the at least one single-order data cell may be combined to form a data cell sequence. In this manner, the data cells do not include complex data cells, but the specific number of included multi-order data cells and single-order data cells is not limited.
In another mode:
when it is determined that the data stream without the precedence relationship includes at least one multi-order data cell and at least one complex data cell, the at least one multi-order data cell and the at least one complex data cell may be combined to form a data cell sequence. In this manner, single-order data cells are not included in the data cells, but the specific number of multi-order data cells and complex data cells included is not limited.
In another mode:
when it is determined that the data stream without the precedence relationship includes at least one single-order data cell and at least one complex data cell, the at least one single-order data cell and the at least one complex data cell may be combined to form a data cell sequence. In this manner, the data cells do not include multiple-order data cells, but the specific number of single-order data cells and complex data cells included is not limited.
In another mode:
when it is determined that the data stream without the precedence relationship includes at least one single-order data cell, at least one complex data cell, and at least one multi-order data cell, the at least one single-order data cell, the at least one complex data cell, and the at least one multi-order data cell may be combined to form a data cell sequence. In this manner, the specific number of single-order data cells, complex data cells, and multi-order data cells included in the data cells is not limited.
In step 102, after the data streams without precedence relationship are combined into a data lattice sequence, the data lattices may be processed according to a frequent plot mining method.
In practical application, the frequent episode mining is mainly divided into serial episode mining and parallel episode mining, and the mixed episode mining is to perform aggregate merging on the existing serial frequent episode mining and the existing parallel frequent episode mining.
Specifically, a plurality of plots included in the data lattice and the length of each plot are confirmed first, and in practical applications, the plots include multiple-order data lattices and data items constituting a single-order data lattice.
For example, made ofA complex data grid of components, wherein,both the middle a and the bcd occur at time 1, and the length is marked as 3 with the longest plot as a standard; e occurs at time 2 and is 1 in length; f occurs at time 3, has a length of 1,xa and by in (b) both occur at time 4 and are 2 in length.
The mining steps of frequent plot mining are described with reference to the above example:
step 401: scanning the data grid to determine the occurrence of a plurality of episodes included in the data grid and the length of each episode, e.g. episodeThe length of the middle a and bcd is 3, the length of plot e is 1, the length of plot f is 1, plotXa and by in (2) have a length of 2.
Step 402: a minimum support, sequence window, is set, each time the sequence window slides backwards at a particular time interval. When a new data item comes in the data stream, discarding the plot when the support degree of the plot is smaller than the minimum support degree threshold value in the detection window; otherwise, the episode greater than the minimum support threshold is identified as a serial frequent episode.
For example, the first sequence windowPlot of things<A,C>Is a frequent pattern, which occurs twice in total,<(A,1),(C,3)>and<(A,3),(C,4)>,sup(<A,C>) Min _ sup is more than or equal to 2. Plot of things<A,C>Referred to as a serial frequent episode.
Step 403: when a binary sequence structure is generated at a certain moment during scanning of the data stream, the binary sequence structure is stored by adopting a two-dimensional array until a next binary sequence is met, and data items which exist in at least two multi-sequence data grids at the same time and are larger than a minimum support threshold are confirmed as parallel frequent plots.
Such asAndis a binary sequence structure in a data stream, both of which can contain episodessupAre frequently parallel episodes, and consider episode expansion,also meeting the need for greater than the minimum support threshold, thenIdentified as parallel frequent episodes.
Step 304: and finally, carrying out set combination on the serial frequent plots and the parallel frequent plots to obtain mixed frequent plots.
In order to more clearly describe the method for mining the frequent episode of the multi-source data stream according to the embodiment of the present invention, the method is further described below by taking fig. 4, fig. 5, and fig. 6 as examples.
Example one
Fig. 4 is a schematic diagram of a data flow structure with "precedence relationship" according to an embodiment of the present invention; fig. 5 is a schematic diagram of providing a data grid structure corresponding to fig. 4 according to an embodiment of the present invention.
As shown in FIG. 4, there is W(1)、W(2)Two data streams, wherein W(1):ADCADCBC…,W(2):BECABEAA…。W(1)、W(2)The precedence relationship of the data items included in the two data streams is shown in fig. 4. In fig. 5, the ' is used to indicate that there is no precedence between two corresponding events, and the ' x ' is used to indicate that there is precedence between two corresponding events.
For data stream W(1)And W(2)Frequent episode mining may be performed according to fig. 5. The method specifically comprises the following steps:
step 501, traverse the whole data grid, find out the first multi-order data grid M31;
Step 502, determining the first multi-order data grid M31Two sets of data items, data source W(1)Is DCA, the data source W(2)The data item participating in (1) is { E };
step 503, generating a binary sequence from the two groups of data items according to a binary sequence generation method
It should be noted that, by repeating steps 501 to 503, a second multi-order data grid M can be found22To obtain a binary sequence
First single-order data cell S2The corresponding sequence is { CA }, the second single-order data cell S4The corresponding sequence is { BCAA }, and the multiple multi-order data grids and the multiple single-order data grids are combined to form a data grid sequence, specifically to form a data grid sequence
Step 504, scanning the data grid sequence,the data grid is identified to include a plurality of episodes and the length of each episode, e.g., episode a occurs at time 1, episode B occurs at time 2,occurring at time 3 and having a length of 3, C and a are at times 4 and 5, respectively, both having a length of 1,occurs at time 6 and is 2 in length, and BCAAs occur at times 7,8,9, and 10, respectively, and are all 1 in length.
Step 505, determining the serial frequent scenario, for example, setting a minimum support degree min _ sup equal to 2, setting a sequence window to 6, sliding the sequence window backward at a time interval of 2 each time, and simulating a stream state of data arrival. First sequence windowPlot of things<A,C>Is a frequent pattern, which occurs twice in total,<(A,1),(C,3)>and<(A,3),(C,4)>,sup(<A,C>) Min _ sup is more than or equal to 2. Plot of things<A,C>Referred to as serial frequent episodes; then, the sequence window is slid backward at time interval 2, and the next occurrence is judged.
In step 506, mining of frequent episodes is performed in parallel, for example,andis a binary sequence structure in a data stream, both of which can contain episodessupAre frequently parallel episodes, and consider episode expansion,also meeting the need for more than a minimum support threshold. ThenIdentified as parallel frequent episodes.
Step 507, finally, the serial frequent plots and the parallel frequent plots are combined together to obtain a mixed frequent plot<A,C>,
Example two
Fig. 6 is a schematic diagram of a method for mining a frequent episode of a multi-source data stream according to a second embodiment of the present invention, as shown in fig. 6, the method includes the following steps:
601, forming a data grid by two groups of data streams without precedence relation;
it should be noted that, in step 606, there may be a case where one edge of the two multi-order data grids coincides with each other, and there may also be a case where two edges are cross-connected, which can be described in detail with reference to fig. 3.
In summary, in the method for mining frequent episodes of multiple source data streams, for data streams without precedence relationship, according to the feature that there are intervals between different data streams corresponding to different processes, the data streams with precedence relationship are combined into complex data grids, so that the combination of multiple source data streams is realized, and compared with the conventional method, the parallel relationship between data is maintained. Furthermore, on the premise of ensuring that the data source information is not missed, a foundation is provided for the subsequent frequent item mining, and the effect of reducing the plot redundancy is realized, so that in the subsequent frequent item mining process, the calculation complexity of the dynamic planning method is reduced, the storage cost is reduced, and the method is more suitable for the complex data environment.
Based on the same inventive concept, embodiments of the present invention provide a device for mining a multiple-source data stream frequent plot, and because the principle of the device for solving the technical problem is similar to that of a method for mining a multiple-source data stream frequent plot, the implementation of the device may refer to the implementation of the method, and repeated parts are not described again.
Fig. 7 is a schematic structural diagram of a multi-source data stream frequent episode mining apparatus according to an embodiment of the present invention, and as shown in fig. 7, the apparatus includes a first confirmation unit 701 and a second confirmation unit 702.
A first determining unit 701, configured to traverse a data stream including a non-precedence data point in a data lattice from an initial point position of the data lattice, and determine the non-precedence data stream as a data lattice sequence according to a feature that an interval occurs in the data stream corresponding to different processes;
a second confirming unit 702, configured to confirm the mixed frequent episodes in the data lattice sequence by a frequent episode mining method.
Further, the data grid sequence comprises single-order data grids, multi-order data grids and complex data grids;
the first confirming unit 701 is further configured to: confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
Further, the first confirming unit 701 is specifically configured to:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence. Further, the second confirming unit 702 is specifically configured to:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
It should be understood that the above multiple-source data stream frequent plot mining apparatus includes units that are only logically divided according to the functions implemented by the device apparatus, and in practical applications, the above units may be stacked or split. The functions implemented by the multi-source data stream frequent episode mining apparatus provided in this embodiment correspond to the multi-source data stream frequent episode mining method provided in the above embodiment one to one, and for a more detailed processing flow implemented by the apparatus, detailed description is already made in the above method embodiment, and detailed description is not repeated here.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (6)
1. A multi-source data stream frequent plot mining method is characterized by comprising the following steps:
traversing data streams without precedence relations in the data grids from the initial point positions of the data grids, and confirming the data streams without precedence relations as data grid sequences according to the characteristics of interval occurrence of the data streams corresponding to different processes;
confirming mixed frequent plots in the data grid sequence by a frequent plot mining method;
the method for determining the mixed frequent plots in the data grid sequence by the frequent plot mining method specifically comprises the following steps:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
2. The method of claim 1, wherein the sequence of data cells comprises a single-order data cell, a multiple-order data cell, a complex data cell;
before the data stream without the precedence relationship is determined as the data lattice sequence, the method further includes:
confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
3. The method according to claim 2, wherein the confirming the data stream without the precedence relationship as a data lattice sequence specifically comprises:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
4. A multi-source data stream frequent plot mining device is characterized by comprising:
the first confirming unit is used for traversing data streams including data points without precedence relationship in the data grids from the initial point positions of the data grids, and confirming the data streams without precedence relationship into a data grid sequence according to the characteristics of interval occurrence of the data streams corresponding to different processes;
the second confirming unit is used for confirming the mixed frequent plots in the data grid sequence by a frequent plot mining method;
wherein the second confirmation unit is specifically configured to:
confirming the occurrence time of the plots included by the data grid sequence and the length of each plot, wherein the plots include at least two multi-sequence data grids and data items forming a single-sequence data grid;
when the support degree of the occurrence plot is smaller than the minimum support degree threshold value in the detection window, discarding the plot; otherwise, confirming the episode greater than the minimum support threshold as a serial frequent episode;
identifying data items that co-exist within at least two of the multi-ordinal data grids and that are greater than a minimum support threshold as being parallel frequent episodes;
and collecting the serial frequent plots and the parallel frequent plots to obtain the mixed frequent plots.
5. The apparatus of claim 4, wherein the sequence of data cells comprises a single-order data cell, a multiple-order data cell, a complex data cell;
the first validation unit is further configured to: confirming a first data item and a second data item which form a first multi-order data grid from the data stream without the precedence relationship, and generating a first binary sequence corresponding to the first multi-order data grid according to the first data item and the second data item; the first data item and the second data item belong to a first data stream and a second data stream, respectively; and/or
Confirming a third data item or a fourth data item forming a first single-order data lattice from the data stream without the precedence relationship, and generating the first single-order data lattice corresponding to the first single-order data lattice according to the third data item or the fourth data item; the third data item belongs to the first data stream and the fourth data item belongs to the second data stream; and/or
And confirming that at least two groups of multi-order data grids exist in the data stream without the precedence relationship, and confirming the two groups of multi-order data grids with at least one adjacent edge as complex data grids when at least one adjacent edge exists in the two groups of multi-order data grids.
6. The apparatus of claim 5, wherein the first acknowledgment unit is specifically configured to:
when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice and at least one first single-order data lattice, confirming the combination of the first multi-order data lattice and the first single-order data lattice as the data lattice sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data grid and at least one complex data grid, confirming the combination of the first multi-order data grid and the complex data grid as the data grid sequence; or
When the data stream without the precedence relationship is confirmed to comprise at least one first single-order data grid and at least one complex data grid, confirming the combination of the first single-order data grid and the complex data grid as the data grid sequence; or
And when the data stream without the precedence relationship is confirmed to comprise at least one first multi-order data lattice, at least one first single-order data lattice and at least one complex data lattice, confirming the combination of the first multi-order data lattice, the first single-order data lattice and the complex data lattice as the data lattice sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810889153.4A CN109033419B (en) | 2018-08-06 | 2018-08-06 | Multi-source data stream frequent plot mining method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810889153.4A CN109033419B (en) | 2018-08-06 | 2018-08-06 | Multi-source data stream frequent plot mining method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033419A CN109033419A (en) | 2018-12-18 |
CN109033419B true CN109033419B (en) | 2022-03-11 |
Family
ID=64649765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810889153.4A Active CN109033419B (en) | 2018-08-06 | 2018-08-06 | Multi-source data stream frequent plot mining method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033419B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667197A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Mining method of data stream association rules based on sliding window |
US20110184922A1 (en) * | 2010-01-18 | 2011-07-28 | Industry-Academic Cooperation Foundation, Yonsei University | Method for finding frequent itemsets over long transaction data streams |
-
2018
- 2018-08-06 CN CN201810889153.4A patent/CN109033419B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667197A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Mining method of data stream association rules based on sliding window |
US20110184922A1 (en) * | 2010-01-18 | 2011-07-28 | Industry-Academic Cooperation Foundation, Yonsei University | Method for finding frequent itemsets over long transaction data streams |
Non-Patent Citations (2)
Title |
---|
《Online Frequent Episode Mining》;Xiang Ao 等;《2015 IEEE 31st International Conference on Data Engineering》;20150601;正文第891-902页 * |
《面向多源异构信息的频繁项集挖掘算法》;刘自力 等;《计算机技术与发展》;20170630;第27卷(第6期);正文第76-80页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109033419A (en) | 2018-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7537523B2 (en) | Dynamic player groups for interest management in multi-character virtual environments | |
US7627611B2 (en) | Conflict resolution in database replication through autonomous node qualified folding | |
CN111024092A (en) | Method for rapidly planning tracks of intelligent aircraft under multi-constraint conditions | |
CN109697512B (en) | Personal data analysis method based on Bayesian network and computer storage medium | |
CN110135716A (en) | A kind of power grid construction project dynamic early-warning recognition methods and system | |
CN108093213B (en) | Target track fuzzy data fusion method based on video monitoring | |
CN105205052A (en) | Method and device for mining data | |
CN109658249A (en) | A kind of block chain performance optimization method | |
CN109033419B (en) | Multi-source data stream frequent plot mining method and device | |
CN104270789B (en) | The sampling task dispatching method of wireless sensor network based on data sharing | |
CN103870562B (en) | Regulation verifying method and system in intelligent building system | |
CN106611213A (en) | Hybrid particle swarm algorithm for solving workshop scheduling problem | |
CN115878729B (en) | Node block storage allocation optimization method and system based on alliance chain | |
CN104978382A (en) | Clustering method based on local density on MapReduce platform | |
CN111988131B (en) | Block chain construction method facing mobile crowd sensing | |
Sheshikala et al. | Parallel approach for finding co-location pattern–a map reduce framework | |
CN109726895B (en) | Multi-target-point task execution planning method and device | |
CN111179580B (en) | Service route evaluation method and device | |
CN111145548B (en) | Important intersection identification and subregion division method based on data field and node compression | |
Kadjouh et al. | A new leader election algorithm based on the WBS algorithm dedicated to smart-cities | |
CN117349031B (en) | Distributed super computing resource scheduling analysis method, system, terminal and medium | |
CN112949686B (en) | Matching method based on optimal local distance | |
CN116036603B (en) | Data processing method, device, computer and readable storage medium | |
Zhang et al. | Step-coordination algorithm of traffic control based on multi-agent system | |
CN116088540B (en) | Unmanned aerial vehicle and unmanned aerial vehicle cooperated cable channel inspection method and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |