CN114153881A - High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data - Google Patents

High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data Download PDF

Info

Publication number
CN114153881A
CN114153881A CN202111486699.3A CN202111486699A CN114153881A CN 114153881 A CN114153881 A CN 114153881A CN 202111486699 A CN202111486699 A CN 202111486699A CN 114153881 A CN114153881 A CN 114153881A
Authority
CN
China
Prior art keywords
algorithm
graph
edge
relationship
deleted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111486699.3A
Other languages
Chinese (zh)
Inventor
欧阳梦云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCB Finetech Co Ltd filed Critical CCB Finetech Co Ltd
Priority to CN202111486699.3A priority Critical patent/CN114153881A/en
Publication of CN114153881A publication Critical patent/CN114153881A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of operation and maintenance data, and provides a high-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data. The high-recall cause and effect discovery method based on the time sequence operation and maintenance big data is based on a cause and effect discovery algorithm and comprises the following steps: predefining a number of relationship rules to be applied to edges and points in the full graph; acquiring streaming operation and maintenance data to be analyzed; initializing the streaming operation and maintenance data into a complete graph by adopting a graph library and a graph algorithm; processing the relationship in the complete graph of the streaming operation and maintenance data according to the relationship rules; outputting a processed cause and effect relationship graph, wherein the cause and effect relationship graph is used for describing cause and effect relationships in the streaming operation and maintenance data. The embodiment provided by the invention can improve the accuracy of the causal relationship analysis of the operation and maintenance data.

Description

High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data
Technical Field
The invention relates to the technical field of operation and maintenance data, in particular to a high recall cause and effect discovery method based on time sequence operation and maintenance big data, a high recall cause and effect discovery device based on the time sequence operation and maintenance big data, high recall cause and effect discovery equipment based on the time sequence operation and maintenance big data and a corresponding storage medium.
Background
The cause and effect discovery in the operation and maintenance data mostly adopts the SVAR-FCI method to remove edges by using stationarity and additional inference. However, the method does not consider the hypothesis of functional relationship or complex structure, so that the recall rate is low, the formed result cannot accurately reflect the causal relationship in the operation and maintenance data, and the accuracy and the reliability of the analysis of the big data are reduced.
Disclosure of Invention
The embodiment of the invention aims to provide a high-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data.
In order to achieve the above object, a first aspect of the present invention provides a high recall causal discovery method based on time series operation and maintenance big data, where the method is based on a causal discovery algorithm, and the method includes: predefining a number of relationship rules to be applied to edges and points in the full graph; acquiring streaming operation and maintenance data to be analyzed; initializing the streaming operation and maintenance data into a complete graph by adopting a graph library and a graph algorithm; processing the relationship in the complete graph of the streaming operation and maintenance data according to the relationship rules; outputting a processed cause and effect relationship graph, wherein the cause and effect relationship graph is used for describing cause and effect relationships in the streaming operation and maintenance data.
In this embodiment of the present application, processing the relationship in the full graph of the streaming operation and maintenance data according to the several relationship rules includes: defining the following algorithm according to the plurality of relationship rules: a full graph processing main algorithm, a first removal algorithm, a second removal algorithm and a delete edge processing algorithm; processing edges in the complete graph of the streaming operation and maintenance data by adopting the first removal algorithm and the second removal algorithm to realize relational processing; the first removal algorithm is used for determining that all edges of the ordered variable pairs between the complete graph and the non-adjacent variables are to be deleted or reserved; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved; the second removal algorithm is used for determining edges between the ordered variable pairs, which meet a preset separation set, to be deleted or reserved; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved; the full graph processing main algorithm is used for providing a main function and calling the first removal algorithm and the second removal algorithm; and the deleted edge processing algorithm is used for carrying out secondary processing on the edges to be deleted determined by the first removing algorithm and the second removing algorithm.
In an embodiment of the present application, the full graph processing main algorithm is configured to: calling the first removal algorithm to perform first traversal on the complete graph, and marking the parent-child relationship in the complete graph; calling the first removal algorithm and the second removal algorithm to identify the complete graph after the first traversal, and determining a partial ancestor graph corresponding to the complete graph; and outputting the determined part of ancestor graphs as processed causal relationship graphs.
In an embodiment of the present application, the first removal algorithm is further configured to: determining ordered variable pairs according to the ordered variables in the complete graph and the adjacency relation; determining that a middle label of an edge between the ordered variable pair is a first type label; determining edges between the ordered variable pairs to be reserved or deleted according to a first preset condition; and if the edges between the ordered variable pairs are determined to be deleted, calling the deleted edge processing algorithm to perform secondary processing.
In an embodiment of the present application, the second removal algorithm is further configured to: determining ordered variable pairs according to the ordered variables in the complete graph and the adjacency relation; determining that a middle label of an edge between the ordered variable pair is a second type label; determining edges between the ordered variable pairs to be reserved or deleted according to a second preset condition; and if the edges between the ordered variable pairs are determined to be deleted, calling the deleted edge processing algorithm to perform secondary processing.
In an embodiment of the present application, the edge deletion processing algorithm is configured to: acquiring edges to be deleted determined according to the first removal algorithm and the second removal algorithm; if the determined edge to be deleted is a directed edge, solving the deleted conflict; and executing deletion operation on the determined edge to be deleted, and weakening and minimizing the corresponding separation set.
A second aspect of the present application provides a high recall cause and effect discovery apparatus based on time series operation and maintenance big data, the apparatus comprising: a rule definition module for predefining a number of relationship rules to be applied to edges and points in the full graph; the operation and maintenance data access module is used for acquiring streaming operation and maintenance data to be analyzed; the complete graph generation module is used for initializing the streaming operation and maintenance data into a complete graph by adopting a graph library and a graph algorithm; the edge processing module is used for processing the relationship in the complete graph of the streaming operation and maintenance data according to the relationship rules; and the result output module is used for outputting the processed cause-and-effect relationship diagram, and the cause-and-effect relationship diagram is used for describing cause-and-effect relationships in the streaming operation and maintenance data.
In this embodiment of the present application, the edge processing module includes: the complete graph processing main algorithm sub-module, the first removal algorithm sub-module, the second removal algorithm sub-module and the delete edge processing algorithm sub-module; the first removal algorithm submodule is used for determining that all edges of the ordered variable pairs between the ordered variable pairs and the nonadjacent variable in the complete graph are to be deleted or reserved through a first removal algorithm; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved; the second removal algorithm submodule is used for determining edges between the ordered variable pairs, which meet the preset separation set, as to-be-deleted or to-be-reserved through a second removal algorithm; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved; the full image processing main algorithm sub-module is used for providing a main function and calling the first removal algorithm and the second removal algorithm; the deleted edge processing algorithm submodule is used for carrying out secondary processing on the edges to be deleted determined by the first removing algorithm and the second removing algorithm.
The third aspect of the present application provides a high recall cause and effect discovery apparatus based on time series operation and maintenance big data, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the aforementioned high recall cause and effect discovery method based on time series operation and maintenance big data.
A fourth aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the aforementioned time-series operation and maintenance big data-based high-recall causal discovery method.
In a fifth aspect of the invention, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the aforementioned time-series operation and maintenance big data based high-recall causal discovery method.
The technical scheme has the following beneficial effects: the operation and maintenance data are processed through the optimized cause and effect discovery method, so that cause and effect relationships extracted from the operation and maintenance data are more accurate, and the accuracy and reliability of analysis of operation and maintenance big data are improved.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a schematic flow chart illustrating a high recall causal discovery method based on time series operation and maintenance big data according to an embodiment of the present application;
fig. 2 schematically shows a block diagram of a high recall cause and effect discovery apparatus based on time series operation and maintenance big data according to an embodiment of the present application.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.
For the understanding and implementation of the technical principles of the present invention and its embodiments, as will be described below, by those skilled in the art:
to true connection
Figure BDA0003397770590000051
Quantifies the probability of a connection not being erroneously deleted due to an erroneous partial correlation coefficient (ParCorr) CI test, and is recorded as
Figure BDA0003397770590000052
And
Figure BDA0003397770590000053
it is also explained here that the method is also applicable to non-time-series situations. The method relies on four aspects: (1) sample size (usually fixed); (2) significance level of CI test α (which will generally be fixed for false positive levels); (3) the estimated dimension of the CI detection; (4) the magnitude of the effect.
The effect size is defined as the minimum CI test statistic I (A; B | S) that replaces all the condition sets S being tested. This minimum can be very small, ultimately resulting in low detection efficiency. The method mainly improves the effect size through the following two aspects: (1) limiting the condition set S to be tested so as to delete all the error connections; it is sufficient here to consider only the set of conditions consisting of the ancestors of A or B; (2) the extension requires the use of a so-called default condition SdefSet S, S of tests performeddefTo increase CI test statistics without creating spurious dependencies. But if SdefConsisting of only the ancestors of A or B, no spurious dependencies exist.
The theory for the above mentioned effects is presented below:
let A → B (wherein
Figure BDA0003397770590000054
And
Figure BDA0003397770590000055
) In M (G) is a link (→ or
Figure BDA0003397770590000056
). For default condition SdefPa ({ a, B }, m (g) \ { a, B }, while X ═ X \ S }def. Is provided with
Figure BDA0003397770590000057
A set of enhanced method effect size sets is defined. If the following two conditions are satisfied simultaneously: (1) s ∈ S, wherein
Figure BDA0003397770590000058
Or
Figure BDA0003397770590000059
(2) There is a reasonable subset
Figure BDA00033977705900000510
Satisfies the condition I (A; B; S)def\Q|S*∪Q)<0, then present
Figure BDA00033977705900000511
If not, the above formula>Become not less than that.
Wherein I represents (conditional) mutual information, while I (A; B; C | D) ≡ I (A; B | D) -I (A; B | C ≡ D) represents mutual information; the theory above shows thatdefBeing the union of the a and B parents will increase the effect size.
From the theoretical derivation above, it can be seen that: this will result in higher detection capability and higher recall unless the higher effect size is over-offset by the increased estimation dimensionality (the reason for this phenomenon is the condition set setting of the higher cardinality). The above principle is only useful if some (non-) ancestors are known before all CI tests are completed. The method is realized by removing and positioning the complex edges, for example, learning the ancestor connection relation and then deleting the error edges.
According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.
FIG. 1 is a schematic flow chart illustrating a high recall causal discovery method based on time series operation and maintenance big data according to an embodiment of the present application. As shown in fig. 1, in an embodiment of the present application, a method for high recall cause and effect discovery based on time series operation and maintenance big data is provided, including:
101. predefining a number of relationship rules to be applied to edges and points in the full graph;
102. acquiring streaming operation and maintenance data to be analyzed;
103. initializing the streaming operation and maintenance data into a complete graph by adopting a graph library and a graph algorithm;
104. processing the relationship in the complete graph of the streaming operation and maintenance data according to the relationship rules;
105. outputting a processed cause and effect relationship graph, wherein the cause and effect relationship graph is used for describing cause and effect relationships in the streaming operation and maintenance data.
The causal relationship between points in the causal relationship graph is represented by edges, including directed edges corresponding to the causal relationship and bidirectional edges corresponding to each other as causal relationships. In the above embodiments, the predefined number of relationship rules are based on causal relationship rules in causal discovery, including but not limited to causal association rules, de-border rules, direction rules, etc. between points and points mentioned in the technical principles section. The above relation rules are the basis for processing the relation in step 104.
And acquiring streaming operation and maintenance data to be analyzed, wherein the acquisition mode comprises modes such as kafka real-time access and the like. Initializing the acquired streaming operation and maintenance data, mapping the data into a plurality of points in the graph, and connecting each pair of the points by using an edge to obtain a complete graph. The complete graph includes all possible causal relationships between all points, which does not reflect the actual causal relationships of the streaming operation and maintenance data. Deleting partial edges in the complete graph through a plurality of predefined relationship rules, and finally leaving points connected through a plurality of edges to represent the causal relationship between the points, wherein the graph at the moment is the final causal relationship graph. And deleting part of edges in the complete graph through a predefined relation rule, wherein the deletion can be realized through one or more preset edge deletion algorithms.
Through the above embodiment, the constraint-based cause and effect discovery algorithm includes cause and effect parental relationships in condition setting, and increases the effect amount. The real-time mode determines the reason causing low recall rate by identifying that the low efficiency of the condition independence test is the main reason, thereby improving the effect of the CI test by utilizing discovery and theory. In the whole process, the parent class is identified as much as possible, potential confounding factors of the time sequence at present are observed, and a new direction rule is used for determining the parent-child or ancestor relationship in the edge removal stage to carry out process iteration. The causal relationship graph with the largest information amount obtained by the above embodiment can accurately represent the causal relationship in the streaming operation and maintenance data.
In an embodiment provided by the present invention, processing the relationship in the full graph according to the relationship rules includes: defining the following algorithm according to the plurality of relationship rules: a full graph processing main algorithm, a first removal algorithm, a second removal algorithm and a delete edge processing algorithm; processing the edges in the complete graph by adopting a first removal algorithm and a second removal algorithm to realize the processing of the relationship; wherein the full graph processing main algorithm is used for providing a main function and calling the first removal algorithm and the second removal algorithm; the first removal algorithm is used for determining that all edges of the ordered variable pairs between the complete graph and the non-adjacent variables are to be deleted or reserved; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved; the second removal algorithm is used for determining edges between the ordered variable pairs, which meet a preset separation set, to be deleted or reserved; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved; and the deleted edge processing algorithm is used for carrying out secondary processing on the edges to be deleted determined by the first removing algorithm and the second removing algorithm. Specific implementations of the above algorithms will be described in detail later.
For a better understanding of the embodiments mentioned hereinafter, the noun definitions used in the following description of the pseudo-code will be explained here as follows:
defining a multivariate time series Vj
Figure BDA0003397770590000081
Figure BDA0003397770590000082
To follow a stationary discrete-time structure vector, the autoregressive process described by the structural causal model SCM is therefore as follows:
Figure BDA0003397770590000083
wherein: j is 1, L,
Figure BDA0003397770590000084
measurement function fjDependent on input parameters, noise variations
Figure BDA0003397770590000085
It is relatively independent. Collection
Figure BDA0003397770590000086
Based on Vt jCausal services are defined, and Vt=(Vt 1,Vt 2…) and ptsIs a time-series data sequence. A pair of variables due to stationarity and causality
Figure BDA0003397770590000087
And all time-series moving pairs
Figure BDA0003397770590000088
Again, where τ ≧ 0 is referred to as hysteresis.
Assuming no cyclic causal relationship, this assumption is premised on timing-limited contemporaneous (τ -0) interactions. The method allowing unobserved presence of variables, e.g. observing only a subset of the study
Figure BDA0003397770590000089
Wherein
Figure BDA00033977705900000810
Further, it is assumed that no variables are selected and that the conditional independence CI in the relatively trusted context, i.e. the observation distribution p (V) generated by the SCM, represents the d-section in the variable V based correlation time series diagram G.
A maximum ancestry graph and a partial ancestry graph are defined. Maximal Anthral Graphs (MAGs) may contain directed edges (denoted "→" in the graph) and bidirectional edges (denoted "→" in the graph)
Figure BDA00033977705900000811
Representation), where a bidirectional edge may also be referred to as a link; partial Ancestor Graphs (PAGs) may have additional directed edges and bidirectional edge types.
A maximum time lag is defined. In time series causal discovery, stationarity assumptions and selected time lag window lengths t- τmaxT' is less than or equal to t and plays an important role. Under sufficient causal conditions (X ═ V), the causal graph is for all τmax≥ptsThe same applies. Different in the potential case, let
Figure BDA00033977705900000812
Is by marginalisation at time interval t'<t-τmaxMAG values derived for all non-observed variables and all general observed variables. Then, by increasing τmaxIncreasing the size of the skew window may result in all contained edges being deleted in the original window, even in the case of good statistical data decisions. That is, at τmax,1max,2In the following, the first and second parts of the material,
Figure BDA0003397770590000091
is not to
Figure BDA0003397770590000092
Sub-diagram of (1), thus τmaxIs an analytical choice and not a tunable parameter.
The rationality and completeness of the definition. For the same reason, stationarity also affects the definition of MAGs and PAGs being estimated. And therefore cannot usually be determined
Figure BDA0003397770590000093
One PAG of (a). Formalize the above logic as follows: (1) in that
Figure BDA0003397770590000094
MAG acquired by forced repetition of the adjacent edge, note as
Figure BDA0003397770590000095
(2) By being at
Figure BDA0003397770590000096
The maximum information PAG of the Markov equivalence class obtained by the upward operation of the directed rule algorithm is recorded as
Figure BDA0003397770590000097
(3) PAG obtained by additionally enforcing the time order and repeating the directional elements using the directional rule at each step is noted
Figure BDA0003397770590000098
In addition, the first and second substrates are,
Figure BDA0003397770590000099
there may be fewer circular loop identifiers than
Figure BDA00033977705900000910
More information is contained. In summary, the object is to construct
Figure BDA00033977705900000911
If an algorithm can return a PAG, it is noted that
Figure BDA00033977705900000912
The algorithm is said to be reasonable; if can return to
Figure BDA00033977705900000913
The algorithm is said to be complete. Hereinafter, the same shall be simply labeled
Figure BDA00033977705900000914
And
Figure BDA00033977705900000915
definition of
Figure BDA00033977705900000916
And (3) gathering:
Figure BDA00033977705900000917
refer to removing
Figure BDA00033977705900000918
Outside the field
Figure BDA00033977705900000919
A set of all non-future adjacencies, an
Figure BDA00033977705900000920
Is not determined as
Figure BDA00033977705900000921
Is a non-ancestor of (c).
In the aforementioned napdstThe following subsets are defined under the set:
collection
Figure BDA00033977705900000922
Is that
Figure BDA00033977705900000923
And
Figure BDA00033977705900000924
a union of (1);
collection
Figure BDA00033977705900000925
Is a set
Figure BDA00033977705900000926
In which all variables are removed
Figure BDA00033977705900000927
As a result of (A) and
Figure BDA00033977705900000928
is and
Figure BDA00033977705900000929
with a connection and a tail portion of
Figure BDA00033977705900000930
All of the variables of (1).
Collection
Figure BDA00033977705900000931
Is all that
Figure BDA00033977705900000932
A set of variables, and
Figure BDA00033977705900000933
is via path p and
Figure BDA00033977705900000934
there is a connection.
The attributes of path p are defined. The path p has the following properties:
a) on path p except
Figure BDA00033977705900000935
Any other node has no tail;
b) the middle node in every three continuous points on the path p is a collision node on the path p;
c) path p does not contain
Figure BDA0003397770590000101
d) And
Figure BDA0003397770590000102
adjacent node
Figure BDA0003397770590000103
And
Figure BDA0003397770590000104
without a head as
Figure BDA0003397770590000105
Are connected without the edges of
Figure BDA0003397770590000106
Is a tail part;
e) on the path p except
Figure BDA0003397770590000107
And
Figure BDA0003397770590000108
all nodes and
Figure BDA0003397770590000109
or
Figure BDA00033977705900001010
A tail part is
Figure BDA00033977705900001011
Or
Figure BDA00033977705900001012
Are not connected by edges and are not connected by
Figure BDA00033977705900001013
And
Figure BDA00033977705900001014
the tail part or the head part is connected with both sides.
An intermediate mark is defined. To facilitate early localization of edges, a clear causal explanation is given to the graph at each step of the algorithm, by adding intermediate labels to the edges. The intermediate mark is represented on a link symbol, and the options include: "? "," L "," R ","! "or empty. For example:
Figure BDA00033977705900001015
is shown if A<B(B<A) Then there is
Figure BDA00033977705900001016
Or is absent
Figure BDA00033977705900001017
I.e., a and B have several distances in m (g). "<"herein refers to any order of variable sets for the purpose of distinguishing
Figure BDA00033977705900001018
And
Figure BDA00033977705900001019
the choice of this symbol is arbitrary and does not affect the content of the cause and effect information. The "+" in the formula is a wildcard symbol indicating the label of all three connecting sides (tail, head, loop) present in PAGs. In addition to this, the present invention is,
Figure BDA00033977705900001020
to represent
Figure BDA00033977705900001021
And
Figure BDA00033977705900001022
are all true. The empty middle marker a × B then indicates a ∈ adj (B, m (g)). Is there a And does not represent any state. Non-circular edge labels (which may be hidden under a symbol here) may represent ancestors and non-ancestors in the standard sense, while the absence of an edge between A and B may still be expressed in the sense that
Figure BDA00033977705900001023
An ancestor-parent theory is defined. 1) A → B available
Figure BDA00033977705900001024
2)A>B and A → B available
Figure BDA00033977705900001025
3)A<B and A → B available
Figure BDA00033977705900001026
Selecting the total order in accordance with the chronological order, i.e.
Figure BDA00033977705900001027
Wherein tau is>0 or τ ═ 0, i<j. Hysteresis connection available edge
Figure BDA00033977705900001028
Initialization is performed.
A weak infinitesimal separation set is defined. In MAG M (G), assuming that a and B are points separated by S sets by some distance, the set S is called a weak minimum separation set of a and B when S satisfies the following two conditions:
1) s can be decomposed into
Figure BDA0003397770590000111
Wherein
Figure BDA0003397770590000112
2) If it is
Figure BDA0003397770590000113
And is
Figure BDA0003397770590000114
And A and B are separated by some distance, then there is S'2=S2
(S1,S2) For very weak small compositions that may be referred to as S.
To generalize the definition of the minimum separation set, guarantees need to be made
Figure BDA0003397770590000115
A strong explicit three-point pair rule is defined. Is provided with
Figure BDA0003397770590000116
Is an explicit three point pair, S, in PAG C (G)ACIs a separate set of a and C, then there are:
1) if B ∈ SAC,SACIs the very weak minimum set, then B ∈ ({ A, C }, G).
2) Is provided with
Figure BDA0003397770590000117
Any value is taken. If it is not
Figure BDA0003397770590000118
A and B cannot be substituted by SAC∪τAB\ { A, B } is separated. C and B cannot be substituted by SAC∪τCBV, { C, B } are separated, then
Figure BDA0003397770590000119
The setting of the latter two conditions may intersect with future or past variables. The rules described above apply to each step of the algorithm and may be executed at any time or in any order.
In one embodiment of the invention, the full graph processing main algorithm comprises: calling the first removal algorithm to perform first traversal on the full graph, and marking parent-child relations in the full graph; calling the first removal algorithm and the second removal algorithm to identify the complete graph after the first traversal, and determining a partial ancestor graph corresponding to the complete graph; and taking the obtained partial ancestral graph as a causal graph with the largest information amount. In particular, to facilitate understanding and implementation by those skilled in the art, corresponding pseudo code is provided as follows:
inputting a requirement: time series data set X ═ X1,...,XN}, maximum time lag τmaxSignificance level α, CI test CI (X, Y, S), non-negative integer k;
1. initializing C (G) as a complete graph, wherein
Figure BDA00033977705900001110
For 0 ≤ l ≤ k-1, performing:
3. using a first removal algorithm to perform edge removal and use of direction rules;
4. repeat the first row if
Figure BDA0003397770590000121
The directed edge is marked in c (g).
5. Using a first removal algorithm to perform edge removal and use of direction rules;
6. using a second removal algorithm for edge removal and use of direction rules;
7.
Figure BDA0003397770590000122
the above pseudo code is explained as follows: initializing c (g) to a complete and complete graph, the algorithm will enter the initial phase and then design calls the first removal algorithm, which removes many (but typically not all) of the erroneous connections and reuses the direction rules described above while deleting them. These rules can identify a subset of (non-) ancestors in G and mark the top or tail labels in the edges of c (G) accordingly. The non-ancestral relationships then further constrain the conditional sets S of subsequent CI tests, the ancestral relationships being used to extend these sets S @ SdefWherein
Figure BDA0003397770590000123
C (G) is the known parent of which variables were tested for independence. All parent-child relationships marked with C (G) after line 3 are then marked, with the next transfer to reinitialized C (G) before the first removal algorithm is used. The condition set may be extended from the beginning to a known parent node. The purpose of this iterative process is to determine an exact subset of parent-child relationships in G. These results are then passed to the final stage in lines 5-6, i.e., the last use for the first removal algorithm. There may still be connections that are faulty at this point, because the first removal algorithm erroneously deletes the variables from each other
Figure BDA0003397770590000124
And
Figure BDA0003397770590000125
is erroneously connected to
Figure BDA0003397770590000126
And
Figure BDA0003397770590000127
without any ancestors of the other. This is the purpose of the second removal algorithm being invoked in line 6, which is the second deletion phase. The second removal algorithm is applied iteratively on the direction rule and is used to identify (non-) ancestors as with the first removal algorithm. Thus, PAG P (G) can be found. In addition, the output of the method is independent of the N-fold time series variable XjThe order of (a). The number k of iterations of the algorithm in the initial stage is a super parameter, and each step of the algorithm has stationarity.
In one embodiment of the present invention, the first removal algorithm further comprises: determining ordered variable pairs according to the ordered variables in the complete graph and the adjacency relation; determining that a middle label of an edge between the ordered variable pair is a first type label; determining edges between the ordered variable pairs to be reserved or deleted according to a first preset condition; and if the edges between the ordered variable pairs are to be deleted, calling the deleted edge processing algorithm to perform secondary processing. Specifically, the pseudo code providing the first removal algorithm is as follows:
inputting a requirement: c (G), variable I of minimum test statisticmin(..), a separation set SepSet (,), and a time series data set X ═ X [ (. X. ])1,...,XN}, maximum time lag τmaxSignificant level alpha, CI test CI (X, Y, S)
Figure BDA0003397770590000131
Figure BDA0003397770590000141
The above pseudo code is explained as follows: deleting pair
Figure BDA0003397770590000142
All edges between variables that are not adjacent to m (g). For this reason, the algorithm is applied to a given ScotusdefIs tested and is based on
Figure BDA0003397770590000143
The cardinality of S increases successively with p, where apdstHas been given before, i.e. excludes
Figure BDA0003397770590000144
All variables in (1) that have been identified as non-ancestors. Default condition set
Figure BDA0003397770590000145
From all in C (G)
Figure BDA0003397770590000146
Or
Figure BDA0003397770590000147
All variables marked as parents are next. The algorithm needs to restart with p-0, otherwise the future computed disjoint set may not be the weakest. The rules mentioned above may also locate non-ancestral relationships and then further on the apdstThe set is limited. Another innovation of the method is that before the edge test is carried out, some edges can be judged in advance and deleted. It can also be expressed by the following terms:
all the self-dependent links are tested first, followed by a cross-link from 0 step by step to τmax. The whole sequence being independent of the N-th order timing variable XjAnd therefore there is no need to introduce order dependencies. The algorithm is marked "!in the middle of C (G)! "or empty" is convergent, it cannot be separated multiple times, and further testing is not required. The updated memory in line 11 is used to track the minimum test statistic for all previous CI tests for a given variable. These values are used for the sequence S of the ground-based departuresearchOrdering is performed, for example, if
Figure BDA0003397770590000148
That is at SsearchIn
Figure BDA0003397770590000149
Will certainly be in
Figure BDA00033977705900001410
The foregoing occurs. Note that in line 18, only a selected subset of rules is used, and only applies to directed lag links.
In one embodiment of the present invention, the second removal algorithm further comprises: determining ordered variable pairs according to the ordered variables in the complete graph and the adjacency relation; determining that a middle label of an edge between the ordered variable pair is a second type label; determining edges between the ordered variable pairs to be reserved or deleted according to a second preset condition; and if the edges between the ordered variable pairs are to be deleted, calling the deleted edge processing algorithm to perform secondary processing. Specifically, the second removal algorithm pseudo code is provided as follows:
inputting a requirement: c (G), variable I of minimum test statisticmin(..), a separation set SepSet (,), and a time series data set X ═ X [ (. X. ])1,...,XN}, maximum time lag τmaxSignificant level alpha, CI test CI (X, Y, S)
Figure BDA0003397770590000151
Figure BDA0003397770590000161
The above pseudo code is explained as follows: all the middle marks in C (G) are "! "or null, and the edge marked null in the middle must be in M (G), but marked"! "is not necessarily in M (G). The latter edges are edges between pairs of variables, and neither is another ancestor. Algorithm only searches out
Figure BDA0003397770590000162
The separation set of (3). In addition to the parent node in C (G), the algorithm also includes
Figure BDA0003397770590000163
Current napds of ChinesetAll nodes are aggregated to set default conditions. Once all intermediate flags are empty, the algorithm will converge and then proceed with the final exhaustive rule application to ensure integrity.
In an embodiment of the present invention, the edge deletion processing algorithm includes: acquiring edges to be deleted determined according to the first removal algorithm and the second removal algorithm; if the determined edge to be deleted is a directed edge, repairing the deleted conflict; and executing deletion operation on the determined edge to be deleted, and weakening and minimizing the corresponding separation set. Specifically, the pseudo code is provided as follows:
inputting a requirement: c (G), ordered rule list tau, variable I of minimum test statisticmin(..), a separation set SepSet (,), and a time series data set X ═ X [ (. X. ])1,...,XN}, maximum time lag τmaxSignificant level alpha, CI test CI (X, Y, S)
Figure BDA0003397770590000171
The above pseudo code is explained as follows: the algorithm exhaustively applies a set of directed rules for edge computation. Since many rules require a weak minimum of the disjoint sets, row 10 acts to make them weak, as follows: separate collections
Figure BDA0003397770590000172
And
Figure BDA0003397770590000173
not necessarily infinitesimal, but they can be identified by successively eliminating individual elements which are identified when the result set can no longer be separated
Figure BDA0003397770590000174
And
Figure BDA0003397770590000175
the ancestors of the collection. In particular, it is not necessary to search all subsets of the original separation set. The effectiveness of this method is in the equivalence of weak minima and weak minima of the second type.
Fig. 2 is a block diagram schematically illustrating a structure of a high recall cause and effect discovery apparatus based on time series operation and maintenance big data according to an embodiment of the present application, and as shown in fig. 2, in an embodiment of the present application, there is provided a high recall cause and effect discovery apparatus based on time series operation and maintenance big data, the apparatus including: a rule definition module for predefining a number of relationship rules to be applied to edges and points in the full graph; the operation and maintenance data access module is used for acquiring streaming operation and maintenance data to be analyzed; the complete graph generation module is used for initializing the streaming operation and maintenance data into a complete graph by adopting a graph library and a graph algorithm; the edge processing module is used for processing the relationship in the complete graph of the streaming operation and maintenance data according to the relationship rules; and the result output module is used for outputting the processed cause-and-effect relationship diagram, and the cause-and-effect relationship diagram is used for describing cause-and-effect relationships in the streaming operation and maintenance data.
The high-recall cause and effect discovery device based on the time sequence operation and maintenance big data comprises a processor and a memory, wherein the rule definition module, the operation and maintenance data access module, the complete graph generation module, the edge processing module, the result output module and the like are all stored in the memory as program units, and the processor executes the program modules stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the high-recall cause and effect discovery method based on the time sequence operation and maintenance big data is realized by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The embodiment of the application provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program, and the steps of the high-recall causal discovery method based on the time sequence operation and maintenance big data are realized.
The present application further provides a computer program product adapted to perform a program initialized with high recall causal discovery method steps based on time series operation and maintenance big data when executed on a data processing apparatus.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (11)

1. A high recall causal discovery method based on time series operation and maintenance big data is characterized in that the method is based on a causal discovery algorithm and comprises the following steps:
predefining a number of relationship rules to be applied to edges and points in the full graph;
acquiring streaming operation and maintenance data to be analyzed;
initializing the streaming operation and maintenance data into a complete graph by adopting a graph library and a graph algorithm;
processing the relationship in the complete graph of the streaming operation and maintenance data according to the relationship rules;
outputting a processed cause and effect relationship graph, wherein the cause and effect relationship graph is used for describing cause and effect relationships in the streaming operation and maintenance data.
2. The method of claim 1, wherein processing the relationship in the full graph of the streaming operation and maintenance data according to the relationship rules comprises:
defining the following algorithm according to the plurality of relationship rules: a full graph processing main algorithm, a first removal algorithm, a second removal algorithm and a delete edge processing algorithm;
processing edges in the complete graph of the streaming operation and maintenance data by adopting the first removal algorithm and the second removal algorithm to realize relational processing;
the first removal algorithm is used for determining that all edges of the ordered variable pairs between the complete graph and the non-adjacent variables are to be deleted or reserved; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved;
the second removal algorithm is used for determining edges between the ordered variable pairs, which meet a preset separation set, to be deleted or reserved; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved;
the full graph processing main algorithm is used for providing a main function and calling the first removal algorithm and the second removal algorithm;
and the deleted edge processing algorithm is used for carrying out secondary processing on the edges to be deleted determined by the first removing algorithm and the second removing algorithm.
3. The method of claim 2, wherein the full graph processing main algorithm is configured to:
calling the first removal algorithm to perform first traversal on the complete graph, and marking the parent-child relationship in the complete graph;
calling the first removal algorithm and the second removal algorithm to identify the complete graph after the first traversal, and determining a partial ancestor graph corresponding to the complete graph;
and outputting the determined part of ancestor graphs as processed causal relationship graphs.
4. The method of claim 2, wherein the first removal algorithm is further configured to:
determining ordered variable pairs according to the ordered variables in the complete graph and the adjacency relation;
determining that a middle label of an edge between the ordered variable pair is a first type label;
determining edges between the ordered variable pairs to be deleted or reserved according to a first preset condition;
and if the edges between the ordered variable pairs are determined to be deleted, calling the deleted edge processing algorithm to perform secondary processing.
5. The method of claim 2, wherein the second removal algorithm is further configured to:
determining ordered variable pairs according to the ordered variables in the complete graph and the adjacency relation;
determining that a middle label of an edge between the ordered variable pair is a second type label;
determining edges between the ordered variable pairs to be deleted or reserved according to a second preset condition;
and if the edges between the ordered variable pairs are determined to be deleted, calling the deleted edge processing algorithm to perform secondary processing.
6. The method of claim 2, wherein the edge deletion processing algorithm is configured to:
acquiring edges to be deleted determined according to the first removal algorithm and the second removal algorithm;
if the determined edge to be deleted is a directed edge, solving the deleted conflict;
and executing deletion operation on the determined edge to be deleted, and weakening and minimizing the corresponding separation set.
7. A high recall causal discovery apparatus based on time series operation and maintenance big data, the apparatus comprising:
a rule definition module for predefining a number of relationship rules to be applied to edges and points in the full graph;
the operation and maintenance data access module is used for acquiring streaming operation and maintenance data to be analyzed;
the complete graph generation module is used for initializing the streaming operation and maintenance data into a complete graph by adopting a graph library and a graph algorithm;
the edge processing module is used for processing the relationship in the complete graph of the streaming operation and maintenance data according to the relationship rules; and
and the result output module is used for outputting the processed cause and effect relationship diagram, and the cause and effect relationship diagram is used for describing cause and effect relationships in the streaming operation and maintenance data.
8. The apparatus of claim 7, wherein the edge processing module comprises:
the complete graph processing main algorithm sub-module, the first removal algorithm sub-module, the second removal algorithm sub-module and the delete edge processing algorithm sub-module;
the first removal algorithm submodule is used for determining that all edges of the ordered variable pairs between the ordered variable pairs and the nonadjacent variable in the complete graph are to be deleted or reserved through a first removal algorithm; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved;
the second removal algorithm submodule is used for determining edges between the ordered variable pairs, which meet the preset separation set, as to-be-deleted or to-be-reserved through a second removal algorithm; determining the direction of the edge to be reserved according to the causal relationship between the points connected with the edge to be reserved;
the full image processing main algorithm sub-module is used for providing a main function and calling the first removal algorithm and the second removal algorithm;
the deleted edge processing algorithm submodule is used for carrying out secondary processing on the edges to be deleted determined by the first removing algorithm and the second removing algorithm.
9. A high recall cause and effect discovery apparatus based on time series operation and maintenance big data, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor when executing the computer program implements the high recall cause and effect discovery method based on time series operation and maintenance big data according to any one of claims 1 to 6.
10. A computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the time-series operation and maintenance big data-based high-recall causal discovery method of any of claims 1 to 6.
11. A computer program product comprising a computer program which, when executed by a processor, implements a time series operation and maintenance big data based high recall causal discovery method according to any of claims 1 to 6.
CN202111486699.3A 2021-12-07 2021-12-07 High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data Pending CN114153881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111486699.3A CN114153881A (en) 2021-12-07 2021-12-07 High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111486699.3A CN114153881A (en) 2021-12-07 2021-12-07 High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data

Publications (1)

Publication Number Publication Date
CN114153881A true CN114153881A (en) 2022-03-08

Family

ID=80453645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111486699.3A Pending CN114153881A (en) 2021-12-07 2021-12-07 High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data

Country Status (1)

Country Link
CN (1) CN114153881A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115051870A (en) * 2022-06-30 2022-09-13 浙江网安信创电子技术有限公司 Method for detecting unknown network attack based on causal discovery

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115051870A (en) * 2022-06-30 2022-09-13 浙江网安信创电子技术有限公司 Method for detecting unknown network attack based on causal discovery
CN115051870B (en) * 2022-06-30 2024-02-06 浙江网安信创电子技术有限公司 Method for detecting unknown network attack based on causal discovery

Similar Documents

Publication Publication Date Title
US11615343B2 (en) Anomaly detection apparatus, anomaly detection method, and program
Czibula et al. Detecting software design defects using relational association rule mining
Zeng et al. Estimation of software defects fix effort using neural networks
CN108268373A (en) Automatic test cases management method, device, equipment and storage medium
CN114153881A (en) High-recall cause and effect discovery method, device and equipment based on time sequence operation and maintenance big data
CN116150757A (en) Intelligent contract unknown vulnerability detection method based on CNN-LSTM multi-classification model
Zellner et al. Concept drift detection on streaming data with dynamic outlier aggregation
Huang et al. Towards smarter diagnosis: A learning-based diagnostic outcome previewer
WO2021109874A1 (en) Method for generating topology diagram, anomaly detection method, device, apparatus, and storage medium
Bodík et al. HiLighter: Automatically Building Robust Signatures of Performance Behavior for Small-and Large-Scale Systems.
US20230315786A1 (en) Sub-graph matching policy determination method, sub-graph matching method, sub-graph counting method and calculation device
Ostrowski et al. Knowledge-based software testing agent using evolutionary learning with cultural algorithms
CN107590160A (en) A kind of method and device for monitoring radix tree internal structure
CN115964211A (en) Root cause positioning method, device, equipment and readable medium
Balle et al. Adaptively learning probabilistic deterministic automata from data streams
KR102182678B1 (en) Method and appratus for predicting fault pattern using multi-classifier based on feature selection method in semiconductor manufacturing process
Kusumaniswari et al. Classification of Software Bugs using Support Vector Machine
WO2015045091A1 (en) Method and program for extraction of super-structure in structural learning of bayesian network
US10157166B2 (en) Method and system for measuring the performance of a diagnoser
Arnold et al. Machine learning phase transitions: Connections to the Fisher information
CN115051870B (en) Method for detecting unknown network attack based on causal discovery
US20080243747A1 (en) Property description coverage measuring apparatus
CN117520040B (en) Micro-service fault root cause determining method, electronic equipment and storage medium
CN117201138B (en) Intelligent contract vulnerability detection method, system and equipment based on vulnerability subgraph
CN117151227B (en) Reasoning method and device for semiconductor detection result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination