CN117938173A - Alarm data compression method and device - Google Patents
Alarm data compression method and device Download PDFInfo
- Publication number
- CN117938173A CN117938173A CN202410027474.9A CN202410027474A CN117938173A CN 117938173 A CN117938173 A CN 117938173A CN 202410027474 A CN202410027474 A CN 202410027474A CN 117938173 A CN117938173 A CN 117938173A
- Authority
- CN
- China
- Prior art keywords
- sequence
- word segmentation
- node
- target
- history
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000013144 data compression Methods 0.000 title claims description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 268
- 238000012545 processing Methods 0.000 claims abstract description 58
- 238000007906 compression Methods 0.000 claims description 39
- 230000006835 compression Effects 0.000 claims description 39
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The application discloses a method and a device for compressing alarm data, wherein the method comprises the following steps: performing word segmentation processing on the current alarm data received in real time to obtain a target word segmentation sequence; determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, wherein the history word segmentation sequence is obtained by word segmentation processing of history alarm data; calculating the similarity between the current alarm data and the historical alarm data based on the longest public sequence; and if the similarity exceeds a preset threshold, compressing the current alarm data.
Description
Technical Field
The present application relates to the field of data processing, and in particular, to a method and apparatus for compressing alarm data.
Background
In some data processing scenes, the alarm data are required to be compressed, and the effects of filtering effective data and improving the data acquisition efficiency are achieved by classifying and combining a large amount of alarm data. The compression rule is usually defined manually, and in an offline state, the alarm data conforming to the compression rule are classified and combined and then sent to the user terminal for consumption.
However, the manual mode of defining the compression rule requires a strong expertise and a great deal of experience for the staff, and the manual mode of compressing takes a great deal of time, which results in an increase in labor cost and a decrease in compression efficiency.
Disclosure of Invention
The embodiment of the application aims to provide an alarm data compression method and device, which are used for solving the problems of increased labor cost and reduced compression efficiency caused by manually defining compression rules at present.
In order to achieve the above object, the embodiment of the present application adopts the following technical scheme:
in a first aspect, an embodiment of the present application provides a method for compressing alert data, including:
performing word segmentation processing on the current alarm data received in real time to obtain a target word segmentation sequence;
determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, wherein the history word segmentation sequence is obtained by word segmentation processing of history alarm data;
calculating the similarity between the current alarm data and the historical alarm data based on the longest public sequence;
and if the similarity exceeds a preset threshold, compressing the current alarm data.
In a second aspect, an embodiment of the present application provides an alarm data compression apparatus, including:
The processing unit is used for performing word segmentation processing on the current alarm data received in real time to obtain a target word segmentation sequence;
The determining unit is used for determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, wherein the history word segmentation sequence is obtained by word segmentation processing of history alarm data;
a calculating unit, configured to calculate a similarity between the current alarm data and the historical alarm data based on the longest common subsequence;
and the compression unit is used for compressing the current alarm data if the similarity exceeds a preset threshold value.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method of the first aspect.
The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects: the method comprises the steps of taking historical alarm data as a reference, performing word segmentation on current alarm data received in real time to obtain a target word segmentation sequence, and performing compression processing on the current alarm data received in real time based on a longest public sequence between the target word segmentation sequence and the historical word segmentation sequence, wherein the longer the longest public sequence is, the more similar the content between the current alarm data and the historical alarm data is indicated, the more similar the content between the current alarm data and the historical alarm data is, further, the similarity between the current alarm data and the historical alarm data is calculated, if the similarity exceeds a preset threshold value, the current alarm data is compressed, the current alarm data received in real time is automatically compressed, and then each new alarm data is generated, and when the similarity exceeds the preset threshold value, the current alarm data is compressed, so that the effect of continuously compressing the alarm data on line is achieved, the compression rule is not required to be defined manually in an off-line state, the compression time is saved, the labor cost is reduced, and the compression efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of an alarm data compression method according to an embodiment of the present application;
FIG. 2 (a) is a schematic diagram of a prefix tree according to an embodiment of the present application;
FIG. 2 (b) is a second schematic diagram of a prefix tree according to an embodiment of the present application;
FIG. 2 (c) is a third exemplary diagram of a prefix tree according to an embodiment of the present application;
FIG. 2 (d) is a diagram illustrating a prefix tree according to an embodiment of the present application;
FIG. 2 (e) is a diagram illustrating a prefix tree according to an embodiment of the present application;
FIG. 2 (f) is a diagram illustrating a prefix tree according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating another method for compressing alarm data according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an alarm data compression device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate to implement embodiments of the application in a sequence other than those illustrated or described herein. Furthermore, in the present specification and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated object is an "or" before and after.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
The alarm data compression method provided by the embodiment of the application is suitable for various scenes needing to be alarmed. Firstly, an application scenario of the alarm data compression method provided by the embodiment of the application is explained. It should be understood that the application of the alert data compression method provided by the embodiment of the present application to the following scenario is only an exemplary illustration, and should not be construed as limiting the application scenario of the method.
The embodiment of the application provides an actual application scene suitable for the alarm data compression method, for example: alert data generated by a monitoring system or device related to the status of the system, application or network, etc. The alarm data generally indicates potential problems, anomalies or security events, and an administrator or staff is required to take corresponding measures to process the potential problems, anomalies or security events, for example, the alarm data are classified and combined and then sent to a user for consumption, so that the timeliness of the alarm is improved. The compression rule is usually defined manually, and in an offline state, the data conforming to the compression rule are classified and combined and then sent to a user for consumption. However, manually defining the compression rules requires a strong expertise and a great deal of experience for the staff, and takes a lot of time, resulting in an increase in labor cost and a decrease in compression efficiency.
In view of the above, the embodiment of the application provides an alarm data compression method, which can automatically compress alarm data received in real time without manually defining a compression rule for compression, thereby saving compression time, reducing labor cost and improving compression efficiency.
Specifically, please refer to fig. 1, which is a flow chart of an alarm data compression method according to an embodiment of the present application. As shown in fig. 1, a method for compressing alarm data according to an embodiment of the present application may include the following steps:
S102, word segmentation processing is carried out on the current alarm data received in real time, and a target word segmentation sequence is obtained.
The word segmentation sequence refers to a sequence which is obtained by taking words as elements and arranging the words according to a certain sequence, and can be obtained by using a word segmentation tool to perform word segmentation processing on alarm data. Common word segmentation tools, such as: the resultant word-segmentation (jieba)、HanLP(Han Language Processing)、THULAC(THU Lexical Analyzer for Chinese)、LTP(Language Technology Platform), etc., the word-segmentation tools are word-segmentation tools for chinese and for english, such as: NLTK (Natural Language Toolkit), spaCy, etc. An appropriate word segmentation tool can be selected according to requirements to perform word segmentation processing on the alarm data, and the embodiment of the application is not particularly limited. Before the word segmentation tool is used for carrying out word segmentation processing on the alarm data, the dictionary library is required to be set, a default dictionary library carried by the word segmentation tool can be used, a custom dictionary library can be established according to requirements, specifically, a certain amount of sample alarm data is used as training materials by adding specific special words into the dictionary library, and word frequency is counted by the word segmentation tool to generate the custom dictionary library. And carrying out word segmentation processing on the alarm data through a word segmentation tool based on the set dictionary library.
It should be understood that, in the embodiment of the present application, the alarm data is composed of field values of a plurality of fields, for example, fields of an alarm name, an alarm object, an alarm level, an alarm time, an alarm content, and the like.
As an alternative embodiment, in the case that the current alert data includes field values of a plurality of fields, S102 may include the following steps: selecting a target field corresponding to a service scene to which the current alarm data belongs from a plurality of fields;
and performing word segmentation on the field value of the target field to obtain a target word segmentation sequence.
Wherein the service scenario relates to a specific service, for example: insurance sales business, insurance claims business, and the like. Corresponding alarm data are generated by monitoring the system, application program, network state and the like of each service. Determining a target field according to a service scene to which the current alarm data belongs, and performing word segmentation on field values of the target field of the current alarm data to obtain a target word segmentation sequence, for example: the current alarm data belongs to an insurance sales service scene, the scene can ignore the differences of different alarm sources and alarm objects, the alarm names and the alarm contents are screened as target fields, and word segmentation processing is carried out on the field values of the alarm names and the alarm contents of the current alarm data to obtain a target word segmentation sequence.
According to the embodiment of the application, the target field is selected as the compression basis according to the service scene to which the received current alarm data belongs, and the target word segmentation sequence is obtained by carrying out word segmentation processing on the field value of the target field of the current alarm data, so that the accuracy of subsequent similarity calculation can be improved, and the accuracy of compressing the current alarm data is further improved.
S104, determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence.
The history word segmentation sequence is obtained by word segmentation processing of history alarm data. The history alarm data is alarm data which is sent to the user side for consumption in history. It should be understood that the historical alarm data is the same as the current alarm data belonging to the service scene, the selected target fields are the same during word segmentation, and the historical word segmentation sequence is obtained by word segmentation on the field values of the target fields of the historical alarm data.
Wherein the longest public sequence refers to the public sequence with the longest sequence length. The length of the sequence is the number of elements in the sequence. A common sequence refers to a subsequence that is the same between two or more sequences. The subsequence is a new sequence obtained by removing part of elements in the original sequence without destroying the relative positions of the rest elements, and has a total of 2 n subsequences including the empty set and the original sequence itself, and n is the sequence length of the original sequence. The longest common subsequence will be described here by example 1. Example 1: the following sequences and elements thereof, sequences 1{ b, d, e } and subsequences thereof: empty set, { b }, { d }, { e }, { b, d }, { b, e }, { d, e } and { b, d, e }; sequence 2{ a, b, c, d } and subsequences thereof: empty set, { a }, { b }, { c }, { d }, { a, b }, { a, c }, { a, d }, { b, c }, { b, d }, { c, d }, { a, b, c }, { a, b, d }, { a, c, d }, { b, c, d } and { a, b, c, d }. The same subsequence between sequence 1 and sequence 2 is: empty set, { b }, { d }, { b, d }, i.e., common sequence, wherein the length of { b, d } is 2, is the common sequence with the longest sequence length, i.e., the longest common sequence.
In the embodiment of the application, the longest public sequence between the target word segmentation sequence and the history word segmentation sequence can be determined by adopting a complete comparison mode, and the sub-sequences of the target word segmentation sequence and the history word segmentation sequence are compared to determine the sub-sequence with the same sequence length as the longest public sequence. This approach does not require additional memory, but the computational complexity is exponential and the computational efficiency is low, such as comparing the subsequences of the two sequences in example 1, which requires 2 3×24 times. Based on the method, the embodiment of the application can also determine the longest public sequence between the target word segmentation sequence and the historical word segmentation sequence in a dynamic programming mode.
Specifically, as an alternative embodiment, in the case where the number of the history word sequences is one, the step S104 may include the following steps: comparing words in the target word segmentation sequence with words in the history word segmentation sequence based on a dynamic programming algorithm, determining the value of each element in the two-dimensional array based on a comparison result, and determining the maximum value of the element in the two-dimensional array as the length of the longest public sequence, wherein the ith row and the jth column of elements in the two-dimensional array represent the length of the longest public sequence between the first i words in the target word segmentation sequence and the first j words in the history word segmentation sequence, and i and j are positive integers;
the last element of the last row in the two-dimensional array is used as a backtracking starting point, the comparison result of the words in the target word segmentation sequence and the words in the history word segmentation sequence is backtracked based on the backtracking starting point and the length of the longest public sequence, and the common words of the target word segmentation sequence and the history word segmentation sequence are determined based on the backtracking comparison result;
Based on the common words, a longest common sequence between the target word-segmentation sequence and the historical word-segmentation sequence is determined.
The basic idea is to decompose the problem to be solved into a plurality of sub-problems, and first solve the sub-problems, and obtain the solution of the original problem from the solutions of the sub-problems. Based on the idea, the embodiment of the application compares words in the target word segmentation sequence with words in the history word segmentation sequence through iteration, determines the length of the longest common sequence between the first i words in the target word segmentation sequence and the first j words in the history word segmentation sequence, and records the length in the ith row and the jth column in a pre-established two-dimensional array, wherein i and j are positive integers; if the ith word in the target word segmentation sequence is the same as the jth word in the history word segmentation sequence, the value of the jth element in the ith row in the two-dimensional array is the value of the jth element in the ith-1 row in the two-dimensional array plus 1; if the ith word in the target word segmentation sequence is different from the jth word in the history word segmentation sequence, the value of the ith row and the jth column element in the two-dimensional array is the maximum value of the ith row and the jth column element in the two-dimensional array, wherein the maximum value of the elements in the two-dimensional array is the length of the longest public sequence between the target word segmentation sequence and the history word segmentation sequence. Specifically, description will be made by way of example 2.
Example 2: in connection with example 1, assume that the sequence 1{ b, d, e } is the target word sequence, denoted S1, S1 i represents an element in the sequence, e.g., S1 i(2) represents d, denoted S i(2) =d; sequence 2{ a, b, c, d } is a historical word segmentation sequence, denoted as S2, S2 j represents an element in the sequence, e.g. S2 j(3) represents c, denoted as S j(3) =c. Based on the target word segmentation sequence and the history word segmentation sequence, a two-dimensional array is created in advance, the values of elements in the array are marked as dp [ i ] [ j ], the values of elements in the ith row and the jth column are represented, for example dp [ i (1) ] [ j (2) ] represents the values of elements in the 1 st row and the 2 nd column in the array, the created two-dimensional array is initialized, an array table is shown in a table (1), and then the values of elements in the two-dimensional array are determined from top to bottom in one row. If S1 i and S2 j are the same, the value of dp [ i ] [ j ] is the value of dp [ i-1] [ j-1] plus 1; if S1 i and S2 j are not the same, the value of dp [ i ] [ j ] is the maximum of dp [ i ] [ j-1] and dp [ i-1] [ j ] }, for example: s1 i(1) =b and s2 j(1) =a are different, then dp [ i (1) ] [ j (1) ]=max { dp [ i (1) ] [ j (1-1) ], dp [ i (1-1) ] [ j (1) ] } =max {0,0} =0; s1 i(1) =b and s2 j(2) =b are the same, dp [ i (1) ] [ j (2) ]=dp [ i (1-1) ] [ j (2-1) ] +1=0+1=1. Similarly, the values of all elements in the two-dimensional array are determined to obtain a table (2), and if the maximum value of the elements in the two-dimensional array is 2 as shown in the table (2), the length of the longest common sequence between the target word segmentation sequence S1 and the history word segmentation sequence S2 is 2.
Watch (1)
Watch (2)
The embodiment of the application takes the last element of the last row and the last column of the two-dimensional array as a backtracking starting point, based on the backtracking starting point and the length of the longest public sequence, the comparison result of the words in the backtracking target word segmentation sequence and the words in the history word segmentation sequence, and the comparison result based on backtracking, the shared words of the target word segmentation sequence and the history word segmentation sequence are determined, and the longest public sequence between the target word segmentation sequence and the history word sequence is determined based on the shared words, wherein if the current backtracking starting point is the ith element of the ith row and the jth element of the jth column, the ith word or the jth word is taken as the shared words, and the ith-1 jth element of the two-dimensional array is taken as the new backtracking starting point; if the ith word in the target word segmentation sequence is different from the jth word in the history word segmentation sequence, taking the highest value of the ith row and jth column elements in the two-dimensional array as a new backtracking starting point; if the ith word in the target word segmentation sequence is different from the jth word in the history word segmentation sequence and the values of the ith row, the jth-1 column element and the ith-1 row, the jth column element in the two-dimensional array are equal, any one of the ith row, the jth-1 column element and the ith-1 row, the jth column element in the ith row is selected as a new backtracking starting point, for example, the ith row, the jth-1 column element in the ith row is selected as the new backtracking starting point, and the ith row, the jth-1 column element in the ith row is also selected as the new backtracking starting point when the situation is met later. Specifically, description will be made by way of one example.
For example: referring to example 2 and table (2), dp [ i (3) ] [ j (4) ] is taken as a first backtracking start point, S i(3) =e and S j(4) =d are different, dp [ i (2) ] [ j (4) ]=2 is greater than dp [ i (3) ] [ j (3) ]=1, and dp [ i (2) ] [ j (4) ] is taken as a new backtracking start point; when dp [ i (2) ] [ j (4) ] is used as a backtracking starting point, S i(2) =e and S j(4) =d are the same, d is marked with a common word, and dp [ i (1) ] [ j (3) ] is used as a new backtracking starting point; when dp [ i (1) ] [ j (3) ] is used as a backtracking start point, S i(1) =b and S j(3) =c are different, and dp [ i (1) ] [ j (2) ]=1 is larger than dp [ i (0) ] [ j (3) ]=0, dp [ i (1) ] [ j (2) ] is used as a new backtracking start point; when dp [ i (1) ] [ j (2) ] is used as a backtracking starting point, S i(1) =b and S j(2) =b are the same, b is marked as a shared word, the number of the shared words meets the maximum value in the two-dimensional array at the moment, backtracking is stopped, and the obtained new sequence is the longest public sequence between the target word segmentation sequence S1 and the historical word segmentation sequence S2 after being arranged according to the marking sequence of the shared word.
According to the embodiment of the application, the longest public sequence of the target word segmentation sequence and the historical word segmentation sequence is determined in a dynamic programming mode, so that the calculation complexity can be simplified, the calculation amount can be reduced, and the calculation efficiency can be improved compared with a complete comparison mode.
It should be understood that if the number of the historical alert data is large and a plurality of types are involved, a plurality of history word sequences are corresponding, it is necessary to confirm which type of the received current alert data belongs to the historical alert data, and to perform compression processing with the type of the historical alert data. The embodiment of the application combines a prefix tree to determine the longest public sequence between the target word segmentation sequence and the history word segmentation sequence.
Specifically, as an optional implementation manner, in a case that the number of the history word segmentation sequences is plural, and the plurality of the history word segmentation sequences are obtained by performing word segmentation processing on the plurality of history alert data, the step S104 may include the following steps:
And S140, acquiring a plurality of historical longest public sequences based on the plurality of historical word segmentation sequences, wherein each historical longest public sequence is the longest public sequence among the historical word segmentation sequences of the same type.
S142, generating a prefix tree based on the plurality of historical longest public sequences, wherein the prefix tree comprises a plurality of paths, the paths are in one-to-one correspondence with the plurality of historical longest public sequences, and nodes on each path represent words in the corresponding historical longest public sequences.
S144, matching the target word segmentation sequence with the nodes on the paths to obtain matched target nodes.
S146, determining the longest public sequence based on the words represented by the target nodes.
When the number of the history word segmentation sequences of the same type is multiple, the longest public sequence between any two history word segmentation sequences can be determined, and the longest public sequence between the longest public sequence and another history word segmentation sequence is determined, so that the longest public sequences among the history word segmentation sequences of the same type are sequentially obtained. The process of S140 described above will be described here as an example.
For example: the system comprises a plurality of historical alarm data and a plurality of corresponding historical word segmentation sequences and types thereof, wherein the plurality of historical alarm data comprises the following historical alarm data: performing word segmentation on the historical alarm data A to obtain a historical word segmentation sequence A ', and marking the historical word segmentation sequence A ' as a type 1, wherein the longest historical public sequence corresponding to the type 1 is the historical word segmentation sequence A ' per se; performing word segmentation on the historical alarm data B, C to obtain a historical word segmentation sequence B ', C', and marking the historical word segmentation sequence as a type 2, wherein the historical longest public sequence corresponding to the type 2 is the longest public sequence between the historical word segmentation sequences B 'and C', similar to the calculation method of the example 3, and redundant description is omitted here; performing word segmentation on the historical alarm data D, E, F to obtain historical word segmentation sequences D ', E', F ', and marking the historical word segmentation sequences as type 3, wherein the historical longest public sequence corresponding to the type 3 is the longest public sequence between the historical word segmentation sequences D', E 'and F', specifically, the longest public sequence between any two of the historical word segmentation sequences can be calculated, and then the longest public sequence of the remaining historical word segmentation sequences is calculated, for example: the longest common subsequence between D ' and F ' is calculated first to obtain m, and then the longest common subsequence between m and E ' is calculated.
Among them, a Prefix Tree (Trie) is also called a dictionary Tree (Trie), and is a Tree data structure specially used for processing character strings. In the prefix tree, each node represents a character in a character string, and from the root node to each node, a prefix of a character string is represented, for example: the string is composed of f, h, i, k, the root node to f represents a prefix, the root node to f to h also represents a prefix, and so on. The path from the root node to each leaf node represents a complete string and the path from node to node is called a join edge. In the embodiment of the application, each historical longest public sequence is regarded as a character string, the words in the historical longest public sequence are regarded as characters in the character string, the path from the root node to each leaf node corresponds to one historical longest public sequence, the words represented by the nodes on each path are words in the historical longest public sequence corresponding to the path, the words represented by the leaf nodes on each path are the last words in the historical longest public sequence corresponding to the path, and the words represented by the nodes directly connected with the root node are the first words in each historical longest public sequence. And matching the target word segmentation sequence with nodes on a plurality of paths of the prefix tree to obtain matched target nodes, and determining the longest public sequence based on words represented by the matched target nodes.
The embodiment of the application adopts a prefix tree mode to determine the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, can be quickly matched with the history alarm data similar to the current alarm data, and further improves the calculation efficiency and the compression efficiency.
Next, a detailed description will be given of a process of generating a prefix tree, and a process of matching a target word segmentation sequence with nodes on multiple paths of the prefix tree to obtain a matched target node.
Specifically, as an alternative embodiment, the step S142 may include the following steps: creating a root node and creating a corresponding node for words in a plurality of historic longest public sequences;
Establishing a connection edge between nodes corresponding to the words in the history longest public sequence based on the arrangement sequence of the words in the history longest public sequence aiming at each history longest public sequence, and obtaining a path corresponding to the history longest public sequence;
And creating a connecting edge between the first node and the root node of each path to obtain the prefix tree.
Wherein the root node is an empty node and does not represent any word.
In the embodiment of the application, after the connection edge is created between the first node and the root node of each path, if the prefixes represented by any node of the root node are the same, the connection edges corresponding to the paths from the root node to any node are combined into the same connection edge, so as to obtain the prefix tree. Specifically, the entire prefix tree generation process will be described herein with example 3. Example 3: there are 4 historic longest common sequences and their elements: y1{ a, b, c }, Y2{ a, c, d }, Y3{ a, d }, Y4{ e, f }, first, creating a null node as a root node, and creating corresponding nodes for words in a plurality of longest-history public sequences, creating a connection edge between the nodes corresponding to the words in each longest-history public sequence based on the arrangement sequence of the words in each longest-history public sequence, to obtain a path corresponding to each longest-history public sequence, and creating a connection edge between the first node and the root node of each path, as shown in FIG. 2 (a), wherein the nodes corresponding to the words in the longest-history public sequence Y1 have nodes a, b, and c, the node c is a leaf node, and the path corresponding to the longest-history public sequence Y1 is the root node→the node a→the node b→the node c. For the longest common sequences Y1, Y2 and Y3, the prefixes represented by the root node to the node a are the same, and then the connection edges corresponding to the paths from the root node to the node a are combined into one connection edge, so as to obtain a final prefix tree, as shown in fig. 2 (b).
According to the embodiment of the application, the prefix tree is generated based on the historical longest public sequence corresponding to each type of historical alarm data, so that the subsequent matching of the target word segmentation sequence and nodes on a plurality of paths of the prefix tree based on the created prefix tree is facilitated, the matched target node is obtained, the longest public sequence between the target word segmentation sequence and the historical word segmentation sequence is determined, the historical alarm data similar to the current alarm data can be quickly matched, and the calculation efficiency and the compression efficiency are further improved.
Specifically, as an alternative embodiment, the step S144 may include the following steps: traversing words in the target word segmentation sequence;
matching the traversed current word with nodes on candidate paths corresponding to the current word to obtain target nodes matched with the current word, wherein the candidate paths corresponding to the 1 st word in the target word segmentation sequence comprise a plurality of paths;
Determining a path of a target node matched with the current word as a candidate path corresponding to the next word of the current word;
and continuing traversing until the target word segmentation sequence is traversed or a target node matched with the current word is a leaf node.
It should be understood that traversing the words in the target word segmentation sequence starts from the first word in the target word segmentation sequence and proceeds traversing according to the arrangement sequence of the words in the target word segmentation sequence, wherein the candidate paths corresponding to the 1 st word in the target word segmentation sequence are paths corresponding to the longest common sequences of histories.
The embodiment of the application matches the traversed current word with the nodes on the candidate path corresponding to the current word until the target word segmentation sequence is traversed. Examples 4 and 5 are described herein. Example 4: in connection with the prefix tree shown in fig. 2 (b) in example 3, it is assumed that the target word segmentation sequence is { a, b }. First, a first word traversed from a target word segmentation sequence is a, a candidate path corresponding to the first word is a null node, a node a, a node b, a node c, a null node, a node a, a node c, a node d, a null node, a node a, a node d, a null node, a node e and a node f, and the node a matched with the first word is marked as a target node. Then, the second word traversed is b, the corresponding candidate paths include node a, node b, node c, node a, node c, node d, node a, node d, and 3 candidate paths in total, the node b matched with the candidate paths is marked as a target node, and at the moment, the word traversing in the target word segmentation sequence is completed.
Example 5: in combination with the prefix tree shown in fig. 2 (b) in example 3, assuming that the target word sequence is { a, c, g, j }, first, the first word traversed from the target word sequence is a, and the candidate path corresponding to the first word is a null node→node a→node b→node c, null node→node a→node c→node d, null node→node a→node d, null node→node e→node f, and the node a matched with the first word is marked as the target node. Then, the second word traversed is c, the corresponding candidate paths include node a, node b, node c, node a, node c, node d, node a, node d, and 3 candidate paths in total, and the node c matched with the candidate paths is marked as a target node. Then, the third word is g, the corresponding candidate path is node c-node d, the word g is not matched with the node d, the node d is a leaf node, and the matching of the word g is stopped. Then, continuing to traverse the fourth word j, wherein the corresponding candidate path is provided with a node c-a node d, the word j is not matched with the node d, the node d is a leaf node, the matching of the word j is stopped, and the word in the target word segmentation sequence is traversed.
The embodiment of the application matches the traversed current word with the nodes on the candidate path corresponding to the current word until the target node matched with the current word is a leaf node. This is illustrated in example 6. Example 6: in combination with the prefix tree shown in fig. 2 (b) in example 3, assuming that the target word sequence is { a, d, e, g }, first, the first word traversed from the target word sequence is a, and the candidate path corresponding to the first word is a null node→node a→node b→node c, null node→node a→node c→node d, null node→node a→node d, null node→node e→node f, and the node a matched with the first word is marked as the target node. Then, the second word traversed is d, the corresponding candidate paths of the second word are node a, node b, node c, node a, node c, node d, node a, node d, and 3 candidate paths in total, the node d matched with the second word is marked as a target node, the matched target node is a leaf node, and the traversal is finished.
The embodiment of the application matches the traversed current word with the nodes on the candidate paths corresponding to the current word, and if a plurality of candidate paths exist, if no node matched with the word in the target word segmentation sequence exists in the candidate paths, the candidate paths corresponding to the first branch are continuously matched until the target word segmentation sequence is traversed or the target node matched with the current word is a leaf node. This is illustrated in example 7. Example 7: in combination with the prefix tree shown in fig. 2 (b) in example 3, assuming that the target word sequence is { a, e, f }, first, the first word traversed from the target word sequence is a, and the candidate path corresponding to the first word is a null node→node a→node b→node c, null node→node a→node c→node d, null node→node a→node d, null node→node e→node f, and the node a matched with the first word is marked as the target node. Then, traversing a second word e, wherein the corresponding candidate paths comprise 3 candidate paths, namely node a, node b, node c, node a, node c, node d and node a, node d, and the node b, node c and node d are not matched with the word e, selecting the path corresponding to the first branch, namely node a, node b, node c as the candidate paths, and continuing to match until the leaf node c is matched, and no node matched with the word e is found. Then, the third word f is continuously traversed, and the nodes matched with the word f are not found until the leaf node c, and the words in the target word segmentation sequence are traversed.
According to the embodiment of the application, the target word segmentation sequence is matched with the nodes on the paths of the prefix tree based on the created prefix tree to obtain the matched target node, and the longest public sequence between the target word segmentation sequence and the history word segmentation sequence is determined based on the target node, so that the historical alarm data similar to the current alarm data can be quickly matched, the calculation efficiency and the compression efficiency are further improved, the longest public sequence between the target word segmentation sequence and the history word segmentation sequence is determined, the historical alarm data similar to the current alarm data can be quickly matched, and the calculation efficiency and the compression efficiency are further improved.
As an alternative, the S146 may determine that the history longest common sequence is the target history longest common sequence according to which path corresponds to the history longest common sequence is collected by all the target nodes, for example, in example 4, the target node a and the node b obtained by matching are collected on the path corresponding to the history longest common sequence Y1, and further determine that the history longest common sequence Y1 is the target history longest common sequence; or according to the leaf node matching the word in the target word segmentation sequence, determining the longest historical public sequence corresponding to the path of the leaf as the longest historical public sequence of the target, for example, in example 5, a leaf node d matching the word in the target word segmentation sequence, and further determining the longest historical public sequence Y2 corresponding to the path of the leaf node d as the longest historical public sequence of the target; and determining the longest public sequence between the target word segmentation sequence and the target historical word segmentation sequence based on the words represented by the target nodes.
S106, calculating the similarity between the current alarm data and the historical alarm data based on the longest public sequence.
And calculating the ratio of the sequence length between the longest public sequence and the target word segmentation sequence, taking the ratio of the sequence length as the similarity between the current alarm data and the historical alarm data, and determining whether to compress the current alarm data or not based on the similarity.
S108, if the similarity exceeds a preset threshold, compressing the current alarm data.
It should be understood that if the calculated similarity exceeds the preset threshold, it is indicated that the current alarm data is consistent with the historical alarm data, and the current alarm data and the historical alarm data belong to the same type of alarm data, so that the current alarm data needs to be compressed and does not need to be sent to a user for consumption. As an alternative way, the current alarm data is compressed, the content similar to the historical alarm data in the current alarm data can be removed and then sent to the user for consumption, specifically, according to the longest public sequence, a field value corresponding to the words in the longest public sequence can be found in the target field of the current alarm data, and after the field value is removed, the current alarm data is sent to the user for consumption.
According to the embodiment of the application, the historical alarm data is taken as a reference, the current alarm data received in real time is subjected to word segmentation processing to obtain the target word segmentation sequence, the longest public sequence between the target word segmentation sequence and the historical word segmentation sequence is based on the historical alarm data, wherein the longer the longest public sequence is, the more similar the content between the current alarm data and the historical alarm data is, the further similarity between the current alarm data and the historical alarm data is calculated, if the similarity exceeds a preset threshold value, the current alarm data is subjected to compression processing, the current alarm data received in real time is automatically compressed, and then each new alarm data is generated, and when the similarity exceeds the preset threshold value, the compression processing is performed, so that the effect of continuously compressing the alarm data on line is achieved, the compression rule is not required to be defined manually in an off-line state, the compression time is saved, the labor cost is reduced, and the compression efficiency is improved.
It should be appreciated that if the similarity exceeds a preset threshold, the historical longest common sequences and prefix tree also need to be updated.
Specifically, in the case where the number of the history word segmentation sequences is plural, and the plurality of history word segmentation sequences are obtained by word segmentation processing on the plurality of history alert data, as another optional embodiment, after S106, the method may further include the following steps:
And S110, if the similarity exceeds a preset threshold, determining the longest public sequence between the historical longest public sequence corresponding to the path of the target node and the target word segmentation sequence as a new historical longest public sequence.
And S112, updating the path of the target node based on the new historical longest public sequence.
The embodiment of the application determines the longest public sequence between the historical longest public sequence corresponding to the path of the target node and the target word segmentation sequence as a new historical longest public sequence under the condition that the similarity exceeds a preset threshold value so as to update the original historical longest public sequence, and updates the path of the target node based on the new historical longest public sequence so as to update the prefix tree so as to compress the next current alarm data.
As an alternative embodiment, the step S112 may include the following steps: sequentially deleting nodes on paths of the target node according to the sequence from the leaf node to the root node until the next node of the currently deleted node is a branch node, wherein the branch node refers to a node shared by at least two paths;
creating corresponding nodes for words in the longest public sequence of the new history;
Based on the arrangement order of the words in the new history longest public sequence, connecting edges are created between the nodes corresponding to the words in the new history longest public sequence.
The embodiment of the application creates the corresponding path for the new historical longest public sequence by updating the historical longest public sequence so as to update the prefix tree, thereby being convenient for compressing the next current alarm data.
A procedure of updating the history longest common subsequence and the prefix tree in case the similarity exceeds a preset threshold will be described herein as an example. Example 8: with reference to example 5, based on the longest public sequence { a, c } obtained after matching, the similarity between the current alert data and the historical alert data of the same class corresponding to the historical longest public sequence Y2 is calculated, that is, the ratio of the sequence length between the longest public sequence { a, c } and the target word segmentation sequence { a, c, g, j } is calculated, and the similarity is obtained to be 50%. Assuming that the similarity 50% exceeds a preset threshold, it is indicated that the current alarm data and the historical alarm data corresponding to Y2 belong to the same type, compression processing is required, the longest common sequence { a, c } is used as a new historical longest common sequence, denoted as Y5, Y2 is replaced, the paths of the target node a and the node c are deleted, namely, the paths corresponding to Y2 are deleted, specifically, the paths corresponding to Y2 are deleted from the leaf node d of the path corresponding to Y2, then the node c is deleted, then the next node a is a branch node, at this time, deletion is stopped, as shown in fig. 2 (c), then the node corresponding to the word in Y5 is created in the prefix tree, a connecting edge is created according to the sequence of the word in Y5, a connecting edge is created between the first node and the root node, and then the connecting edges with the same prefix are combined, so that the finally updated prefix tree is obtained, as shown in fig. 2 (d).
It should be understood that if the calculated similarity does not exceed the preset threshold, it is indicated that the current alarm data and the historical alarm data are not of the same type, and the current alarm data needs to be sent to the user for consumption. As an alternative, the target field may be determined according to the service scenario, and the target field of the current alert data and its field value may be sent to the user for consumption. Meanwhile, a new classification is newly established for the current alarm data, and a history longest public sequence is newly added and a prefix tree is updated.
Specifically, in the case where the number of the history word segmentation sequences is plural, and the plurality of history word segmentation sequences are obtained by word segmentation processing on the plurality of history alert data, as another optional embodiment, after S106, the method may further include the following steps: if the similarity is smaller than or equal to a preset threshold value, the target word segmentation sequence is used as a new historical longest public sequence;
Creating corresponding nodes for words in the longest public sequence of the new history in the prefix tree;
Based on the arrangement order of the words in the new history longest public sequence, connecting edges are created between the nodes corresponding to the words in the new history longest public sequence.
A procedure of adding the history of the longest common sequences and updating the prefix tree in the case where the similarity does not exceed the preset threshold will be described here as an example.
For example: in combination with example 8, assuming that the similarity 50% does not exceed the preset threshold, it is indicated that the historical alert data corresponding to the current alert data and Y2 do not belong to the same type, a new classification needs to be newly established for the current alert data, and then the target word segmentation sequence { a, c, g, j } is used as a new historical longest public sequence, denoted as Y5', added to the original historical longest public sequences Y1, Y2, Y3 and Y4, and nodes corresponding to words in Y5' are created in the prefix tree, connection edges are created according to the sequence of words in Y5', and connection edges are created between the first node and the root node, as shown in fig. 2 (e), and then the connection edges with the same prefix are combined, so as to obtain a finally updated prefix tree, as shown in fig. 2 (f).
The embodiment of the application updates the prefix tree by adding the history longest public sequence and the corresponding path in the prefix tree so as to compress the next current alarm data.
It should be understood that, the longest common sequence between the target word segmentation sequence and the history word segmentation sequence is determined by adopting the prefix tree mode, after S106, if the similarity does not exceed the preset threshold, in order to further ensure the compression accuracy, each history longest common sequence may be traversed, the longest common sequence between each history longest common sequence and the target word segmentation sequence is calculated, the similarity between the historical alarm data of the same type corresponding to each history longest common sequence and the current alarm data is calculated, and based on the similarity with the largest value, whether the current alarm data is compressed is determined, if the similarity with the largest value exceeds the preset threshold, the current alarm data is compressed.
As an alternative, the current alarm data is compressed and processed
It should be understood that the foregoing examples, as illustrated by the examples, are provided to more clearly illustrate the method of embodiments of the present application and are not to be construed as limiting the method of embodiments of the present application.
The following describes the method for compressing alarm data according to the embodiment of the present application by using a complete example, which should not be construed as limiting the method according to the embodiment of the present application. Referring to fig. 3, a flowchart of another alarm data compression method according to an embodiment of the application is shown.
First, a preparation work is performed, including: determining a service scene, and determining a target field to be aimed at in word segmentation processing according to the service scene; selecting a word segmentation tool and setting a dictionary library; setting a similarity threshold, namely a preset threshold; word segmentation processing is carried out on the plurality of historical alarm data to obtain a plurality of historical word segmentation sequences; acquiring a plurality of historical longest public sequences, wherein each historical longest public sequence is the longest public sequence among the historical word segmentation sequences of the same type; and generating a prefix tree based on the plurality of historic longest public sequences, wherein the prefix tree comprises a plurality of paths corresponding to the historic longest public sequences, and nodes on the paths correspond to words in the historic longest public sequences.
After the preparation work is finished, the received current alarm data is compressed, specifically, as shown in fig. 3, the current alarm data is received first, word segmentation is carried out on the current alarm data, and a target word segmentation sequence is obtained; matching words in the target word segmentation sequence with the prefix tree to obtain target nodes; determining a target history longest public sequence matched with the target word segmentation sequence based on the target node, further determining the longest public sequence between the target word segmentation sequence and the target history longest public sequence, calculating the similarity between the current alarm data and the history alarm data corresponding to the target history longest public sequence, and if the similarity exceeds a preset threshold, updating the target history longest public sequence and a prefix tree, and performing compression processing on the current alarm data;
If the similarity does not exceed the preset threshold, traversing all the historical longest public sequences, finding the similarity with the maximum value, comparing the similarity with the preset threshold, and taking the historical longest public sequence corresponding to the similarity with the maximum value as the target historical longest public sequence; if the similarity with the maximum value exceeds a preset threshold, updating the public sequence with the longest target history and the prefix tree, and compressing the current alarm data; if the current warning data is not exceeded, a new classification is established for the current warning data, the target word segmentation sequence is used as the longest historical public sequence corresponding to the classified warning data, the target word segmentation sequence is newly added into the plurality of longest historical public sequences, the prefix tree is updated, and the current warning data is sent to a user for consumption.
In the above complete example, with the history alarm data as a reference, under the condition that the history alarm data is of multiple types, establishing a prefix tree, performing word segmentation on the current alarm data to obtain a target word segmentation sequence, matching the words in the target word segmentation sequence with the prefix tree, determining the similarity between the current alarm data and the history alarm data, if the similarity exceeds a preset threshold, compressing the alarm data, and updating the longest historical public sequence and the prefix tree; if the similarity does not exceed the preset threshold, traversing all historical longest word segmentation sequences, calculating the similarity between the current alarm data and each type of historical alarm data, comparing the similarity with the maximum value with the preset threshold, if the similarity exceeds the preset threshold, compressing the current alarm data, updating the longest public sequence and the prefix tree, if the similarity does not exceed the preset threshold, sending the current alarm data to a user for consumption, establishing a new classification for the current alarm data, and newly adding the historical longest public sequence and updating the prefix tree. Therefore, the current alarm data received in real time can be automatically compressed, and then each new alarm data is generated, when the similarity exceeds the preset threshold value, the new alarm data is compressed, the effect of continuously compressing the alarm data on line is achieved, the compression rule is not required to be defined manually for compression in an off-line state, and further the compression time is saved, the labor cost is reduced, and the compression efficiency is improved.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Corresponding to the above-mentioned method for compressing alarm data shown in fig. 1, the embodiment of the application also provides an alarm data compression device. Referring to fig. 4, a schematic structural diagram of an alarm data compression device 400 according to an embodiment of the present application is provided, where the device 400 includes: a processing unit 410, a determining unit 420, a calculating unit 430 and a compressing unit 440.
The processing unit 410 is configured to perform word segmentation on the current alert data received in real time, so as to obtain a target word segmentation sequence.
The determining unit 420 is configured to determine a longest common sequence between the target word segmentation sequence and a history word segmentation sequence, where the history word segmentation sequence is obtained by performing word segmentation processing on the history alert data.
The calculating unit 430 is configured to calculate a similarity between the current alarm data and the historical alarm data based on the longest common subsequence.
The compression unit 440 is configured to compress the current alarm data if the similarity exceeds a preset threshold.
Alternatively, the determining unit 420 may perform the following steps when determining the longest common subsequence between the target word segmentation sequence and the history word segmentation sequence: comparing words in the target word segmentation sequence with words in the history word segmentation sequence based on a dynamic programming algorithm, determining the value of each element in a two-dimensional array based on a comparison result, and determining the maximum value of the elements in the two-dimensional array as the longest public sequence length, wherein the ith row and the jth column of elements in the two-dimensional array represent the length of the longest public sequence between the first i words in the target word segmentation sequence and the first j words in the history word segmentation sequence, and i and j are positive integers;
The last element of the last row in the two-dimensional array is used as a backtracking starting point, the comparison result of the words in the target word segmentation sequence and the words in the history word segmentation sequence is backtracked based on the backtracking starting point and the length of the longest public sequence, and the common words of the target word segmentation sequence and the history word segmentation sequence are determined based on the backtracking comparison result;
based on the common words, a longest common sequence between the target word-segmentation sequence and the historical word-segmentation sequence is determined.
Alternatively, in the case that the number of the history word segmentation sequences is a plurality, and the plurality of history word segmentation sequences are obtained by word segmentation processing on a plurality of history alert data, the determining unit 420 may perform the following steps when determining the longest common subsequence between the target word segmentation sequence and the history word segmentation sequence: based on the plurality of history word segmentation sequences, acquiring a plurality of history longest public sequences, wherein each history longest public sequence is the longest public sequence among the history word segmentation sequences of the same type;
Generating a prefix tree based on the plurality of historic longest public sequences, wherein the prefix tree comprises a plurality of paths, the paths are in one-to-one correspondence with the plurality of historic longest public sequences, and nodes on each path represent words in the corresponding historic longest public sequences;
matching the target word segmentation sequence with nodes on the paths to obtain matched target nodes;
The longest common subsequence is determined based on the terms represented by the target node.
Optionally, when the determining unit 420 matches the target word sequence with the nodes on the multiple paths to obtain a matched target node, the following steps may be performed: traversing words in the target word segmentation sequence;
matching the traversed current word with nodes on candidate paths corresponding to the current word to obtain target nodes matched with the current word, wherein the candidate paths corresponding to the 1 st word in the target word segmentation sequence comprise the paths;
Determining a path of a target node matched with the current word as a candidate path corresponding to the next word of the current word;
and continuing traversing until the target word segmentation sequence is traversed or a target node matched with the current word is a leaf node.
Alternatively, the determining unit 420 may perform the following steps when generating the prefix tree based on the plurality of historic longest common sequences: creating a root node and creating a corresponding node for words in the plurality of historic longest common sequences;
establishing a connection edge between nodes corresponding to words in the history longest public sequence based on the arrangement sequence of the words in the history longest public sequence aiming at each history longest public sequence, so as to obtain a path corresponding to the history longest public sequence;
and creating a connecting edge between the first node of each path and the root node to obtain the prefix tree.
Optionally, the determining unit 420 is further configured to perform, after the calculating of the similarity between the current alarm data and the historical alarm data based on the longest common subsequence, the following steps: if the similarity is smaller than or equal to the preset threshold value, the target word segmentation sequence is used as a new historical longest public sequence;
Creating corresponding nodes for words in the new historical longest public sequence in the prefix tree;
and creating connection edges between nodes corresponding to the words in the new historical longest public sequence based on the arrangement sequence of the words in the new historical longest public sequence.
Optionally, the determining unit 420 is further configured to, after the calculating unit 430 calculates the similarity between the current alarm data and the historical alarm data based on the longest common subsequence, perform the following steps: if the similarity exceeds the preset threshold, determining a longest public sequence between a historical longest public sequence corresponding to a path where the target node is located and the target word segmentation sequence as a new historical longest public sequence;
And updating the path of the target node based on the new historical longest public sequence.
Alternatively, the determining unit 420 may perform the following steps when updating the path where the target node is located based on the new historical longest common subsequence: sequentially deleting nodes on the path of the target node according to the sequence from the leaf node to the root node until the next node of the currently deleted node is a branch node, wherein the branch node refers to a node shared by at least two paths;
Creating corresponding nodes for words in the new historical longest public sequence;
and creating connection edges between nodes corresponding to the words in the new historical longest public sequence based on the arrangement sequence of the words in the new historical longest public sequence.
Optionally, in the case that the current alert data includes field values of a plurality of fields, when the current alert data received in real time is subjected to word segmentation processing to obtain a target word segmentation sequence, the processing unit 410 may perform the following steps: selecting a target field corresponding to a service scene to which the current alarm data belongs from the plurality of fields;
and performing word segmentation on the field value of the target field to obtain the target word segmentation sequence.
It is obvious that the alarm data compression device provided in the embodiment of the present application can be used as an execution body of the alarm data compression method shown in fig. 1, for example, in the alarm data compression method shown in fig. 1, step S102 may be executed by the processing unit 410 in the alarm data compression device shown in fig. 4, step S104 may be executed by the determining unit 420 in the alarm data compression device shown in fig. 4, step S106 may be executed by the calculating unit 430 in the alarm data compression device shown in fig. 4, and step S108 may be executed by the compressing unit 440 in the alarm data compression device shown in fig. 4.
According to another embodiment of the present application, each unit in the alarm data compression device shown in fig. 4 may be separately or completely combined into one or several other units, or some unit(s) thereof may be further split into a plurality of units with smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the alarm data compression device may also include other units, and in practical applications, these functions may also be implemented with assistance from other units, and may be implemented by cooperation of multiple units.
According to another embodiment of the present application, the alarm data compression apparatus as shown in fig. 4 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 1 on a general-purpose computing device such as a computer including a processing element such as a central processing unit (Central Processing Unit, CPU), a random access storage medium (Random Access Memory, RAM), a Read-Only Memory (ROM), and a storage element, and the alarm data compression method of the embodiment of the present application is implemented. The computer program may be recorded on, for example, a computer readable storage medium, transferred to, and run in, an electronic device via the computer readable storage medium.
Fig. 5 is a schematic structural view of an electronic device according to an embodiment of the present application. Referring to fig. 5, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 5, but not only one bus or type of bus.
And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs, forming the data processing device on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:
performing word segmentation processing on the current alarm data received in real time to obtain a target word segmentation sequence;
determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, wherein the history word segmentation sequence is obtained by word segmentation processing of history alarm data;
calculating the similarity between the current alarm data and the historical alarm data based on the longest public sequence;
and if the similarity exceeds a preset threshold, compressing the current alarm data.
The method performed by the alarm data compression device disclosed in the embodiment of fig. 4 of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The electronic device may also execute the method of fig. 1 and implement the functions of the alarm data compression device in the embodiments shown in fig. 1 and fig. 4, and the embodiments of the present application are not described herein again.
Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or a logic device.
The embodiment of the present application also proposes a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment shown in fig. 5, and in particular to perform the following operations:
performing word segmentation processing on the current alarm data received in real time to obtain a target word segmentation sequence;
determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, wherein the history word segmentation sequence is obtained by word segmentation processing of history alarm data;
calculating the similarity between the current alarm data and the historical alarm data based on the longest public sequence;
and if the similarity exceeds a preset threshold, compressing the current alarm data.
In summary, the foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
Claims (10)
1. A method of compressing alert data, comprising:
performing word segmentation processing on the current alarm data received in real time to obtain a target word segmentation sequence;
determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, wherein the history word segmentation sequence is obtained by word segmentation processing of history alarm data;
calculating the similarity between the current alarm data and the historical alarm data based on the longest public sequence;
and if the similarity exceeds a preset threshold, compressing the current alarm data.
2. The method of claim 1, wherein said determining the longest common subsequence between the target and historical word sequences comprises:
Comparing words in the target word segmentation sequence with words in the history word segmentation sequence based on a dynamic programming algorithm, determining the value of each element in a two-dimensional array based on a comparison result, and determining the maximum value of the elements in the two-dimensional array as the longest public sequence length, wherein the ith row and the jth column of elements in the two-dimensional array represent the length of the longest public sequence between the first i words in the target word segmentation sequence and the first j words in the history word segmentation sequence, and i and j are positive integers;
The last element of the last row in the two-dimensional array is used as a backtracking starting point, the comparison result of the words in the target word segmentation sequence and the words in the history word segmentation sequence is backtracked based on the backtracking starting point and the length of the longest public sequence, and the common words of the target word segmentation sequence and the history word segmentation sequence are determined based on the backtracking comparison result;
based on the common words, a longest common sequence between the target word-segmentation sequence and the historical word-segmentation sequence is determined.
3. The method of claim 1, wherein the number of the history word segmentation sequences is a plurality, and the plurality of the history word segmentation sequences are obtained by word segmentation processing on a plurality of history alarm data;
The determining the longest common subsequence between the target word segmentation sequence and the historical word segmentation sequence comprises:
Based on the plurality of history word segmentation sequences, acquiring a plurality of history longest public sequences, wherein each history longest public sequence is the longest public sequence among the history word segmentation sequences of the same type;
Generating a prefix tree based on the plurality of historic longest public sequences, wherein the prefix tree comprises a plurality of paths, the paths are in one-to-one correspondence with the plurality of historic longest public sequences, and nodes on each path represent words in the corresponding historic longest public sequences;
matching the target word segmentation sequence with nodes on the paths to obtain matched target nodes;
The longest common subsequence is determined based on the terms represented by the target node.
4. The method of claim 3, wherein matching the target word sequence with nodes on the plurality of paths to obtain matched target nodes comprises:
Traversing words in the target word segmentation sequence;
matching the traversed current word with nodes on candidate paths corresponding to the current word to obtain target nodes matched with the current word, wherein the candidate paths corresponding to the 1 st word in the target word segmentation sequence comprise the paths;
Determining a path of a target node matched with the current word as a candidate path corresponding to the next word of the current word;
and continuing traversing until the target word segmentation sequence is traversed or a target node matched with the current word is a leaf node.
5. A method according to claim 3, wherein said generating a prefix tree based on said plurality of historic longest common sequences comprises:
Creating a root node and creating a corresponding node for words in the plurality of historic longest common sequences;
establishing a connection edge between nodes corresponding to words in the history longest public sequence based on the arrangement sequence of the words in the history longest public sequence aiming at each history longest public sequence, so as to obtain a path corresponding to the history longest public sequence;
and creating a connecting edge between the first node of each path and the root node to obtain the prefix tree.
6. A method according to claim 3, wherein after said calculating a similarity between said current alert data and said historical alert data based on said longest common subsequence, said method further comprises:
If the similarity is smaller than or equal to the preset threshold value, the target word segmentation sequence is used as a new historical longest public sequence;
Creating corresponding nodes for words in the new historical longest public sequence in the prefix tree;
and creating connection edges between nodes corresponding to the words in the new historical longest public sequence based on the arrangement sequence of the words in the new historical longest public sequence.
7. A method according to claim 3, wherein after said calculating a similarity between said current alert data and said historical alert data based on said longest common subsequence, said method further comprises:
If the similarity exceeds the preset threshold, determining a longest public sequence between a historical longest public sequence corresponding to a path where the target node is located and the target word segmentation sequence as a new historical longest public sequence;
And updating the path of the target node based on the new historical longest public sequence.
8. The method of claim 7, wherein updating the path of the target node based on the new historical longest common sequence comprises:
Sequentially deleting nodes on the path of the target node according to the sequence from the leaf node to the root node until the next node of the currently deleted node is a branch node, wherein the branch node refers to a node shared by at least two paths;
Creating corresponding nodes for words in the new historical longest public sequence;
and creating connection edges between nodes corresponding to the words in the new historical longest public sequence based on the arrangement sequence of the words in the new historical longest public sequence.
9. The method of claim 1, wherein the current alert data comprises field values for a plurality of fields;
the word segmentation processing is performed on the current alarm data received in real time to obtain a target word segmentation sequence, which comprises the following steps:
selecting a target field corresponding to a service scene to which the current alarm data belongs from the plurality of fields;
and performing word segmentation on the field value of the target field to obtain the target word segmentation sequence.
10. An alert data compression apparatus, comprising:
The processing unit is used for performing word segmentation processing on the current alarm data received in real time to obtain a target word segmentation sequence;
The determining unit is used for determining the longest public sequence between the target word segmentation sequence and the history word segmentation sequence, wherein the history word segmentation sequence is obtained by word segmentation processing of history alarm data;
a calculating unit, configured to calculate a similarity between the current alarm data and the historical alarm data based on the longest common subsequence;
and the compression unit is used for compressing the current alarm data if the similarity exceeds a preset threshold value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410027474.9A CN117938173A (en) | 2024-01-08 | 2024-01-08 | Alarm data compression method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410027474.9A CN117938173A (en) | 2024-01-08 | 2024-01-08 | Alarm data compression method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117938173A true CN117938173A (en) | 2024-04-26 |
Family
ID=90767863
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410027474.9A Pending CN117938173A (en) | 2024-01-08 | 2024-01-08 | Alarm data compression method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117938173A (en) |
-
2024
- 2024-01-08 CN CN202410027474.9A patent/CN117938173A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109101620B (en) | Similarity calculation method, clustering method, device, storage medium and electronic equipment | |
CN109614499B (en) | Dictionary generation method, new word discovery method, device and electronic equipment | |
CN110362824B (en) | Automatic error correction method, device, terminal equipment and storage medium | |
US9009029B1 (en) | Semantic hashing in entity resolution | |
JP2005525625A (en) | Computer representation by data structure and related encoding / decoding method | |
CN108205571B (en) | Key value data table connection method and device | |
CN112784009B (en) | Method and device for mining subject term, electronic equipment and storage medium | |
CN112506864B (en) | File retrieval method, device, electronic equipment and readable storage medium | |
CN110399464B (en) | Similar news judgment method and system and electronic equipment | |
CN107451204B (en) | Data query method, device and equipment | |
CN116089663A (en) | Rule expression matching method and device and computer readable storage medium | |
CN111026736B (en) | Data blood margin management method and device and data blood margin analysis method and device | |
CN107368281B (en) | Data processing method and device | |
CN115544214B (en) | Event processing method, device and computer readable storage medium | |
CN113742332A (en) | Data storage method, device, equipment and storage medium | |
CN111078671A (en) | Method, device, equipment and medium for modifying data table field | |
CN117938173A (en) | Alarm data compression method and device | |
CN108376054B (en) | Processing method and device for indexing identification data | |
CN116450664A (en) | Data processing method, device, equipment and storage medium | |
CN114547086B (en) | Data processing method, device, equipment and computer readable storage medium | |
CN115952332A (en) | Core search phrase determining method based on co-occurrence word frequency | |
CN113609279B (en) | Material model extraction method and device and computer equipment | |
CN112182235A (en) | Method and device for constructing knowledge graph, computer equipment and storage medium | |
CN117729176B (en) | Method and device for aggregating application program interfaces based on network address and response body | |
CN112231450B (en) | Question-answer searching method, question-answer searching device, question-answer searching apparatus and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |