CN116402136A - Rule extraction method based on offline data, storage medium and electronic equipment - Google Patents

Rule extraction method based on offline data, storage medium and electronic equipment Download PDF

Info

Publication number
CN116402136A
CN116402136A CN202310288192.XA CN202310288192A CN116402136A CN 116402136 A CN116402136 A CN 116402136A CN 202310288192 A CN202310288192 A CN 202310288192A CN 116402136 A CN116402136 A CN 116402136A
Authority
CN
China
Prior art keywords
data
character
time sequence
event
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310288192.XA
Other languages
Chinese (zh)
Other versions
CN116402136B (en
Inventor
薄满辉
张凯伦
苏茹梅
马泽龙
邓翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Travelsky Mobile Technology Co Ltd
Original Assignee
China Travelsky Mobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Travelsky Mobile Technology Co Ltd filed Critical China Travelsky Mobile Technology Co Ltd
Priority to CN202310288192.XA priority Critical patent/CN116402136B/en
Publication of CN116402136A publication Critical patent/CN116402136A/en
Application granted granted Critical
Publication of CN116402136B publication Critical patent/CN116402136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to the field of data processing, and in particular, to a rule extraction method, a storage medium, and an electronic device based on offline data. Comprising the following steps: an initial dataset of a plurality of description fields of a target event is obtained. And carrying out character type conversion processing on each field of data to generate corresponding character data. And generating a character confidence coefficient set corresponding to each initial data set according to the character data. And generating a data judgment rule of the corresponding description field according to the confidence coefficient in the confidence coefficient set of each character. The invention can obtain the character composition form corresponding to each description data by carrying out character type conversion processing on the description data. The most likely data form in each description field is then determined based on the confidence level. And the data form is used as a judging rule for judging whether the data is abnormal or not, and the newly added description data of the description field is monitored and judged to determine the abnormal data, so that the accuracy of the description information of the target event is improved.

Description

Rule extraction method based on offline data, storage medium and electronic equipment
Technical Field
The present invention relates to the field of data processing, and in particular, to a rule extraction method, a storage medium, and an electronic device based on offline data.
Background
With the development of internet technology, in more and more industries, description information of multiple aspects corresponding to an event is stored in corresponding fields to form a description data packet of the event for record storage or transmission. As illustrated by way of example with flight information in the field of aviation. The description packet for a flight may include the following field contents: flight number, departure place arrival place, boarding start time, boarding end time, cabin door closing time, wheel withdrawal time, departure time, arrival time, luggage turnplate number and the like.
In such a large amount of field information, abnormality may occur in some of the data describing the field or the information describing the entire packet due to equipment failure or parsing rule failure, etc. In the prior art, a method for effectively identifying the abnormal information is lacked, so that the accuracy of the description information of the event is lower.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme:
according to one aspect of the present invention, there is provided a rule extraction method based on offline data, the method comprising the steps of:
acquiring an initial dataset A of multiple description fields of a target event 1 ,A 2 ,…,A i ,…,A z The method comprises the steps of carrying out a first treatment on the surface of the Wherein A is i An initial data set corresponding to the ith description field; i=1, 2, …, z; z is the total number of description fields for the target event; each initial dataset comprising at least one corresponding field data;
performing character type conversion processing on each field of data to generate character data corresponding to each field of data; each initial dataset comprising at least one type of character data;
generating a character confidence coefficient set B corresponding to each initial data set according to the character data corresponding to the field data contained in each initial data set 1 ,B 2 ,…,B i ,…,B z ,B i ={A i1 ,A i2 ,…,A in ,…,A f(Ai) -a }; wherein B is i Is A i A corresponding set of character confidence levels; a is that in Is A i Confidence corresponding to the nth type of character data; n=1, 2, …, f (a i );f(A i ) Is A i The total number of kinds of the medium character data; a is that in The following conditions are satisfied:
A in =Y in /Y i the method comprises the steps of carrying out a first treatment on the surface of the Wherein Y is in Is A i The total number of n-th type of character data; y is Y i Is A i The total number of all character data in the database;
generating a data judgment rule of a description field corresponding to each initial data set of the target event according to the confidence coefficient distribution condition in the character confidence coefficient set corresponding to each initial data set;
the character type conversion process includes:
splitting each character in the composition field data by using a split function to generate a plurality of independent characters;
if the independent character is a number, marking the independent character as a first character mark;
if the independent character is a letter, marking as a second character mark;
if the independent character is a Chinese character, marking the independent character as a third character mark;
respectively corresponding character identifiers of a plurality of independent characters to be spliced into character data corresponding to field data; the character identifiers comprise a first character identifier, a second character identifier and a third character identifier.
According to a second aspect of the present invention, there is provided a non-transitory computer readable storage medium storing a computer program which when executed by a processor implements a rule extraction method based on offline data as described above.
According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a rule extraction method based on offline data as described above when executing the computer program.
The invention has at least the following beneficial effects:
the invention can obtain the character data corresponding to each description data by carrying out character type conversion processing on the description data. Further, a character composition form corresponding to each description data can be acquired. The most likely data form in each description field can then be determined based on the confidence level for each character data of the description field. And taking the determined data form as a standard data form corresponding to the description field, taking the data form as a judging rule of whether the data is abnormal or not, and monitoring and judging the newly added description data of the description field to determine abnormal data, thereby improving the accuracy of the description information of the target event.
In addition, when the character type conversion processing is carried out, each character in the description data is fried, and the type of each independent character is judged in turn, so that the corresponding character data can be generated more quickly. Therefore, the rule of the data composition form in each description field can be displayed more conveniently, so that the judgment rule of the data composition form can be formed more accurately.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a rule extraction method flow based on time-series offline data according to an embodiment of the present invention.
Fig. 2 is a flowchart of a rule extraction method based on offline data according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
As a possible embodiment of the present invention, as shown in fig. 1, there is provided a rule extraction method based on time-series offline data, the method including the steps of:
s100: acquiring a duration set G of each time sequence stage of a target event 1 ,G 2 ,…,G k ,…,G y . Wherein G is k Is the duration set of the kth timing phase. k=1, 2, …, y. y is the total number of sequential phases of the target event. Each set of duration includes at least one corresponding duration.
The data in the duration set of each timing phase may take the form of existing offline data. The embodiment can be used in the aviation field. The description will be made taking, as an example, a data rule for acquiring a time series phase formed from a boarding start time to a boarding end time of an airport flight.
The duration in this example is the difference between the boarding end time and the boarding start time of each flight at the airport.
To improve accuracy, the target event may be set to a more single and specific event. Such as a time sequence phase formed from boarding start time to boarding end time of a certain flight.
S200: and performing duration interval extraction processing on the duration interval set of each time sequence stage to generate a standard duration interval corresponding to each time sequence stage.
S300: and generating a judgment rule corresponding to the time sequence stage according to the standard duration corresponding to each time sequence stage.
If the standard duration determined after the processing is [10min,35min ], all duration included in the standard duration is used as the normal duration value of the time sequence stage. If the duration of the new data of the time sequence stage appears in the subsequent time sequence stage is not in the section, the data is considered to be abnormal. Of course, the duration of the corresponding timing phase may also be predicted by determining a standard duration.
The duration interval extraction processing comprises the following steps:
s201: and generating a time length proportion curve corresponding to the time sequence stage according to the time durations included in the time duration set, wherein the horizontal axis is a time duration value, and the vertical axis is the ratio of the number of each time duration to the total number of the time durations included in the time duration set.
S202: and generating a first credibility corresponding to each accumulated duration according to the duration duty ratio curve. The first confidence level satisfies the following condition:
Figure BDA0004140474610000041
wherein (1)>
Figure BDA0004140474610000042
And the first reliability corresponding to the a-th accumulated duration is obtained. f (x) is a function corresponding to the duration duty cycle curve. W (W) 0 The total area is formed by the duration occupying ratio curve and the transverse axis; g is g 1 The minimum value of duration is concentrated for duration. W (W) 0 In particular a time length duty ratio curve [ g ] 1 ,g 4 ]The horizontal axes of the parts enclose the combined total area.
In this step, the accumulated time period may be accumulated according to 1 minute. Thus, the a-th accumulated time length is a minutes, and the corresponding integral interval is [ g ] 1 ,g 1 +a]。
S203: when (when)
Figure BDA0004140474610000043
When the first time is larger than the first confidence threshold, will +.>
Figure BDA0004140474610000044
The corresponding accumulated duration is taken as the target interval length L.
The first confidence threshold may be 90%.
S204: and generating the skewness S of the duration occupying ratio curve according to the duration included in the duration set.
Further, S satisfies the following condition:
Figure BDA0004140474610000045
wherein X is u Is the u-th duration in the duration set. u is the total number of durations contained in the duration set. μ is the average of the duration in the duration set. σ is the standard deviation of the duration in the duration set.
In this step, when the duration distribution included in the duration set is a symmetric distribution (normal distribution), s=0.
When the duration distribution included in the duration set is a left-offset distribution, S <0.
When the duration distribution included in the duration set is a right-bias distribution, S >0.
S205: generating a second credibility corresponding to each duration interval to be selected according to the skewness and the target interval length, wherein the second credibility meets the following conditions:
Figure BDA0004140474610000046
wherein (1)>
Figure BDA00041404746100000411
And the second credibility corresponding to the b-th duration interval to be selected is obtained. />
Figure BDA0004140474610000047
And the first endpoint value of the b-th duration interval to be selected. />
Figure BDA0004140474610000048
And the second endpoint value of the b-th duration interval to be selected. g4 is the maximum value of the duration in the duration set.
Figure BDA0004140474610000049
The following conditions are satisfied: />
Figure BDA00041404746100000410
Figure BDA0004140474610000051
The following conditions are satisfied: />
Figure BDA0004140474610000052
The basic starting time of every two adjacent time intervals to be selected in the step is different by 1 minute. g 1 +b-1 then represents the base start time of the b-th candidate duration. Thus, the base start time of the first duration interval to be selected is g 1 The basic initial time of the second time interval to be selected is g 1 +1. At the same time, the method comprises the steps of,
Figure BDA0004140474610000053
and->
Figure BDA0004140474610000054
The final value takes only the value of the integer part as the final output.
In g 1 =5min,g 4 =50 min, s=0.353, l=36 for example:
corresponding to
Figure BDA0004140474610000055
21 after rounding; />
Figure BDA0004140474610000056
The rounding is 11.
The corresponding 1 st time interval to be selected is [5min,16min ]; the corresponding 30 th duration interval to be selected is 13min and 45 min.
S206: when (when)
Figure BDA0004140474610000057
Greater than or equal to the second confidence threshold, will +.>
Figure BDA0004140474610000058
As a standard duration corresponding to the timing phase.
Therefore, in the mode of the embodiment, a time interval to be selected can be determined every one minute, and the constant integral of the time duty ratio curve in each time interval to be selected is obtained. In this embodiment, the maximum value of the constant integral corresponding to all the duration intervals to be selected may be used as the second trusted threshold.
As a result, the invention will
Figure BDA0004140474610000059
And->
Figure BDA00041404746100000510
As coefficients for distributing the target section length L to the left and right, respectively. When the data distribution is a symmetric distribution, s=0, < >>
Figure BDA00041404746100000511
Whereby L can be equally distributed. When the data distribution is left-offset distribution, the centralized distribution of the data is closer to the right side, S<0,/>
Figure BDA00041404746100000512
Thus, L can be more assigned to the second endpoint value located on the right side. Similarly, when the data distribution is a right-bias distribution, L may be more assigned to the second endpoint value located on the left side. Therefore, the method and the device add the influence factor of the skewness S when determining the standard duration interval corresponding to each time sequence stage, so that the determined form of the duration interval to be selected is more attached to the data distribution form, and further two endpoints of the standard duration interval can be determined more accurately and rapidly. The present embodiment is more suitable for extraction of data decision rules for a target event having a plurality of timing phases. Such as flight operationsEvents, shopping flow events, etc.
As a possible embodiment of the present invention, as shown in fig. 2, there is further provided a rule extraction method based on offline data, where the method further includes:
s400: acquiring an initial dataset A of multiple description fields of a target event 1 ,A 2 ,…,A i ,…,A z . Wherein A is i And the initial data set corresponding to the ith description field. i=1, 2, …, z. z is the total number of description fields for the target event. Each initial dataset includes at least one corresponding field data.
Specifically, taking the aviation field as an example for illustration, the description field for a flight may include a flight number, a departure place arrival location, a departure time, an arrival time, a luggage carousel number, and the like. The flight number may include field data such as MU1234, 3U1234, and middle voyage 1254.
S500: and carrying out character type conversion processing on each field of data to generate character data corresponding to each field of data. Each initial dataset includes at least one type of character data.
S600: generating a character confidence coefficient set B corresponding to each initial data set according to the character data corresponding to the field data contained in each initial data set 1 ,B 2 ,…,B i ,…,B z ,B i ={A i1 ,A i2 ,…,A in ,…,A f(Ai) }. Wherein B is i Is A i A corresponding set of character confidence levels. A is that in Is A i Confidence corresponding to the nth type of character data. n=1, 2, …, f (a i )。f(A i ) Is A i The total number of kinds of character data. A is that in The following conditions are satisfied:
A in =Y in /Y i . Wherein Y is in Is A i The total number of the nth type of character data. Y is Y i Is A i Is included in the total number of all character data.
S700: and generating a data judgment rule of the description field corresponding to each initial data set of the target event according to the confidence coefficient distribution condition in the character confidence coefficient set corresponding to each initial data set.
Further, the method also comprises the following steps:
s800: and judging the newly added field data of the corresponding description field according to the data judgment rule corresponding to each description field.
If the character data of the newly added field data is different from any character data existing in the corresponding data judging rule, judging that the newly added field data is abnormal data.
The character type conversion process includes:
s501: and splitting each character in the composition field data by using a split function to generate a plurality of independent characters.
If the independent character is a number, the mark is a first character mark.
If the independent character is a letter, the mark is a second character mark.
If the independent character is a Chinese character, the mark is a third character mark.
S502: and respectively corresponding character identifiers of the plurality of independent characters are spliced into character data corresponding to the field data. The character identifiers comprise a first character identifier, a second character identifier and a third character identifier.
In this embodiment, the first character is identified as 1, the first character is identified as a, and the first character is identified as C.
Taking SC1234 as an example, the characters are first fried into S, C,1,2,3,4 by split function, and then each individual character is judged to determine whether each bit is a number, a letter, a chinese, or other symbol. And finally obtaining the corresponding character data as AA1111. After the rule is converted, the duty ratio of the description data of each composition form in each description field can be obtained, and the corresponding rule can be conveniently obtained.
There are a large number of three formats in the flight number field, AA, A1 and 1A, respectively. For a time class field, such as a takeoff time field, there is only one format 1111-11-1111:11:11:11, namely yyyy-MM-dd hh: MM: ss. Therefore, after the character type conversion processing in the embodiment, the rule of the character composition form of the description data in each description field can be more obviously highlighted. And then, according to the rule, the judging rule of the abnormal data form corresponding to each description field can be more accurately determined. The embodiment is mainly used for judging the composition form of the data, and is more suitable for carrying out preliminary judgment on the abnormality of the description data.
As a possible embodiment of the present invention, S700: generating a data judgment rule of a description field corresponding to each initial data set of the target event according to the confidence distribution condition in the character confidence set corresponding to each initial data set, wherein the data judgment rule comprises the following steps:
s701: and ordering the confidence degrees in the character confidence coefficient set in a descending order to generate a confidence coefficient sequence.
S702: and taking the character data corresponding to the first m confidence degrees in the confidence coefficient sequence as a target data format. And m is the number of the confidence coefficient when the confidence coefficient accumulation sum in the confidence coefficient sequence is larger than the first confidence threshold value for the first time.
S703: and generating a data judgment rule of each description field of the target event according to the target data format corresponding to each description field.
Description is given by the flight number field: there are a large number of three formats in this field, AA, A1 and 1A, respectively; also very small amounts of 11 and C1 are present. The confidence corresponding to each format is aa=0.38, a1=0.33, aa=0.21, 11=0.07, c1=0.04, respectively. The first confidence threshold is 0.9.
So m=3. Correspondingly, AA, A1 and 1A target data are formatted.
In this embodiment, the rare cases occurring in the description data can be automatically removed. Since some abnormal data exist in the historical data as well, the occurrence times of the abnormal data are small, and the corresponding confidence is also a minimum value. Therefore, the embodiment can filter abnormal data, further ensure that the acquired target data format is normal data which accords with actual conditions and has more occurrence times, and improve the accuracy of the finally generated data judgment rule.
As a possible embodiment of the present invention, in S702: after character data corresponding to the first m confidence degrees in the confidence coefficient sequence are used as the target data format, the method further comprises the steps of:
s704: and generating a supplementary data format according to the character data respectively corresponding to the rest multiple confidence degrees in the confidence coefficient sequence.
S705: and generating a data judgment rule of each description field of the target event according to the complementary data format corresponding to each description field.
Since the data format corresponding to the minimum value in some fields may be normal data, only the frequency of occurrence is low. As in the flight numbering, the domestic flight numbering differs from the international numbering rules and thus the composition format is also different, but since there are few international flights in some airports this results in a very small confidence of the correspondence.
Thus, in this embodiment, by re-identifying the multiple minima remaining in the confidence sequence, more normal data formats can be determined as a supplement. Thereby, the accuracy of the data determination rule can be further improved.
As a possible embodiment of the present invention, after S600, the method further includes:
s601: acquiring multiple historical timing vectors C of a target event 1 ,C 2 ,…,C p ,…,C q . Wherein C is p =(D 1 ,D 2 ,…,D r ),C p Is the p-th historical timing vector. p=1, 2, …, q. q is the total number of historical timing vectors for the target event. D (D) r Is the time interval between the (r) th and (r+1) th running nodes of the target event.
Specifically, the boarding start time 9:00, the boarding end time 9:30, the closing door time 9:40 and the take-off time 9:56 are taken as operation nodes for illustration:
the timing vector corresponding to the operation node is (30,10,16). Thus, a large number of historical timing vectors can be derived from the historically accumulated data.
S602: clustering the plurality of historical timing vectors to generate a plurality of timing groups.
The clustering can be performed using existing clustering methods, ultimately generating a plurality of time series groups. The number of clusters can be set by a person, e.g. 5.
S603: and obtaining a time sequence vector to be detected corresponding to the event to be detected. The event to be detected and the target event are the same type of event.
S604: if the time sequence vector to be detected belongs to any time sequence group, performing secondary judgment on the time sequence vector to be detected.
The data may be roughly divided by clustering based on the similarity of the data. A large number of historical time sequence vectors approximately determine each time sequence group, a preliminary abnormal judgment condition can be formed, and if a new time sequence vector to be detected belongs to any time sequence group, more strict secondary judgment is carried out. If the time sequence vector to be detected does not belong to the time sequence vector to be detected, the time sequence vector to be detected can be rapidly determined to belong to the abnormality.
The secondary determination includes:
s614: and obtaining a standard time sequence vector corresponding to the time sequence vector to be detected. The correlation degree of the event corresponding to the standard time sequence vector and the event to be detected is larger than a correlation threshold value.
The relevance in this step may be determined by the same point between different flights. If the flight number, departure place and arrival place of the event corresponding to the standard time sequence vector are the same, the correlation degree between the event corresponding to the standard time sequence vector and the event to be detected is larger than the correlation threshold value. That is, the standard timing vector is the historical data of the event to be tested, and the similarity between the data of the same flight is higher, so that the referential property is also stronger.
S624: and generating the offset degree of the time sequence vector to be detected according to the time sequence vector to be detected and the corresponding standard time sequence vector. The degree of offset satisfies the following condition:
Figure BDA0004140474610000081
wherein E is s For the s-th time sequence direction to be measuredOffset of the amount. t is t sj And the time interval between the jth and the (j+1) th operation nodes in the event corresponding to the(s) th time sequence vector to be detected is set. T (T) sj And the time interval between the j and j+1th operation nodes in the event corresponding to the standard time sequence vector corresponding to the s time sequence vector to be detected is set. j=1, 2, …, r.
S634: if the offset of the time sequence vector to be detected is larger than the offset threshold value, determining the time sequence data of the event to be detected as abnormal data.
The offset threshold may be 0.8.
In this embodiment, the data to be tested can be primarily and rapidly determined through a plurality of time sequence groups, and the anomaly determination can be more accurately performed through the secondary determination. Thus, the judgment efficiency can be considered while the accuracy is ensured.
As a possible embodiment of the present invention, S603: obtaining a time sequence vector to be detected corresponding to an event to be detected, comprising:
s613: acquiring time sequence data F of event to be detected 1 ,F 2 ,…,F h ,…,F r+1 . Wherein F is h The time of the h operating node of the event to be detected. h=1, 2, …, r+1.r+1 is the total number of running nodes of the event to be tested.
S623: according to F 1 ,F 2 ,…,F h ,…,F r+1 Generating a time sequence vector (f) to be detected corresponding to the event to be detected 1 ,f 2 ,…,f h ,…,f r ). Wherein f h Is the h element of the timing vector to be measured. f (f) h The following conditions are satisfied: f (f) h =F h+1 -F h
Specifically, the operation nodes are exemplified by the time sequence data of the event to be detected including boarding start time 9:00, boarding end time 9:30, closing door time 9:40 and take-off time 9:56:
the timing vector corresponding to the operation node is (30,10,16).
As a possible embodiment of the present invention, S601: acquiring a plurality of historical timing vectors for a target event, comprising:
s611: the target timing field is determined from a plurality of description fields of the target event. The target timing field is a plurality of.
S621: and taking field data which accords with the corresponding target data format in each target time sequence field as target time sequence data.
S631: a plurality of historical timing vectors for the target event are generated based on the target timing data.
In this embodiment, when selecting a history timing vector for extracting a rule, only field data conforming to a target data format is selected. That is, the field data with larger proportion in each field is selected, so that the selected field data is basically common data corresponding to the description field, and the interference of few abnormal data is removed. Thus, the common classification can be more accurately obtained based on the data.
As a possible embodiment of the present invention, after obtaining the target time series data, the method further includes:
s700: noise data in the target time series data is removed. The noise data is record data corresponding to the empty set.
Specifically, the certain record data in the target time sequence data comprises boarding start time 9:00 and boarding end time
Figure BDA0004140474610000091
The 4 target timing fields of door closing time 9:40 and take-off time 9:56 are exemplified, since boarding completion time is +.>
Figure BDA0004140474610000092
The recorded data needs to be deleted.
In this embodiment, the record data with blank data in the target time sequence field may be removed, so as to ensure that each target time sequence field of the finally obtained target time sequence data is a valued field. The empty set data can be removed through the denoising step in the embodiment, so that the usability of the finally obtained target time sequence data is further improved.
Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.
Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.
Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention described in the present specification when the program product is run on the electronic device.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. A rule extraction method based on offline data, the method comprising the steps of:
acquiring an initial dataset A of multiple description fields of a target event 1 ,A 2 ,…,A i ,…,A z The method comprises the steps of carrying out a first treatment on the surface of the Wherein A is i An initial data set corresponding to the ith description field; i=1, 2, …, z; z is the total number of description fields for the target event; each of the initial data sets includes at least one corresponding field data;
performing character type conversion processing on each field data to generate character data corresponding to each field data; each of the initial data sets includes at least one type of character data;
generating a character confidence coefficient set B corresponding to each initial data set according to the character data corresponding to the field data contained in each initial data set 1 ,B 2 ,…,B i ,…,B z ,B i ={A i1 ,A i2 ,…,A in ,…,A f(Ai) -a }; wherein B is i Is A i A corresponding set of character confidence levels; a is that in Is A i Confidence corresponding to the nth type of character data; n=1, 2, …, f (a i );f(A i ) Is A i The total number of kinds of the medium character data; a is that in The following conditions are satisfied:
A in =Y in /Y i the method comprises the steps of carrying out a first treatment on the surface of the Wherein Y is in Is A i The total number of n-th type of character data; y is Y i Is A i The total number of all character data in the database;
generating a data judgment rule of a description field corresponding to each initial data set of the target event according to the confidence coefficient distribution condition in the character confidence coefficient set corresponding to each initial data set;
the character type conversion process includes:
splitting each character forming the field data by using a split function to generate a plurality of independent characters;
if the independent character is a number, marking the independent character as a first character mark;
if the independent character is a letter, marking as a second character mark;
if the independent character is a Chinese character, marking the independent character as a third character mark;
the character identifiers corresponding to the independent characters are spliced into character data corresponding to the field data; the character identifiers comprise a first character identifier, a second character identifier and a third character identifier.
2. The method of claim 1, wherein generating the data decision rule for the description field corresponding to each initial dataset of the target event according to the confidence distribution in the character confidence set corresponding to each initial dataset comprises:
the confidence degrees in the character confidence coefficient sets are ordered in a descending order, and a confidence coefficient sequence is generated;
taking character data corresponding to the first m confidence degrees in the confidence degree sequence as a target data format; m is the number of confidence coefficients when the confidence coefficient accumulation sum in the confidence coefficient sequence is larger than a first confidence threshold value for the first time;
and generating a data judgment rule of each description field of the target event according to the target data format corresponding to each description field.
3. The method according to claim 2, wherein after character data corresponding to the first m confidence levels in the confidence level sequence, respectively, is used as the target data format, the method further comprises:
generating a supplementary data format according to the character data respectively corresponding to the rest multiple confidence degrees in the confidence coefficient sequence;
and generating a data judgment rule of each description field of the target event according to the complementary data format corresponding to each description field.
4. A method according to claim 3, wherein after generating the data decision rule for the description field corresponding to each initial data set of the target event, the method further comprises:
judging the newly added field data of the corresponding description field according to the data judgment rule corresponding to each description field;
if the character data of the newly added field data is different from any character data existing in the corresponding data judging rule, judging that the newly added field data is abnormal data.
5. The method of claim 2, wherein after generating the character confidence set for each initial data set, the method further comprises:
acquiring multiple histories of a target eventTiming vector C 1 ,C 2 ,…,C p ,…,C q The method comprises the steps of carrying out a first treatment on the surface of the Wherein C is p =(D 1 ,D 2 ,…,D r ),C p Is the p-th historical timing vector; p=1, 2, …, q; q is the total number of historical timing vectors for the target event; d (D) r A time interval between an (r) th and (r+1) th running node of the target event;
clustering the plurality of historical timing vectors to generate a plurality of timing groups;
acquiring a time sequence vector to be detected corresponding to an event to be detected; the event to be detected and the target event are the same type of event;
if the time sequence vector to be detected belongs to any time sequence group, performing secondary judgment on the time sequence vector to be detected;
the secondary determination includes:
obtaining a standard time sequence vector corresponding to the time sequence vector to be detected; the correlation degree of the event corresponding to the standard time sequence vector and the event to be detected is larger than a correlation threshold value;
generating the offset of the time sequence vector to be detected according to the time sequence vector to be detected and the corresponding standard time sequence vector; the degree of offset satisfies the following condition:
Figure FDA0004140474590000021
wherein E is s The offset degree of the s-th time sequence vector to be measured; t is t sj The time interval between the jth and the (j+1) th operation nodes in the event corresponding to the(s) th time sequence vector to be detected is set; t (T) sj The time interval between the j and j+1th operation nodes in the event corresponding to the standard time sequence vector corresponding to the s time sequence vector to be detected; j=1, 2, …, r;
and if the offset degree of the time sequence vector to be detected is greater than an offset threshold value, determining the time sequence data of the event to be detected as abnormal data.
6. The method of claim 5, wherein the obtaining the timing vector to be measured corresponding to the event to be measured comprises:
acquiring time sequence data F of the event to be detected 1 ,F 2 ,…,F h ,…,F r+1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein F is h The time of the h operation node of the event to be detected; h=1, 2, …, r+1; r+1 is the total number of running nodes of the event to be detected;
according to F 1 ,F 2 ,…,F h ,…,F r+1 Generating a time sequence vector (f) to be detected corresponding to the event to be detected 1 ,f 2 ,…,f h ,…,f r ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein f h The h element is the h element of the time sequence vector to be detected; f (f) h The following conditions are satisfied: f (f) h =F h+1 -F h
7. The method of claim 5, wherein the obtaining a plurality of historical timing vectors for the target event comprises:
determining a target time sequence field from a plurality of description fields of the target event; the target time sequence field is a plurality of;
taking field data which accords with a corresponding target data format in each target time sequence field as target time sequence data;
a plurality of historical timing vectors for the target event are generated based on the target timing data.
8. The method of claim 7, wherein after obtaining the target timing data, the method further comprises:
removing noise data in the target time sequence data; the noise data is record data corresponding to the empty set.
9. A non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of offline data-based rule extraction according to any one of claims 1 to 8.
10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements an offline data-based rule extraction method according to any one of claims 1 to 8 when executing the computer program.
CN202310288192.XA 2023-03-22 2023-03-22 Rule extraction method based on offline data, storage medium and electronic equipment Active CN116402136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310288192.XA CN116402136B (en) 2023-03-22 2023-03-22 Rule extraction method based on offline data, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310288192.XA CN116402136B (en) 2023-03-22 2023-03-22 Rule extraction method based on offline data, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN116402136A true CN116402136A (en) 2023-07-07
CN116402136B CN116402136B (en) 2023-11-17

Family

ID=87017045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310288192.XA Active CN116402136B (en) 2023-03-22 2023-03-22 Rule extraction method based on offline data, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116402136B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4060716A (en) * 1975-05-19 1977-11-29 Rockwell International Corporation Method and apparatus for automatic abnormal events monitor in operating plants
JP2000105707A (en) * 1998-09-29 2000-04-11 Hitachi Ltd System for detection of abnormal state of processing program
JP2019101895A (en) * 2017-12-06 2019-06-24 国立大学法人大阪大学 Abnormality detector, abnormality detection method and program
CN111756560A (en) * 2019-03-26 2020-10-09 中移(苏州)软件技术有限公司 Data processing method, device and storage medium
FR3103039A1 (en) * 2019-11-07 2021-05-14 Electricite De France Detecting attacks using hardware performance counters
CN113987190A (en) * 2021-11-16 2022-01-28 全球能源互联网研究院有限公司 Data quality check rule extraction method and system
CN114022161A (en) * 2021-10-22 2022-02-08 重庆市清泽水质检测有限公司 Independent source tracing system, method and device for original recording data of LIMS (laser induced mass spectrometry) system based on block chain and storage medium
CN114092851A (en) * 2021-10-12 2022-02-25 甘肃欧美亚信息科技有限公司 Monitoring video abnormal event detection method based on time sequence action detection
CN114610957A (en) * 2022-03-16 2022-06-10 深圳希施玛数据科技有限公司 Data processing method, device, equipment and computer storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4060716A (en) * 1975-05-19 1977-11-29 Rockwell International Corporation Method and apparatus for automatic abnormal events monitor in operating plants
JP2000105707A (en) * 1998-09-29 2000-04-11 Hitachi Ltd System for detection of abnormal state of processing program
JP2019101895A (en) * 2017-12-06 2019-06-24 国立大学法人大阪大学 Abnormality detector, abnormality detection method and program
CN111756560A (en) * 2019-03-26 2020-10-09 中移(苏州)软件技术有限公司 Data processing method, device and storage medium
FR3103039A1 (en) * 2019-11-07 2021-05-14 Electricite De France Detecting attacks using hardware performance counters
CN114092851A (en) * 2021-10-12 2022-02-25 甘肃欧美亚信息科技有限公司 Monitoring video abnormal event detection method based on time sequence action detection
CN114022161A (en) * 2021-10-22 2022-02-08 重庆市清泽水质检测有限公司 Independent source tracing system, method and device for original recording data of LIMS (laser induced mass spectrometry) system based on block chain and storage medium
CN113987190A (en) * 2021-11-16 2022-01-28 全球能源互联网研究院有限公司 Data quality check rule extraction method and system
CN114610957A (en) * 2022-03-16 2022-06-10 深圳希施玛数据科技有限公司 Data processing method, device, equipment and computer storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUNYU CHEN 等: "Detection of Anomalous Crowd Behavior Based on the Acceleration Feature", 《DOI: 10.1109/JSEN.2015.2472960》 *
严宏 等: "基于异方差高斯过程的时间序列数据离群点检测", 《计算机应用》 *
张博 等: "数据挖掘 中的关联规则在入侵检测 系统中的应用", 《航空计算技术》 *

Also Published As

Publication number Publication date
CN116402136B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
Steck et al. Bayesian belief networks for data mining
CN116306937B (en) Rule extraction method, medium and device based on time sequence offline data
CN107145516B (en) Text clustering method and system
EP2963553B1 (en) System analysis device and system analysis method
EP3608802A1 (en) Model variable candidate generation device and method
CN110633371A (en) Log classification method and system
CN113341919B (en) Computing system fault prediction method based on time sequence data length optimization
CN116402136B (en) Rule extraction method based on offline data, storage medium and electronic equipment
EP2492826A1 (en) High-accuracy similarity search system
CN106844765B (en) Significant information detection method and device based on convolutional neural network
CN113468418A (en) Intelligent policy data recommendation method and system
WO2012133941A1 (en) Method for matching elements in schemas of databases using bayesian network
CN104573095B (en) Extensive object identifying method based on Hadoop frames
JPH0535484A (en) Fault diagnostic method
CN116522171A (en) Electric power field fault analysis method and system based on big data
CN116542320A (en) Small sample event detection method and system based on continuous learning
CN112101780A (en) Airport scene operation comprehensive evaluation method based on structure entropy weight method
CN113205215A (en) Knowledge-based battlefield situation prediction method
CN113313352A (en) Safety monitoring method for hydrogen station, electronic equipment and storage medium
CN116244106B (en) Data detection method of civil aviation data, storage medium and electronic equipment
WO2008117015A1 (en) Method of comparing data sequences
CN108154179B (en) Data error detection method and system
CN110968631A (en) Vehicle fault warning method based on TBOX
CN112069374B (en) Identification method and device for multiple customer numbers of bank
CN114780756B (en) Entity alignment method and device based on noise detection and noise perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant