WO2023040300A1 - 数据处理方法、电子设备、存储介质及程序产品 - Google Patents

数据处理方法、电子设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2023040300A1
WO2023040300A1 PCT/CN2022/091576 CN2022091576W WO2023040300A1 WO 2023040300 A1 WO2023040300 A1 WO 2023040300A1 CN 2022091576 W CN2022091576 W CN 2022091576W WO 2023040300 A1 WO2023040300 A1 WO 2023040300A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
data information
data
time period
frequency
Prior art date
Application number
PCT/CN2022/091576
Other languages
English (en)
French (fr)
Inventor
戴新宇
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023040300A1 publication Critical patent/WO2023040300A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the embodiments of the present application relate to the technical field of data processing, and in particular, to a data processing method, electronic device, storage medium, and program product.
  • the industry In the various links of power generation, transmission, transformation, distribution, and utilization, the industry generally adopts a power monitoring (Supervisory Control And Data Acquisition, SCADA) system for management.
  • SCADA Supervisory Control And Data Acquisition
  • the log files (such as operation logs, security logs, system logs, etc.) generated during the operation of the power monitoring system are often used for fault location analysis.
  • the difficulty of failure is increasing day by day.
  • the Bayesian algorithm is commonly used to calculate the correlation between specific logs and faults by using the filtering method of time period, which requires high computing power and processing capacity for back-end training, and the resource overhead is relatively large.
  • the filtering method of time period which requires high computing power and processing capacity for back-end training, and the resource overhead is relatively large.
  • For alarm monitoring it is necessary to establish an experience database corresponding to the type of fault for the alarm, and business personnel need to invest a lot of experience to sort out the experience database. This manual method is not only costly, but also causes the experience database to be broken due to human subjective randomness. omissions and errors.
  • Embodiments of the present application provide a data processing method, an electronic device, a storage medium, and a program product.
  • the embodiment of the present application provides a data processing method, including: acquiring a plurality of first data information; performing preprocessing on the plurality of first data information to obtain the second data information; determine a plurality of candidate data information from the plurality of first data information; and obtain a target by screening from the plurality of candidate data information according to the second data information and the plurality of candidate data information data information, the target data information has the same data type as the second data information.
  • the embodiment of the present application also provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the above when executing the computer program.
  • an electronic device including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the above when executing the computer program.
  • the embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, where the computer-executable instructions are used to execute the data processing method as described above.
  • the embodiment of the present application further provides a computer program product, including a computer program or a computer instruction, the computer program or the computer instruction is stored in a computer-readable storage medium, and the processor of the computer device reads from the The computer-readable storage medium reads the computer program or the computer instruction, and the processor executes the computer program or the computer instruction, so that the computer device executes the data processing method as described above.
  • FIG. 1 is a schematic diagram of the architecture of a power monitoring system for performing a data processing method provided by an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a data processing method provided by an embodiment of the present application.
  • Fig. 4 is the flowchart of the method of step S320 in Fig. 3;
  • Fig. 5 is a flow chart of a method of step S321 in Fig. 4;
  • Fig. 6 is a histogram of the corresponding relationship between the alarm number and the alarm frequency provided by an example of the present application.
  • Fig. 7 is a flowchart of another method of step S321 in Fig. 4;
  • Fig. 8 is a flowchart of the method of step S3213 in Fig. 6;
  • Fig. 9 is a flowchart of the method of step S330 in Fig. 3;
  • Fig. 10 is a flowchart of a method of step S340 in Fig. 3;
  • Fig. 11 is the flowchart of the method of step S342 in Fig. 9;
  • Fig. 12 is a flowchart of another method of step S340 in Fig. 3;
  • Fig. 13 is a schematic diagram of dividing target time periods provided by an example of the present application.
  • Fig. 14 is the flowchart of the method of step S346 in Fig. 11;
  • Fig. 15 is a flowchart of the method of step S348 in Fig. 11;
  • Fig. 16 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Embodiments of the present application provide a data processing method, electronic equipment, storage medium, and program product.
  • preprocessing a plurality of first data information is performed to obtain second data information among the plurality of first data information, and then Determining a plurality of candidate data information from a plurality of first data information, and then screening target data having the same data type as the second data information from the plurality of candidate data information according to the second data information and the plurality of candidate data information Information, therefore, can achieve the purpose of quickly obtaining the desired target data information without increasing the resource allocation of the power monitoring system.
  • FIG. 1 is a schematic diagram of an architecture of a power monitoring system for implementing a data processing method provided by an embodiment of the present application.
  • the power monitoring system architecture includes a system layer 100, a communication layer 200, and a device layer 300, wherein the communication layer 200 is set between the system layer 100 and the device layer 300, and the communication layer 200 is connected to the system layer 100 respectively. It communicates with the device layer 300 .
  • the equipment layer 300 may include equipment such as power meters 310, environmental sensors 320, and actuators 330, which can collect various power parameters of various power equipment, such as power meters, power protection equipment, or bus couplers.
  • the power parameters of the equipment can also collect real-time sampling values of various environmental sensors 320, such as the temperature value measured by the temperature sensor, the temperature value measured by the humidity sensor, or the mechanical quantity detected by the vibration sensor, and the collected data can be collected through the equipment side protocol.
  • Various power parameters and sampled values are sent to the communication management machine 210 in the communication layer 200 .
  • the device-side protocol refers to a series of shared or manufacturer's private protocols, such as modbus protocol, ProflNet protocol, Ethernet/IP protocol or HSE protocol, etc., which will not be listed here.
  • the communication layer 200 includes a communication management machine 210, which can also be called a collection gateway.
  • the communication layer 200 can receive various data sent by the equipment layer 300, aggregate the received data, and then report the aggregated data to the system layer 100 through the system side protocol, so the communication layer 200 is in It plays a connecting role in the power monitoring system.
  • the system-side protocols include common protocols of the electric power industry and private protocols of manufacturers, and no specific limitation is made here.
  • the system layer 100 includes SCADA software 110 and a data processing device 120.
  • the SCADA software 110 can receive the data sent by the communication layer 200, and map the data into corresponding equipment and points, and at the same time, it can react based on a series of control strategies. To control the actuators 330 in the power system to ensure the smooth operation of the power system.
  • the power monitoring system can be applied not only to power systems, but also to water supply systems, petroleum or chemical industries and other fields.
  • the embodiment of the present application provides a data processing device 120, the data processing device 120 at least includes an information acquisition module 121, an association identification module 123 and a data preprocessing module 122, wherein the data preprocessing module 122 is connected to the information acquisition module 121 and the association identification module 123 respectively, and the information acquisition module 121 is connected to the association identification module 123.
  • the information acquiring module 121 receives a plurality of first data information generated by the power monitoring system, and sends the plurality of first data information to the data preprocessing module 122 and the association identification module 123 .
  • the data preprocessing module 122 preprocesses the multiple pieces of first data information received to obtain multiple pieces of second data information, and sends the multiple pieces of second data information to the association identification module 123 .
  • the association identification module 123 determines a plurality of candidate data information from the plurality of first data information, and screens the target data information having the same data type as the second data information from the plurality of candidate data information according to the second data information, and then sends to Demonstration of power monitoring system.
  • the data preprocessing module 122 may include a log gateway and an alarm gateway, etc., which are not specifically limited here.
  • the log gateway receives multiple log information generated by the power monitoring system, and preprocesses the multiple log information to obtain multiple low-frequency log information, and then transmits the multiple low-frequency log information to the association identification module 123, wherein, The log information includes operation information, system information, and security information, etc., which are not specifically limited here.
  • the alarm gateway receives a plurality of alarm information from the power monitoring system, and preprocesses the plurality of alarm information to obtain a plurality of low-frequency alarm information, and transmits the low-frequency alarm information to the association identification module 123 .
  • the association identification module 123 selects the target data information from the multiple log information and the multiple alarm information according to the multiple low-frequency log information pushed by the log gateway and the multiple low-frequency alarm information pushed by the alarm gateway, and sends the target data information to the electric power
  • the monitoring system enables the target data information to be displayed, assisting the operation and maintenance personnel to locate the root cause.
  • low-frequency log information is log information that is lower than or equal to the preset frequency threshold.
  • low-frequency alarm information is alarm information that is lower than or equal to the preset frequency threshold, and the preset frequency threshold can be based on actual conditions. Appropriate selection is made according to the application situation, and no specific limitation is made here.
  • both the low-frequency log information and the low-frequency alarm information belong to the second data information
  • the log information and the alarm information both belong to the first data information
  • the second data information and the target data information have the same data characteristics.
  • FIG. 1 the architecture of the power monitoring system shown in FIG. 1 and the data processing device shown in FIG. 2 do not constitute a limitation to the embodiment of the application, and may include more or less components, or combinations of certain components, or different arrangements of components.
  • FIG. 3 is a flowchart of a data processing method provided by an embodiment of the present application.
  • the data processing method can be applied to a data processing device, such as the data processing device in the power monitoring system architecture shown in FIG. 1 .
  • the data processing method may include but not limited to step S310, step S320, step S330 and step S340.
  • Step S310 Obtain a plurality of first data information.
  • the data processing device may obtain a plurality of first data information generated by the power monitoring system, wherein the first data information may include log information, alarm information, and other data information, etc., here No specific restrictions are made.
  • the first data information may include the name of the alarm, the cause of the alarm failure, and the level of the alarm, etc.; for another example, when the first data information is assumed to be log information, the first data information may include Operation logs, security logs, system logs, etc.
  • first data information may be obtained by reading related data files generated by the power monitoring system, or obtained by calling a query interface exposed by the power monitoring system.
  • Step S320 Preprocessing the plurality of first data information to obtain second data information among the plurality of first data information.
  • step S310 since a plurality of first data information including second data information is acquired in step S310, the data processing device will preprocess the acquired plurality of first data information to obtain a plurality of second data information , so that the subsequent step can filter a plurality of candidate data information according to the second data information.
  • This step greatly reduces the computing power consumption of data processing, thereby reducing processing time, and at the same time is conducive to the integration and miniaturization of the power monitoring system, reducing the overall cost of the solution, and improving market competitiveness.
  • Step S330 Determine a plurality of candidate data information from the plurality of first data information.
  • the data processing device filters the plurality of first data information to obtain a plurality of candidate data information, which reduces the amount of data processing and saves data processing time .
  • the candidate data information may include log information, alarm information, and other data information, etc., which is not specifically limited here.
  • the candidate data information may include alarm name, alarm failure reason and alarm level, etc.; for another example, when the first data information is log information, the first data information may include operation log, security log, and system log.
  • Step S340 According to the second data information and the plurality of candidate data information, the target data information is obtained by screening the plurality of candidate data information, and the target data information has the same data type as the second data information.
  • this step after the candidate data information determined in step S330, according to the data type of the second data information obtained in step S320, a plurality of candidate data information with the same data as the second data information will be selected.
  • the target data information of the data type reduces the amount of data processing by 80-90% to reduce the calculation time, and at the same time reduces the resource consumption of the power monitoring system, and deploys the root cause location of the fault without increasing the resource allocation of the power monitoring system.
  • the candidate data information may have multiple different data types, and the first data information and the candidate data information have the same data type, therefore, the data type of the second data information obtained according to the first data information may belong to One or more of the data types of the candidate data information.
  • the data types may include date and time types, low-frequency types, or high-frequency types, etc., which are not specifically limited in this embodiment.
  • the association identification module will also perform different screening methods correspondingly.
  • the association identification module when the second data information is a low-frequency data type, the association identification module will filter out the low-frequency candidate data information from the candidate data information to obtain the target data information, while the candidate data information of high-frequency words is not processed or Abandon; for another example, when the data type of the second data information is a date and time type, the association identification module will filter out the candidate data information with the same date and time, or the association identification module will filter out the candidate data information with the same date and different time The data information is screened out, or the association identification module will screen out the candidate data information with the same time but different dates, and finally obtain the target data information, which is helpful for assisting the operation and maintenance personnel to locate the root cause of the failure.
  • the low frequency refers to the frequency lower than or equal to the frequency threshold.
  • the high frequency type refers to the frequency higher than the frequency threshold, and the frequency threshold can be properly selected according to the actual application situation. Specific limits.
  • a plurality of first data information is obtained by using the data processing device, and then the plurality of first data information are preprocessed by the data processing device to obtain The second data information, then use the data processing device to determine a plurality of candidate data information from the plurality of first data information, and finally filter the plurality of candidate data information according to the data type of the second data information to obtain the second data information
  • the information has target data information of the same data type, and finally information push processing is performed on the target data information, so that the target data information is displayed. Therefore, this embodiment can achieve the purpose of quickly obtaining desired target data information without increasing the resource allocation of the power monitoring system, which is helpful for operation and maintenance personnel to locate the root cause.
  • the data processing time has been reduced to less than 5% of the original through the preprocessing of the plurality of first data information in step S320.
  • step S320 is further described, and step S320 may include but not limited to step S321 and step S322 .
  • Step S321 Perform frequency-based clustering processing on multiple pieces of first data information to obtain multiple cluster sets, and different cluster sets have different center frequencies.
  • the data processing device may perform step S321 to obtain multiple cluster sets, and different cluster sets have different center frequency, so that the subsequent steps can obtain the second data information according to the multiple clustering sets.
  • clustering processing can be done without manual setting, reducing the tedious and subjective factors of manual setting.
  • clustering refers to dividing a data set into different classes or clusters according to a certain standard (such as distance), so that the similarity of data objects in the same cluster is as large as possible, while not in the same cluster.
  • the data objects in are also as diverse as possible. That is to say, after clustering, the data of the same class are gathered together as much as possible, and the data of different classes are separated as much as possible.
  • the clustering method based on frequency clustering processing means that the frequency divides multiple first data information into different cluster sets, so that the frequencies of the first data information in the same cluster set are as close as possible. At the same time, the frequency difference of the first data information not in the same clustering set is also as large as possible.
  • Step S322 Determine the target cluster set whose center frequency is less than or equal to the frequency threshold from the multiple cluster sets to obtain the second data information.
  • step S321 since multiple clustering sets are obtained in step S321, the target clustering set whose central frequency is less than or equal to the frequency threshold can be determined according to the multiple clustering sets, and the second data information can be obtained, so that the subsequent steps can be A plurality of candidate data information is screened according to the second data information.
  • the frequency threshold may be set manually, or may be automatically set by the power monitoring system according to the center frequencies in multiple cluster sets, which is not specifically limited in this embodiment.
  • multiple clustering sets can be divided into two clustering sets, which are the low-frequency clustering set whose central frequency is less than or equal to the frequency threshold, that is, the target clustering set, and the high-frequency clustering set whose central frequency is greater than the frequency threshold. class collection.
  • the second data information can be obtained from the target clustering set, and the high-frequency clustering set may not be processed or discarded, and no specific restrictions are set here, but the high-frequency clustering set is discarded, according to the Pareto rule , can reduce the amount of data processing by more than 90%, which greatly reduces the calculation load, and at the same time reduces the resource requirements for CPU and memory. It should also be noted that the discarding of the high-frequency first data information will not affect the screening of target data information in subsequent steps.
  • the cluster set is shown in Table 1 below, which includes the frequency classification, the number of data information in different cluster sets, and the number of data information in each frequency type
  • the proportion of data information in the total data information can be seen from Table 1 below.
  • the data processing device performs frequency-based clustering processing on a plurality of first data information
  • a plurality of cluster sets with different central frequencies are obtained, and then the Determining the target data information whose central frequency is less than or equal to the frequency threshold in the plurality of cluster sets to obtain the second data information.
  • the second data information is low-frequency data.
  • step S321 is further described.
  • this step S321 may include but not limited to steps S3211 and Step S3212.
  • Step S3211 Perform frequency statistics on the alarm information according to the alarm number to obtain the alarm frequency of the alarm information.
  • the frequency of the alarm information can be counted according to the alarm number to obtain the alarm frequency of the alarm information , so that the subsequent steps can use the alarm frequency to cluster the alarm information.
  • a two-dimensional table with two fields of alarm number and frequency can be established according to the corresponding relationship between the alarm number and the alarm frequency, so as to facilitate the clustering of the alarm information in the subsequent steps.
  • the value_counts() method of python’s pandas library is used to summarize by alarm number and arrange them in reverse order.
  • the two-dimensional table of the two fields of alarm number and frequency is shown in Table 2 below. This table includes alarm codes and alarm frequencies.
  • FIG. 6 is a histogram corresponding to Table 1.
  • the alarm codes are the first alarm information (2114060448), the second alarm information (2114322696), the third alarm information (12596994) and the fourth alarm information (12611841).
  • the warning information corresponding to the fifth warning information (2114060402), the sixth warning information (12596992), the seventh warning information (2114322678) and the eighth warning information (2121662481) respectively are low-frequency warning information, and it can be clearly seen from Figure 6 It is observed that the frequency distribution of different types of alarms is unbalanced.
  • Alarm code Alarm frequency The first warning message (2114060448) 72761 The second warning message (2114322696) 7721 The third warning message (12596994) 5141 The fourth warning message (12611841) 2085 Fifth warning message (2114060402) 918 The sixth warning message (12596992) 646 Seventh warning message (2114322678) 10 The eighth warning message (2121662481) 1
  • Step S3212 Perform clustering processing on all alarm information according to alarm frequency to obtain multiple cluster sets.
  • step S3211 since the alarm frequency of the alarm information is obtained in step S3211, all the alarm information can be clustered according to the alarm frequency to obtain multiple cluster sets, so that the subsequent steps can use the cluster sets Determine the set of target clusters.
  • the clustering method is used to cluster the alarm frequency of the alarm information, instead of manually configuring the classification threshold, mainly for the purpose of more end-to-end algorithm processing, avoiding the impact of human subjective judgment on the algorithm, and reducing the configuration of operation and maintenance personnel workload.
  • clustering methods used for clustering processing, such as kmeans algorithm, K-means++ algorithm or bi-kmeans algorithm, which is not specifically limited in this embodiment.
  • the data processing device performs frequency statistics on all alarm information according to the alarm numbers to obtain the alarm frequency corresponding to all alarm information, and then calculates all alarm information according to the alarm frequency.
  • Cluster processing to obtain a cluster set of multiple alarm information.
  • step S321 is further described.
  • step S321 may include but not limited to step S3213 and step S3214 .
  • Step S3213 Perform frequency statistics on the log information to obtain the log frequency of the log information.
  • the log information when multiple cluster sets are to be obtained, the log information can be counted based on frequency to obtain the log frequency of the log information, so that the log can be used in subsequent steps
  • the frequency clusters the log information.
  • Step S3214 Perform clustering processing on all log information according to log frequency to obtain multiple cluster sets.
  • step S3213 since the log frequency of the log information is obtained in step S3213, all the log information can be clustered according to the log frequency to obtain multiple cluster sets, so that the subsequent steps can use the cluster sets Determine the set of target clusters.
  • the clustering method is used to cluster the log frequency of the log information instead of manually configuring the classification threshold.
  • the main purpose is to make the algorithm processing more end-to-end, avoid the influence of human subjective judgment on the algorithm, and reduce the configuration of operation and maintenance personnel. workload.
  • clustering log information there are many methods for clustering log information, which are not specifically limited again, such as the use of MapReduce parallel technology, LCS-based Chameleon real-time log clustering method, and hierarchical clustering of nearest neighbor chains algorithm etc.
  • the data processing device performs frequency statistics on all log information to obtain the log frequency of all log information, and then performs clustering processing on all log information according to the log frequency, Get a clustered set of multiple log information.
  • step S3213 is further described, and step S3213 may include but not limited to step S32131 , step S32132 and step S32133 .
  • Step S32131 Perform variable substitution processing on the log information to obtain candidate information.
  • variable substitution processing may be performed on the log information to obtain alternative information, so that the subsequent steps can use the alternative information to obtain its corresponding mapping information.
  • variable replacement processing may be performed on the log information, which is not specifically limited here.
  • variable substitution is performed on the log information based on regular expressions, and the detailed IP address, port number, and time in the log information are replaced with strings such as $IP, $IPPort, and $DateTime to obtain alternative information.
  • a regular expression describes a pattern of string matching, which can be used to check whether a string contains a certain substring, replace a matched substring, or extract a substring that meets a certain condition from a string wait.
  • variables may refer to time, signed integers, floating point numbers, or special characters, etc., depending on actual conditions, and are not specifically limited here.
  • Step S32132 Perform mapping processing on the candidate information to obtain the mapping information.
  • step S32131 since the candidate information is obtained in step S32131, the mapping information corresponding to the candidate information is obtained through mapping processing, so that subsequent steps can use the mapping information to perform frequency statistics on the log information.
  • mapping the candidate information there are many methods for mapping the candidate information, which are not specifically limited here.
  • the hash function is used to encode the candidate information into a fixed-length character string to obtain the mapping information.
  • Another example is to establish a general-purpose function for character string matching in a fixed-length encoding format in the power monitoring system, and to perform mapping processing on alternative information by calling this function.
  • the hash function is a commonly used fixed-length encoding function, which has fast encoding speed and good anti-collision characteristics, and is widely used.
  • the hexdigest() method of python's hashlib library is used for encoding to obtain a fixed-length string.
  • Step S32133 Perform frequency statistics on the log information according to the mapping information to obtain the log frequency of the log information.
  • step S32132 since the mapping information is obtained in step S32132, frequency statistics of the log information can be performed according to the mapping information to obtain the log frequency of the log information, so as to perform clustering processing on the log information in the subsequent step S3214.
  • the data processing device performs variable substitution processing on the log information to obtain candidate information, and then performs mapping processing on the candidate information to obtain the mapping information, and then perform frequency statistics on the log information according to the mapping information, and finally obtain the log frequency of the log information.
  • mapping relationship between the mapping information and the log frequency can be established, so that the subsequent S3214 can cluster the log information, which can be selected according to the actual situation.
  • variable substitution is performed on the log information based on a regular expression to obtain alternative information
  • the hash function is used to encode the alternative information into a fixed-length string to obtain the mapping information
  • the log information is counted through the mapping information to form a log code
  • a two-dimensional table with two fields of frequency and frequency, among which, the value_counts() method of python's pandas library can be used to summarize by log code and arrange them in reverse order.
  • step S330 is further described, and step S330 may include but not limited to step S331 and step S332 .
  • Step S331 Determine the target time period.
  • the target time period can be determined first, so that the subsequent steps can determine a plurality of candidate data information within the target time period from the plurality of first data information.
  • the target time period there are many ways to determine the target time period, which are not specifically limited here. For example, assuming that the time when the fault occurs is set as the end time of the target time period, the start time of the target time period is determined based on the length of the overall filtering time period, thereby determining the target time period.
  • Step S332 Determine a plurality of candidate data information within the target time period from the plurality of first data information.
  • step S331 since the target time period is determined in step S331, a plurality of first data information can be screened within the target time period to obtain candidate data information, which reduces the amount of data processing.
  • the target time period is determined, and then a plurality of candidate data information within the target time period is determined from a plurality of first data information, greatly The amount of data processing is reduced, and the waiting time for pushing target data information is reduced.
  • step S340 is further described, and step S340 may include but not limited to step S341 , step S342 , step S343 and step S344 .
  • Step S341 Divide the target time period according to the preset time length to obtain more than two filtering time periods.
  • the target time period when the target data information needs to be obtained, the target time period can be divided and processed according to the preset time length to obtain more than two filtering time periods, so that the subsequent steps can use the filtering time period to filter candidate data information.
  • the multiple preset time lengths can be different, such as 2 hours, 6 hours or 12 hours, etc.
  • the time lengths of more than two filtering time periods can also be different. Use different preset time lengths for correlation filtering to avoid potential correlation filtering omissions that may be caused by filtering with a single preset time length.
  • the correlation filtering efficiency is smaller than the correlation filtering efficiency of two filtering time periods, but greater than the correlation filtering efficiency of more than three filtering time periods.
  • Step S342 Screening candidate data information in more than two filtering time periods respectively to obtain a first information set for each filtering time period.
  • step S341 since more than two filtering time periods are obtained in step S341, the candidate data information in more than two filtering time periods can be screened separately to obtain the first information of each filtering time period set, so that subsequent steps can de-duplicate the first information set in each filtering time period.
  • Step S343 Perform deduplication processing on the first information set in each filtering time period to obtain a second information set in each filtering time period.
  • step S342 since the first information set is obtained in step S342, the first information set of each filtering time period is deduplicated respectively to obtain the second information set of each filtering time period, so as to facilitate subsequent
  • the step is to take a union of all the second information sets to obtain the target data information.
  • Step S344 Obtain the union of all second information sets to obtain target data information.
  • step S343 since the second information set is obtained in step S343, the union of all second information sets can be obtained to obtain the target data information, so that the subsequent steps can perform information push processing on the target data information, so that the target data information displayed to assist O&M personnel in locating the root cause of the fault.
  • the target time period is divided according to the preset time length to obtain more than two filtering time periods, and then according to the second data information
  • the candidate data information in more than two filtering time periods are screened separately to obtain the first information set of each filtering time period, and then the first information set of each filtering time period is deduplicated respectively to obtain The second information set in each filtering time period, finally, the union of all the second information sets is obtained to obtain the target data information.
  • the data information in the first information set has the same data type as the second data information, and screening the candidate data information in more than two filtering time periods only based on the second data information will not Influence the subsequent filtering of target data information that is strongly related to the fault.
  • the Bayesian formula can be used to prove this. The proof process is as follows:
  • event B is an event with high-frequency data information
  • A is equivalent to a failure
  • B) is the probability of failure when high-frequency data information occurs
  • Bayes formula is It can be seen from the formula that the smaller the P(B), the larger the value of P(A
  • B), and the high-frequency data information will basically appear in each filtering time period, that is to say, P(B) 1(100 %), the corresponding value of P(A
  • step S342 is further described, and step S342 may include but not limited to step S3421 and step S3422 .
  • Step S3421 Go through the candidate data information in more than two filtering time periods, and filter out the first candidate data information having the same data type as the second data information in each filtering time period.
  • all candidate data information in all filtering time periods can be traversed, so as to filter out information that is identical to the second data information in each filtering time period.
  • the first candidate data information of the data type so that the subsequent steps can collect the first candidate data information of each filtering time period respectively.
  • Step S3422 Collect the first candidate data information in each filtering time period to obtain the first information set in each filtering time period.
  • step S3421 since the first candidate data information of each filtering time period is obtained in step S3421, the first candidate data information of each filtering time period can be collected separately to obtain each filter The first information set of the time period.
  • step S340 is further described, and step S340 may include but not limited to step S345 , step S346 , step S347 , step S348 and step S349 .
  • Step S345 Divide the target time period according to the preset time length to obtain a first filtering time period and a second filtering time period.
  • the target time period can be divided and processed according to the preset time length to obtain the first filtering time period and the second filtering time period, so that the subsequent steps can process each filtering time period Candidate data information is screened.
  • Step S346 Perform screening processing on the candidate data information in the first filtering time period to obtain a third information set in the first filtering time period.
  • step S345 since the first filtering time period is obtained in step S345, the candidate data information in the first filtering time period can be screened to obtain the third information set of the first filtering time period, so that the subsequent steps Perform deduplication processing on the third information set.
  • Step S347 Perform deduplication processing on the third information set to obtain a fourth information set.
  • step S346 since the third information set is obtained in step S346, the third information set can be deduplicated to obtain the fourth information set, so that the subsequent steps can filter the second filtering time period according to the fourth information set
  • the candidate data information in is screened.
  • Step S348 According to the fourth information set, filter the candidate data information in the second filtering time period to obtain the fifth information set in the second filtering time period.
  • step S347 since the fourth information set is obtained in step S347 and the second filtering time period is obtained in step S345, the candidate data information in the second filtering time period can be screened according to the fourth information set processing to obtain the fifth information set in the second filtering time period, so that the subsequent steps can perform aggregation processing on the fourth information set and the fifth information set.
  • Step S349 Obtain the union of the fourth information set and the fifth information set to obtain target data information.
  • step S347 since the fourth information set is obtained in step S347 and the fifth information set is obtained in step S348, the fourth information set and the fifth information set can be combined to obtain the target data information, so that The next step is to perform information push processing on the target data information, so that the target data information is displayed, and assist the operation and maintenance personnel to locate the root cause of the failure.
  • the target time period can be divided according to the preset time length to obtain the first filtering time period and the second filtering time period, and then , according to the second data information, the candidate data information in the first filtering time period is screened to obtain the third information set in the first filtering time period, and then the third information set is deduplicated to obtain the fourth information set, Then use the obtained fourth information set to filter the candidate data information in the second filtering time period according to the second data information to obtain the fifth information set in the second filtering time period, and finally obtain the fourth information set and the second filtering time period.
  • the union of the five information sets is used to obtain the target data information.
  • the data information in the third information set and the data information in the fifth information set both have the same data type as the second data information, and the candidates in the first filtering time period are only filtered according to the second data information.
  • the data information and the candidate data information in the second filtering time period are screened separately and will not affect the subsequent filtering of the target data information that is strongly related to the fault.
  • the Bayesian formula can be used as a proof for this. The proof process is as follows:
  • event B is an event with high-frequency data information
  • time A is a failure
  • B) is the probability of failure when high-frequency data information occurs
  • Bayes formula is It can be seen from the formula that the smaller the P(B), the larger the value of P(A
  • B), and the high-frequency data information will basically appear in each filtering time period, that is to say, P(B) 1(100 %), the corresponding value of P(A
  • the target time period is divided into two parts, T1 and T2, as shown in Figure 13.
  • T1 and T2 Let all the first data information in T1 be All first data information in T2 is If the first data information of a certain type and Then the first data information of this type is strongly related to the fault.
  • the essence of this simplification process is to discard the first data information of P(A
  • B) ⁇ 50%, to reduce the complexity of data processing, and further reduce the processing time.
  • the filtering time period can be three (T1 of the above two filtering time periods is divided into two), and its filtering efficiency is less than the method of two filtering time periods but greater than the traditional N filtering time
  • the filtering time period of the simplified filtering method can also be 4, 5, up to N-1. Therefore, the time filtering algorithm can greatly reduce the number of operations of the first data information traversal judgment without affecting the filtering accuracy, thereby reducing the time of filtering operations.
  • the low frequency refers to the frequency lower than or equal to the preset frequency threshold.
  • the high frequency refers to the frequency higher than the preset frequency threshold, and the preset frequency threshold can be properly selected according to the actual application situation. , not specifically limited here.
  • step S346 is further described, and step S346 may include but not limited to step S3461 and step S3462 .
  • Step S3461 Go through the candidate data information in the first filtering time period, and filter out the second candidate data information having the same data type as the second data information.
  • the candidate data information in the first filtering time period can be traversed first, and the second candidate with the same data type as the second data information can be screened out. data information, so that subsequent steps can collect and process the second candidate data information.
  • Step S3462 Collecting the second candidate data information to obtain the third information set in the first filtering time period.
  • step S3461 since the second candidate data information having the same data type as the second data information is screened out in step S3461, the second candidate data information can be aggregated to obtain the first filtering time period The third information set, so that the subsequent steps can perform deduplication processing on the third information set.
  • step S3461 and step S3462 by adopting the data processing method including the above step S3461 and step S3462, firstly, the candidate data information in the first filtering time period is traversed, and the second candidate data having the same data type as the second data information is screened out. data information, and then perform collection processing on the second candidate data information to obtain a third information set in the first filtering time period.
  • step S348 is further described, and step S348 may include but not limited to step S3481 and step S3482 .
  • Step S3481 traverse the candidate data information in the second filtering time period, and filter out the third candidate data information that has the same data type as the second data information and does not belong to the fourth information set;
  • Step S3482 Collecting the third candidate data information to obtain the fifth information set in the second filtering time period.
  • step S3481 and step S3482 by adopting the data processing method including the above step S3481 and step S3482, firstly, traverse the candidate data information in the second filtering time period, and filter out the candidate data information that has the same data type as the second data information and does not belong to the fourth
  • the third candidate data information of the information set is then collected and processed on the third candidate data information to obtain the fifth information set of the second filtering time period.
  • the embodiment of the present application also provides an electronic device 400, as shown in FIG. 16, the electronic device 400 includes but is not limited to:
  • Memory 420 used to store programs
  • the processor 410 is configured to execute the program stored in the memory 420.
  • the processor 410 executes the program stored in the memory 420, the processor 410 is configured to execute the above data processing method.
  • the processor 410 and the memory 420 may be connected through a bus or in other ways.
  • the memory 420 can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the data processing method described in the embodiment of the present application.
  • the processor 410 implements the above data processing method by running the non-transitory software programs and instructions stored in the memory 420 .
  • the memory 420 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store and execute the above-mentioned data processing method.
  • the memory 420 may include a high-speed random access memory 420 , and may also include a non-transitory memory 420 , such as at least one magnetic disk storage 420 , a flash memory device, or other non-transitory solid-state memory 420 .
  • the memory 420 includes memory 420 located remotely relative to the processor 410, and these remote memories 420 may be connected to the processor 410 through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to realize the above-mentioned data processing method are stored in the memory 420, and when executed by one or more processors 410, the above-mentioned data processing method is executed, for example, the above-described execution in FIG. 3 is executed.
  • An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor 410 or a controller, for example, by the above-mentioned device
  • Execution by a processor 410 in the embodiment can make the above-mentioned processor 410 execute the data processing method in the above-mentioned embodiment, for example, execute the method steps S310 to S340 in FIG. 3 described above, and the method step S321 in FIG. 4 and method step S3211 and step S3212 among step S322, Fig. 5, method step S3213 and step S3214 among Fig. 7, method step S32131 to step S32133 among Fig.
  • Step S3482 Method step S331 and step S332 among Fig. 9, Fig. 10
  • an embodiment of the present application also provides a computer program product, including computer programs or computer instructions, the computer programs or computer instructions are stored in a computer-readable storage medium, and the processor 410 of the computer device reads from the computer-readable storage medium Read the computer program or computer instruction, and the processor 410 executes the computer program or computer instruction, so that the computer device executes the data processing method in the above-mentioned embodiment, for example, executes the method steps S310 to S340 in FIG. 3 described above, and FIG. 4 Method step S321 and step S322 in, method step S3211 and step S3212 in Fig. 5, method step S3213 and step S3214 in Fig. 7, method step S32131 to step S32133 in Fig.
  • Step S332 method step S341 to step S344 in FIG. 10, method step S3421 and step S3422 in FIG. 11, method step S345 to step S349 in FIG. 12, method step S3461 and step S3462 in FIG.
  • the method step S3481 and step S3482 method step S3481 and step S3482.
  • the embodiment of the present application includes: acquiring a plurality of first data information, preprocessing the plurality of first data information to obtain the second data information in the plurality of first data information; and then determining a plurality of first data information from the plurality of first data information
  • the candidate data information is based on the second data information and the plurality of candidate data information, and the target data information is obtained by screening the plurality of candidate data information, wherein the target data information and the second data information have the same data type.
  • the second data information in the plurality of first data information is obtained, so that the second data information can be used as a reference standard for the desired data information
  • a plurality of candidate data information is determined from the plurality of first data information, and then according to the second data information and the plurality of candidate data information, objects having the same data type as the second data information are screened out from the plurality of candidate data information
  • Data information that is to say, the solution of the embodiment of the present application, by pre-determining the data type of the expected data information, and then filtering out the target data information corresponding to the data type, it can be achieved without increasing the resource allocation of the power monitoring system. On this basis, the purpose of quickly obtaining the desired target data information is achieved.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种数据处理方法、电子设备、存储介质及程序产品。其中,数据处理方法包括:获取多个第一数据信息(S310);对所述多个第一数据信息进行预处理,得到所述多个第一数据信息中的第二数据信息(S320);从所述多个第一数据信息中确定多个候选数据信息(S330);根据所述第二数据信息和所述多个候选数据信息,从所述多个候选数据信息中筛选得到目标数据信息,所述目标数据信息与所述第二数据信息具有相同的数据类型(S340)。

Description

数据处理方法、电子设备、存储介质及程序产品
相关申请的交叉引用
本申请基于申请号为202111083803.4、申请日为2021年09月14日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请实施例涉及数据处理技术领域,尤其是一种数据处理方法、电子设备、存储介质及程序产品。
背景技术
在电力行业的发、输、变、配、用各个环节,业界一般采用电力监控(Supervisory Control And Data Acquisition,SCADA)系统进行管理。目前,常常通过电力监控系统运行过程中产生的日志文件(例如操作日志、安全日志、系统日志等)进行故障定位分析,但是,随着电力组网日益复杂,规模日益扩大,根据日志文件分析定位故障的难度日益增加。
在一些情形中,常用贝叶斯算法使用过滤时间段过滤的方法来计算出特定日志和故障间的关联关系,这对后端训练的算力处理能力要求较高,资源开销比较大。而对于告警监控来说,需要针对告警建立对应故障类型的经验库,需要业务人员投入大量经验梳理出经验库,这种人工方式除了成本耗费巨大,而且还会因为人的主观随意性导致经验库的疏漏和错误。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本申请实施例提供了一种数据处理方法、电子设备、存储介质及程序产品。
第一方面,本申请实施例提供了一种数据处理方法,包括:获取多个第一数据信息;对所述多个第一数据信息进行预处理,得到所述多个第一数据信息中的第二数据信息;从所述多个第一数据信息中确定多个候选数据信息;根据所述第二数据信息和所述多个候选数据信息,从所述多个候选数据信息中筛选得到目标数据信息,所述目标数据信息与所述第二数据信息具有相同的数据类型。
第二方面,本申请实施例还提供了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上所述的数据处理方法。
第三方面,本申请实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如上所述的数据处理方法。
第四方面,本申请实施例还提供了一种计算机程序产品,包括计算机程序或计算机指令,所述计算机程序或所述计算机指令存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机程序或所述计算机指令,所述处理器执行所述计算机程序或所述计算机指令,使得所述计算机设备执行如上所述的数据处理方法。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的 实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的用于执行数据处理方法的电力监控系统架构的示意图;
图2是本申请一个实施例提供的数据处理装置的结构示意图;
图3是本申请一个实施例提供的数据处理方法的流程图;
图4是图3中步骤S320的方法的流程图;
图5是图4中步骤S321的一种方法的流程图;
图6是本申请一个示例提供的告警编号与告警频次对应关系的柱状图;
图7是图4中步骤S321的另一种方法的流程图;
图8是图6中步骤S3213的方法的流程图;
图9是图3中步骤S330的方法的流程图;
图10是图3中步骤S340的一种方法的流程图;
图11是图9中步骤S342的方法的流程图;
图12是图3中步骤S340的另一种方法的流程图;
图13是本申请一个示例提供的划分目标时间段的示意图;
图14是图11中步骤S346的方法的流程图;
图15是图11中步骤S348的方法的流程图;
图16是本申请一个实施例提供的电子设备的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图的描述中,多个(或多项)的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到“第一”、“第二”等只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。
本申请实施例提供了一种数据处理方法、电子设备、存储介质及程序产品,先通过对多个第一数据信息进行预处理,得到多个第一数据信息中的第二数据信息,接着又从多个第一数据信息中确定多个候选数据信息,然后根据第二数据信息和多个候选数据信息,从多个候选数据信息中筛选出与第二数据信息具有相同的数据类型的目标数据信息,因此,能够在不增加电力监控系统资源配置的基础上,达到快速获得期望的目标数据信息的目的。
下面结合附图,对本申请实施例进行阐述。
如图1所示,图1是本申请一个实施例提供的用于执行数据处理方法的电力监控系统架构的示意图。在图1的示例中,该电力监控系统架构包括系统层100、通讯层200和设备层300,其中,通讯层200设置于系统层100与设备层300之间,通讯层200分别与系统层100和设备层300通讯连接。
以电力系统为例,设备层300可以包括电力仪表310、环境传感器320和执行器330等设备,能够采集各种电力设备的各项电力参数,比如电表、电力保护设备或者母联等各种电力设备的电力参数,也能够采集各类环境传感器320的实时采样值,比如温度传感器测量的温度值、湿度传感器测量的温度值或者振动传感器检测的机械量,并且能够通过设备侧规约将所采集的各项电力参数和采样值上送到通讯层200中的通讯管理机210。需要说明的是,设备侧规约是指一系列共用或者厂家的私有协议,比如modbus协议、ProflNet协议、Ethernet/IP协议或者HSE协议等,在此不再一一列举。
通讯层200包括通讯管理机210,通讯管理机210也可称作采集网关。该通讯层200能 够接收设备层300上送的各项数据,并将接收到的各项数据进行汇聚,然后将汇聚后的各项数据通过系统侧规约上报到系统层100,所以通讯层200在电力监控系统中起到一个承上启下的作用。需要说明的是,系统侧规约包括电力行业的通用规约和厂家的私有规约,在此不作具体限制。
系统层100包括SCADA软件110和数据处理装置120,该SCADA软件110能够接收到通讯层200上送的数据,并将该数据映射成相应的设备和点位,同时能够基于一系列的控制策略反向控制电力系统内的执行器330,以保障电力系统的平稳运行。
需要说明的是,该电力监控系统不仅可以应用于电力系统,也可以应用于给水系统、石油或者化工等领域。
基于图1所示的电力监控系统架构,如图2所示,本申请实施例提供了一种数据处理装置120,该数据处理装置120至少包括信息获取模块121、关联识别模块123和数据预处理模块122,其中数据预处理模块122分别与信息获取模块121和关联识别模块123连接,信息获取模块121和关联识别模块123连接。
信息获取模块121接收到电力监控系统生成的多个第一数据信息,并将该多个第一数据信息发送给数据预处理模块122和关联识别模块123。数据预处理模块122对接收到的多个第一数据信息进行预处理,得到多个第二数据信息,并将该多个第二数据信息发送给关联识别模块123。关联识别模块123从多个第一数据信息中确定多个候选数据信息,并根据第二数据信息从多个候选数据信息中筛选与第二数据信息具有相同数据类型的目标数据信息,然后发送给电力监控系统展示。
需要说明的是,数据预处理模块122可以包括日志网关和告警网关等,在此不做具体限定。例如,日志网关接收电力监控系统生成的多个日志信息,并对多个日志信息进行预处理,得到多个低频次日志信息,然后将多个低频次日志信息传给关联识别模块123,其中,日志信息包括操作信息、系统信息和安全信息等,在此不作具体限制。同样地,告警网关接收来自电力监控系统的多个告警信息,并对多个告警信息进行预处理,得到多个低频次告警信息,将该低频次告警信息传给关联识别模块123。关联识别模块123根据日志网关推送的多个低频日志信息和告警网关推送的多个低频次告警信息,从多个日志信息和多个告警信息中目标数据信息,并将该目标数据信息送给电力监控系统,使得目标数据信息被展示,辅助运维人员进行根因定位。
需要说明的是,低频次日志信息为低于或者等于预设频次阈值的日志信息,同理,低频次告警信息为低于或者等于预设频次阈值的告警信息,而预设频次阈值可以根据实际应用情况而进行适当的选择,在此不做具体限定。
需要说明的是,低频次日志信息和低频次告警信息均属于第二数据信息,日志信息和告警信息均属于第一数据信息,第二数据信息和目标数据信息具有相同的数据特征。
本申请实施例描述的电力监控系统架构以及应用场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着电力监控系统架构的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图1中示出的电力监控系统架构和图2中示出的数据处理装置并不构成对本申请实施例的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
基于上述电力监控系统架构,下面提出数据处理方法的各个实施例。
如图3所示,图3是本申请一个实施例提供的数据处理方法的流程图,该数据处理方法可以应用于数据处理装置,例如图1所示电力监控系统架构中的数据处理装置。该数据处理方法可以包括但不限于有步骤S310、步骤S320、步骤S330和步骤S340。
步骤S310:获取多个第一数据信息。
本步骤中,当电力监控系统发生故障时,数据处理装置可以获取电力监控系统生成的多 个第一数据信息,其中,第一数据信息可以包括日志信息、告警信息和其他数据信息等,在此不做具体限制。例如,假设第一数据信息是告警信息时,该第一数据信息可以包括告警名称、告警故障原因和告警级别等;又如,假设第一数据信息是日志信息时,该第一数据信息可以包括操作日志、安全日志和系统日志等。
需要说明的是,获取多个第一数据信息可以有不同的实施方式,本实施例对此不作具体限定。例如,可以指通过读取电力监控系统生成的相关数据文件得到,也可以是通过调用电力监控系统暴露的查询接口获取到。
步骤S320:对多个第一数据信息进行预处理,得到多个第一数据信息中的第二数据信息。
本步骤中,由于在步骤S310中获取到了多个包括第二数据信息的第一数据信息,所以数据处理装置会将获取到的多个第一数据信息进行预处理,得到多个第二数据信息,以便于后续步骤可以根据第二数据信息对多个候选数据信息进行筛选。该步骤大大减少了数据处理的算力消耗,进而减少处理时间,同时有利于电力监控系统的一体化、小型化,压缩方案整体成本,提高市场竞争力。
步骤S330:从多个第一数据信息中确定多个候选数据信息。
本步骤中,由步骤S310中获取到多个第一数据信息后,数据处理装置将该多个第一数据信息进行筛选,得到多个候选数据信息,减少了数据处理量,节省了数据处理时间。
可以理解的是,候选数据信息可以包括日志信息、告警信息和其他数据信息等,在此不做具体限制。例如,假设候选数据信息是告警信息时,该第一数据信息可以包括告警名称、告警故障原因和告警级别等;又如,假设第一数据信息是日志信息时,该第一数据信息可以包括操作日志、安全日志和系统日志。
步骤S340:根据第二数据信息和多个候选数据信息,从多个候选数据信息中筛选得到目标数据信息,目标数据信息与第二数据信息具有相同的数据类型。
本步骤中,由步骤S330中得确定的候选数据信息后,会将根据由步骤S320中获取到的第二数据信息的数据类型,从多个候选数据信息中筛选和第二数据信息具有相同的数据类型的目标数据信息,该步骤减少80-90%的数据处理量从而减少计算时间,同时减少了电力监控系统资源消耗,且在不增加电力监控系统资源配置的情况下部署故障根因定位。
需要说明的是,候选数据信息可以有多种不同的数据类型,而第一数据信息和候选数据信息具有相同的数据类型,因此,根据第一数据信息得到的第二数据信息的数据类型可以属于候选数据信息的数据类型中的一种或者多种,数据类型可以包括日期与时间类型、低频次类型或者高频次类型等,本实施例对此不作具体限定。当第二数据信息为不同的数据类型时,关联识别模块也会对应执行不同的筛选方式。例如,当第二数据信息为低频次的数据类型时,关联识别模块会从候选数据信息中筛选出低频次的候选数据信息,得到目标数据信息,而高频词的候选数据信息不做处理或者舍弃;又如,当第二数据信息的数据类型为日期与时间类型,关联识别模块会将日期与时间都相同的候选数据信息筛选出来,或者,关联识别模块会将日期相同而时间不同的候选数据信息筛选出来,又或者,关联识别模块会将时间相同而日期不同的候选数据信息筛选出来,最终得到目标数据信息,有利于辅助运维人员定位故障根因。
需要说明的是,低频次为低于或者等于频次阈值的频次,同理,高频次类型为高于频次阈值的频次,而频次阈值可以根据实际应用情况而进行适当的选择,在此不做具体限定。
本实施例中,通过采用包括上述步骤S310至步骤S340的数据处理方法,利用数据处理装置获取到多个第一数据信息,然后利用该数据处理装置对多个第一数据信息进行预处理,得到第二数据信息,接着利用该数据处理装置从多个第一数据信息中确定多个候选数据信息,最后根据第二数据信息的数据类型,对多个候选数据信息进行筛选,得到与第二数据信息具有相同的数据类型的目标数据信息,最后对该目标数据信息进行信息推送处理,使得该目标数据信息被展示。因此,本实施例能够在不增加电力监控系统资源配置的基础上,达到快速获得期望的目标数据信息的目的,有助于运维人员进行根因定位。
需要说明的是,通过步骤S320对多个第一数据信息的预处理,数据处理时间已经减少到原来5%以内。
值得注意的是,随着组网日益复杂,规模的日益扩大,电力监控系统产生的数据量往往是非常庞大的,如果是人工方式对所有数据进行故障根因定位,是非常耗费成本的,而且还会因为人的主观随意性导致经验库的疏漏和错误。但若是后台对所有数据进行故障根因定位,对后端训练的算力处理能力要求会相对较高,并且通常需要GPU服务器,这样就增加了电力监控系统资源开销,不利于电力监控系统整体的小型化、集约化,也不利于降低成本。而在本实施例中,由于对第一数据信息进行预处理,因此可以达到减少数据处理量、降低电力监控系统资源开销和节省成本的目的。
在一实施例中,如图4所示,对步骤S320进一步的说明,该步骤S320可以包括但不限于有步骤S321和步骤S322。
步骤S321:对多个第一数据信息进行基于频次的聚类处理,得到多个聚类集合,不同的聚类集合具有不同的中心频次。
本步骤中,当数据处理装置对多个第一数据信息进行预处理以得第二数据信息时,数据处理装置可以先执行步骤S321以得到多个聚类集合,而且不同的聚类集合具有不同的中心频次,以便于后续步骤可以根据该多个聚类集合得到第二数据信息,除此之外,聚类处理可以不用手工设置,减少人工设置的繁琐与主观因素。
需要说明的是,聚类是指按照某个特定标准(如距离)把一个数据集分割成不同的类或簇,使得同一个簇内的数据对象的相似性尽可能大,同时不在同一个簇中的数据对象的差异性也尽可能地大。也即聚类后同一类的数据尽可能聚集到一起,不同类数据尽量分离。而基于频次的聚类处理而采用的聚类方法是指频次把多个第一数据信息分割成不同的聚类集合,使得同一个聚类集合里的第一数据信息的频次尽可能的接近,同时不在同一个聚类集合中的第一数据信息的频次的差异性也尽可能地大。还需要说明的是,聚类方法可以有多种,比如kmeans算法、K-means++算法或者bi-kmeans算法,本实施例对此不做具体限定。
步骤S322:从多个聚类集合中确定中心频次小于或等于频次阈值的目标聚类集合,得到第二数据信息。
本步骤中,由于在步骤S321得到了多个聚类集合,因此可以根据该多个聚类集合确定中心频次小于或等于频次阈值的目标聚类集合,得到第二数据信息,以便于后续步骤可以根据第二数据信息对多个候选数据信息进行筛选。
需要说明的是,频次阈值可以是人工设置,也可以是电力监控系统根据多个聚类集合中的中心频次自动设置,本实施例对此不作具体限定。根据频次阈值可以将多个聚类集合分成两大聚类集合,分别是中心频次小于或者等于频次阈值的低频次聚类集合,即目标聚类集合,和中心频次大于频次阈值的高频次聚类集合。从目标聚类集合中可以得到第二数据信息,而高频次聚类集合可以不做处理或者舍弃,在此不做具体限制,而对高频次聚类集合进行舍弃,按照帕累托法则,能够减少90%以上的数据处理量,这样大大减少了计算负荷,同时减少了对CPU、内存的资源要求。还需要说明的是,对高频次的第一数据信息的舍弃不会影响后续步骤筛选出目标数据信息。例如,假设使用kmeans算法对基于频次对多个第一数据信息进行聚类,聚类集合如下表1所示,该表包括频次分类、不同聚类集合中数据信息的数目和各频次类型中的数据信息在总数据信息中的占比,由下表1可知,基于聚类算法自动分类出来的高频次数据舍弃后,可以减少将近90%的数据处理量。
表1
Figure PCTCN2022091576-appb-000001
在本实施例中,通过采用上述步骤S321和步骤S322,在数据处理装置对多个第一数据 信息基于频次的聚类处理后,得到多个具有不同的中心频次的聚类集合,接着又该多个聚类集合中确定中心频次小于或者等于频次阈值的目标数据信息,得到第二数据信息。可以理解的是,第二数据信息为低频次的数据。
在一实施例中,如图5所示,对步骤S321进行进一步的说明,在第一数据信息包括告警信息,告警信息具有告警编号的情况下,该步骤S321可以包括但不限于有步骤S3211和步骤S3212。
步骤S3211:根据告警编号对告警信息进行频次统计,得到告警信息的告警频次。
本步骤中,在第一数据信息包括告警信息,告警信息具有告警编号的情况下,当要得到多个聚类集合时,可以根据告警编号对告警信息基于频次进行统计,得到告警信息的告警频次,以便于后续步骤可以利用该告警频次对告警信息进行聚类处理。
需要说明的是,对告警信息基于频次进行统计后,可以根据告警编号与告警频次的对应关系建立告警编号、频次两个字段的二维表,以便于后续步骤对告警信息进行聚类处理。例如,假设采用python的pandas库的value_counts()方法,按告警编号进行汇总,并倒序排列,告警编号、频次两个字段的二维表如下表2所示,该表包括告警编码和告警频次,图6是与表1相对应的柱状图。假设频次阈值是1000,则告警频次低于或者等于频次阈值的告警信息均为低频次告警信息,而告警频次高于频次阈值的告警信息均为高频次告警信息,根据下面表2可以看出,告警编码为第一告警信息(2114060448)、第二告警信息(2114322696)、第三告警信息(12596994)和第四告警信息(12611841)分别对应的告警信息均为高频次告警信息,而告警编码为第五告警信息(2114060402)、第六告警信息(12596992)、第七告警信息(2114322678)和第八告警信息(2121662481)分别对应的告警信息为低频次告警信息,且由图6可以明显地观察到不同类型的告警频次分布是不平衡的。
表2
告警编码 告警频次
第一告警信息(2114060448) 72761
第二告警信息(2114322696) 7721
第三告警信息(12596994) 5141
第四告警信息(12611841) 2085
第五告警信息(2114060402) 918
第六告警信息(12596992) 646
第七告警信息(2114322678) 10
第八告警信息(2121662481) 1
步骤S3212:根据告警频次对所有告警信息进行聚类处理,得到多个聚类集合。
本步骤中,由于在步骤S3211中得到告警信息的告警频次,因此,可以根据该告警频次对所有的告警信息进行聚类处理,得到多个聚类集合,以便于后续步骤可以利用该聚类集合确定目标聚类集合。该步骤使用聚类方法对告警信息的告警频次进行聚类,而不是手工配置分类阈值,主要是为了算法处理更加端到端,避免人的主观判断对算法的影响,并且减少运维人员的配置工作量。
需要说明的是,进行聚类处理而采用的聚类方法可以有多种,比如kmeans算法、K-means++算法或者bi-kmeans算法,本实施例对此不做具体限定。
本实施例中,通过采用上述步骤S3211和步骤S3212,在数据处理装置对所有告警信息根据告警编号进行频次统计,得到所有告警信息对应的告警频次,然后又根据该告警频次对所有的告警信息进行聚类处理,得到多个告警信息的聚类集合。
在一实施例中,如图7所示,对步骤S321进行进一步的说明,在第一数据信息包括日志信息的情况下,该步骤S321可以包括但不限于有步骤S3213和步骤S3214。
步骤S3213:对日志信息进行频次统计,得到日志信息的日志频次。
本步骤中,在第一数据信息包括日志信息的情况下,当要得到多个聚类集合时,可以对日志信息基于频次进行统计,得到日志信息的日志频次,以便于后续步骤可以利用该日志频次对日志信息进行聚类处理。
步骤S3214:根据日志频次对所有日志信息进行聚类处理,得到多个聚类集合。
本步骤中,由于在步骤S3213中得到日志信息的日志频次,因此,可以根据该日志频次对所有的日志信息进行聚类处理,得到多个聚类集合,以便于后续步骤可以利用该聚类集合确定目标聚类集合。该步骤使用聚类方法对日志信息的日志频次进行聚类,而不是手工配置分类阈值,主要是为了算法处理更加端到端,避免人的主观判断对算法的影响,并且减少运维人员的配置工作量。
需要说明的是,对日志信息进行聚类处理而采用的方法可以有多种,再次不做具体限定,比如采用MapReduce并行技术、基于LCS的Chameleon实时日志聚类方法和最近邻链的层次聚类算法等。
本实施例中,通过采用上述步骤S3213和步骤S3214,在数据处理装置对所有日志信息进行频次统计,得到所有日志信息的日志频次,然后又根据该日志频次对所有的日志信息进行聚类处理,得到多个日志信息的聚类集合。
在一实施例中,如图8所示,对步骤S3213进行进一步的说明,该步骤S3213可以包括但不限于有步骤S32131、步骤S32132和步骤S32133。
步骤S32131:对日志信息进行变量替换处理,得到备选信息。
本步骤中,当需要得到日志信息的日志频次时,可以先对日志信息进行变量替换处理,得到备选信息,以便于后续步骤可以利用备选信息获取其对应的映射信息。
需要说明的是,对日志信息进行变量替换处理可以采用不同的处理方法,在此不做具体限定。例如对日志信息基于正则表达式进行变量替换,将日志信息中的详细的IP地址、端口号以及时间等利用$IP、$IPPort以及$DateTime等字符串替换,得到备选信息。需要说明的是,正则表达式描述了一种字符串匹配的模式,可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。
还需要说明的是,变量可以指时间、带符号整数、浮点数或者特殊字符等,根据实际情况而定,在此不做具体限定。
步骤S32132:对备选信息进行映射处理,得到映射信息。
本步骤中,由于在步骤S32131中得到了备选信息,因此,通过映射处理得到与备选信息对应的映射信息,以便于后续步骤可以利用映射信息对日志信息进行频次统计。
需要说明的是,对备选信息进行映射处理而采用的方法有很多,在此不做具体限定。比如,采用哈希函数对备选信息编码为固定长度的字符串,得到映射信息。又如,在电力监控系统中建立一个固定长度编码格式的字符串匹配通用函数,通过调用该函数,对备选信息进行映射处理。需要说明的是,哈希函数是常用定长编码函数,其编码速度快、防撞特性好,使用广泛。比如,采用python的hashlib库的hexdigest()方法进行编码得到固定长度的字符串。
步骤S32133:根据映射信息对日志信息进行频次统计,得到日志信息的日志频次。
本步骤中,由于在步骤S32132中得到了映射信息,因此,可以根据映射信息对日志信息进行频次统计,得到日志信息的日志频次,以便于后续步骤S3214对日志信息进行聚类处理。
本实施例中,通过采用包括有上述步骤S32121至步骤S32133的数据处理方法,利用数据处理装置对日志信息进行变量替换处理,得到备选信息,接着又对该备选信息进行映射处理,得到映射信息,然后根据该映射信息对日志信息进行频次统计,最终得到日志信息的日志频次。
需要说明的是,在得到日志信息的日志频次后,可以对将映射信息与日志频次建立映射关系,以便于后续S3214对日志信息进行聚类处理,可以根据实际情况选择,本实施例对此不做具体限定。比如,对日志信息基于正则表达式进行变量替换,得到备选信息,采用哈希 函数对备选信息编码为固定长度的字符串,得到映射信息,通过映射信息对日志信息进行统计,形成日志编码、频次两个字段的二维表,其中,可以采用python的pandas库的value_counts()方法,按日志编码汇总,并倒序排列。
在一实施例中,如图9所示,对步骤S330进行进一步的说明,该步骤S330可以包括但不限于有步骤S331和步骤S332。
步骤S331:确定目标时间段。
本步骤中,当需要确定多个候选数据信息时,可以先确定目标时间段,以便于后续步骤可以从多个第一数据信息中确定处于目标时间段内的多个候选数据信息。
需要说明的是,目标时间段的确定方式有很多,在此不做具体限定。比如,假设故障发生的时间设为目标时间段的结束时间,再基于整体过滤时间段长度来确定目标时间段的开始时间,从而确定目标时间段。
步骤S332:从多个第一数据信息中确定处于目标时间段内的多个候选数据信息。
本步骤中,由于步骤S331确定了目标时间段,因此,可以在目标时间段内对多个第一数据信息进行筛选,得到候选数据信息,减少了数据处理量。
本实施例中,通过采用包括有上述步骤S331至步骤S332的数据处理方法,确定了目标时间段,接着又从多个第一数据信息中确定处于目标时间段内的多个候选数据信息,大大减少了数据处理量,减少目标数据信息推送的等待时间。
在一实施例中,如图10所示,对步骤S340进行进一步的说明,该步骤S340可以包括但不限于有步骤S341、步骤S342、步骤S343和步骤S344。
步骤S341:根据预设时间长度对目标时间段进行划分处理,得到两个以上的过滤时间段。
本步骤中,当需要得到目标数据信息时,可以先根据预设时间长度对目标时间段进行划分处理得到两个以上过滤时间段,以便于后续步骤利用过滤时间段对候选数据信息进行筛选处理。
需要说明的是,预设时间长度可以有多个,并且多个预设时间长度可以不同,比如2小时、6小时或者12小时等,相对应的,两个以上过滤时间段的时间长度也可以不同。使用不同预设时间长度进行相关性过滤,避免单一预设时间长度的过滤可能导致的潜在关联性过滤遗漏。
需要说明的是,当过滤时间段为三个时,其相关性过滤效率小于两个过滤时间段的相关性过滤效率,但是却大于三个以上过滤时间段的相关性过滤效率。
步骤S342:对两个以上的过滤时间段中的候选数据信息分别进行筛选处理,得到每个过滤时间段的第一信息集合。
本步骤中,由于在步骤S341中得到了两个以上的过滤时间段,因此,可以对两个以上的过滤时间段中的候选数据信息分别进行筛选处理,得到每个过滤时间段的第一信息集合,以便于后续步骤对每个过滤时间段的第一信息集合分别进行去重处理。
步骤S343:对每个过滤时间段的第一信息集合分别进行去重处理,得到每个过滤时间段的第二信息集合。
本步骤中,由于在步骤S342中得到了第一信息集合,因此,对每个过滤时间段的第一信息集合分别进行去重处理,得到每个过滤时间段的第二信息集合,以便于后续步骤对所有第二信息集合取并集,得到目标数据信息。
步骤S344:获取所有第二信息集合的并集,得到目标数据信息。
本步骤中,由于在步骤S343中得到了第二信息集合,因此,可以获取所有第二信息集合的并集,得到目标数据信息,以便后续步骤对目标数据信息进行信息推送处理,使得目标数据信息被展示,辅助运维人员定位故障根因。
本实施例中,通过采用包括有上述步骤S341至步骤S344的数据处理方法,首先,根据预设时间长度对目标时间段进行划分处理,得到两个以上的过滤时间段,接着根据第二数据信息对两个以上的过滤时间段中的候选数据信息分别进行筛选处理,得到每个过滤时间段的 第一信息集合,然后,对每个过滤时间段的第一信息集合分别进行去重处理,得到每个过滤时间段的第二信息集合,最终,获取所有第二信息集合的并集,得到目标数据信息。
需要说明的是,第一信息集合中的数据信息具有与第二数据信息相同的数据类型,而只根据第二数据信息对两个以上的过滤时间段中的候选数据信息分别进行筛选处理不会影响后续过滤出与故障强相关的目标数据信息,贝叶斯公式可以为此作证明,证明过程如下:
假设第二数据信息为低频次数据信息,事件B为出现高频次数据信息的事件,A相当于发生故障,P(A|B)就是出现高频次数据信息时发生故障的概率,其中,贝叶斯公式为
Figure PCTCN2022091576-appb-000002
由公式可知,P(B)越小,P(A|B)的值越大,而高频次数据信息基本会在每个过滤时间段内出现,也就是说P(B)=1(100%),相应的P(A|B)的值很小,也就是说这些高频次数据信息的舍弃不会影响后续过滤出与故障强相关的目标数据信息。
在一实施例中,如图11所示,对步骤S342进行进一步的说明,该步骤S342可以包括但不限于有步骤S3421和步骤S3422。
步骤S3421:遍历两个以上的过滤时间段中的候选数据信息,筛选出每个过滤时间段中与第二数据信息具有相同数据类型的第一备选数据信息。
本步骤中,当需要得到每个过滤时间段的第一信息集合时,可遍历所有过滤时间段中的所有候选数据信息,从而筛选出筛选出每个过滤时间段中与第二数据信息具有相同数据类型的第一备选数据信息,以便后续步骤对每个过滤时间段的第一备选数据信息分别进行归集处理。
步骤S3422:对每个过滤时间段的第一备选数据信息分别进行归集处理,得到每个过滤时间段的第一信息集合。
本步骤中,由于在步骤S3421中得到了每个过滤时间段的第一备选数据信息,因此,可以对每个过滤时间段的第一备选数据信息分别进行归集处理,得到每个过滤时间段的第一信息集合。
本实施例中,通过采用包括有上述步骤S3421至步骤S3422的数据处理方法,首先,遍历每个过滤时间段中的候选数据信息,筛选出每个过滤时间段中与第二数据信息具有相同数据类型的第一备选数据信息,然后对每个过滤时间段的第一备选数据信息分别进行归集处理,得到每个过滤时间段的第一信息集合。
在一实施例中,如图12所示,对步骤S340进行进一步的说明,该步骤S340可以包括但不限于有步骤S345、步骤S346、步骤S347、步骤S348和步骤S349。
步骤S345:根据预设时间长度对目标时间段进行划分处理,得第一过滤时间段和第二过滤时间段。
本步骤中,当需要得到目标数据信息时,可以先根据预设时间长度对目标时间段进行划分处理,得第一过滤时间段和第二过滤时间段,以便后续步骤对每个过滤时间段的候选数据信息进行筛选处理。
步骤S346:对第一过滤时间段中的候选数据信息进行筛选处理,得到第一过滤时间段的第三信息集合。
本步骤中,由于在步骤S345中得到了第一过滤时间段,因此,可以对第一过滤时间段中的候选数据信息进行筛选处理,得到第一过滤时间段的第三信息集合,以便后续步骤对第三信息集合进行去重处理。
步骤S347:对第三信息集合进行去重处理,得到第四信息集合。
本步骤中,由于在步骤S346中得到了第三信息集合,因此,可以对第三信息集合进行去重处理,得到第四信息集合,以便后续步骤根据第四信息集合,对第二过滤时间段中的候选数据信息进行筛选处理。
步骤S348:根据第四信息集合,对第二过滤时间段中的候选数据信息进行筛选处理,得到第二过滤时间段的第五信息集合。
本步骤中,由于在步骤S347中得到了第四信息集合和在步骤S345中得到了第二过滤时间段,因此,可以根据第四信息集合,对第二过滤时间段中的候选数据信息进行筛选处理,得到第二过滤时间段的第五信息集合,以便后续步骤对第四信息集合和对第五信息集合做归集处理。
步骤S349:获取第四信息集合和第五信息集合的并集,得到目标数据信息。
本步骤中,由于在步骤S347中得到了第四信息集合和在步骤S348中得到了第五信息集合,因此,可以对第四信息集合和第五信息集合取并集,得到目标数据信息,以便后续步骤对目标数据信息进行信息推送处理,使得目标数据信息被展示,辅助运维人员定位故障根因。
本实施例中,通过采用包括有上述步骤S345至步骤S349的数据处理方法,首先,可以根据预设时间长度对目标时间段进行划分处理,得第一过滤时间段和第二过滤时间段,接着,根据第二数据信息对第一过滤时间段中的候选数据信息进行筛选处理,得到第一过滤时间段的第三信息集合,然后对第三信息集合进行去重处理,得到第四信息集合,再利用所得到的第四信息集合,根据第二数据信息对第二过滤时间段中的候选数据信息进行筛选处理,得到第二过滤时间段的第五信息集合,最后获取第四信息集合和第五信息集合的并集,得到目标数据信息。
需要说明的是,第三信息集合中的数据信息和第五信息集合中的数据信息均具有与第二数据信息相同的数据类型,而只根据第二数据信息对第一过滤时间段中的候选数据信息和第二过滤时间段中的候选数据信息分别进行筛选处理不会影响后续过滤出与故障强相关的目标数据信息,贝叶斯公式可以为此作证明,证明过程如下:
假设第二数据信息为低频次数据信息,事件B为出现高频次数据信息的事件,时间A为发生故障,P(A|B)就是出现高频次数据信息时发生故障的概率,其中,贝叶斯公式为
Figure PCTCN2022091576-appb-000003
由公式可知,P(B)越小,P(A|B)的值越大,而高频次数据信息基本会在每个过滤时间段内出现,也就是说P(B)=1(100%),相应的P(A|B)的值很小,也就是说这些高频次数据信息的舍弃不会影响后续过滤出与故障强相关的目标数据信息。
以时间窗过滤算法为例,按照预设时间长度T,将目标时间段分为T1和T2两部分,如图13所示。设T1内所有第一数据信息为
Figure PCTCN2022091576-appb-000004
T2内所有第一数据信息为
Figure PCTCN2022091576-appb-000005
若某个类型的第一数据信息
Figure PCTCN2022091576-appb-000006
并且
Figure PCTCN2022091576-appb-000007
那么该类型第一数据信息和故障强相关。这个简化过程的本质是对P(A|B)<=50%的第一数据信息的舍弃,减少数据处理的复杂度,进而减少处理时间。若
Figure PCTCN2022091576-appb-000008
并且
Figure PCTCN2022091576-appb-000009
从贝叶斯公式的角度,假设为N个过滤时间段,那么P(A)和P(B)都是1/N,那么
Figure PCTCN2022091576-appb-000010
Figure PCTCN2022091576-appb-000011
并且
Figure PCTCN2022091576-appb-000012
那么
Figure PCTCN2022091576-appb-000013
Figure PCTCN2022091576-appb-000014
而P(B|A)<=1,所以
Figure PCTCN2022091576-appb-000015
同样的,对于该方法的扩展,过滤时间段可以是三个(上述两段过滤时间段的T1再一分为二),其过滤效率小于两段过滤时间段的方法但大于传统N个过滤时间段的方法;同理,简化过滤方法的过滤时间段也可以是4个、5个、直到N-1个。所以,该时间过滤算法能够在不影响过滤精度的情况下,大量减少第一数据信息遍历判断的运算次数,从而减少过滤运算的时间。
需要说明的是,低频次为低于或者等于预设频次阈值的频次,同理,高频次为高于预设频次阈值的频次,而预设频次阈值可以根据实际应用情况而进行适当的选择,在此不做具体限定。
在一实施例中,如图14所示,对步骤S346进行进一步的说明,该步骤S346可以包括但不限于有步骤S3461和步骤S3462。
步骤S3461:遍历第一过滤时间段中的候选数据信息,筛选出与第二数据信息具有相同数据类型的第二备选数据信息。
本步骤中,由于需要获取第一过滤时间段的第三信息集合,因此,可以先遍历第一过滤 时间段中的候选数据信息,筛选出与第二数据信息具有相同数据类型的第二备选数据信息,以便后续步骤对第二备选数据信息进行归集处理。
步骤S3462:对第二备选数据信息进行归集处理,得到第一过滤时间段的第三信息集合。
本步骤中,由于在步骤S3461中筛选出了与第二数据信息具有相同数据类型的第二备选数据信息,因此,可以对第二备选数据信息做归集处理,得到第一过滤时间段的第三信息集合,以便后续步骤对第三信息集合进行去重处理。
本实施例中,通过采用包括有上述步骤S3461和步骤S3462的数据处理方法,首先,遍历第一过滤时间段中的候选数据信息,筛选出与第二数据信息具有相同数据类型的第二备选数据信息,然后对第二备选数据信息进行归集处理,得到第一过滤时间段的第三信息集合。
在一实施例中,如图15所示,对步骤S348进行进一步的说明,该步骤S348可以包括但不限于有步骤S3481和步骤S3482。
步骤S3481:遍历第二过滤时间段中的候选数据信息,筛选出与第二数据信息具有相同数据类型且不属于第四信息集合的第三备选数据信息;
步骤S3482:对第三备选数据信息进行归集处理,得到第二过滤时间段的第五信息集合。
本实施例中,通过采用包括有上述步骤S3481和步骤S3482的数据处理方法,首先,遍历第二过滤时间段中的候选数据信息,筛选出与第二数据信息具有相同数据类型且不属于第四信息集合的第三备选数据信息,接着对第三备选数据信息进行归集处理,得到第二过滤时间段的第五信息集合。
本申请实施例还提供了一种电子设备400,如图16所示,该电子设备400包括但不限于:
存储器420,用于存储程序;
处理器410,用于执行存储器420存储的程序,当处理器410执行存储器420存储的程序时,处理器410用于执行上述的数据处理方法。
处理器410和存储器420可以通过总线或者其他方式连接。
存储器420作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如本申请实施例描述的数据处理方法。处理器410通过运行存储在存储器420中的非暂态软件程序以及指令,从而实现上述的数据处理方法。
存储器420可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述的数据处理方法。此外,存储器420可以包括高速随机存取存储器420,还可以包括非暂态存储器420,例如至少一个磁盘存储器420件、闪存器件、或其他非暂态固态存储器420件。在一些实施方式中,存储器420包括相对于处理器410远程设置的存储器420,这些远程存储器420可以通过网络连接至该处理器410。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述的数据处理方法所需的非暂态软件程序以及指令存储在存储器420中,当被一个或者多个处理器410执行时,执行上述的数据处理方法,例如,执行以上描述的图3中的方法步骤S310至步骤S340、图4中的方法步骤S321和步骤S322、图5中的方法步骤S3211和步骤S3212、图7中的方法步骤S3213和步骤S3214、图8中的方法步骤S32131至步骤S32133、图9中的方法步骤S331和步骤S332、图10中的方法步骤S341至步骤S344、图11中的方法步骤S3421和步骤S3422、图12中的方法步骤S345至步骤S349、图14中的方法步骤S3461和步骤S3462以及图15中的方法步骤S3481和步骤S3482。
以上所描述的装置实施例或者系统实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器410或控制器执行,例如,被上述装置实施例中的一个处理器410执行,可使得上述处理器410执行上述实施例中的数据处理方 法,例如,执行以上描述的图3中的方法步骤S310至步骤S340、图4中的方法步骤S321和步骤S322、图5中的方法步骤S3211和步骤S3212、图7中的方法步骤S3213和步骤S3214、图8中的方法步骤S32131至步骤S32133、图9中的方法步骤S331和步骤S332、图10中的方法步骤S341至步骤S344、图11中的方法步骤S3421和步骤S3422、图12中的方法步骤S345至步骤S349、图14中的方法步骤S3461和步骤S3462以及图15中的方法步骤S3481和步骤S3482。
此外,本申请的一个实施例还提供了一种计算机程序产品,包括计算机程序或计算机指令,计算机程序或计算机指令存储在计算机可读存储介质中,计算机设备的处理器410从计算机可读存储介质读取计算机程序或计算机指令,处理器410执行计算机程序或计算机指令,使得计算机设备执行上述实施例中的数据处理方法,例如,执行以上描述的图3中的方法步骤S310至步骤S340、图4中的方法步骤S321和步骤S322、图5中的方法步骤S3211和步骤S3212、图7中的方法步骤S3213和步骤S3214、图8中的方法步骤S32131至步骤S32133、图9中的方法步骤S331和步骤S332、图10中的方法步骤S341至步骤S344、图11中的方法步骤S3421和步骤S3422、图12中的方法步骤S345至步骤S349、图14中的方法步骤S3461和步骤S3462以及图15中的方法步骤S3481和步骤S3482。
本申请实施例包括:获取多个第一数据信息,对多个第一数据信息进行预处理得到多个第一数据信息中的第二数据信息;接着从多个第一数据信息中确定多个候选数据信息,根据第二数据信息和多个候选数据信息,从多个候选数据信息中筛选得到目标数据信息,其中,目标数据信息与第二数据信息具有相同的数据类型。根据本申请实施例的方案,通过对多个第一数据信息进行预处理,得到多个第一数据信息中的第二数据信息,使得第二数据信息可以作为期望得到的数据信息的参考标准,接着从多个第一数据信息中确定多个候选数据信息,然后根据第二数据信息和多个候选数据信息,从多个候选数据信息中筛选出与第二数据信息具有相同的数据类型的目标数据信息,即是说,本申请实施例的方案,通过预先确定所期望得到的数据信息的数据类型,然后筛选出与该数据类型对应的目标数据信息,能够在不增加电力监控系统资源配置的基础上,达到快速获得期望的目标数据信息的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上是对本申请的实施例进行了说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。

Claims (15)

  1. 一种数据处理方法,包括:
    获取多个第一数据信息;
    对所述多个第一数据信息进行预处理,得到所述多个第一数据信息中的第二数据信息;
    从所述多个第一数据信息中确定多个候选数据信息;以及
    根据所述第二数据信息和所述多个候选数据信息,从所述多个候选数据信息中筛选得到目标数据信息,所述目标数据信息与所述第二数据信息具有相同的数据类型。
  2. 根据权利要求1所述的数据处理方法,其中,所述对所述多个第一数据信息进行预处理,得到所述多个第一数据信息中的第二数据信息,包括:
    对所述多个第一数据信息进行基于频次的聚类处理,得到多个聚类集合,不同的所述聚类集合具有不同的中心频次;
    从所述多个聚类集合中确定中心频次小于或等于频次阈值的目标聚类集合,得到第二数据信息。
  3. 根据权利要求2所述的数据处理方法,其中,所述第一数据信息包括告警信息,所述告警信息具有告警编号;所述对所述多个第一数据信息进行基于频次的聚类处理,得到多个聚类集合,包括:
    根据所述告警编号对所述告警信息进行频次统计,得到所述告警信息的告警频次;
    根据所述告警频次对所有所述告警信息进行聚类处理,得到多个聚类集合。
  4. 根据权利要求2所述的数据处理方法,其中,所述第一数据信息包括日志信息;所述对所述多个第一数据信息进行基于频次的聚类处理,得到多个聚类集合,包括:
    对所述日志信息进行频次统计,得到所述日志信息的日志频次;
    根据所述日志频次对所有所述日志信息进行聚类处理,得到多个聚类集合。
  5. 根据权利要求4所述的数据处理方法,其中,所述对所述日志信息进行频次统计,得到所述日志信息的日志频次,包括:
    对所述日志信息进行变量替换处理,得到备选信息;
    对所述备选信息进行映射处理,得到映射信息;以及
    根据所述映射信息对所述日志信息进行频次统计,得到所述日志信息的日志频次。
  6. 根据权利要求1所述的数据处理方法,其中,所述从所述多个第一数据信息中确定多个候选数据信息,包括:
    确定目标时间段;
    从所述多个第一数据信息中确定处于所述目标时间段内的多个候选数据信息。
  7. 根据权利要求6所述的数据处理方法,其中,所述根据所述第二数据信息和所述多个候选数据信息,从所述多个候选数据信息中筛选得到目标数据信息,包括:
    根据预设时间长度对所述目标时间段进行划分处理,得到两个以上的过滤时间段;
    对所述两个以上的过滤时间段中的所述候选数据信息分别进行筛选处理,得到每个所述过滤时间段的第一信息集合;
    对每个所述过滤时间段的所述第一信息集合分别进行去重处理,得到每个所述过滤时间段的第二信息集合;以及
    获取所有所述第二信息集合的并集,得到目标数据信息。
  8. 根据权利要求7所述的数据处理方法,其中,所述对所述两个以上的过滤时间段中的所述候选数据信息分别进行筛选处理,得到每个所述过滤时间段的第一信息集合,包括:
    遍历所述两个以上的过滤时间段中的所述候选数据信息,筛选出每个所述过滤时间段中与所述第二数据信息具有相同数据类型的第一备选数据信息;
    对每个所述过滤时间段的所述第一备选数据信息分别进行归集处理,得到每个所述过滤时间段的第一信息集合。
  9. 根据权利要求6所述的数据处理方法,其中,所述根据所述第二数据信息和所述候选数据信息,从所述候选数据信息中筛选得到目标数据信息,包括:
    根据预设时间长度对所述目标时间段进行划分处理,得第一过滤时间段和第二过滤时间段;
    对所述第一过滤时间段中的所述候选数据信息进行筛选处理,得到所述第一过滤时间段的第三信息集合;
    对所述第三信息集合进行去重处理,得到第四信息集合;
    根据所述第四信息集合,对所述第二过滤时间段中的所述候选数据信息进行筛选处理,得到所述第二过滤时间段的第五信息集合;以及
    获取所述第四信息集合和所述第五信息集合的并集,得到目标数据信息。
  10. 根据权利要求9所述的数据处理方法,其中,所述对所述第一过滤时间段中的所述候选数据信息进行筛选处理,得到所述第一过滤时间段的第三信息集合,包括:
    遍历所述第一过滤时间段中的所述候选数据信息,筛选出与所述第二数据信息具有相同数据类型的第二备选数据信息;
    对所述第二备选数据信息进行归集处理,得到所述第一过滤时间段的第三信息集合。
  11. 根据权利要求9所述的数据处理方法,其中,所述根据所述第四信息集合,对所述第二过滤时间段中的所述候选数据信息进行筛选处理,得到所述第二过滤时间段的第五信息集合,包括:
    遍历所述第二过滤时间段中的所述候选数据信息,筛选出与所述第二数据信息具有相同数据类型且不属于所述第四信息集合的第三备选数据信息;
    对所述第三备选数据信息进行归集处理,得到所述第二过滤时间段的第五信息集合。
  12. 根据权利要求1至11任意一项所述的数据处理方法,其中,所述数据处理方法还包括:
    对所述目标数据信息进行信息推送处理,使得所述目标数据信息被展示。
  13. 一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至12任意一项所述的数据处理方法。
  14. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1至12任意一项所述的数据处理方法。
  15. 一种计算机程序产品,包括计算机程序或计算机指令,所述计算机程序或所述计算机指令存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机程序或所述计算机指令,所述处理器执行所述计算机程序或所述计算机指令,使得所述计算机设备执行如权利要求1至12任意一项所述的数据处理方法。
PCT/CN2022/091576 2021-09-14 2022-05-07 数据处理方法、电子设备、存储介质及程序产品 WO2023040300A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111083803.4 2021-09-14
CN202111083803.4A CN115809160A (zh) 2021-09-14 2021-09-14 数据处理方法、电子设备、存储介质及程序产品

Publications (1)

Publication Number Publication Date
WO2023040300A1 true WO2023040300A1 (zh) 2023-03-23

Family

ID=85481075

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091576 WO2023040300A1 (zh) 2021-09-14 2022-05-07 数据处理方法、电子设备、存储介质及程序产品

Country Status (2)

Country Link
CN (1) CN115809160A (zh)
WO (1) WO2023040300A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116680752A (zh) * 2023-05-23 2023-09-01 杭州水立科技有限公司 一种基于数据处理的水利工程安全监测方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017197566A1 (zh) * 2016-05-16 2017-11-23 华为技术有限公司 一种日志显示的方法、设备及系统
CN109885456A (zh) * 2019-02-20 2019-06-14 武汉大学 一种基于系统日志聚类的多类型故障事件预测方法及装置
US20190354457A1 (en) * 2018-05-21 2019-11-21 Oracle International Corporation Anomaly detection based on events composed through unsupervised clustering of log messages
CN112448836A (zh) * 2019-09-04 2021-03-05 中兴通讯股份有限公司 故障根因确定方法、装置、服务器和计算机可读介质
CN112612887A (zh) * 2020-12-25 2021-04-06 北京天融信网络安全技术有限公司 日志处理方法、装置、设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017197566A1 (zh) * 2016-05-16 2017-11-23 华为技术有限公司 一种日志显示的方法、设备及系统
US20190354457A1 (en) * 2018-05-21 2019-11-21 Oracle International Corporation Anomaly detection based on events composed through unsupervised clustering of log messages
CN109885456A (zh) * 2019-02-20 2019-06-14 武汉大学 一种基于系统日志聚类的多类型故障事件预测方法及装置
CN112448836A (zh) * 2019-09-04 2021-03-05 中兴通讯股份有限公司 故障根因确定方法、装置、服务器和计算机可读介质
CN112612887A (zh) * 2020-12-25 2021-04-06 北京天融信网络安全技术有限公司 日志处理方法、装置、设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIU MING, DONG SHAO-HUA: "Optimization of Hardware Inventory Based on Cluster-Algorithm Analysis", JOURNAL OF TONGLING UNIVERSITY, no. 1, 15 February 2013 (2013-02-15), pages 100 - 104, XP093048952, ISSN: 1672-0547, DOI: 10.16394/j.cnki.34-1258/z.2013.01.001 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116680752A (zh) * 2023-05-23 2023-09-01 杭州水立科技有限公司 一种基于数据处理的水利工程安全监测方法及系统
CN116680752B (zh) * 2023-05-23 2024-03-19 杭州水立科技有限公司 一种基于数据处理的水利工程安全监测方法及系统

Also Published As

Publication number Publication date
CN115809160A (zh) 2023-03-17

Similar Documents

Publication Publication Date Title
KR102520044B1 (ko) 경보 로그 압축 방법, 장치, 및 시스템, 및 저장 매체
CN108628929B (zh) 用于智能存档和分析的方法和装置
US20190286510A1 (en) Automatic correlation of dynamic system events within computing devices
CN110851321B (zh) 一种业务告警方法、设备及存储介质
CN110928718A (zh) 一种基于关联分析的异常处理方法、系统、终端及介质
CN107786368B (zh) 异常节点检测方法以及相关装置
US9183242B1 (en) Analyzing frequently occurring data items
CN107943668A (zh) 计算机服务器集群日志监控方法及监控平台
EP3282643B1 (en) Method and apparatus of estimating conversation in a distributed netflow environment
WO2023040300A1 (zh) 数据处理方法、电子设备、存储介质及程序产品
CN112328425A (zh) 一种基于机器学习的异常检测方法和系统
WO2023071761A1 (zh) 一种异常定位方法及装置
CN113328985B (zh) 一种被动物联网设备识别方法、系统、介质及设备
CN111078513A (zh) 日志处理方法、装置、设备、存储介质及日志告警系统
CN113254255A (zh) 一种云平台日志的分析方法、系统、设备及介质
CN112039726A (zh) 一种内容分发网络cdn设备的数据监控方法及系统
CN113420032A (zh) 一种日志的分类存储方法及装置
CN109684328A (zh) 一种高维时序数据压缩存储方法
CN115473688A (zh) 面向软件定义网络的异常检测方法、装置及设备
US20200210305A1 (en) System, device and method for frozen period detection in sensor datasets
CN112685473B (zh) 一种基于时序分析技术的网络异常流量检测方法及其系统
CN116804957A (zh) 一种系统监控方法及装置
CN115269519A (zh) 一种日志检测方法、装置及电子设备
CN114331688A (zh) 一种银行柜面系统业务批量运行状态检测方法及装置
CN111274089B (zh) 一种基于旁路技术的服务器异常行为感知系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868678

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE