WO2022095434A1 - Auto-encoder-based data anomaly identification method and apparatus and computer device - Google Patents

Auto-encoder-based data anomaly identification method and apparatus and computer device Download PDF

Info

Publication number
WO2022095434A1
WO2022095434A1 PCT/CN2021/097550 CN2021097550W WO2022095434A1 WO 2022095434 A1 WO2022095434 A1 WO 2022095434A1 CN 2021097550 W CN2021097550 W CN 2021097550W WO 2022095434 A1 WO2022095434 A1 WO 2022095434A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
time series
autoencoder
autoencoders
specified
Prior art date
Application number
PCT/CN2021/097550
Other languages
French (fr)
Chinese (zh)
Inventor
邓悦
郑立颖
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022095434A1 publication Critical patent/WO2022095434A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular to a method, device and computer equipment for identifying data anomalies based on an autoencoder.
  • most of the current anomaly detection methods are based on statistics, mainly including deviation-based methods, methods based on the distribution of specified recommendation scores, distance-based methods and density-based methods, etc.
  • these types of methods need to know the distribution of the data in advance.
  • most of the statistical-based anomaly detection algorithms are only suitable for mining univariate numerical data, and are not suitable for time series data. If the effect is directly applied to time series data It is not ideal, and the recognition accuracy of abnormal data is low.
  • the main purpose of this application is to provide an autoencoder-based data anomaly identification method, device, computer equipment and storage medium, aiming to solve the problem that the existing anomaly detection method is not applicable to time series data, if it is directly applied to time The effect on sequence data is not ideal, and the recognition accuracy of abnormal data is low.
  • the present application proposes a method for identifying data anomalies based on an autoencoder, the method comprising the steps of:
  • an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
  • abnormal score value it is identified whether there is abnormal data value in the time series.
  • the present application also provides a device for identifying data anomalies based on an autoencoder, including:
  • a receiving module for receiving the input time series to be detected
  • the training module is configured to perform integrated training processing on a pre-generated specified number of sparsely connected autoencoders based on the time series according to preset rules, and generate a corresponding autoencoder integration framework, wherein the sparsely connected autoencoders are The encoder is generated by deleting the unit connection of a specified number of cyclic neural network-based autoencoders respectively;
  • a calculation module configured to calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework
  • An identification module configured to identify whether there is an abnormal data value in the time series according to the abnormal score value.
  • the present application also provides a computer device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements a method for identifying data anomalies based on an autoencoder when the processor executes the computer program, wherein the The method for identifying data anomalies based on the autoencoder includes the following steps:
  • an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
  • abnormal score value it is identified whether there is abnormal data value in the time series.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements an autoencoder-based data abnormality identification method, wherein the autoencoder-based data anomaly identification method is
  • the data anomaly identification method includes the following steps:
  • an integrated training process is performed on the pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
  • abnormal score value it is identified whether there is abnormal data value in the time series.
  • the method, device, computer equipment and storage medium for data anomaly identification based on the autoencoder provided in this application effectively improve the identification accuracy of abnormal data values in time series, and for abnormal data values in time series
  • the recognition processing efficiency is high.
  • FIG. 1 is a schematic flowchart of a method for identifying data anomalies based on an autoencoder according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an apparatus for identifying data anomalies based on an autoencoder according to an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • RNN Recurrent Neural Network
  • the specific method of RNN The network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layers are no longer unconnected, and the input of the hidden layer includes not only the output of the input layer, but also the upper The output of the hidden layer at a moment.
  • the functional characteristics of RNN 1.
  • the hidden layer nodes can be interconnected or self-connected; 2.
  • the output of each step is not necessary, and the input of each step is not necessary.
  • RNN uses: language model and text generation research, machine translation, speech recognition, image description generation.
  • Autoencoder It is a kind of neural network. After training, it can try to copy the input to the output. There is a hidden layer h inside the autoencoder, which can generate the encoded representation input.
  • each vector s t in the time series is fed to the encoding of the autoencoder RNN unit in the generator to perform the following computations: where s t is the vector at time step t in the time series, the hidden state is the output of the previous RNN unit at time step t-1 in the encoder, and f( ) is a nonlinear function.
  • the hidden state of the encoder’s current RNN unit can be obtained at time step t It is then hidden into the next RNN unit at time step t-1.
  • the time series is reconstructed in reverse order, i.e. First, the last hidden state of the encoder is used as the first hidden state of the decoder. Decoder based The previous hidden state and the previously reconstructed vector of reconstruct the current vector and calculate the current hidden state where g( ⁇ ) is a nonlinear function.
  • an autoencoder-based data anomaly identification method includes:
  • the execution body of this embodiment of the method is a data abnormality identification device based on an autoencoder.
  • the above-mentioned device for identifying data anomalies based on autoencoders can be implemented through virtual devices, such as software codes, or through physical devices written or integrated with relevant execution codes, and can communicate with users through keyboards, mice, Human-computer interaction is carried out by means of remote control, touchpad or voice control device.
  • the apparatus for identifying data anomalies based on an autoencoder in this embodiment can quickly and accurately identify anomalous data values in the time series to be detected. Specifically, the input time series to be detected is received first.
  • the above-mentioned time series to be detected is the time series of whether there are abnormal data values to be detected.
  • the time series may be a KPI (Key Performance Indicator, key performance indicator) time series in the server, and the time series includes The data is in vector form.
  • the pre-generated specified number of sparsely connected autoencoders are integrated and trained according to preset rules to generate a corresponding autoencoder integration framework, wherein the above sparsely connected autoencoders are obtained by separately Generated after a specified number of RNN-based autoencoders perform unit connection removal processing.
  • the above-mentioned generation process of the sparsely connected autoencoders may include: first obtaining a specified number of cyclic neural network-based autoencoders.
  • the above-mentioned cyclic neural network-based autoencoder may specifically be a cyclic neural network autoencoder (RSCN) using additional auxiliary connections, and the cyclic neural network autoencoder using additional auxiliary connections adds between each RNN unit.
  • RSCN cyclic neural network autoencoder
  • Auxiliary connections, and the above specified number is not specifically limited, and can be set according to actual needs. In this embodiment, the specified number may be taken as N.
  • the unit connection deletion process is performed on each of the above-mentioned cyclic neural network-based autoencoders respectively to generate a corresponding number of sparsely connected autoencoders. Since the autoencoder of the recurrent neural network with additional auxiliary connections adds auxiliary connections between each RNN unit, it is possible to cut off the auxiliary links between some RNN units to make certain differences between the network layers.
  • the process of performing unit connection deletion processing on each of the above-mentioned cyclic neural network-based autoencoders may include: for an autoencoder based on a cyclic neural network using additional auxiliary connections, by introducing a sparse weight vector, it can be controlled at each Which auxiliary connections should be removed at time step t.
  • a sparsely connected autoencoder can be generated based on the above sparse weight vector w t , and the hidden state of each RNN unit in the obtained sparsely connected autoencoder is calculated as follows: where s t is the vector at time step t in the input time series data, h t-1 is the hidden state at time step t-1 in the encoder of the sparse loop autoencoder, and h tL is the sparse loop The hidden state at time step tL in the encoder of the autoencoder, w t is the sparse weight vector,
  • 0 represents the number of non-zero elements in the vector w t .
  • the unit connection deletion process can also be performed by randomly deleting connections according to actual needs. For each RNN-based autoencoder, the connection of some RNN units is randomly deleted to obtain a sparsely connected autoencoder, The reconstruction error obtained by the sparsely connected autoencoder after reconstruction processing of the time series is not the same, which effectively expands the scope of application of the autoencoder and enhances the reliability, accuracy and generalization of the autoencoder. .
  • the above-mentioned autoencoder integration framework may include an independent framework and a shared framework.
  • the corresponding first objective function can be generated based on all the vectors contained in the above time series and the reconstruction vector corresponding to each vector contained in the above time series generated by the sparsely connected autoencoder, and then based on the first objective function.
  • the objective function is to train each sparsely connected autoencoder separately to obtain the above autoencoder ensemble framework.
  • the corresponding second objective function can be generated based on all the vectors contained in the above-mentioned time series, the reconstruction vector corresponding to each vector contained in the above-mentioned time series generated by the sparsely connected autoencoder, and the preset shared hidden state, and then Based on the second objective function, all sparsely connected autoencoders are jointly trained to obtain the above-mentioned autoencoder ensemble framework.
  • the abnormal score value corresponding to each vector included in the above-mentioned time series is calculated by the above-mentioned autoencoder integration framework.
  • the reconstruction error corresponding to each vector contained in the time series can be calculated and generated by each autoencoder included in the above-mentioned autoencoder integration framework, and then for any specified vector in the time series, the calculation and The median of all the above-mentioned reconstruction errors corresponding to the above-mentioned designated vector, and then the abnormal score value corresponding to the designated vector can be obtained. Finally, according to the above abnormal score value, it is identified whether there is abnormal data value in the above time series.
  • whether there are abnormal data values in the above-mentioned time series can be identified according to a preset abnormal threshold, and if the abnormal score value corresponding to any one of the specified vectors in the above-mentioned time series is greater than the abnormal threshold, the specified vector is determined as abnormal data value. And if the abnormal score value corresponding to the designated vector is not greater than the abnormal threshold, the designated vector is determined to be a normal data value, that is, the designated vector does not belong to an abnormal data value.
  • this embodiment adopts an autoencoder-based integration framework to perform data anomaly identification processing for time series.
  • the original The autoencoder of the recurrent neural network is improved to generate a sparsely connected autoencoder, and then the pre-generated sparsely connected autoencoder is integrated and trained based on the time series to generate an autoencoder that can be used for outlier identification of time series data.
  • Integration framework so that the autoencoder integration framework can be used to calculate the abnormal score value corresponding to each vector included in the above time series, and then can quickly and accurately identify whether there is abnormality in the above time series according to the abnormal score value
  • the data value effectively improves the identification accuracy of abnormal data values in the time series, and the identification processing efficiency for abnormal data values in the time series is high.
  • step S2 includes:
  • S203 Perform training on each of the sparsely connected autoencoders based on the first objective function to obtain a trained first autoencoder, wherein the number of the first autoencoders is related to the sparse connections the same number of autoencoders;
  • S204 Perform integrated processing on all the first autoencoders to generate corresponding independent frames, wherein the independent frames include a specified number of the first autoencoders, and each of the first autoencoders no interaction occurs;
  • the above-mentioned auto-encoder integration framework may be an independent framework generated based on all the above-mentioned sparsely connected auto-encoders, and the training process of the independent framework is to independently train each different sparsely connected auto-encoders , so each sparsely connected autoencoder does not interact during the training phase, nor does each autoencoder contained in the generated independent frame interact.
  • the pre-generated specified number of sparsely connected autoencoders are subjected to integrated training processing according to preset rules
  • the step of generating the corresponding autoencoder integration framework may include: first obtaining the above-mentioned time series including all first vectors of .
  • the first objective function may be: Among them, J i is the first objective function, s t is the vector at time step t in the time series, represents that the decoder D i contained in the autoencoder from the sparse connections at time step t generates a reconstructed vector for vector s t , is the L2-norm of the vector.
  • each of the sparsely connected autoencoders is trained based on the first objective function to obtain a trained first autoencoder, wherein the first autoencoder is The number is the same as the number of sparsely connected autoencoders above.
  • the above-mentioned first self-encoders are obtained, all the above-mentioned first self-encoders are integrated and processed to generate corresponding independent frames.
  • the above-mentioned independent frame includes a specified number of the above-mentioned first auto-encoders, and there is no interaction among the above-mentioned first auto-encoders.
  • all the above-mentioned first autoencoders can be integrated into a preset integrated framework to generate the above-mentioned independent framework.
  • each decoder D i in an independent frame will have an independent hidden state used as initial hidden state and corresponding weight matrix linear combination of .
  • the above-mentioned independent framework is determined as the above-mentioned self-encoder integrated framework.
  • an independent frame composed of a specified number of sparsely connected autoencoders with different network structures is generated through training. Since the reconstruction error from multiple autoencoders will be considered when using the independent frame for anomaly detection, This helps to reduce the variance of the overall reconstruction error, so that the anomaly score value corresponding to each vector included in the above time series can be accurately calculated subsequently according to the independent framework, and then according to the anomaly score value, to quickly and accurately Identify whether there are abnormal data values in the above time series, so as to effectively improve the recognition efficiency and recognition accuracy of abnormal data values in the time series.
  • step S2 includes:
  • S210 Acquire a preset shared layer, wherein the shared layer includes a shared hidden state
  • S215 Generate a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;
  • S216 Perform joint training on all the sparsely connected autoencoders based on the second objective function to obtain a trained second autoencoder, wherein the number of the second autoencoders is the same as the number of the sparsely connected autoencoders The same number of autoencoders;
  • S217 Integrate all the second autoencoders to generate a corresponding shared frame, wherein the shared frame includes a specified number of the second autoencoders, and each of the second autoencoders There is interaction between;
  • S218 Determine the shared framework as the autoencoder integration framework.
  • the above-mentioned auto-encoder integration framework may be a shared framework generated based on all the above-mentioned sparsely connected auto-encoders and a preset shared layer, including different auto-encoders, and due to the shared
  • the framework includes the interaction between different autoencoders, so compared with the above independent framework, the recognition accuracy of abnormal data values in time series can be further improved.
  • the pre-generated specified number of sparsely connected autoencoders are subjected to integrated training processing according to preset rules
  • the step of generating a corresponding autoencoder integration framework may include: first obtaining a preset shared layer, and weight sharing processing is performed on all the sparsely connected autoencoders through the sharing layer, wherein the sharing layer includes a shared hidden state.
  • the above shared layer is the last hidden state of the encoder that connects all the above sparse connections with the corresponding weight matrix
  • a linear combination of, specifically, shared layers, i.e. shared hidden states Then, L1 regularization is performed on the above shared hidden state to obtain the processed shared hidden state.
  • the shared hidden state can be Sparse. This in turn prevents some encoders from overfitting the above time series, making the decoder more applicable and less susceptible to outlier data values.
  • all second vectors included in the above time series are obtained.
  • a reconstructed time series corresponding to the time series is generated. and reconstruct the time series vector contained in That is, it can be regarded as the second reconstruction vector corresponding to the above-mentioned second vector respectively. Then, a corresponding second objective function is generated according to the processed shared hidden state, the second vector, and the second reconstruction vector.
  • the above-mentioned second objective function may specifically be: where ⁇ is the weight parameter controlling the importance of the L1 regularization term, s t is the vector at time step t in the time series, represents the reconstructed vector from the decoder Di at time step t, is the shared hidden state after L1 regularization, is the L2-norm of the vector, and J i is the first objective function above.
  • the shared framework includes a specified number of the second auto-encoders, and there is interaction between the second auto-encoders.
  • all of the above second autoencoders can be integrated into a preset integration framework to generate the above shared framework.
  • the above-mentioned shared framework is determined as the above-mentioned autoencoder integration framework. In this embodiment, a shared frame consisting of a specified number of sparsely connected autoencoders with different network structures is generated through training.
  • the reconstruction error from multiple autoencoders will be considered when the shared frame is used for anomaly detection, Moreover, there can be interactions between the sparsely connected autoencoders, which is more helpful to reduce the variance of the overall reconstruction error, so that the corresponding value of each vector included in the above time series can be accurately calculated according to the shared framework.
  • the abnormal score value and then according to the abnormal score value, to quickly and accurately identify whether there is an abnormal data value in the above time series, so as to effectively improve the identification efficiency and accuracy of the abnormal data value in the time series.
  • step S3 includes:
  • S300 Calculate and generate a reconstruction error corresponding to a specified vector by each autoencoder included in the autoencoder integration framework, wherein the specified vector is any one of all vectors included in the time series;
  • S302 Determine the median as a specified abnormal score value corresponding to the specified vector in the time series.
  • the above-mentioned step of calculating the abnormal score value corresponding to each vector included in the above-mentioned time series through the above-mentioned autoencoder integration framework may specifically include:
  • Each of the autoencoders calculates and generates a reconstruction error corresponding to a specified vector, where the specified vector is any one of all vectors included in the above time series.
  • the above specified number is N
  • the N auto-encoders included in the auto-encoder integration framework can be used.
  • the generator generates N reconstruction errors ⁇ a 1 , a 2 , . .
  • the generating process of the reconstruction error may include: generating reconstructed time series corresponding to the above-mentioned time series by using N autoencoders included in the autoencoder integration framework, and then extracting the corresponding time series from the reconstructed time series, respectively.
  • the reconstruction vector corresponding to the vector sk is called, so that the vector sk and the calculation formula related to the reconstruction vector corresponding to the vector sk are called to calculate the reconstruction error corresponding to the quantity sk .
  • the above median is determined as the specified anomaly score value corresponding to the above specified vector in the above time series.
  • the median of the N reconstruction errors is therefore used as the final outlier score value of the vector sk .
  • the above-mentioned independent framework and the above-mentioned shared framework use the same calculation formula to calculate the abnormal score value corresponding to each vector included in the above-mentioned time series.
  • This embodiment calculates and generates a reconstruction error corresponding to the specified vector by using each autoencoder included in the autoencoder integration framework, and the median of all the above reconstruction errors is corresponding to the above specified vector in the above time series.
  • the specified abnormal score value of so as to accurately calculate and calculate the abnormal score value corresponding to each vector included in the above time series, which is helpful to quickly and accurately identify whether the above time series exists in the above time series according to the abnormal score value.
  • Abnormal data values to effectively improve the identification efficiency and accuracy of abnormal data values in time series.
  • step S300 includes:
  • S3000 Perform reconstruction processing on the time series by using a specific autoencoder to obtain a specific reconstructed time series corresponding to the time series, where the specific autoencoder is a component included in the autoencoder integration framework any one of all autoencoders;
  • S3002 Calculate a specific reconstruction error corresponding to the specified vector according to the specified vector and the specified reconstruction vector.
  • the above-mentioned step of calculating and generating the reconstruction error corresponding to the specified vector by each autoencoder included in the above-mentioned autoencoder integration framework may specifically include: The time series is reconstructed to obtain a specific reconstructed time series corresponding to the above-mentioned time series, wherein the above-mentioned specific auto-encoder is any one of all the auto-encoders included in the above-mentioned auto-encoder integration framework.
  • a specific reconstruction vector corresponding to the above-mentioned specified vector is extracted from the above-mentioned specific reconstruction time series.
  • the reconstructed time series can be generated from a specific autoencoder Extract the specific reconstruction vector corresponding to the specified vector sk from
  • a specific reconstruction error corresponding to the above-mentioned designated vector is calculated.
  • the formula can be to calculate the specific reconstruction error corresponding to the specified vector above. Further, by formula to calculate the specified anomaly score value corresponding to the above specified vector in the above time series.
  • the reconstruction error corresponding to the specified vector can be calculated and generated according to each autoencoder included in the autoencoder integration framework in the future, so as to quickly calculate the abnormal score value corresponding to each vector included in the above time series, and then It is beneficial to quickly and accurately identify whether there is an abnormal data value in the above-mentioned time series according to the abnormal score value, so as to effectively improve the identification efficiency and identification accuracy of the abnormal data value in the time series.
  • step S4 includes:
  • S401 Determine whether there is a specified score value with a value greater than the abnormal threshold value among all the abnormal score values;
  • the above step of identifying whether there is an abnormal data value in the above time series according to the above abnormal score value may specifically include first obtaining a preset abnormal threshold value.
  • the value of the above abnormal threshold is not specifically limited, and can be generated based on corresponding statistical calculation of historical time series data, or can be set according to actual needs. Then, it is judged whether there is a specified score value with a value greater than the above-mentioned abnormal threshold value among all the above-mentioned abnormal score values. If there is a designated score value whose value is greater than the above abnormal threshold value among all the above abnormal score values, the above designated score value is filtered out from all the above abnormal score values.
  • the third vector corresponding to the above specified score value is found from the above time series. Finally, when the third vector is obtained, the third vector is determined as the abnormal data value.
  • the autoencoder integration framework is used to calculate the abnormal score value corresponding to each vector included in the above time series. By comparing the abnormal score value with the preset abnormal threshold value, the specified score value that is greater than the above abnormal score value among all abnormal score values is found from the time series, and the corresponding score value corresponding to the specified score value in the time series will be found.
  • the third vector of is determined as the abnormal data value, which realizes the accurate identification of the abnormal data value contained in the time series, and effectively improves the identification efficiency of the abnormal data in the time series.
  • step S404 it includes:
  • S405 Screen out a fourth vector other than the third vector from the time series
  • S409 Generate an abnormality analysis report corresponding to the time series according to the abnormal data value, the first quantity, the normal data, and the second quantity;
  • a corresponding abnormality analysis report may be further generated according to the abnormal data value and related data.
  • the above-mentioned third vector is determined as described above.
  • the method may further include: firstly screening out a fourth vector other than the third vector from the time series, and marking the second vector as a normal data value. Then get the first quantity corresponding to the above third vector. and simultaneously acquiring the second quantity corresponding to the above-mentioned fourth vector.
  • an anomaly analysis report corresponding to the above-mentioned time series is generated according to the above-mentioned abnormal data value, the above-mentioned first quantity, the above-mentioned normal data, and the above-mentioned second quantity.
  • one of the above-mentioned abnormality analysis reports at least includes the above-mentioned abnormal data value, the above-mentioned first quantity, the above-mentioned normal data, and the above-mentioned second quantity.
  • the above-mentioned abnormality analysis report is displayed, so that the user can clearly understand the specific distribution and scale of abnormal data values contained in the time series to be detected according to the abnormality analysis report, as well as the normal The specific distribution and scale of data values.
  • the display method of the above exception analysis report is not specifically limited, and can be set according to implementation requirements.
  • the method for identifying data anomalies based on the autoencoder in the embodiments of the present application can also be applied to the blockchain field, for example, the data such as the above-mentioned autoencoder integration framework is stored on the blockchain.
  • the blockchain By using the blockchain to store and manage the above-mentioned self-encoder integration framework, the security and immutability of the above-mentioned self-encoder integration framework can be effectively guaranteed.
  • the above-mentioned blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring.
  • the user management module is responsible for the identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, and maintenance of the corresponding relationship between the user's real identity and blockchain address (authority management), etc.
  • account management maintenance of public and private key generation
  • key management key management
  • authorization management maintenance of the corresponding relationship between the user's real identity and blockchain address
  • the basic service module is deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on valid requests, record them in the storage.
  • the basic service For a new business request, the basic service first adapts the interface for analysis and authentication processing (interface adaptation), and then encrypts the business information through the consensus algorithm (consensus management), After encryption, it is completely and consistently transferred to the shared ledger (network communication), and records are stored; the smart contract module is responsible for the registration and issuance of contracts, as well as contract triggering and contract execution.
  • contract logic through a programming language and publish to On the blockchain (contract registration), according to the logic of the contract terms, call the key or other events to trigger execution, complete the contract logic, and also provide the function of contract upgrade and cancellation;
  • the operation monitoring module is mainly responsible for the deployment in the product release process , configuration modification, contract settings, cloud adaptation, and visual output of real-time status in product operation, such as: alarms, monitoring network conditions, monitoring node equipment health status, etc.
  • an embodiment of the present application also provides a device for identifying data anomalies based on an autoencoder, including:
  • the training module 2 is configured to perform integrated training processing on a pre-generated specified number of sparsely connected autoencoders based on the time series according to preset rules, and generate a corresponding autoencoder integration framework, wherein the sparsely connected autoencoders are The autoencoder is generated by removing the unit connection of a specified number of cyclic neural network-based autoencoders respectively;
  • Calculation module 3 for calculating the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework
  • the identification module 4 is configured to identify whether there is an abnormal data value in the time series according to the abnormal score value.
  • the above-mentioned training module includes:
  • a first acquiring unit configured to acquire all the first vectors included in the time series
  • a second obtaining unit configured to obtain a one-to-one corresponding first reconstruction vector generated by each of the sparsely connected autoencoders based on each of the first vectors
  • a first generating unit for generating a corresponding first objective function based on the first vector and the first reconstruction vector
  • a first training unit configured to separately train each of the sparsely connected autoencoders based on the first objective function to obtain a trained first autoencoder, wherein the number of the first autoencoders is the same as the number of sparsely connected autoencoders;
  • the first processing unit is configured to perform integrated processing on all the first self-encoders to generate a corresponding independent frame, wherein the independent frame includes a specified number of the first self-encoders, and each of the There is no interaction between the first autoencoders;
  • a first determining unit configured to determine the independent frame as the autoencoder integrated frame.
  • the functions and functions of the first acquisition unit, the second acquisition unit, the first generation unit, the first training unit, the first processing unit, and the first determination unit in the above-mentioned autoencoder-based data abnormality identification device For details of the implementation process, please refer to the implementation process corresponding to steps S200 to S205 in the above-mentioned autoencoder-based data abnormality identification method, which will not be repeated here.
  • the above-mentioned training module includes:
  • a third acquiring unit configured to acquire a preset shared layer, wherein the shared layer includes a shared hidden state
  • a second processing unit configured to perform weight sharing processing on all the sparsely connected autoencoders through the sharing layer
  • a third processing unit configured to perform L1 regularization processing on the shared hidden state to obtain the processed shared hidden state
  • a fourth acquiring unit configured to acquire all the second vectors contained in the time series
  • a fifth obtaining unit configured to obtain a one-to-one corresponding second reconstruction vector generated by each of the sparsely connected autoencoders based on each of the second vectors;
  • a second generating unit configured to generate a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;
  • the second training unit is configured to jointly train all the sparsely connected autoencoders based on the second objective function to obtain a trained second autoencoder, wherein the number of the second autoencoders is the same as The number of sparsely connected autoencoders is the same;
  • the fourth processing unit is configured to perform integrated processing on all the second auto-encoders to generate a corresponding shared frame, wherein the shared frame includes a specified number of the second auto-encoders, and each of the There is interaction between the second autoencoders;
  • the second determining unit is configured to determine the shared frame as the autoencoder integration frame.
  • the third obtaining unit, the second processing unit, the third processing unit, the fourth obtaining unit, the fifth obtaining unit, the second generating unit, the second training unit in the above-mentioned autoencoder-based data anomaly identification device The implementation process of the functions and functions of the unit, the fourth processing unit and the second determining unit can be found in the implementation process corresponding to steps S210 to S218 in the above-mentioned autoencoder-based data abnormality identification method, which will not be repeated here.
  • the above calculation module includes:
  • a first calculation unit configured to calculate and generate a reconstruction error corresponding to a specified vector through each autoencoder included in the autoencoder integration framework, where the specified vector is one of all vectors included in the time series any vector of ;
  • a second computing unit for computing the median of all the reconstruction errors
  • a third determining unit configured to determine the median as a specified abnormal score value corresponding to the specified vector in the time series.
  • the above-mentioned first computing unit includes:
  • a processing subunit configured to perform reconstruction processing on the time series by using a specific autoencoder to obtain a specific reconstructed time series corresponding to the time series, wherein the specific autoencoder is the integration of the autoencoder Any one of all autoencoders included in the framework;
  • an extraction subunit for extracting a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series
  • a calculation subunit configured to calculate a specific reconstruction error corresponding to the specified vector according to the specified vector and the specified reconstruction vector.
  • the above-mentioned identification module includes:
  • a sixth obtaining unit used for obtaining a preset abnormal threshold
  • a judging unit for judging whether in all the abnormal score values, whether there is a specified score value whose value is greater than the abnormal threshold
  • a first screening unit configured to filter out the specified score value from all the abnormal score values if it is;
  • a search unit configured to search out a third vector corresponding to the specified score value from the time series
  • a fourth determination unit configured to determine the third vector as the abnormal data value.
  • the implementation process of the functions and functions of the sixth acquiring unit, the judging unit, the first screening unit, the searching unit and the fourth determining unit in the above-mentioned self-encoder-based data abnormality identification device are detailed in the above-mentioned self-based
  • the implementation process corresponding to steps S400 to S404 in the data abnormality identification method of the encoder will not be repeated here.
  • the above-mentioned identification module includes:
  • a second screening unit configured to screen out a fourth vector other than the third vector from the time series
  • a marking unit for marking the second vector as a normal data value
  • a seventh obtaining unit for obtaining the first quantity corresponding to the third vector
  • an eighth obtaining unit configured to obtain a second quantity corresponding to the fourth vector
  • a third generating unit configured to generate an anomaly analysis report corresponding to the time series according to the second screening unit, the first quantity, the normal data, and the second quantity;
  • the display unit is used to display the abnormality analysis report.
  • the implementation process of the functions and roles of the second screening unit, the marking unit, the seventh acquiring unit, the eighth acquiring unit, the third generating unit and the displaying unit in the above-mentioned autoencoder-based data abnormality identification device is specific
  • an embodiment of the present application further provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3 .
  • the computer equipment includes a processor, memory, a network interface, a display screen, an input device and a database connected by a system bus. Among them, the processor of the computer equipment is designed to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store data such as time series to be detected, sparsely connected autoencoders, autoencoder integration frameworks, abnormal score values, and abnormal data values.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the display screen of the computer equipment is an indispensable graphic and text output device in the computer, which is used to convert digital signals into optical signals, so that text and graphics can be displayed on the screen of the display screen.
  • the input device of the computer equipment is the main device for information exchange between the computer and the user or other devices, and is used to transmit data, instructions and certain flag information to the computer. When the computer program is executed by the processor, a method for identifying data anomalies based on an autoencoder is realized.
  • the above-mentioned processor performs the steps of the above-mentioned self-encoder-based data anomaly identification method:
  • an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
  • abnormal score value it is identified whether there is abnormal data value in the time series.
  • FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the apparatus or computer equipment to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and a computer program is stored thereon, and the computer program is implemented when executed by a processor
  • the self-encoder-based data abnormality identification method shown in any of the above-mentioned exemplary embodiments, the self-encoder-based data abnormality identification method comprises the following steps:
  • an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
  • abnormal score value it is identified whether there is abnormal data value in the time series.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • SSRSDRAM double-rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous Link (Synchlink) DRAM
  • SLDRAM synchronous Link (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and provided therein are an auto-encoder-based data anomaly identification method and apparatus, a computer device, and a storage medium. The method comprises: receiving an inputted time sequence to be detected; performing, on the basis of the time sequence and according to a preset rule, integration training on a specified quantity of pre-generated and sparsely connected auto-encoders so as to generate a corresponding auto-encoder integrated framework; calculating, by means of the auto-encoder integrated framework, an abnormal score value corresponding to each vector comprised in the time sequence; and identifying, according to the abnormal score value, whether an abnormal data value is present in the time sequence. By means of the present application, it can be accurately identified whether an abnormal data value is present in a time sequence, thus effectively improving the accuracy of identifying abnormal data values in time sequences. The present application further relates to the field of blockchains, and the auto-encoder integrated framework can be stored in a blockchain.

Description

基于自编码器的数据异常识别方法、装置和计算机设备Data anomaly identification method, device and computer equipment based on autoencoder
本申请要求于2020年11月09日提交中国专利局、申请号为202011242143.5,发明名称为“基于自编码器的数据异常识别方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on November 09, 2020 with the application number 202011242143.5 and the title of the invention is "Auto-encoder-based data anomaly identification method, device and computer equipment", the entire content of which is Incorporated herein by reference.
技术领域technical field
本申请涉及人工智能技术领域,具体涉及一种基于自编码器的数据异常识别方法、装置和计算机设备。The present application relates to the technical field of artificial intelligence, and in particular to a method, device and computer equipment for identifying data anomalies based on an autoencoder.
背景技术Background technique
伴随着大数据时代的到来,云计算、物联网等各种新兴的主题应运而生,其中,从海量数据中挖掘出人们最终需要的潜在数据变得越来越重要。传统的数据挖掘主要关注的是内含大量数据的数据模型,而对异常数据的检测关注较少。其实,分析和挖掘有用的数据固然重要,但出现重要数据偏差的异常值中也包含大量有用的信息,可以对数据造成影响,使数据变得畸形,从而无法得到正确的结果,因此对于异常数据的检测同样也不可忽略。With the advent of the era of big data, various emerging topics such as cloud computing and the Internet of Things have emerged. Among them, mining the potential data that people ultimately need from massive data has become more and more important. Traditional data mining mainly focuses on the data model containing a large amount of data, and pays less attention to the detection of abnormal data. In fact, it is important to analyze and mine useful data, but outliers with important data deviations also contain a lot of useful information, which can affect the data and make the data deformed, so that correct results cannot be obtained. Therefore, for abnormal data detection cannot be ignored either.
现有技术中,目前的异常检测方法大多建立在统计学的基础上,主要包括基于偏离的方法、基于指定推荐分数值分布的方法、基于距离的方法和基于密度的方法等,但发明人意识到,这些类型的方法需要事先知道数据的分布,此外,基于统计的异常检测算法大多只适合于挖掘单变量的数值型数据,对于时间序列数据并不适用,如果直接应用于时间序列数据上效果会不太理想,且对于异常数据的识别准确性低。In the prior art, most of the current anomaly detection methods are based on statistics, mainly including deviation-based methods, methods based on the distribution of specified recommendation scores, distance-based methods and density-based methods, etc. However, these types of methods need to know the distribution of the data in advance. In addition, most of the statistical-based anomaly detection algorithms are only suitable for mining univariate numerical data, and are not suitable for time series data. If the effect is directly applied to time series data It is not ideal, and the recognition accuracy of abnormal data is low.
技术问题technical problem
本申请的主要目的为提供一种基于自编码器的数据异常识别方法、装置、计算机设备和存储介质,旨在解决现有的异常检测方法的对于时间序列数据并不适用,如果直接应用于时间序列数据上效果会不太理想,且对于异常数据的识别准确性低的技术问题。The main purpose of this application is to provide an autoencoder-based data anomaly identification method, device, computer equipment and storage medium, aiming to solve the problem that the existing anomaly detection method is not applicable to time series data, if it is directly applied to time The effect on sequence data is not ideal, and the recognition accuracy of abnormal data is low.
技术解决方案technical solutions
本申请提出一种基于自编码器的数据异常识别方法,所述方法包括步骤:The present application proposes a method for identifying data anomalies based on an autoencoder, the method comprising the steps of:
接收输入的待检测的时间序列;Receive the input time series to be detected;
基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
本申请还提供一种基于自编码器的数据异常识别装置,包括:The present application also provides a device for identifying data anomalies based on an autoencoder, including:
接收模块,用于接收输入的待检测的时间序列;a receiving module for receiving the input time series to be detected;
训练模块,用于基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;The training module is configured to perform integrated training processing on a pre-generated specified number of sparsely connected autoencoders based on the time series according to preset rules, and generate a corresponding autoencoder integration framework, wherein the sparsely connected autoencoders are The encoder is generated by deleting the unit connection of a specified number of cyclic neural network-based autoencoders respectively;
计算模块,用于通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;a calculation module, configured to calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
识别模块,用于根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。An identification module, configured to identify whether there is an abnormal data value in the time series according to the abnormal score value.
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现一种基于自编码器的数据异常识别方法,其中,所述基于自编码器的数据异常识别方法包括以下步骤:The present application also provides a computer device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements a method for identifying data anomalies based on an autoencoder when the processor executes the computer program, wherein the The method for identifying data anomalies based on the autoencoder includes the following steps:
接收输入的待检测的时间序列;Receive the input time series to be detected;
基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基于自编码器的数据异常识别方法,其中,所述基于自编码器的数据异常识别方法包括以下步骤:The present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements an autoencoder-based data abnormality identification method, wherein the autoencoder-based data anomaly identification method is The data anomaly identification method includes the following steps:
接收输入的待检测的时间序列;Receive the input time series to be detected;
基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理, 生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on the pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
有益效果beneficial effect
本申请中提供的基于自编码器的数据异常识别方法、装置、计算机设备和存储介质,有效地提高了对于时间序列中的异常数据值的识别准确性,且对于时间序列中的异常数据值的识别处理效率较高。The method, device, computer equipment and storage medium for data anomaly identification based on the autoencoder provided in this application effectively improve the identification accuracy of abnormal data values in time series, and for abnormal data values in time series The recognition processing efficiency is high.
附图说明Description of drawings
图1是本申请一实施例的基于自编码器的数据异常识别方法的流程示意图;1 is a schematic flowchart of a method for identifying data anomalies based on an autoencoder according to an embodiment of the present application;
图2是本申请一实施例的基于自编码器的数据异常识别装置的结构示意图;2 is a schematic structural diagram of an apparatus for identifying data anomalies based on an autoencoder according to an embodiment of the present application;
图3是本申请一实施例的计算机设备的结构示意图。FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
应当理解,此处所描述的具体实施例仅仅用于解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
为便于对本申请实施例的解释,下面先对一些概念进行简单介绍:In order to facilitate the explanation of the embodiments of the present application, some concepts are briefly introduced below:
循环神经网络(Recurrent Neural Network,RNN),它的本质是:像人一样拥有记忆的能力,因此,RNN的输出就依赖于当前的输入和记忆。RNN网络引入定向循环,能够处理那些输入之间前后关联的问题。打破了传统神经网络结构层与层之间全连接,每层节点之间无连接的转态,不在是输入-隐层-输出的模式。RNN的目的:处理序列数据RNN的内容:一个序列当前的输出与前面的输入也有关系。RNN的具体做法:网络会对前面的信息进行记忆,并应用于当前输出的计算中,即隐层之间的节点不再无连接,并且隐层的输入不仅包括输入层的输出,还包括上一时刻隐层的输出。RNN的功能特点:1、隐层节点之间可以互连也可以自连;2、RNN网络中,每一步的输出不是必须的,每一步的输入也不是必须的。RNN用途:语言模型和文本生成研究、机器翻译、语音识别、图像描述生成。The essence of Recurrent Neural Network (RNN) is that it has the ability to remember like a human being. Therefore, the output of RNN depends on the current input and memory. The RNN network introduces directed loops, which can deal with the problem of contextual correlation between those inputs. It breaks the full connection between the layers of the traditional neural network structure, and the transition state of no connection between the nodes of each layer is not the input-hidden layer-output mode. The purpose of RNN: process sequence data RNN content: the current output of a sequence is also related to the previous input. The specific method of RNN: The network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layers are no longer unconnected, and the input of the hidden layer includes not only the output of the input layer, but also the upper The output of the hidden layer at a moment. The functional characteristics of RNN: 1. The hidden layer nodes can be interconnected or self-connected; 2. In the RNN network, the output of each step is not necessary, and the input of each step is not necessary. RNN uses: language model and text generation research, machine translation, speech recognition, image description generation.
自编码器:是神经网络的一种,经过训练后能尝试将输入复制到输出,自编码器内部有一个隐藏层h,可以产生编码表示输入,该网络可以看作由两部分组成:一个由函数h=f(x)表示的编码器和一个生成重构的解码器r=g(h)。传统的自编码器对于时间序列的处理过程为:对于时间序列T=<s 1,s 2,…,s C>,该时间序列中的每个向量s t被馈送到的自编码器的编码器中的RNN单元以执行以下计算:
Figure PCTCN2021097550-appb-000001
其中,s t是时间序列中时间步长t处的向量,隐藏状态
Figure PCTCN2021097550-appb-000002
是编码器中时间步长t-1时前一个RNN单元的输出,f(·)是一个非线性函数。通过上述公式
Figure PCTCN2021097550-appb-000003
可以在时间步长t处获得编码器当前RNN单元的隐藏状态
Figure PCTCN2021097550-appb-000004
然后在时间步长t-1处将其隐藏到下一个RNN单元中。另外,在自编码器的解码器中,会以相反的顺序重建该时间序列,即
Figure PCTCN2021097550-appb-000005
首先,将编码器的最后隐藏状态用作解码器的第一隐藏状态。基于解码器
Figure PCTCN2021097550-appb-000006
的先前隐藏状态和先前重建的向量
Figure PCTCN2021097550-appb-000007
重构当前向量
Figure PCTCN2021097550-appb-000008
并计算当前隐藏状态
Figure PCTCN2021097550-appb-000009
其中,g(·)是非线性函数。
Autoencoder: It is a kind of neural network. After training, it can try to copy the input to the output. There is a hidden layer h inside the autoencoder, which can generate the encoded representation input. The network can be regarded as composed of two parts: one is composed of An encoder represented by a function h=f(x) and a decoder that generates a reconstruction r=g(h). The processing procedure of traditional autoencoder for time series is: for time series T=<s 1 ,s 2 ,...,s C >, each vector s t in the time series is fed to the encoding of the autoencoder RNN unit in the generator to perform the following computations:
Figure PCTCN2021097550-appb-000001
where s t is the vector at time step t in the time series, the hidden state
Figure PCTCN2021097550-appb-000002
is the output of the previous RNN unit at time step t-1 in the encoder, and f( ) is a nonlinear function. by the above formula
Figure PCTCN2021097550-appb-000003
The hidden state of the encoder’s current RNN unit can be obtained at time step t
Figure PCTCN2021097550-appb-000004
It is then hidden into the next RNN unit at time step t-1. Also, in the decoder of the autoencoder, the time series is reconstructed in reverse order, i.e.
Figure PCTCN2021097550-appb-000005
First, the last hidden state of the encoder is used as the first hidden state of the decoder. Decoder based
Figure PCTCN2021097550-appb-000006
The previous hidden state and the previously reconstructed vector of
Figure PCTCN2021097550-appb-000007
reconstruct the current vector
Figure PCTCN2021097550-appb-000008
and calculate the current hidden state
Figure PCTCN2021097550-appb-000009
where g(·) is a nonlinear function.
参照图1,本申请一实施例的基于自编码器的数据异常识别方法,包括:1 , an autoencoder-based data anomaly identification method according to an embodiment of the present application includes:
S1:接收输入的待检测的时间序列;S1: Receive the input time series to be detected;
S2:基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;S2: Based on the time series, perform integrated training processing on the pre-generated specified number of sparsely connected autoencoders according to preset rules, and generate a corresponding autoencoder integration framework, wherein the sparsely connected autoencoders are Generated by deleting the unit connection of a specified number of cyclic neural network-based autoencoders respectively;
S3:通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;S3: Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
S4:根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。S4: Identify whether there is an abnormal data value in the time series according to the abnormal score value.
如上述步骤S1至S4所述,本方法实施例的执行主体为一种基于自编码器的数据异常识别装置。在实际应用中,上述基于自编码器的数据异常识别装置可以通过虚拟装置,例如软件代码实现,也可以通过写入或集成有相关执行代码的实体装置实现,且可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。本实施例中的基于自编码器的数据异常识别装置,能够快速准确地识别出待检测的时间序列中的异常数据值。具体地,首先接收输入的待检测的时间序列。其中,上述待检测的时间序列是为待检测是否存在异常数据值的时间序列,举例地,该时间序列可为服务器中的KPI(Key Performance Indicator,关键绩效指标)时间序列,且时间序列中包含的数据为向量形式。然后基于上述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,上述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自 编码器进行单元连接删除处理后生成的。具体的,上述稀疏连接的自编码器的生成过程可包括:先获取指定数量的基于循环神经网络的自编码器。上述基于循环神经网络的自编码器具体可为采用附加辅助连接的循环神经网络的自编码器(RSCN),该采用附加辅助连接的循环神经网络的自编码器在每个RNN单元之间添加了辅助连接,且对上述指定数量不作具体限定,可根据实际需求进行设置,本实施例可以将指定数量取为N。再分别对各上述基于循环神经网络的自编码器进行单元连接删除处理,生成对应数量的稀疏连接的自编码器。由于采用附加辅助连接的循环神经网络的自编码器在每个RNN单元之间添加了辅助连接,因此可以通过切断部分RNN单元之间的辅助链接,进而使得各网络层之间有一定差异。具体的,对各上述基于循环神经网络的自编码器进行单元连接删除处理的过程可包括:对于基于采用附加辅助连接的循环神经网络的自编码器,通过引入稀疏权重向量,可以控制在每个时间步长t应删除哪些辅助连接。
Figure PCTCN2021097550-appb-000010
w t代表稀疏权重向量,
Figure PCTCN2021097550-appb-000011
Figure PCTCN2021097550-appb-000012
代表稀疏权重向量中包含的元素。稀疏权重向量w t中至少一个元素不等于0,即w t=(0,1),(1,0),(1,1)三种情况。因而基于上述稀疏权重向量w t能够生成稀疏连接的自编码器,且得到的稀疏连接的自编码器内每一个RNN单元的隐藏状态计算如下:
Figure PCTCN2021097550-appb-000013
其中,s t为输入时间序列数据中时间步长t处的向量,h t-1为稀疏循环的自编码器的编码器中的时间步长t-1时的隐藏状态,h t-L为稀疏循环的自编码器的编码器中的时间步长t-L时的隐藏状态,w t为稀疏权重向量,||w t|| 0表示向量w t中非零元素的数量。进一步地,还可以根据实际需求使用随机删除连接的方式来进行单元连接删除处理,对于每个基于循环神经网络的自编码器,通过随机删除一些RNN单元的连接以获得稀疏连接的自编码器,使得稀疏连接的自编码器对时间序列进行重构处理后得到的重构误差不尽相同,有效地拓展了自编码器的适用范围,增强了自编码器的可靠性、准确性与泛化性。此外,假设上述指定数量为N,则得到N个上述稀疏循环的自编码器,且每个稀疏循环的自编码器由编码器E i和解码器D i组成,1≤i≤N,且每个稀疏循环的自动编码器都有其不同的稀疏权重向量。另外,上述自编码器集成框架可包括独立框架与共享框架。具体的,可以基于上述时间序列包含的所有向量、以及通过稀疏连接的自编码器生成的与上述时间序列包含的各向量对应的重构向量来生成对应的第一目标函数,再基于该第一目标函数分别对每一个稀疏连接的自编码器进行训练后以得到上述自编码器集成框架。以及,可以基于上述时间序列包含的所有向量、通过稀疏连接的自编码器生成的与上述时间序列包含的各向量对应的重构向量以及预设的共享隐藏状态生成对应的第二目标函数,再基于该第二目标函数对所有的稀疏连接的自编码器进行联合训练后以得到上述自编码器集成框架。在得到了上述自编码器集成框架后,再通过上述自编码器集成框架计算出上述时间序列中包含的每一个向量所对应的异常分数值。其中,可以通过上述自编码器集成框架中包含的每一个自编码器计算生成与时间序列中包含的每一个向量一一对应的重构误差,再对于时间序列中的任意一个指定向量,计算与上述指定向量对应的所有上述重构误差的中位数,进而可以得到与该指定向量所对应的异常分数值。最后根据上述异常分数值,识别出上述时间序列中是否存在异常数据值。其中,可根据预设的异常阈值来识别出上述时间序列中是否存在异常数据值,如果上述时间序列中的任意一个指定向量对应的异常分数值大于该异常阈值,则将该指定向量确定为异常数据值。而如果该指定向量对应的异常分数值不大于该异常阈值,则该指定向量确定为正常数据值,即该指定向量不属于异常数据值。不同于现有的异常检测方法,本实施例采用了基于自编码器集成框架来进行对于时间序列的数据异常识别处理的,在接收输入的待检测的时间序列时,会先获取对原始的基于循环神经网络的自编码器进行改进以生成稀疏连接的自编码器,再基于时间序列对预生成的稀疏连接的自编码器进行集成训练处理生成能够使用于时间序列数据异常值识别的自编码器集成框架,从而可以使用该自编码器集成框架计算出上述时间序列中包含的每一个向量所对应的异常分数值,进而可以根据该异常分数值来快速准确地识别出上述时间序列中是否存在异常数据值,有效地提高了对于时间序列中的异常数据值的识别准确性,且对于时间序列中的异常数据值的识别处理效率较高。
As described in the above steps S1 to S4, the execution body of this embodiment of the method is a data abnormality identification device based on an autoencoder. In practical applications, the above-mentioned device for identifying data anomalies based on autoencoders can be implemented through virtual devices, such as software codes, or through physical devices written or integrated with relevant execution codes, and can communicate with users through keyboards, mice, Human-computer interaction is carried out by means of remote control, touchpad or voice control device. The apparatus for identifying data anomalies based on an autoencoder in this embodiment can quickly and accurately identify anomalous data values in the time series to be detected. Specifically, the input time series to be detected is received first. Wherein, the above-mentioned time series to be detected is the time series of whether there are abnormal data values to be detected. For example, the time series may be a KPI (Key Performance Indicator, key performance indicator) time series in the server, and the time series includes The data is in vector form. Then, based on the above time series, the pre-generated specified number of sparsely connected autoencoders are integrated and trained according to preset rules to generate a corresponding autoencoder integration framework, wherein the above sparsely connected autoencoders are obtained by separately Generated after a specified number of RNN-based autoencoders perform unit connection removal processing. Specifically, the above-mentioned generation process of the sparsely connected autoencoders may include: first obtaining a specified number of cyclic neural network-based autoencoders. The above-mentioned cyclic neural network-based autoencoder may specifically be a cyclic neural network autoencoder (RSCN) using additional auxiliary connections, and the cyclic neural network autoencoder using additional auxiliary connections adds between each RNN unit. Auxiliary connections, and the above specified number is not specifically limited, and can be set according to actual needs. In this embodiment, the specified number may be taken as N. Then, the unit connection deletion process is performed on each of the above-mentioned cyclic neural network-based autoencoders respectively to generate a corresponding number of sparsely connected autoencoders. Since the autoencoder of the recurrent neural network with additional auxiliary connections adds auxiliary connections between each RNN unit, it is possible to cut off the auxiliary links between some RNN units to make certain differences between the network layers. Specifically, the process of performing unit connection deletion processing on each of the above-mentioned cyclic neural network-based autoencoders may include: for an autoencoder based on a cyclic neural network using additional auxiliary connections, by introducing a sparse weight vector, it can be controlled at each Which auxiliary connections should be removed at time step t.
Figure PCTCN2021097550-appb-000010
w t represents the sparse weight vector,
Figure PCTCN2021097550-appb-000011
and
Figure PCTCN2021097550-appb-000012
Represents the elements contained in the sparse weight vector. At least one element in the sparse weight vector wt is not equal to 0, that is, wt = (0,1), (1,0), (1,1) three cases. Therefore, a sparsely connected autoencoder can be generated based on the above sparse weight vector w t , and the hidden state of each RNN unit in the obtained sparsely connected autoencoder is calculated as follows:
Figure PCTCN2021097550-appb-000013
where s t is the vector at time step t in the input time series data, h t-1 is the hidden state at time step t-1 in the encoder of the sparse loop autoencoder, and h tL is the sparse loop The hidden state at time step tL in the encoder of the autoencoder, w t is the sparse weight vector, ||w t || 0 represents the number of non-zero elements in the vector w t . Further, the unit connection deletion process can also be performed by randomly deleting connections according to actual needs. For each RNN-based autoencoder, the connection of some RNN units is randomly deleted to obtain a sparsely connected autoencoder, The reconstruction error obtained by the sparsely connected autoencoder after reconstruction processing of the time series is not the same, which effectively expands the scope of application of the autoencoder and enhances the reliability, accuracy and generalization of the autoencoder. . In addition, assuming that the above specified number is N, then N self-encoders of the above-mentioned sparse loops are obtained, and the self-encoder of each sparse loop consists of an encoder E i and a decoder D i , 1≤i≤N, and each Each sparse recurrent autoencoder has its different sparse weight vector. In addition, the above-mentioned autoencoder integration framework may include an independent framework and a shared framework. Specifically, the corresponding first objective function can be generated based on all the vectors contained in the above time series and the reconstruction vector corresponding to each vector contained in the above time series generated by the sparsely connected autoencoder, and then based on the first objective function. The objective function is to train each sparsely connected autoencoder separately to obtain the above autoencoder ensemble framework. And, the corresponding second objective function can be generated based on all the vectors contained in the above-mentioned time series, the reconstruction vector corresponding to each vector contained in the above-mentioned time series generated by the sparsely connected autoencoder, and the preset shared hidden state, and then Based on the second objective function, all sparsely connected autoencoders are jointly trained to obtain the above-mentioned autoencoder ensemble framework. After the above-mentioned autoencoder integration framework is obtained, the abnormal score value corresponding to each vector included in the above-mentioned time series is calculated by the above-mentioned autoencoder integration framework. Among them, the reconstruction error corresponding to each vector contained in the time series can be calculated and generated by each autoencoder included in the above-mentioned autoencoder integration framework, and then for any specified vector in the time series, the calculation and The median of all the above-mentioned reconstruction errors corresponding to the above-mentioned designated vector, and then the abnormal score value corresponding to the designated vector can be obtained. Finally, according to the above abnormal score value, it is identified whether there is abnormal data value in the above time series. Wherein, whether there are abnormal data values in the above-mentioned time series can be identified according to a preset abnormal threshold, and if the abnormal score value corresponding to any one of the specified vectors in the above-mentioned time series is greater than the abnormal threshold, the specified vector is determined as abnormal data value. And if the abnormal score value corresponding to the designated vector is not greater than the abnormal threshold, the designated vector is determined to be a normal data value, that is, the designated vector does not belong to an abnormal data value. Different from the existing anomaly detection methods, this embodiment adopts an autoencoder-based integration framework to perform data anomaly identification processing for time series. When receiving the input time series to be detected, the original The autoencoder of the recurrent neural network is improved to generate a sparsely connected autoencoder, and then the pre-generated sparsely connected autoencoder is integrated and trained based on the time series to generate an autoencoder that can be used for outlier identification of time series data. Integration framework, so that the autoencoder integration framework can be used to calculate the abnormal score value corresponding to each vector included in the above time series, and then can quickly and accurately identify whether there is abnormality in the above time series according to the abnormal score value The data value effectively improves the identification accuracy of abnormal data values in the time series, and the identification processing efficiency for abnormal data values in the time series is high.
进一步地,本申请一实施例中,上述步骤S2,包括:Further, in an embodiment of the present application, the above step S2 includes:
S200:获取所述时间序列包含的所有第一向量;以及,S200: Acquire all the first vectors included in the time series; and,
S201:获取各所述稀疏连接的自编码器基于各所述第一向量生成的一一对应的第一重构向量;S201: Obtain a one-to-one corresponding first reconstruction vector generated by each of the sparsely connected autoencoders based on each of the first vectors;
S202:基于所述第一向量与所述第一重构向量,生成对应的第一目标函数;S202: Based on the first vector and the first reconstruction vector, generate a corresponding first objective function;
S203:基于所述第一目标函数分别对每一个所述稀疏连接的自编码器进行训练,得到训练完成的第一自编码器,其中,所述第一自编码器的数量与所述稀疏连接的自编码器的数量相同;S203: Perform training on each of the sparsely connected autoencoders based on the first objective function to obtain a trained first autoencoder, wherein the number of the first autoencoders is related to the sparse connections the same number of autoencoders;
S204:对所有所述第一自编码器进行集成处理,生成对应的独立框架,其中,所述独立框架内包含有指定数量的所述第一自编码器,且各所述第一自编码器之间不产生交互;S204: Perform integrated processing on all the first autoencoders to generate corresponding independent frames, wherein the independent frames include a specified number of the first autoencoders, and each of the first autoencoders no interaction occurs;
S205:将所述独立框架确定为所述自编码器集成框架。S205: Determine the independent frame as the autoencoder integration frame.
如上述步骤S200至S205所述,上述自编码器集成框架可以为基于所有上述稀疏连接的自编码器所生成的独立框架,独立框架的训练过程是通过单独训练各个不同的稀疏连接的自动编码器,因此每一个稀疏连接的自编码器在训练阶段不会产生交互,且生成的独立框架中包含的每一个自编码器之间也不会产生交互。具体地,上述基于上述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架的步骤可包括:首先获取上述时间序列包含的所有第一向量。其中,上述输入的待检测的时间序列可为:T=<s 1,s 2,…,s C>,时间序列T中包含的向量s 1,s 2,…,s C即可视为上述第一向量。以及同时获取获取各上述稀疏连接的自编码器基于各上述第一向量生成的一一对应的第一重构向量,其中,任意一个上述稀疏连接的自编码器通过对上述时间序列进行重建处理后,会生成与该时间序列对应的重构时间序列
Figure PCTCN2021097550-appb-000014
且重构时间序列
Figure PCTCN2021097550-appb-000015
中包含的向量
Figure PCTCN2021097550-appb-000016
即可视为与上述第一向量分别对应的第一重构向量。然后基于上述第一向量与上述第一重构向量,生成对应的第一目标函数。其中,可通过最小化上述时间序列中的输入向量,与对应的由稀疏连接的自编码器生成的与该输入向量对应的重构向量之间的差作为第一目标函数J i,并使用该第一目标函数J i来独立对每一个稀疏连接的自编码器进行独立训练。具体的,第一目标函数可为:
Figure PCTCN2021097550-appb-000017
其中,J i为第一目标函数,s t是时间序列中时间步长t处的向量,
Figure PCTCN2021097550-appb-000018
表示在时间步t处来自稀疏连接的自编码器中包含的解码器D i生成对于向量s t的重构向量,
Figure PCTCN2021097550-appb-000019
是向量的L2-范数。在得到了上述第一目标函数后,再基于上述第一目标函数分别对每一个上述稀疏连接的自编码器进行训练,得到训练完成的第一自编码器,其中,上述第一自编码器的数量与上述稀疏连接的自编码器的数量相同。在得到了上述第一自编码器后,再对所有上述第一自编码器进行集成处理,生成对应的独立框架。其中,上述独立框架内包含有指定数量的上述第一自编码器,且各上述第一自编码器之间不产生交互。具体可以将所有上述第一自编码器集成到预设的集成框架中,以生成上述独立框架。另外,独立框架中的每个解码器D i将独立的隐藏状态
Figure PCTCN2021097550-appb-000020
用作初始隐藏状态与相应权重矩阵
Figure PCTCN2021097550-appb-000021
的线性组合。最后在得到了上述独立框架时,将上述独立框架确定为上述自编码器集成框架。本实施例通过训练生成由指定数量,且具有不同网络结构的稀疏连接的自编码器组成的独立框架,由于在使用该独立框架进行异常检测时会考虑来自多个自编码器的重构误差,从而有助于减少总体重构误差的方差,以便后续根据该独立框架能够准确地计算出上述时间序列中包含的每一个向量所对应的异常分数值,进而根据该异常分数值,来快速准确地识别出上述时间序列中是否存在异常数据值,以有效提高对于时间序列中的异常数据值的识别效率与识别准确性。
As described in the above steps S200 to S205, the above-mentioned auto-encoder integration framework may be an independent framework generated based on all the above-mentioned sparsely connected auto-encoders, and the training process of the independent framework is to independently train each different sparsely connected auto-encoders , so each sparsely connected autoencoder does not interact during the training phase, nor does each autoencoder contained in the generated independent frame interact. Specifically, based on the above-mentioned time series, the pre-generated specified number of sparsely connected autoencoders are subjected to integrated training processing according to preset rules, and the step of generating the corresponding autoencoder integration framework may include: first obtaining the above-mentioned time series including all first vectors of . The above input time series to be detected may be: T=<s 1 ,s 2 ,...,s C >, and the vectors s 1 ,s 2 ,...,s C included in the time series T can be regarded as the above first vector. and simultaneously acquire and obtain the first reconstruction vectors that are generated by each of the sparsely connected autoencoders based on each of the above-mentioned first vectors. , will generate a reconstructed time series corresponding to the time series
Figure PCTCN2021097550-appb-000014
and reconstruct the time series
Figure PCTCN2021097550-appb-000015
vector contained in
Figure PCTCN2021097550-appb-000016
That is, it can be regarded as the first reconstruction vector corresponding to the above-mentioned first vector respectively. Then, based on the first vector and the first reconstruction vector, a corresponding first objective function is generated. Wherein, the difference between the input vector in the above time series and the corresponding reconstruction vector generated by the sparsely connected autoencoder corresponding to the input vector can be used as the first objective function J i , and the The first objective function J i is used to independently train each sparsely connected autoencoder. Specifically, the first objective function may be:
Figure PCTCN2021097550-appb-000017
Among them, J i is the first objective function, s t is the vector at time step t in the time series,
Figure PCTCN2021097550-appb-000018
represents that the decoder D i contained in the autoencoder from the sparse connections at time step t generates a reconstructed vector for vector s t ,
Figure PCTCN2021097550-appb-000019
is the L2-norm of the vector. After the first objective function is obtained, each of the sparsely connected autoencoders is trained based on the first objective function to obtain a trained first autoencoder, wherein the first autoencoder is The number is the same as the number of sparsely connected autoencoders above. After the above-mentioned first self-encoders are obtained, all the above-mentioned first self-encoders are integrated and processed to generate corresponding independent frames. Wherein, the above-mentioned independent frame includes a specified number of the above-mentioned first auto-encoders, and there is no interaction among the above-mentioned first auto-encoders. Specifically, all the above-mentioned first autoencoders can be integrated into a preset integrated framework to generate the above-mentioned independent framework. In addition, each decoder D i in an independent frame will have an independent hidden state
Figure PCTCN2021097550-appb-000020
used as initial hidden state and corresponding weight matrix
Figure PCTCN2021097550-appb-000021
linear combination of . Finally, when the above-mentioned independent framework is obtained, the above-mentioned independent framework is determined as the above-mentioned self-encoder integrated framework. In this embodiment, an independent frame composed of a specified number of sparsely connected autoencoders with different network structures is generated through training. Since the reconstruction error from multiple autoencoders will be considered when using the independent frame for anomaly detection, This helps to reduce the variance of the overall reconstruction error, so that the anomaly score value corresponding to each vector included in the above time series can be accurately calculated subsequently according to the independent framework, and then according to the anomaly score value, to quickly and accurately Identify whether there are abnormal data values in the above time series, so as to effectively improve the recognition efficiency and recognition accuracy of abnormal data values in the time series.
进一步地,本申请一实施例中,上述步骤S2,包括:Further, in an embodiment of the present application, the above step S2 includes:
S210:获取预设的共享层,其中,所述共享层包括共享隐藏状态;S210: Acquire a preset shared layer, wherein the shared layer includes a shared hidden state;
S211:通过所述共享层对所有所述稀疏连接的自编码器进行权值共享处理;S211: Perform weight sharing processing on all the sparsely connected autoencoders through the sharing layer;
S212:对所述共享隐藏状态进行L1正则化处理,得到处理后的共享隐藏状态;S212: Perform L1 regularization processing on the shared hidden state to obtain a processed shared hidden state;
S213:获取所述时间序列包含的所有第二向量;以及,S213: Acquire all second vectors included in the time series; and,
S214:获取各所述稀疏连接的自编码器基于各所述第二向量生成的一一对应的第二重构向量;S214: Obtain a one-to-one corresponding second reconstruction vector generated by each of the sparsely connected autoencoders based on each of the second vectors;
S215:根据所述处理后的共享隐藏状态、所述第二向量以及所述第二重构向量,生成对应的第二目标函数;S215: Generate a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;
S216:基于所述第二目标函数对所有所述稀疏连接的自编码器进行联合训练,得到训练完成的第二自编码器,其中,所述第二自编码器的数量与所述稀疏连接的自编码器的数量相同;S216: Perform joint training on all the sparsely connected autoencoders based on the second objective function to obtain a trained second autoencoder, wherein the number of the second autoencoders is the same as the number of the sparsely connected autoencoders The same number of autoencoders;
S217:对所有所述第二自编码器进行集成处理,生成对应的共享框架,其中,所述共享框架内包含有指定数量的所述第二自编码器,且各所述第二自编码器之间存在交互;S217: Integrate all the second autoencoders to generate a corresponding shared frame, wherein the shared frame includes a specified number of the second autoencoders, and each of the second autoencoders There is interaction between;
S218:将所述共享框架确定为所述自编码器集成框架。S218: Determine the shared framework as the autoencoder integration framework.
如上述步骤S210至S218所述,上述自编码器集成框架可以为基于所有上述稀疏连接的自编码器以及预设的共享层所生成的包括了不同自编码器之间的共享框架,且由于共享框架包含了不同自编码器之间的交互,因而相比于上述独立框架,可进一步提升了对于时间序列的中的异常数据值的识别准确性。具体地,上述基于上述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架的步骤可包括:首先获取预设的共享层,并通过上述共享层对所 有上述稀疏连接的自编码器进行权值共享处理,其中,上述共享层包括共享隐藏状态。另外,上述共享层为连接所有上述稀疏连接的编码器的最后隐藏状态
Figure PCTCN2021097550-appb-000022
与相应权重矩阵
Figure PCTCN2021097550-appb-000023
的线性组合,具体的,共享层,也即共享隐藏状态
Figure PCTCN2021097550-appb-000024
然后对上述共享隐藏状态进行L1正则化处理,得到处理后的共享隐藏状态。其中,通过对共享隐藏状态进行L1正则化处理,可以使共享隐藏状态
Figure PCTCN2021097550-appb-000025
稀疏。进而避免某些编码器过度拟合上述时间序列,使得解码器适用范围更广,并且不容易受到异常数据值的影响。在得到了上述处理后的共享隐藏状态后,再获取上述时间序列包含的所有第二向量。其中,上述输入的待检测的时间序列可为:T=<s 1,s 2,…,s C>,时间序列T中包含的向量s 1,s 2,…,s C即可视为上述第二向量。以及同时获取各上述稀疏连接的自编码器基于各上述第二向量生成的一一对应的第二重构向量。其中,各上述稀疏连接的自编码器通过对上述时间序列进行重建处理后,会生成与该时间序列对应的重构时间序列
Figure PCTCN2021097550-appb-000026
且重构时间序列
Figure PCTCN2021097550-appb-000027
中包含的向量
Figure PCTCN2021097550-appb-000028
即可视为与上述第二向量分别对应的第二重构向量。之后根据上述处理后的共享隐藏状态、上述第二向量以及上述第二重构向量,生成对应的第二目标函数。具体的,上述第二目标函数具体可为:
Figure PCTCN2021097550-appb-000029
Figure PCTCN2021097550-appb-000030
其中,λ是控制L1正则化项重要性的权重参数,s t是时间序列中时间步长t处的向量,
Figure PCTCN2021097550-appb-000031
表示在时间步t处来自解码器D i的重构矢量,
Figure PCTCN2021097550-appb-000032
是经过L1正则化处理后的共享隐藏状态,
Figure PCTCN2021097550-appb-000033
是矢量的L2-范数,J i为上述第一目标函数。在得到了上述第二目标函数后,再基于上述第二目标函数对所有上述稀疏连接的自编码器进行联合训练,得到训练完成的第二自编码器,其中,上述第二自编码器的数量与上述稀疏连接的自编码器的数量相同。之后对所有上述第二自编码器进行集成处理,生成对应的共享框架。其中,上述共享框架内包含有指定数量的上述第二自编码器,且各上述第二自编码器之间存在交互。另外,可以将所有上述第二自编码器集成到预设的集成框架中,以生成上述共享框架。最后将上述共享框架确定为上述自编码器集成框架。本实施例通过训练生成由指定数量,且具有不同网络结构的稀疏连接的自编码器组成的共享框架,由于在使用该共享框架进行异常检测时会考虑来自多个自编码器的重构误差,且各稀疏连接的自编码器之间可产生交互,从而更加有助于减少总体重构误差的方差,以便后续根据该共享框架来准确地计算出上述时间序列中包含的每一个向量所对应的异常分数值,进而根据该异常分数值,来快速准确地识别出上述时间序列中是否存在异常数据值,以有效提高对于时间序列中的异常数据值的识别效率与识别准确性。
As described in the above steps S210 to S218, the above-mentioned auto-encoder integration framework may be a shared framework generated based on all the above-mentioned sparsely connected auto-encoders and a preset shared layer, including different auto-encoders, and due to the shared The framework includes the interaction between different autoencoders, so compared with the above independent framework, the recognition accuracy of abnormal data values in time series can be further improved. Specifically, based on the above-mentioned time series, the pre-generated specified number of sparsely connected autoencoders are subjected to integrated training processing according to preset rules, and the step of generating a corresponding autoencoder integration framework may include: first obtaining a preset shared layer, and weight sharing processing is performed on all the sparsely connected autoencoders through the sharing layer, wherein the sharing layer includes a shared hidden state. In addition, the above shared layer is the last hidden state of the encoder that connects all the above sparse connections
Figure PCTCN2021097550-appb-000022
with the corresponding weight matrix
Figure PCTCN2021097550-appb-000023
A linear combination of, specifically, shared layers, i.e. shared hidden states
Figure PCTCN2021097550-appb-000024
Then, L1 regularization is performed on the above shared hidden state to obtain the processed shared hidden state. Among them, by performing L1 regularization processing on the shared hidden state, the shared hidden state can be
Figure PCTCN2021097550-appb-000025
Sparse. This in turn prevents some encoders from overfitting the above time series, making the decoder more applicable and less susceptible to outlier data values. After obtaining the shared hidden state processed above, all second vectors included in the above time series are obtained. The above input time series to be detected may be: T=<s 1 ,s 2 ,...,s C >, and the vectors s 1 ,s 2 ,...,s C included in the time series T can be regarded as the above second vector. and simultaneously acquiring the one-to-one corresponding second reconstruction vectors generated by the sparsely connected autoencoders based on the second vectors. Wherein, after each of the above-mentioned sparsely connected autoencoders performs reconstruction processing on the above-mentioned time series, a reconstructed time series corresponding to the time series is generated.
Figure PCTCN2021097550-appb-000026
and reconstruct the time series
Figure PCTCN2021097550-appb-000027
vector contained in
Figure PCTCN2021097550-appb-000028
That is, it can be regarded as the second reconstruction vector corresponding to the above-mentioned second vector respectively. Then, a corresponding second objective function is generated according to the processed shared hidden state, the second vector, and the second reconstruction vector. Specifically, the above-mentioned second objective function may specifically be:
Figure PCTCN2021097550-appb-000029
Figure PCTCN2021097550-appb-000030
where λ is the weight parameter controlling the importance of the L1 regularization term, s t is the vector at time step t in the time series,
Figure PCTCN2021097550-appb-000031
represents the reconstructed vector from the decoder Di at time step t,
Figure PCTCN2021097550-appb-000032
is the shared hidden state after L1 regularization,
Figure PCTCN2021097550-appb-000033
is the L2-norm of the vector, and J i is the first objective function above. After the second objective function is obtained, joint training is performed on all the sparsely connected autoencoders based on the second objective function to obtain a trained second autoencoder, wherein the number of the second autoencoders Same as the number of sparsely connected autoencoders above. Afterwards, all the above-mentioned second autoencoders are integrated to generate the corresponding shared frame. Wherein, the shared framework includes a specified number of the second auto-encoders, and there is interaction between the second auto-encoders. In addition, all of the above second autoencoders can be integrated into a preset integration framework to generate the above shared framework. Finally, the above-mentioned shared framework is determined as the above-mentioned autoencoder integration framework. In this embodiment, a shared frame consisting of a specified number of sparsely connected autoencoders with different network structures is generated through training. Since the reconstruction error from multiple autoencoders will be considered when the shared frame is used for anomaly detection, Moreover, there can be interactions between the sparsely connected autoencoders, which is more helpful to reduce the variance of the overall reconstruction error, so that the corresponding value of each vector included in the above time series can be accurately calculated according to the shared framework. The abnormal score value, and then according to the abnormal score value, to quickly and accurately identify whether there is an abnormal data value in the above time series, so as to effectively improve the identification efficiency and accuracy of the abnormal data value in the time series.
进一步地,本申请一实施例中,上述步骤S3,包括:Further, in an embodiment of the present application, the above step S3 includes:
S300:通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,其中,所述指定向量为所述时间序列包含的所有向量中的任意一个向量;S300: Calculate and generate a reconstruction error corresponding to a specified vector by each autoencoder included in the autoencoder integration framework, wherein the specified vector is any one of all vectors included in the time series;
S301:计算所有所述重构误差的中位数;S301: Calculate the median of all the reconstruction errors;
S302:将所述中位数确定为与所述时间序列中的所述指定向量对应的指定异常分数值。S302: Determine the median as a specified abnormal score value corresponding to the specified vector in the time series.
如上述步骤S300至S302所述,上述通过上述自编码器集成框架计算上述时间序列中包含的每一个向量所对应的异常分数值的步骤,具体可包括:首先通过上述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,其中,上述指定向量为上述时间序列包含的所有向量中的任意一个向量。具体的,假设上述指定数量为N,对于原始时间序列T=<s 1,s 2,…,s C>中的任意一个向量s k,可通过自编码器集成框架中包含的N个自编码器生成与该向量s k对应的N个重构误差{a 1,a 2,…,a N}。另外,重构误差的生成过程可包括:通过自编码器集成框架中包含的N个自编码器分别生成与上述时间序列对应的重构时间序列,然后从各个重构时间序列中分别提取出与向量s k对应的重构向量,从而调用向量s k,以及与其对应的重构向量相关的计算公式来计算出与量s k对应的重构误差。然后计算所有上述重构误差的中位数。其中,可通过公式OS(s k)=median{a 1,a 2,…,a N},来计算出上述中位数。最后将上述中位数确定为与上述时间序列中的上述指定向量对应的指定异常分数值。其中,为了降低来自自编码器的重构误差的影响,因此使用N个重构误差的中位数作为向量s k的最终异常分数值。需要说明的是,上述独立框架与上述共享框架计算上述时间序列中包含的每一个向量所对应的异常分数值所使用到的计算公式是相同的。本实施例通过使用自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,进而所有上述重构误差的中位数为与上述时间序列中的上述指定向量对应的指定异常分数值, 以实现准确地计算出计算上述时间序列中包含的每一个向量所对应的异常分数值,进而有利于后续根据该异常分数值来快速准确地识别出上述时间序列中是否存在异常数据值,以有效提高对于时间序列中的异常数据值的识别效率与识别准确性。 As described in the above steps S300 to S302, the above-mentioned step of calculating the abnormal score value corresponding to each vector included in the above-mentioned time series through the above-mentioned autoencoder integration framework may specifically include: Each of the autoencoders calculates and generates a reconstruction error corresponding to a specified vector, where the specified vector is any one of all vectors included in the above time series. Specifically, assuming that the above specified number is N, for any vector sk in the original time series T=<s 1 , s 2 , ..., s C >, the N auto-encoders included in the auto-encoder integration framework can be used. The generator generates N reconstruction errors {a 1 , a 2 , . . . , a N } corresponding to the vector sk . In addition, the generating process of the reconstruction error may include: generating reconstructed time series corresponding to the above-mentioned time series by using N autoencoders included in the autoencoder integration framework, and then extracting the corresponding time series from the reconstructed time series, respectively. The reconstruction vector corresponding to the vector sk is called, so that the vector sk and the calculation formula related to the reconstruction vector corresponding to the vector sk are called to calculate the reconstruction error corresponding to the quantity sk . Then calculate the median of all the above reconstruction errors. Wherein, the above median can be calculated by the formula OS( sk )=median{a 1 , a 2 , . . . , a N }. Finally, the above median is determined as the specified anomaly score value corresponding to the above specified vector in the above time series. Among them, in order to reduce the influence of the reconstruction error from the autoencoder, the median of the N reconstruction errors is therefore used as the final outlier score value of the vector sk . It should be noted that the above-mentioned independent framework and the above-mentioned shared framework use the same calculation formula to calculate the abnormal score value corresponding to each vector included in the above-mentioned time series. This embodiment calculates and generates a reconstruction error corresponding to the specified vector by using each autoencoder included in the autoencoder integration framework, and the median of all the above reconstruction errors is corresponding to the above specified vector in the above time series. The specified abnormal score value of , so as to accurately calculate and calculate the abnormal score value corresponding to each vector included in the above time series, which is helpful to quickly and accurately identify whether the above time series exists in the above time series according to the abnormal score value. Abnormal data values to effectively improve the identification efficiency and accuracy of abnormal data values in time series.
进一步地,本申请一实施例中,上述步骤S300,包括:Further, in an embodiment of the present application, the above step S300 includes:
S3000:通过特定自编码器对所述时间序列进行重构处理,得到与所述时间序列对应的特定重构时间序列,其中,所述特定自编码器为所述自编码器集成框架中包含的所有自编码器中的任意一个自编码器;S3000: Perform reconstruction processing on the time series by using a specific autoencoder to obtain a specific reconstructed time series corresponding to the time series, where the specific autoencoder is a component included in the autoencoder integration framework any one of all autoencoders;
S3001:从所述特定重构时间序列中提取出与所述指定向量对应的特定重构向量;S3001: Extract a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;
S3002:根据所述指定向量与所述特定重构向量,计算出与所述指定向量对应的特定重构误差。S3002: Calculate a specific reconstruction error corresponding to the specified vector according to the specified vector and the specified reconstruction vector.
如上述步骤S3000至S3002所述,上述通过上述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差的步骤,具体可包括:首先通过特定自编码器对上述时间序列进行重构处理,得到与上述时间序列对应的特定重构时间序列,其中,上述特定自编码器为上述自编码器集成框架中包含的所有自编码器中的任意一个自编码器。其中,上述输入的待检测的时间序列可为:T=<s 1,s 2,…,s C>,特定自编码器通过对上述时间序列进行重建处理后,可生成与该时间序列对应的重构时间序列
Figure PCTCN2021097550-appb-000034
1≤i≤N。然后从上述特定重构时间序列中提取出与上述指定向量对应的特定重构向量。其中,对于上述时间序列中指定向量s k,可从特定自编码器生成的重构时间序列
Figure PCTCN2021097550-appb-000035
中提取出与该指定向量s k对应的特定重构向量
Figure PCTCN2021097550-appb-000036
最后根据上述指定向量与上述特定重构向量,计算出与上述指定向量对应的特定重构误差。其中,可以通过公式
Figure PCTCN2021097550-appb-000037
来计算出与上述指定向量对应的特定重构误差。进一步地,可通过公式
Figure PCTCN2021097550-appb-000038
来计算出与上述时间序列中的上述指定向量对应的指定异常分数值。以便后续能够根据自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,来快速地计算出上述时间序列中包含的每一个向量所对应的异常分数值,进而有利于后续根据该异常分数值来快速准确地识别出上述时间序列中是否存在异常数据值,以有效提高对于时间序列中的异常数据值的识别效率与识别准确性。
As described in the above steps S3000 to S3002, the above-mentioned step of calculating and generating the reconstruction error corresponding to the specified vector by each autoencoder included in the above-mentioned autoencoder integration framework may specifically include: The time series is reconstructed to obtain a specific reconstructed time series corresponding to the above-mentioned time series, wherein the above-mentioned specific auto-encoder is any one of all the auto-encoders included in the above-mentioned auto-encoder integration framework. The above input time series to be detected may be: T=<s 1 , s 2 ,...,s C >, and the specific autoencoder can generate a corresponding time series after reconstructing the above time series. Reconstructing time series
Figure PCTCN2021097550-appb-000034
1≤i≤N. Then, a specific reconstruction vector corresponding to the above-mentioned specified vector is extracted from the above-mentioned specific reconstruction time series. where, for the specified vector sk in the above time series, the reconstructed time series can be generated from a specific autoencoder
Figure PCTCN2021097550-appb-000035
Extract the specific reconstruction vector corresponding to the specified vector sk from
Figure PCTCN2021097550-appb-000036
Finally, according to the above-mentioned designated vector and the above-mentioned specific reconstruction vector, a specific reconstruction error corresponding to the above-mentioned designated vector is calculated. Among them, the formula can be
Figure PCTCN2021097550-appb-000037
to calculate the specific reconstruction error corresponding to the specified vector above. Further, by formula
Figure PCTCN2021097550-appb-000038
to calculate the specified anomaly score value corresponding to the above specified vector in the above time series. So that the reconstruction error corresponding to the specified vector can be calculated and generated according to each autoencoder included in the autoencoder integration framework in the future, so as to quickly calculate the abnormal score value corresponding to each vector included in the above time series, and then It is beneficial to quickly and accurately identify whether there is an abnormal data value in the above-mentioned time series according to the abnormal score value, so as to effectively improve the identification efficiency and identification accuracy of the abnormal data value in the time series.
进一步地,本申请一实施例中,上述步骤S4,包括:Further, in an embodiment of the present application, the above step S4 includes:
S400:获取预设的异常阈值;S400: Obtain a preset abnormal threshold;
S401:判断在所有所述异常分数值中,是否存在数值大于所述异常阈值的指定分数值;S401: Determine whether there is a specified score value with a value greater than the abnormal threshold value among all the abnormal score values;
S402:若是,则从所有所述异常分数值中筛选出所述指定分数值;S402: If yes, filter out the specified score value from all the abnormal score values;
S403:从所述时间序列中查找出与所述指定分数值对应的第三向量;S403: Find a third vector corresponding to the specified score value from the time series;
S404:将所述第三向量确定为所述异常数据值。S404: Determine the third vector as the abnormal data value.
如上述步骤S400至S404所述,上述根据上述异常分数值,识别出上述时间序列中是否存在异常数据值的步骤,具体可包括首先获取预设的异常阈值。其中,对于上述异常阈值的取值不作具体限定,可基于对历史时间序列数据进行相应统计计算后生成,也可根据实际需求进行设置。然后判断在所有上述异常分数值中,是否存在数值大于上述异常阈值的指定分数值。如果在所有上述异常分数值中存在数值大于上述异常阈值的指定分数值,则从所有上述异常分数值中筛选出上述指定分数值。之后从上述时间序列中查找出与上述指定分数值对应的第三向量。最后在得到了上述第三向量时,将上述第三向量确定为上述异常数据值。本实施例在使用自编码器集成框架计算出上述时间序列中包含的每一个向量所对应的异常分数值。通过将该异常分数值与预设的异常阈值进行比较,进而从时间序列中查找出所有异常分数值中大于上述异常阈值的指定分数值,并将在时间序列中与该指定分数值对应的对应的第三向量确定为异常数据值,实现了对于时间序列中所包含的异常数据值的精确识别,有效地提高了对于时间序列中的异常数据的识别效率。As described in the above steps S400 to S404, the above step of identifying whether there is an abnormal data value in the above time series according to the above abnormal score value may specifically include first obtaining a preset abnormal threshold value. The value of the above abnormal threshold is not specifically limited, and can be generated based on corresponding statistical calculation of historical time series data, or can be set according to actual needs. Then, it is judged whether there is a specified score value with a value greater than the above-mentioned abnormal threshold value among all the above-mentioned abnormal score values. If there is a designated score value whose value is greater than the above abnormal threshold value among all the above abnormal score values, the above designated score value is filtered out from all the above abnormal score values. Then, the third vector corresponding to the above specified score value is found from the above time series. Finally, when the third vector is obtained, the third vector is determined as the abnormal data value. In this embodiment, the autoencoder integration framework is used to calculate the abnormal score value corresponding to each vector included in the above time series. By comparing the abnormal score value with the preset abnormal threshold value, the specified score value that is greater than the above abnormal score value among all abnormal score values is found from the time series, and the corresponding score value corresponding to the specified score value in the time series will be found. The third vector of is determined as the abnormal data value, which realizes the accurate identification of the abnormal data value contained in the time series, and effectively improves the identification efficiency of the abnormal data in the time series.
进一步地,本申请一实施例中,上述步骤S404之后,包括:Further, in an embodiment of the present application, after the above step S404, it includes:
S405:从所述时间序列中筛选出除所述第三向量之外的第四向量;S405: Screen out a fourth vector other than the third vector from the time series;
S406:将所述第二向量标记为正常数据值;S406: mark the second vector as a normal data value;
S407:获取与所述第三向量对应的第一数量;以及,S407: Obtain the first quantity corresponding to the third vector; and,
S408:获取与所述第四向量对应的第二数量;S408: Obtain a second quantity corresponding to the fourth vector;
S409:根据所述异常数据值、所述第一数量、所述正常数据以及所述第二数量,生成与所述时间序列对应的异常分析报告;S409: Generate an abnormality analysis report corresponding to the time series according to the abnormal data value, the first quantity, the normal data, and the second quantity;
S410:展示所述异常分析报告。S410: Display the abnormality analysis report.
如上述步骤S405至S410所述,在得到了上述时间序列中的异常数据值后,还可进一步根据该异常数据值及相关数据生成对应的异常分析报告,具体地,上述将上述第三向量确定为上述异常数据值的步骤之后,还可包括:首先从上述时间序列中筛选出除上述第三向量之外的第四向量,并将上述第二向量标记为正常数据值。然后获取与上述第三向量对应的第一数量。以及同时获取与上述第四向量对应的第二数量。之后根据上述异常数据值、上述第一数量、上述正常数据以及上述第二数量,生成与上述时间序列对应的异常分析报告。其中,上述异常分析报告中个至少包括上述异常数据值、上述第一数量、上述正常数据以及上述第二数量。最后在得到了上述异常分析报告后,再展示上述异常分析报告,以便用户能够根据该异常分析报告清楚地了解待检测的时间序列中包含的异常数据值的具体分布情况及与规模量,以及正常数据值的具体分布情况与规模量。其中,对于上述异常分析报告的展示方式不作具体限定,可根据实现需求进行设置。As described in the above steps S405 to S410, after obtaining the abnormal data value in the above-mentioned time series, a corresponding abnormality analysis report may be further generated according to the abnormal data value and related data. Specifically, the above-mentioned third vector is determined as described above. After the step of identifying the abnormal data value, the method may further include: firstly screening out a fourth vector other than the third vector from the time series, and marking the second vector as a normal data value. Then get the first quantity corresponding to the above third vector. and simultaneously acquiring the second quantity corresponding to the above-mentioned fourth vector. Then, an anomaly analysis report corresponding to the above-mentioned time series is generated according to the above-mentioned abnormal data value, the above-mentioned first quantity, the above-mentioned normal data, and the above-mentioned second quantity. Wherein, one of the above-mentioned abnormality analysis reports at least includes the above-mentioned abnormal data value, the above-mentioned first quantity, the above-mentioned normal data, and the above-mentioned second quantity. Finally, after obtaining the above-mentioned abnormality analysis report, the above-mentioned abnormality analysis report is displayed, so that the user can clearly understand the specific distribution and scale of abnormal data values contained in the time series to be detected according to the abnormality analysis report, as well as the normal The specific distribution and scale of data values. The display method of the above exception analysis report is not specifically limited, and can be set according to implementation requirements.
本申请实施例中的基于自编码器的数据异常识别方法还可以应用于区块链领域,如将上述自编码器集成框架等数据存储于区块链上。通过使用区块链来对上述自编码器集成框架进行存储和管理,能够有效地保证上述自编码器集成框架的安全性与不可篡改性。The method for identifying data anomalies based on the autoencoder in the embodiments of the present application can also be applied to the blockchain field, for example, the data such as the above-mentioned autoencoder integration framework is stored on the blockchain. By using the blockchain to store and manage the above-mentioned self-encoder integration framework, the security and immutability of the above-mentioned self-encoder integration framework can be effectively guaranteed.
上述区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The above-mentioned blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
区块链底层平台可以包括用户管理、基础服务、智能合约以及运营监控等处理模块。其中,用户管理模块负责所有区块链参与者的身份信息管理,包括维护公私钥生成(账户管理)、密钥管理以及用户真实身份和区块链地址对应关系维护(权限管理)等,并且在授权的情况下,监管和审计某些真实身份的交易情况,提供风险控制的规则配置(风控审计);基础服务模块部署在所有区块链节点设备上,用来验证业务请求的有效性,并对有效请求完成共识后记录到存储上,对于一个新的业务请求,基础服务先对接口适配解析和鉴权处理(接口适配),然后通过共识算法将业务信息加密(共识管理),在加密之后完整一致的传输至共享账本上(网络通信),并进行记录存储;智能合约模块负责合约的注册发行以及合约触发和合约执行,开发人员可以通过某种编程语言定义合约逻辑,发布到区块链上(合约注册),根据合约条款的逻辑,调用密钥或者其它的事件触发执行,完成合约逻辑,同时还提供对合约升级注销的功能;运营监控模块主要负责产品发布过程中的部署、配置的修改、合约设置、云适配以及产品运行中的实时状态的可视化输出,例如:告警、监控网络情况、监控节点设备健康状态等。The underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring. Among them, the user management module is responsible for the identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, and maintenance of the corresponding relationship between the user's real identity and blockchain address (authority management), etc. When authorized, supervise and audit the transactions of some real identities, and provide rule configuration for risk control (risk control audit); the basic service module is deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on valid requests, record them in the storage. For a new business request, the basic service first adapts the interface for analysis and authentication processing (interface adaptation), and then encrypts the business information through the consensus algorithm (consensus management), After encryption, it is completely and consistently transferred to the shared ledger (network communication), and records are stored; the smart contract module is responsible for the registration and issuance of contracts, as well as contract triggering and contract execution. Developers can define contract logic through a programming language and publish to On the blockchain (contract registration), according to the logic of the contract terms, call the key or other events to trigger execution, complete the contract logic, and also provide the function of contract upgrade and cancellation; the operation monitoring module is mainly responsible for the deployment in the product release process , configuration modification, contract settings, cloud adaptation, and visual output of real-time status in product operation, such as: alarms, monitoring network conditions, monitoring node equipment health status, etc.
参照图2,本申请一实施例中还提供了一种基于自编码器的数据异常识别装置,包括:Referring to FIG. 2 , an embodiment of the present application also provides a device for identifying data anomalies based on an autoencoder, including:
接收模块1,用于接收输入的待检测的时间序列;A receiving module 1, for receiving the input time series to be detected;
训练模块2,用于基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;The training module 2 is configured to perform integrated training processing on a pre-generated specified number of sparsely connected autoencoders based on the time series according to preset rules, and generate a corresponding autoencoder integration framework, wherein the sparsely connected autoencoders are The autoencoder is generated by removing the unit connection of a specified number of cyclic neural network-based autoencoders respectively;
计算模块3,用于通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值; Calculation module 3, for calculating the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
识别模块4,用于根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。The identification module 4 is configured to identify whether there is an abnormal data value in the time series according to the abnormal score value.
本实施例中,上述基于自编码器的数据异常识别装置中的接收模块、训练模块、计算模块与识别模块的功能和作用的实现过程具体详见上述基于自编码器的数据异常识别方法中对应步骤S1至S4的实现过程,在此不再赘述。In this embodiment, for details of the realization process of the functions and functions of the receiving module, the training module, the calculation module and the identification module in the above-mentioned self-encoder-based data abnormality identification device, please refer to the corresponding data in the above-mentioned self-encoder-based data abnormality identification method. The implementation process of steps S1 to S4 will not be repeated here.
进一步地,本申请一实施例中,上述训练模块,包括:Further, in an embodiment of the present application, the above-mentioned training module includes:
第一获取单元,用于获取所述时间序列包含的所有第一向量;以及,a first acquiring unit, configured to acquire all the first vectors included in the time series; and,
第二获取单元,用于获取各所述稀疏连接的自编码器基于各所述第一向量生成的一一对应的第一重构向量;a second obtaining unit, configured to obtain a one-to-one corresponding first reconstruction vector generated by each of the sparsely connected autoencoders based on each of the first vectors;
第一生成单元,用于基于所述第一向量与所述第一重构向量,生成对应的第一目标函数;a first generating unit for generating a corresponding first objective function based on the first vector and the first reconstruction vector;
第一训练单元,用于基于所述第一目标函数分别对每一个所述稀疏连接的自编码器进行训练,得到训练完成的第一自编码器,其中,所述第一自编码器的数量与所述稀疏连接的自编码器的数量相同;a first training unit, configured to separately train each of the sparsely connected autoencoders based on the first objective function to obtain a trained first autoencoder, wherein the number of the first autoencoders is the same as the number of sparsely connected autoencoders;
第一处理单元,用于对所有所述第一自编码器进行集成处理,生成对应的独立框架,其中,所述独立框架内包含有指定数量的所述第一自编码器,且各所述第一自编码器之间不产生交互;The first processing unit is configured to perform integrated processing on all the first self-encoders to generate a corresponding independent frame, wherein the independent frame includes a specified number of the first self-encoders, and each of the There is no interaction between the first autoencoders;
第一确定单元,用于将所述独立框架确定为所述自编码器集成框架。a first determining unit, configured to determine the independent frame as the autoencoder integrated frame.
本实施例中,上述基于自编码器的数据异常识别装置中的第一获取单元、第二获取单元、第一生成单元、第一训练单元、第一处理单元与第一确定单元的功能和作用的实现过程具体详见上述基于自编码器的数据异常识别方法中对应步骤S200至S205的实现过程,在此不再赘述。In this embodiment, the functions and functions of the first acquisition unit, the second acquisition unit, the first generation unit, the first training unit, the first processing unit, and the first determination unit in the above-mentioned autoencoder-based data abnormality identification device For details of the implementation process, please refer to the implementation process corresponding to steps S200 to S205 in the above-mentioned autoencoder-based data abnormality identification method, which will not be repeated here.
进一步地,本申请一实施例中,上述训练模块,包括:Further, in an embodiment of the present application, the above-mentioned training module includes:
第三获取单元,用于获取预设的共享层,其中,所述共享层包括共享隐藏状态;a third acquiring unit, configured to acquire a preset shared layer, wherein the shared layer includes a shared hidden state;
第二处理单元,用于通过所述共享层对所有所述稀疏连接的自编码器进行权值共享处理;a second processing unit, configured to perform weight sharing processing on all the sparsely connected autoencoders through the sharing layer;
第三处理单元,用于对所述共享隐藏状态进行L1正则化处理,得到处理后的共享隐藏状态;a third processing unit, configured to perform L1 regularization processing on the shared hidden state to obtain the processed shared hidden state;
第四获取单元,用于获取所述时间序列包含的所有第二向量;以及,a fourth acquiring unit, configured to acquire all the second vectors contained in the time series; and,
第五获取单元,用于获取各所述稀疏连接的自编码器基于各所述第二向量生成的一一对应的第二重构向量;a fifth obtaining unit, configured to obtain a one-to-one corresponding second reconstruction vector generated by each of the sparsely connected autoencoders based on each of the second vectors;
第二生成单元,用于根据所述处理后的共享隐藏状态、所述第二向量以及所述第二重构向量,生成对应的第二目标函数;a second generating unit, configured to generate a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;
第二训练单元,用于基于所述第二目标函数对所有所述稀疏连接的自编码器进行联合训练,得到训练完成的第二自编码器,其中,所述第二自编码器的数量与所述稀疏连接的自编码器的数量相同;The second training unit is configured to jointly train all the sparsely connected autoencoders based on the second objective function to obtain a trained second autoencoder, wherein the number of the second autoencoders is the same as The number of sparsely connected autoencoders is the same;
第四处理单元,用于对所有所述第二自编码器进行集成处理,生成对应的共享框架,其中,所述共享框架内包含有指定数量的所述第二自编码器,且各所述第二自编码器之间存在交互;The fourth processing unit is configured to perform integrated processing on all the second auto-encoders to generate a corresponding shared frame, wherein the shared frame includes a specified number of the second auto-encoders, and each of the There is interaction between the second autoencoders;
第二确定单元,用于将所述共享框架确定为所述自编码器集成框架。The second determining unit is configured to determine the shared frame as the autoencoder integration frame.
本实施例中,上述基于自编码器的数据异常识别装置中的第三获取单元、第二处理单元、第三处理单元、第四获取单元、第五获取单元、第二生成单元、第二训练单元、第四处理单元与第二确定单元的功能和作用的实现过程具体详见上述基于自编码器的数据异常识别方法中对应步骤S210至S218的实现过程,在此不再赘述。In this embodiment, the third obtaining unit, the second processing unit, the third processing unit, the fourth obtaining unit, the fifth obtaining unit, the second generating unit, the second training unit in the above-mentioned autoencoder-based data anomaly identification device The implementation process of the functions and functions of the unit, the fourth processing unit and the second determining unit can be found in the implementation process corresponding to steps S210 to S218 in the above-mentioned autoencoder-based data abnormality identification method, which will not be repeated here.
进一步地,本申请一实施例中,上述计算模块,包括:Further, in an embodiment of the present application, the above calculation module includes:
第一计算单元,用于通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,其中,所述指定向量为所述时间序列包含的所有向量中的任意一个向量;A first calculation unit, configured to calculate and generate a reconstruction error corresponding to a specified vector through each autoencoder included in the autoencoder integration framework, where the specified vector is one of all vectors included in the time series any vector of ;
第二计算单元,用于计算所有所述重构误差的中位数;a second computing unit for computing the median of all the reconstruction errors;
第三确定单元,用于将所述中位数确定为与所述时间序列中的所述指定向量对应的指定异常分数值。A third determining unit, configured to determine the median as a specified abnormal score value corresponding to the specified vector in the time series.
本实施例中,上述基于自编码器的数据异常识别装置中的第一计算单元、第二计算单元与第三确定单元的功能和作用的实现过程具体详见上述基于自编码器的数据异常识别方法中对应步骤S300至S302的实现过程,在此不再赘述。In this embodiment, for the implementation process of the functions and functions of the first calculation unit, the second calculation unit and the third determination unit in the above-mentioned self-encoder-based data abnormality identification device, please refer to the above-mentioned self-encoder-based data abnormality identification for details. The implementation process corresponding to steps S300 to S302 in the method will not be repeated here.
进一步地,本申请一实施例中,上述第一计算单元,包括:Further, in an embodiment of the present application, the above-mentioned first computing unit includes:
处理子单元,用于通过特定自编码器对所述时间序列进行重构处理,得到与所述时间序列对应的特定重构时间序列,其中,所述特定自编码器为所述自编码器集成框架中包含的所有自编码器中的任意一个自编码器;a processing subunit, configured to perform reconstruction processing on the time series by using a specific autoencoder to obtain a specific reconstructed time series corresponding to the time series, wherein the specific autoencoder is the integration of the autoencoder Any one of all autoencoders included in the framework;
提取子单元,用于从所述特定重构时间序列中提取出与所述指定向量对应的特定重构向量;an extraction subunit for extracting a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;
计算子单元,用于根据所述指定向量与所述特定重构向量,计算出与所述指定向量对应的特定重构误差。A calculation subunit, configured to calculate a specific reconstruction error corresponding to the specified vector according to the specified vector and the specified reconstruction vector.
本实施例中,上述基于自编码器的数据异常识别装置中的处理子单元、提取子单元与计算子单元的功能和作用的实现过程具体详见上述基于自编码器的数据异常识别方法中对应步骤S3000至S3002的实现过程,在此不再赘述。In this embodiment, for details of the implementation process of the functions and functions of the processing subunit, the extraction subunit, and the calculation subunit in the above-mentioned self-encoder-based data abnormality identification method, please refer to the corresponding data in the above-mentioned self-encoder-based data abnormality identification method. The implementation process of steps S3000 to S3002 will not be repeated here.
进一步地,本申请一实施例中,上述识别模块,包括:Further, in an embodiment of the present application, the above-mentioned identification module includes:
第六获取单元,用于获取预设的异常阈值;a sixth obtaining unit, used for obtaining a preset abnormal threshold;
判断单元,用于判断在所有所述异常分数值中,是否存在数值大于所述异常阈值的指定分数值;a judging unit for judging whether in all the abnormal score values, whether there is a specified score value whose value is greater than the abnormal threshold;
第一筛选单元,用于若是,则从所有所述异常分数值中筛选出所述指定分数值;a first screening unit, configured to filter out the specified score value from all the abnormal score values if it is;
查找单元,用于从所述时间序列中查找出与所述指定分数值对应的第三向量;a search unit, configured to search out a third vector corresponding to the specified score value from the time series;
第四确定单元,用于将所述第三向量确定为所述异常数据值。a fourth determination unit, configured to determine the third vector as the abnormal data value.
本实施例中,上述基于自编码器的数据异常识别装置中的第六获取单元、判断单元、第一筛选单元、查找单元与第四确定单元的功能和作用的实现过程具体详见上述基于自编码器的数据异常识别方法中对应步骤S400至S404的实现过程,在此不再赘述。In this embodiment, the implementation process of the functions and functions of the sixth acquiring unit, the judging unit, the first screening unit, the searching unit and the fourth determining unit in the above-mentioned self-encoder-based data abnormality identification device are detailed in the above-mentioned self-based The implementation process corresponding to steps S400 to S404 in the data abnormality identification method of the encoder will not be repeated here.
进一步地,本申请一实施例中,上述识别模块,包括:Further, in an embodiment of the present application, the above-mentioned identification module includes:
第二筛选单元,用于从所述时间序列中筛选出除所述第三向量之外的第四向量;a second screening unit, configured to screen out a fourth vector other than the third vector from the time series;
标记单元,用于将所述第二向量标记为正常数据值;a marking unit for marking the second vector as a normal data value;
第七获取单元,用于获取与所述第三向量对应的第一数量;以及,a seventh obtaining unit, for obtaining the first quantity corresponding to the third vector; and,
第八获取单元,用于获取与所述第四向量对应的第二数量;an eighth obtaining unit, configured to obtain a second quantity corresponding to the fourth vector;
第三生成单元,用于根据所述第二筛选单元、所述第一数量、所述正常数据以及所述第二数量,生成与所述时间序列对应的异常分析报告;a third generating unit, configured to generate an anomaly analysis report corresponding to the time series according to the second screening unit, the first quantity, the normal data, and the second quantity;
展示单元,用于展示所述异常分析报告。The display unit is used to display the abnormality analysis report.
本实施例中,上述基于自编码器的数据异常识别装置中的第二筛选单元、标记单元、第七获取单元、第八获取单元、第三生成单元与展示单元的功能和作用的实现过程具体详见上述基于自编码器的数据异常识别方法中对应步骤S405至S410的实现过程,在此不再赘述。In this embodiment, the implementation process of the functions and roles of the second screening unit, the marking unit, the seventh acquiring unit, the eighth acquiring unit, the third generating unit and the displaying unit in the above-mentioned autoencoder-based data abnormality identification device is specific For details, please refer to the implementation process corresponding to steps S405 to S410 in the above-mentioned method for identifying data anomaly based on an autoencoder, which will not be repeated here.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏、输入装置和数据库。其中,该计算机设备设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储待检测的时间序列、稀疏连接的自编码器、自编码器集成框架、异常分数值以及异常数据值等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机设备的显示屏是计算机中必不可少的一种图文输出设备,用于将数字信号转换为光信号,使文字与图形在显示屏的屏幕上显示出来。该计算机设备的输入装置是计算机与用户或其他设备之间进行信息交换的主要装置,用于把数据、指令及某些标志信息等输送到计算机中去。该计算机程序被处理器执行时以实现一种基于自编码器的数据异常识别方法。Referring to FIG. 3 , an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer equipment includes a processor, memory, a network interface, a display screen, an input device and a database connected by a system bus. Among them, the processor of the computer equipment is designed to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store data such as time series to be detected, sparsely connected autoencoders, autoencoder integration frameworks, abnormal score values, and abnormal data values. The network interface of the computer device is used to communicate with an external terminal through a network connection. The display screen of the computer equipment is an indispensable graphic and text output device in the computer, which is used to convert digital signals into optical signals, so that text and graphics can be displayed on the screen of the display screen. The input device of the computer equipment is the main device for information exchange between the computer and the user or other devices, and is used to transmit data, instructions and certain flag information to the computer. When the computer program is executed by the processor, a method for identifying data anomalies based on an autoencoder is realized.
上述处理器执行上述基于自编码器的数据异常识别方法的步骤:The above-mentioned processor performs the steps of the above-mentioned self-encoder-based data anomaly identification method:
接收输入的待检测的时间序列;Receive the input time series to be detected;
基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的装置、计算机设备的限定。Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the apparatus or computer equipment to which the solution of the present application is applied.
本申请一实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,其上存储有计算机程序,计算机程序被处理器执行时实现上述任一个示例性实施例所示出的基于自编码器的数据异常识别方法,所述基于自编码器的数据异常识别方法包括以下步骤:An embodiment of the present application further provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and a computer program is stored thereon, and the computer program is implemented when executed by a processor The self-encoder-based data abnormality identification method shown in any of the above-mentioned exemplary embodiments, the self-encoder-based data abnormality identification method comprises the following steps:
接收输入的待检测的时间序列;Receive the input time series to be detected;
基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储与一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM通过多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, apparatus, article or method comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, apparatus, article or method. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and are not intended to limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied to other related The technical field is similarly included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于自编码器的数据异常识别方法,其中,包括:An autoencoder-based data anomaly identification method, comprising:
    接收输入的待检测的时间序列;Receive the input time series to be detected;
    基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
    通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
    根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
  2. 根据权利要求1所述的基于自编码器的数据异常识别方法,其中,所述基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架的步骤,包括:The method for identifying anomalies in data based on an autoencoder according to claim 1, wherein, based on the time series, the pre-generated specified number of sparsely connected autoencoders are subjected to an integrated training process according to preset rules to generate The steps of the corresponding autoencoder integration framework include:
    获取所述时间序列包含的所有第一向量;以及,obtain all first vectors contained in the time series; and,
    获取各所述稀疏连接的自编码器基于各所述第一向量生成的一一对应的第一重构向量;obtaining a one-to-one corresponding first reconstruction vector generated by each of the sparsely connected autoencoders based on each of the first vectors;
    基于所述第一向量与所述第一重构向量,生成对应的第一目标函数;generating a corresponding first objective function based on the first vector and the first reconstruction vector;
    基于所述第一目标函数分别对每一个所述稀疏连接的自编码器进行训练,得到训练完成的第一自编码器,其中,所述第一自编码器的数量与所述稀疏连接的自编码器的数量相同;Based on the first objective function, each of the sparsely connected autoencoders is trained to obtain a trained first autoencoder, wherein the number of the first autoencoders is related to the sparsely connected autoencoders. The number of encoders is the same;
    对所有所述第一自编码器进行集成处理,生成对应的独立框架,其中,所述独立框架内包含有指定数量的所述第一自编码器,且各所述第一自编码器之间不产生交互;Perform integrated processing on all the first self-encoders to generate a corresponding independent frame, wherein the independent frame includes a specified number of the first self-encoders, and each of the first self-encoders is between no interaction;
    将所述独立框架确定为所述自编码器集成框架。The independent framework is determined to be the autoencoder integrated framework.
  3. 根据权利要求1所述的基于自编码器的数据异常识别方法,其中,所述基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架的步骤,包括:The method for identifying anomalies in data based on an autoencoder according to claim 1, wherein, based on the time series, the pre-generated specified number of sparsely connected autoencoders are subjected to an integrated training process according to preset rules to generate The steps of the corresponding autoencoder integration framework include:
    获取预设的共享层,其中,所述共享层包括共享隐藏状态;obtaining a preset shared layer, wherein the shared layer includes a shared hidden state;
    通过所述共享层对所有所述稀疏连接的自编码器进行权值共享处理;Perform weight sharing processing on all the sparsely connected autoencoders through the shared layer;
    对所述共享隐藏状态进行L1正则化处理,得到处理后的共享隐藏状态;L1 regularization processing is performed on the shared hidden state to obtain the processed shared hidden state;
    获取所述时间序列包含的所有第二向量;以及,obtain all second vectors contained in the time series; and,
    获取各所述稀疏连接的自编码器基于各所述第二向量生成的一一对应的第二重构向量;obtaining a one-to-one corresponding second reconstruction vector generated by each of the sparsely connected autoencoders based on each of the second vectors;
    根据所述处理后的共享隐藏状态、所述第二向量以及所述第二重构向量,生成对应的第二目标函数;generating a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;
    基于所述第二目标函数对所有所述稀疏连接的自编码器进行联合训练,得到训练完成的第二自编码器,其中,所述第二自编码器的数量与所述稀疏连接的自编码器的数量相同;All the sparsely connected autoencoders are jointly trained based on the second objective function to obtain a trained second autoencoder, wherein the number of the second autoencoders is the same as the number of the sparsely connected autoencoders the same number of devices;
    对所有所述第二自编码器进行集成处理,生成对应的共享框架,其中,所 述共享框架内包含有指定数量的所述第二自编码器,且各所述第二自编码器之间存在交互;Perform integrated processing on all the second self-encoders to generate a corresponding shared frame, wherein the shared frame includes a specified number of the second self-encoders, and each of the second self-encoders there is interaction;
    将所述共享框架确定为所述自编码器集成框架。The shared framework is determined to be the autoencoder integration framework.
  4. 根据权利要求1所述的基于自编码器的数据异常识别方法,其中,所述通过所述自编码器集成框架计算所述时间序列中包含的每一个向量所对应的异常分数值的步骤,包括:The method for identifying anomalies in data based on an autoencoder according to claim 1, wherein the step of calculating an anomaly score value corresponding to each vector included in the time series through the autoencoder integration framework includes the following steps: :
    通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,其中,所述指定向量为所述时间序列包含的所有向量中的任意一个向量;Calculate and generate a reconstruction error corresponding to a specified vector by each autoencoder included in the autoencoder integration framework, wherein the specified vector is any one of all vectors included in the time series;
    计算所有所述重构误差的中位数;calculating the median of all said reconstruction errors;
    将所述中位数确定为与所述时间序列中的所述指定向量对应的指定异常分数值。The median is determined as the specified anomaly score value corresponding to the specified vector in the time series.
  5. 根据权利要求4所述的基于自编码器的数据异常识别方法,其中,所述通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差的步骤,包括:The method for identifying data anomalies based on an autoencoder according to claim 4, wherein the step of generating a reconstruction error corresponding to a specified vector by calculating each autoencoder included in the autoencoder integration framework, include:
    通过特定自编码器对所述时间序列进行重构处理,得到与所述时间序列对应的特定重构时间序列,其中,所述特定自编码器为所述自编码器集成框架中包含的所有自编码器中的任意一个自编码器;The time series is reconstructed by a specific autoencoder to obtain a specific reconstructed time series corresponding to the time series, wherein the specific autoencoder is all the autoencoders included in the autoencoder integration framework. any one of the autoencoders in the encoder;
    从所述特定重构时间序列中提取出与所述指定向量对应的特定重构向量;extracting a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;
    根据所述指定向量与所述特定重构向量,计算出与所述指定向量对应的特定重构误差。According to the specified vector and the specific reconstruction vector, a specific reconstruction error corresponding to the specified vector is calculated.
  6. 根据权利要求1所述的基于自编码器的数据异常识别方法,其中,所述根据所述异常分数值,识别出所述时间序列中是否存在异常数据值的步骤,包括:The method for identifying data anomalies based on an autoencoder according to claim 1, wherein the step of identifying whether there is an abnormal data value in the time series according to the abnormal score value comprises:
    获取预设的异常阈值;Get the preset abnormal threshold;
    判断在所有所述异常分数值中,是否存在数值大于所述异常阈值的指定分数值;Judging whether there is a specified score value with a value greater than the abnormal threshold value among all the abnormal score values;
    若是,则从所有所述异常分数值中筛选出所述指定分数值;If so, filter out the specified score value from all the abnormal score values;
    从所述时间序列中查找出与所述指定分数值对应的第三向量;Find a third vector corresponding to the specified score value from the time series;
    将所述第三向量确定为所述异常数据值。The third vector is determined to be the outlier data value.
  7. 根据权利要求6所述的基于自编码器的数据异常识别方法,其中,所述将所述第三向量确定为所述异常数据值的步骤之后,包括:The method for identifying data anomalies based on an autoencoder according to claim 6, wherein after the step of determining the third vector as the abnormal data value, the method comprises:
    从所述时间序列中筛选出除所述第三向量之外的第四向量;Filter out a fourth vector other than the third vector from the time series;
    将所述第二向量标记为正常数据值;marking the second vector as normal data values;
    获取与所述第三向量对应的第一数量;以及,obtaining a first quantity corresponding to the third vector; and,
    获取与所述第四向量对应的第二数量;obtaining a second quantity corresponding to the fourth vector;
    根据所述异常数据值、所述第一数量、所述正常数据以及所述第二数量,生成与所述时间序列对应的异常分析报告;generating an anomaly analysis report corresponding to the time series according to the abnormal data value, the first quantity, the normal data and the second quantity;
    展示所述异常分析报告。Display the anomaly analysis report.
  8. 一种基于自编码器的数据异常识别装置,其中,包括:A device for identifying data anomalies based on an autoencoder, comprising:
    接收模块,用于接收输入的待检测的时间序列;a receiving module for receiving the input time series to be detected;
    训练模块,用于基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;The training module is configured to perform integrated training processing on a pre-generated specified number of sparsely connected autoencoders based on the time series according to preset rules, and generate a corresponding autoencoder integration framework, wherein the sparsely connected autoencoders are The encoder is generated by deleting the unit connection of a specified number of cyclic neural network-based autoencoders respectively;
    计算模块,用于通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;a calculation module, configured to calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
    识别模块,用于根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。An identification module, configured to identify whether there is an abnormal data value in the time series according to the abnormal score value.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种基于自编码器的数据异常识别方法:A computer device, comprising a memory and a processor, wherein a computer program is stored in the memory, wherein, when the processor executes the computer program, a method for identifying data anomalies based on an autoencoder is implemented:
    其中,所述基于自编码器的数据异常识别方法包括:Wherein, the data anomaly identification method based on the autoencoder includes:
    接收输入的待检测的时间序列;Receive the input time series to be detected;
    基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
    通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
    根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
  10. 根据权利要求9所述的计算机设备,其中,所述基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架的步骤,包括:The computer device according to claim 9, wherein, based on the time series, the pre-generated specified number of sparsely connected autoencoders are subjected to integrated training processing according to preset rules to generate a corresponding autoencoder integration framework steps, including:
    获取所述时间序列包含的所有第一向量;以及,obtain all first vectors contained in the time series; and,
    获取各所述稀疏连接的自编码器基于各所述第一向量生成的一一对应的第一重构向量;obtaining a one-to-one corresponding first reconstruction vector generated by each of the sparsely connected autoencoders based on each of the first vectors;
    基于所述第一向量与所述第一重构向量,生成对应的第一目标函数;generating a corresponding first objective function based on the first vector and the first reconstruction vector;
    基于所述第一目标函数分别对每一个所述稀疏连接的自编码器进行训练,得到训练完成的第一自编码器,其中,所述第一自编码器的数量与所述稀疏连接的自编码器的数量相同;Based on the first objective function, each of the sparsely connected autoencoders is trained to obtain a trained first autoencoder, wherein the number of the first autoencoders is related to the sparsely connected autoencoders. The number of encoders is the same;
    对所有所述第一自编码器进行集成处理,生成对应的独立框架,其中,所述独立框架内包含有指定数量的所述第一自编码器,且各所述第一自编码器之间不产生交互;Perform integrated processing on all the first self-encoders to generate a corresponding independent frame, wherein the independent frame includes a specified number of the first self-encoders, and each of the first self-encoders is between no interaction;
    将所述独立框架确定为所述自编码器集成框架。The independent framework is determined to be the autoencoder integrated framework.
  11. 根据权利要求9所述的计算机设备,其中,所述基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架的步骤,包括:The computer device according to claim 9, wherein, based on the time series, the pre-generated specified number of sparsely connected autoencoders are subjected to integrated training processing according to preset rules to generate a corresponding autoencoder integration framework steps, including:
    获取预设的共享层,其中,所述共享层包括共享隐藏状态;obtaining a preset shared layer, wherein the shared layer includes a shared hidden state;
    通过所述共享层对所有所述稀疏连接的自编码器进行权值共享处理;Perform weight sharing processing on all the sparsely connected autoencoders through the shared layer;
    对所述共享隐藏状态进行L1正则化处理,得到处理后的共享隐藏状态;L1 regularization processing is performed on the shared hidden state to obtain the processed shared hidden state;
    获取所述时间序列包含的所有第二向量;以及,obtain all second vectors contained in the time series; and,
    获取各所述稀疏连接的自编码器基于各所述第二向量生成的一一对应的第二重构向量;obtaining a one-to-one corresponding second reconstruction vector generated by each of the sparsely connected autoencoders based on each of the second vectors;
    根据所述处理后的共享隐藏状态、所述第二向量以及所述第二重构向量,生成对应的第二目标函数;generating a corresponding second objective function according to the processed shared hidden state, the second vector and the second reconstruction vector;
    基于所述第二目标函数对所有所述稀疏连接的自编码器进行联合训练,得到训练完成的第二自编码器,其中,所述第二自编码器的数量与所述稀疏连接的自编码器的数量相同;All the sparsely connected autoencoders are jointly trained based on the second objective function to obtain a trained second autoencoder, wherein the number of the second autoencoders is the same as the number of the sparsely connected autoencoders the same number of devices;
    对所有所述第二自编码器进行集成处理,生成对应的共享框架,其中,所述共享框架内包含有指定数量的所述第二自编码器,且各所述第二自编码器之间存在交互;Perform integrated processing on all the second self-encoders to generate a corresponding shared frame, wherein the shared frame includes a specified number of the second self-encoders, and each of the second self-encoders there is interaction;
    将所述共享框架确定为所述自编码器集成框架。The shared framework is determined to be the autoencoder integration framework.
  12. 根据权利要求9所述的计算机设备,其中,所述通过所述自编码器集成框架计算所述时间序列中包含的每一个向量所对应的异常分数值的步骤,包括:The computer device according to claim 9, wherein the step of calculating the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework comprises:
    通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,其中,所述指定向量为所述时间序列包含的所有向量中的任意一个向量;Calculate and generate a reconstruction error corresponding to a specified vector by each autoencoder included in the autoencoder integration framework, wherein the specified vector is any one of all vectors included in the time series;
    计算所有所述重构误差的中位数;calculating the median of all said reconstruction errors;
    将所述中位数确定为与所述时间序列中的所述指定向量对应的指定异常分数值。The median is determined as the specified anomaly score value corresponding to the specified vector in the time series.
  13. 根据权利要求12所述的计算机设备,其中,所述通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差的步骤,包括:The computer device according to claim 12, wherein the step of calculating and generating a reconstruction error corresponding to a specified vector by each autoencoder included in the autoencoder integration framework comprises:
    通过特定自编码器对所述时间序列进行重构处理,得到与所述时间序列对应的特定重构时间序列,其中,所述特定自编码器为所述自编码器集成框架中包含的所有自编码器中的任意一个自编码器;The time series is reconstructed by a specific autoencoder to obtain a specific reconstructed time series corresponding to the time series, wherein the specific autoencoder is all the autoencoders included in the autoencoder integration framework. any one of the autoencoders in the encoder;
    从所述特定重构时间序列中提取出与所述指定向量对应的特定重构向量;extracting a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;
    根据所述指定向量与所述特定重构向量,计算出与所述指定向量对应的特定重构误差。According to the specified vector and the specific reconstruction vector, a specific reconstruction error corresponding to the specified vector is calculated.
  14. 根据权利要求9所述的计算机设备,其中,所述根据所述异常分数值,识别出所述时间序列中是否存在异常数据值的步骤,包括:The computer device according to claim 9, wherein the step of identifying whether there is an abnormal data value in the time series according to the abnormal score value comprises:
    获取预设的异常阈值;Get the preset abnormal threshold;
    判断在所有所述异常分数值中,是否存在数值大于所述异常阈值的指定分数值;Judging whether there is a specified score value with a value greater than the abnormal threshold value among all the abnormal score values;
    若是,则从所有所述异常分数值中筛选出所述指定分数值;If so, filter out the specified score value from all the abnormal score values;
    从所述时间序列中查找出与所述指定分数值对应的第三向量;Find a third vector corresponding to the specified score value from the time series;
    将所述第三向量确定为所述异常数据值。The third vector is determined to be the outlier data value.
  15. 根据权利要求14所述的计算机设备,其中,所述将所述第三向量确定为所述异常数据值的步骤之后,包括:The computer device of claim 14, wherein after the step of determining the third vector as the abnormal data value, comprising:
    从所述时间序列中筛选出除所述第三向量之外的第四向量;Filter out a fourth vector other than the third vector from the time series;
    将所述第二向量标记为正常数据值;marking the second vector as normal data values;
    获取与所述第三向量对应的第一数量;以及,obtaining a first quantity corresponding to the third vector; and,
    获取与所述第四向量对应的第二数量;obtaining a second quantity corresponding to the fourth vector;
    根据所述异常数据值、所述第一数量、所述正常数据以及所述第二数量,生成与所述时间序列对应的异常分析报告;generating an anomaly analysis report corresponding to the time series according to the abnormal data value, the first quantity, the normal data and the second quantity;
    展示所述异常分析报告。Display the anomaly analysis report.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种基于自编码器的数据异常识别方法,其中,所述基于自编码器的数据异常识别方法包括以下步骤:A computer-readable storage medium on which a computer program is stored, wherein, when the computer program is executed by a processor, a method for identifying data anomalies based on an auto-encoder is implemented, wherein the data anomaly based on an auto-encoder The identification method includes the following steps:
    接收输入的待检测的时间序列;Receive the input time series to be detected;
    基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架,其中,所述稀疏连接的自编码器是通过分别对指定数量的基于循环神经网络的自编码器进行单元连接删除处理后生成的;Based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected autoencoders according to preset rules, and a corresponding autoencoder integration framework is generated, wherein the sparsely connected autoencoders are obtained by separately Generated after the specified number of cyclic neural network-based autoencoders are processed by unit connection deletion;
    通过所述自编码器集成框架计算出所述时间序列中包含的每一个向量所对应的异常分数值;Calculate the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework;
    根据所述异常分数值,识别出所述时间序列中是否存在异常数据值。According to the abnormal score value, it is identified whether there is abnormal data value in the time series.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述基于所述时间序列,按照预设规则对预生成的指定数量的稀疏连接的自编码器进行集成训练处理,生成对应的自编码器集成框架的步骤,包括:The computer-readable storage medium according to claim 16, wherein, based on the time series, an integrated training process is performed on a pre-generated specified number of sparsely connected auto-encoders according to preset rules to generate corresponding auto-encoders The steps of the server integration framework, including:
    获取所述时间序列包含的所有第一向量;以及,obtain all first vectors contained in the time series; and,
    获取各所述稀疏连接的自编码器基于各所述第一向量生成的一一对应的第一重构向量;obtaining a one-to-one corresponding first reconstruction vector generated by each of the sparsely connected autoencoders based on each of the first vectors;
    基于所述第一向量与所述第一重构向量,生成对应的第一目标函数;generating a corresponding first objective function based on the first vector and the first reconstruction vector;
    基于所述第一目标函数分别对每一个所述稀疏连接的自编码器进行训练,得到训练完成的第一自编码器,其中,所述第一自编码器的数量与所述稀疏连接的自编码器的数量相同;Based on the first objective function, each of the sparsely connected autoencoders is trained to obtain a trained first autoencoder, wherein the number of the first autoencoders is related to the sparsely connected autoencoders. The number of encoders is the same;
    对所有所述第一自编码器进行集成处理,生成对应的独立框架,其中,所述独立框架内包含有指定数量的所述第一自编码器,且各所述第一自编码器之间不产生交互;Perform integrated processing on all the first self-encoders to generate a corresponding independent frame, wherein the independent frame includes a specified number of the first self-encoders, and each of the first self-encoders is between no interaction;
    将所述独立框架确定为所述自编码器集成框架。The independent framework is determined to be the autoencoder integrated framework.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述通过所述自编码器集成框架计算所述时间序列中包含的每一个向量所对应的异常分数值的步骤,包括:The computer-readable storage medium according to claim 16, wherein the step of calculating the abnormal score value corresponding to each vector included in the time series through the autoencoder integration framework comprises:
    通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差,其中,所述指定向量为所述时间序列包含的所有向量中的任意一个向量;Calculate and generate a reconstruction error corresponding to a specified vector by each autoencoder included in the autoencoder integration framework, wherein the specified vector is any one of all vectors included in the time series;
    计算所有所述重构误差的中位数;calculating the median of all said reconstruction errors;
    将所述中位数确定为与所述时间序列中的所述指定向量对应的指定异常分数值。The median is determined as the specified anomaly score value corresponding to the specified vector in the time series.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述通过所述自编码器集成框架中包含的每一个自编码器计算生成与指定向量对应的重构误差的步骤,包括:The computer-readable storage medium according to claim 18, wherein the step of calculating and generating a reconstruction error corresponding to a specified vector by each autoencoder included in the autoencoder integration framework comprises:
    通过特定自编码器对所述时间序列进行重构处理,得到与所述时间序列对应的特定重构时间序列,其中,所述特定自编码器为所述自编码器集成框架中包含的所有自编码器中的任意一个自编码器;The time series is reconstructed through a specific auto-encoder to obtain a specific reconstructed time series corresponding to the time series, wherein the specific auto-encoder is all the auto-encoders included in the auto-encoder integration framework. Any one of the autoencoders in the encoder;
    从所述特定重构时间序列中提取出与所述指定向量对应的特定重构向量;extracting a specific reconstruction vector corresponding to the specified vector from the specific reconstruction time series;
    根据所述指定向量与所述特定重构向量,计算出与所述指定向量对应的特定重构误差。According to the specified vector and the specific reconstruction vector, a specific reconstruction error corresponding to the specified vector is calculated.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述异常分数值,识别出所述时间序列中是否存在异常数据值的步骤,包括:The computer-readable storage medium according to claim 16, wherein the step of identifying whether there is an abnormal data value in the time series according to the abnormal score value comprises:
    获取预设的异常阈值;Get the preset abnormal threshold;
    判断在所有所述异常分数值中,是否存在数值大于所述异常阈值的指定分数值;Judging whether there is a specified score value whose value is greater than the abnormal threshold value among all the abnormal score values;
    若是,则从所有所述异常分数值中筛选出所述指定分数值;If so, filter out the specified score value from all the abnormal score values;
    从所述时间序列中查找出与所述指定分数值对应的第三向量;Find a third vector corresponding to the specified score value from the time series;
    将所述第三向量确定为所述异常数据值。The third vector is determined to be the outlier data value.
PCT/CN2021/097550 2020-11-09 2021-05-31 Auto-encoder-based data anomaly identification method and apparatus and computer device WO2022095434A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011242143.5 2020-11-09
CN202011242143.5A CN112329865B (en) 2020-11-09 2020-11-09 Data anomaly identification method and device based on self-encoder and computer equipment

Publications (1)

Publication Number Publication Date
WO2022095434A1 true WO2022095434A1 (en) 2022-05-12

Family

ID=74316541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097550 WO2022095434A1 (en) 2020-11-09 2021-05-31 Auto-encoder-based data anomaly identification method and apparatus and computer device

Country Status (2)

Country Link
CN (1) CN112329865B (en)
WO (1) WO2022095434A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116165353A (en) * 2023-04-26 2023-05-26 江西拓荒者科技有限公司 Industrial pollutant monitoring data processing method and system

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329865B (en) * 2020-11-09 2023-09-08 平安科技(深圳)有限公司 Data anomaly identification method and device based on self-encoder and computer equipment
CN112839059B (en) * 2021-02-22 2022-08-30 北京六方云信息技术有限公司 WEB intrusion detection self-adaptive alarm filtering processing method and device and electronic equipment
CN113114529B (en) * 2021-03-25 2022-05-24 清华大学 KPI (Key Performance indicator) anomaly detection method and device based on condition variation automatic encoder and computer storage medium
CN113671917B (en) * 2021-08-19 2022-08-02 中国科学院自动化研究所 Detection method, system and equipment for abnormal state of multi-modal industrial process
CN114066435A (en) * 2021-11-10 2022-02-18 广东工业大学 Block chain illegal address detection method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798340A (en) * 2017-09-29 2018-03-13 中国地质大学(武汉) Multiple Geochemical abnormality recognition method based on the more self-encoding encoders of space constraint
CN109902564A (en) * 2019-01-17 2019-06-18 杭州电子科技大学 A kind of accident detection method based on the sparse autoencoder network of structural similarity
US20200106795A1 (en) * 2017-06-09 2020-04-02 British Telecommunications Public Limited Company Anomaly detection in computer networks
CN111724074A (en) * 2020-06-23 2020-09-29 华中科技大学 Pavement lesion detection early warning method and system based on deep learning
CN112329865A (en) * 2020-11-09 2021-02-05 平安科技(深圳)有限公司 Data anomaly identification method and device based on self-encoder and computer equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9606167B2 (en) * 2011-08-03 2017-03-28 President And Fellows Of Harvard College System and method for detecting integrated circuit anomalies
CN107480777A (en) * 2017-08-28 2017-12-15 北京师范大学 Sparse self-encoding encoder Fast Training method based on pseudo- reversal learning
CN110119447B (en) * 2019-04-26 2023-06-16 平安科技(深圳)有限公司 Self-coding neural network processing method, device, computer equipment and storage medium
CN111178523B (en) * 2019-08-02 2023-06-06 腾讯科技(深圳)有限公司 Behavior detection method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200106795A1 (en) * 2017-06-09 2020-04-02 British Telecommunications Public Limited Company Anomaly detection in computer networks
CN107798340A (en) * 2017-09-29 2018-03-13 中国地质大学(武汉) Multiple Geochemical abnormality recognition method based on the more self-encoding encoders of space constraint
CN109902564A (en) * 2019-01-17 2019-06-18 杭州电子科技大学 A kind of accident detection method based on the sparse autoencoder network of structural similarity
CN111724074A (en) * 2020-06-23 2020-09-29 华中科技大学 Pavement lesion detection early warning method and system based on deep learning
CN112329865A (en) * 2020-11-09 2021-02-05 平安科技(深圳)有限公司 Data anomaly identification method and device based on self-encoder and computer equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116165353A (en) * 2023-04-26 2023-05-26 江西拓荒者科技有限公司 Industrial pollutant monitoring data processing method and system

Also Published As

Publication number Publication date
CN112329865A (en) 2021-02-05
CN112329865B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
WO2022095434A1 (en) Auto-encoder-based data anomaly identification method and apparatus and computer device
CN109087079B (en) Digital currency transaction information analysis method
CN110797124A (en) Model multi-terminal collaborative training method, medical risk prediction method and device
WO2020220545A1 (en) Long short-term memory model-based disease prediction method and apparatus, and computer device
WO2023065632A1 (en) Data desensitization method, data desensitization apparatus, device, and storage medium
CN112464117A (en) Request processing method and device, computer equipment and storage medium
CN110875093A (en) Treatment scheme processing method, device, equipment and storage medium
CN112016318A (en) Triage information recommendation method, device, equipment and medium based on interpretation model
CN112132624A (en) Medical claims data prediction system
CN111831908A (en) Medical field knowledge graph construction method, device, equipment and storage medium
CN112036749A (en) Method and device for identifying risk user based on medical data and computer equipment
CN113672654B (en) Data query method, device, computer equipment and storage medium
CN113051372B (en) Material data processing method, device, computer equipment and storage medium
WO2021114613A1 (en) Artificial intelligence-based fault node identification method, device, apparatus, and medium
WO2021155684A1 (en) Gene-disease relationship knowledge base construction method and apparatus, and computer device
CN113986581A (en) Data aggregation processing method and device, computer equipment and storage medium
CN112200684B (en) Method, system and storage medium for detecting medical insurance fraud
US20210019518A1 (en) Enterprise Profile Management and Control System
CN112102311A (en) Thyroid nodule image processing method and device and computer equipment
CN111275059B (en) Image processing method and device and computer readable storage medium
CN117540336A (en) Time sequence prediction method and device and electronic equipment
CN116776857A (en) Customer call key information extraction method, device, computer equipment and medium
CN114021732B (en) Proportional risk regression model training method, device and system and storage medium
CN114360732B (en) Medical data analysis method, device, electronic equipment and storage medium
CN116151369A (en) Bayesian-busy robust federal learning system and method for public audit

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888137

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888137

Country of ref document: EP

Kind code of ref document: A1