CN116089788B

CN116089788B - Online missing data processing method and device, computer equipment and storage medium

Info

Publication number: CN116089788B
Application number: CN202310292552.3A
Authority: CN
Inventors: 余方晨; 李文烨
Original assignee: Shenzhen Research Institute of Big Data SRIBD
Current assignee: Shenzhen Research Institute of Big Data SRIBD
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-08-22
Anticipated expiration: 2043-03-23
Also published as: CN116089788A

Abstract

The embodiment of the application belongs to the technical field of computer networks, and relates to an online missing data processing method and device based on a similarity matrixA computer device and a storage medium, the method comprising: collecting a current monitoring data matrix of a target network in real time; when the current monitoring data matrix has missing data, calculating an initial similarity matrix of the current monitoring data matrix and all offline dataThe method comprises the steps of carrying out a first treatment on the surface of the According to an optimization method, the initial similarity matrixPerforming correction processing to obtain a correction similarity matrix; and carrying out missing filling processing on missing data of the current monitoring data matrix according to the correction similarity matrix. The application can effectively improve the processing efficiency of the missing data stream, greatly improve the accuracy of similarity calculation, and has firm theoretical guarantee, good expandability and universality.

Description

Online missing data processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer networks, and in particular, to a method and apparatus for processing online missing data based on a similarity matrix, a computer device, and a storage medium.

Background

In a real scene, a phenomenon of data missing or incomplete information often occurs, which refers to a phenomenon that data is incomplete or not available due to various reasons during data collection or processing. For an online Data processing task, online missing Data (Online Incomplete Data) refers to a Data Stream (Data Stream) containing missing values generated in real time, which is a common problem in Data analysis and mining, such as equipment failure, network delay, information omission and the like, may cause the generation of online missing Data, and has wide application in the fields of the internet of things, sensor networks, social networks and the like. The online missing data has a great influence on the similarity calculation and the machine learning task, because the online missing data can reduce the data quality and the usability, increase the uncertainty of the similarity matrix calculation and influence the accuracy and the stability of the model. If the missing data is not reasonably processed, the problems of deviation, variance, overfitting and the like of the model can be caused. Therefore, the need to efficiently identify and process online missing data in real-time before performing similarity calculations is a significant challenge to the processing methods and systems.

The existing online missing data processing method is characterized in that missing values are filled by utilizing existing data or additional information, so that a complete data matrix is obtained, and then a similarity matrix is calculated according to the complete data matrix.

However, the applicant found that when the missing data is generated in real time on line in the form of a data stream, the conventional on-line data processing method is difficult to adapt, and the on-line missing data is different from the off-line missing data because it needs to be processed in a limited time and space, and it cannot wait for all the data to arrive before being complemented or ignored or modeled, thus it is seen that the conventional on-line data processing method has a problem of low processing efficiency when processing the missing data stream generated in real time.

Disclosure of Invention

The embodiment of the application aims to provide an online missing data processing method, device, computer equipment and storage medium based on a similarity matrix, so as to solve the problem that the processing efficiency is low when a missing data stream generated in real time is processed by the traditional online data processing method.

In order to solve the above technical problems, the embodiment of the present application provides an online missing data processing method based on a similarity matrix, which adopts the following technical scheme:

Collecting a current monitoring data matrix of a target network in real time;

when the current monitoring data matrix has missing data, calculating an initial similarity matrix of the current monitoring data matrix and all offline dataThe initial similarity matrix->Expressed as:

；

wherein the saidAny one of the data representing the current monitoring data matrix, said +.>Any one of the data representing all the offline data, said +.>Expressed as:

；

wherein I is and />Simultaneously, the dimension of the non-NaN characteristic value is provided;

according to an optimization method, the initial similarity matrixPerforming correction processing to obtain a correction similarity matrix;

and carrying out missing filling processing on missing data of the current monitoring data matrix according to the correction similarity matrix.

Further, the initial similarity matrix is subjected to the optimization methodAnd (3) performing correction processing to obtain a correction similarity matrix, wherein the method specifically comprises the following steps of:

the initial similarity matrix is subjected to a first optimization targetAnd carrying out correction processing to obtain the correction similarity matrix, wherein the first optimization target is expressed as:

；

wherein ,the current monitored data matrix and the initial similarity matrix for all offline data, Representing the square of the F-Norm (Frobenius Norm), S.gtoreq.0 represents that the matrix S has semi-positive properties.

the initial similarity matrix is subjected to a second optimization targetPerforming correction processing to obtain the correction similarity matrix, wherein the initial similarity matrix +.>Can be converted into:

；

wherein ,is an n x n dimensional matrix calculated from n of said all offline data; />Is an n x 1-dimensional vector representing the initial similarity vector between the missing data and all of the offline data; />Is a vector of dimension 1 x n, is +.>Transposition of vectors; c represents a similarity constant; the second optimization objective is expressed as:

；

wherein ,representation->Inverse of the matrix.

Further, in said optimizing said initial similarity matrixAfter the step of correcting to obtain the correction similarity matrix, the method further comprises the following steps:

repeating the initial similarity matrix when the missing data occurs in order in real time And calculating and correcting the processing operation, and updating the correction similarity matrix in real time until the dimension of the correction similarity matrix is finally enlarged from (n+1) x (n+1) to (n+m) x (n+m).

In order to solve the above technical problems, the embodiment of the present application further provides an online missing data processing device based on a similarity matrix, which adopts the following technical scheme:

the monitoring data acquisition module is used for acquiring a current monitoring data matrix of the target network in real time;

the initial matrix calculation module is used for calculating the initial similarity matrix of the current monitoring data matrix and all offline data when the missing data appear in the current monitoring data matrixThe initial phaseSimilarity matrix->Expressed as:

；

the correction matrix acquisition module is used for carrying out the initial similarity matrix according to an optimization methodPerforming correction processing to obtain a correction similarity matrix;

and the missing filling module is used for carrying out missing filling processing on missing data of the current monitoring data matrix according to the correction similarity matrix.

Further, the correction matrix acquisition module includes:

a first correction processing sub-module for correcting the initial similarity matrix according to a first optimization targetAnd carrying out correction processing to obtain the correction similarity matrix, wherein the first optimization target is expressed as:

；

wherein ,the current monitored data matrix and the initial similarity matrix for all offline data,representing the square of the F-Norm (Frobenius Norm), S.gtoreq.0 represents that the matrix S has semi-positive properties.

Further, the correction matrix acquisition module includes:

a second correction processing sub-module for correcting the initial similarity matrix according to a second optimization objectivePerforming correction processing to obtain the correction similarity matrix, wherein the initial similarity matrix +.>Can be converted into:

；

wherein ,Representation->Inverse of the matrix.

Further, the device further comprises:

matrix updating module for repeatedly executing initial similarity matrix when the missing data appears in order in real timeAnd calculating and correcting the processing operation, and updating the correction similarity matrix in real time until the dimension of the correction similarity matrix is finally enlarged from (n+1) x (n+1) to (n+m) x (n+m).

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

the method comprises a memory and a processor, wherein the memory stores computer readable instructions, and the processor executes the computer readable instructions to realize the steps of the online missing data processing method based on the similarity matrix.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

the computer readable storage medium has stored thereon computer readable instructions which when executed by a processor implement the steps of the similarity matrix based online missing data processing method described above.

The application provides an online missing data processing method based on a similarity matrix, which comprises the following steps: collecting a current monitoring data matrix of a target network in real time; when the current monitoring data matrix has missing data, calculating an initial similarity matrix of the current monitoring data matrix and all offline dataThe initial similarity matrix->Expressed as:the method comprises the steps of carrying out a first treatment on the surface of the Wherein said->Any one of the data representing the current monitoring data matrix, said +.>Any one of the data representing all the offline data, said +.>Expressed as:the method comprises the steps of carrying out a first treatment on the surface of the Wherein I is->Andsimultaneously, the dimension of the non-NaN characteristic value is provided; according to an optimization method for said initial similarity matrix +.>Performing correction processing to obtain a correction similarity matrix; and carrying out missing filling processing on missing data of the current monitoring data matrix according to the correction similarity matrix. Compared with the prior art, the applicationThe method can effectively improve the processing efficiency of the missing data stream, greatly improve the accuracy of similarity calculation, and has firm theoretical guarantee, good expandability and universality.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flowchart of an implementation of an online missing data processing method based on a similarity matrix according to an embodiment of the present application;

FIG. 3 is a flow chart of one embodiment of step S203 of FIG. 2;

FIG. 4 is a flow chart of another embodiment of step S203 in FIG. 2;

FIG. 5 is a flow chart of one embodiment after step S203 in FIG. 2;

FIG. 6 is a schematic diagram of comparing OnMC with RF and KFMC on MNIST data set to reduce similarity matrix estimation error according to the first embodiment of the present application;

FIG. 7 is a schematic diagram of OnMC versus RF and KFMC reduced similarity matrix estimation errors on a PROTEIN dataset according to an embodiment of the present application;

FIG. 8 is a schematic diagram of OnMC versus RF and KFMC on MNIST data set to reduce missing data processing time according to an embodiment of the present application;

FIG. 9 is a schematic diagram of OnMC versus RF and KFMC on a PROTEIN dataset to reduce missing data processing time according to an embodiment of the present application;

FIG. 10 is a schematic diagram of an online missing data processing device based on a similarity matrix according to a second embodiment of the present application;

FIG. 11 is a schematic structural view of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the online missing data processing method based on the similarity matrix provided by the embodiment of the application is generally executed by a server/terminal device, and correspondingly, the online missing data processing device based on the similarity matrix is generally arranged in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Example 1

With continued reference to fig. 2, a flowchart of an implementation of the method for processing online missing data based on a similarity matrix according to the first embodiment of the present application is shown, and for convenience of explanation, only a portion relevant to the present application is shown.

The online missing data processing method based on the similarity matrix comprises the following steps of:

in step S201, a current monitoring data matrix of the target network is collected in real time;

in step S202, when missing data occurs in the current monitored data matrix, an initial similarity matrix between the current monitored data matrix and all offline data is calculated The initial similarity matrix->Expressed as:

；

in step S203, the initial similarity matrix is subjected to an optimization methodPerforming correction processing to obtain a correction similarity matrix;

in step S204, the missing data of the current monitored data matrix is subjected to missing filling processing according to the correction similarity matrix.

In embodiments of the present application, the similarity matrix (Similarity Matrix) is a mathematical tool that describes the similarity between two or more objects according to a certain metric, and has wide application in many fields, such as linear algebra, image processing, machine learning, data mining, bioinformatics, etc. Similarity computation is the basis for many data processing and analysis tasks. The similarity matrix may be used to measure relationships between different samples or features in the dataset, thereby performing various machine learning tasks such as clustering, classification, noise reduction, feature selection, and the like. The similarity matrix may also be used to compare different types or sources of data, such as images, text, audio, video, etc., to achieve a variety of downstream application tasks such as cross-modality retrieval, matching, fusion, etc. Common similarity measures include cosine similarity (Cosine Similarity), jacquard coefficients (Jaccard Coefficient), gaussian Kernel functions (Gaussian Kernel), and the like. These methods all require calculations based on complete and accurate data that would otherwise affect the quality and reliability of the similarity matrix.

In the embodiment of the present application, the initial similarity matrix may be calculated by estimating the similarity between any two missing data samples, so as to obtain the initial similarity matrix between all the data samples. The offline data set forms an offline sample set X, where X is a matrix of d X n dimensions comprising n samples, each sampleCharacteristic value with d dimensions +.>，,…,/>. For any two samples-> and />If there is a missing value, the missing eigenvalue is recorded as NaN, record I is the dimension that two samples have non-NaN eigenvalue at the same time, then +.> and />Respectively indicate-> and />A vector is limited in I dimensions. Thus for any two missing data samples +.> and />Is approximately equivalent to calculating the estimated value of the similarity of (2) and />Is a similarity of (3). Taking the calculation of cosine similarity as an example:

；

in obtaining the pair-wise similarity of two samplesThereafter, the initial similarity matrix may be passedCalculated, it is a matrix of dimension n x n, wherein +.>The ith row and jth column matrix elements of +.>. The online missing data stream forms an online data sample set Y which is a matrix of dimension d x m comprising m missing data samples +.>,/>,…,/>Every sample->Characteristic value with d dimensions +. >,/>,…,/>Because of the missing values, the eigenvalues of some dimensions are NaN. When the on-line missing data appear from the on-line data set Y one by one in order, the corresponding initial similarity matrix is updated accordingly, and when t on-line missing data appear, the +.>The matrix dimension of (n+t) × (n+t). Wherein (1)>Gradually updated from nxn to (n+1) × (n+1), (n+2) × (n+2), and finally updated to the matrix of (n+m) × (n+m) as the presence of online data.

In the embodiment of the application, the initial similarity matrixThe correction is performed by optimizing the initial similarity matrix>Correcting correction to matrix->Ensure->Comparison->More realistic matrix->, wherein />Representing the similarity matrix calculated from the complete data.

In an embodiment of the present application, an online missing data processing method based on a similarity matrix is provided, including: collecting a current monitoring data matrix of a target network in real time; when missing data appears in the current monitoring data matrix, calculating an initial similarity matrix of the current monitoring data matrix and all offline dataInitial similarity matrix->Expressed as:；

wherein ,any one data representing the current monitoring data matrix, < > >Any one of the data representing all offline data, < +.>Expressed as:

；

wherein I is and />Simultaneously, the dimension of the non-NaN characteristic value is provided; for initial similarity matrix according to optimization methodPerforming correction processing to obtain a correction similarity matrix; and carrying out missing filling processing on missing data of the current monitoring data matrix according to the correction similarity matrix. Compared with the prior art, the method and the device can effectively improve the processing efficiency of the missing data stream, greatly improve the accuracy of similarity calculation, and have firm theoretical guarantee and good expandability and universality.

With continued reference to fig. 3, a flowchart of one embodiment of step S203 of fig. 2 is shown, only the portions relevant to the present application being shown for ease of illustration.

In some optional implementations of this embodiment, step S203 specifically includes: step S301.

In step S301, the initial similarity matrix is subjected to a first optimization objectiveAnd performing correction processing to obtain a correction similarity matrix, wherein the first optimization target is expressed as:

；

In an embodiment of the present application, in the present application,due toIs a similarity matrix estimated based on missing data, which generally does not satisfy the semi-definite properties that the similarity matrix would otherwise have. In order to simplify the problem, the present application first considers the case where only 1 online missing data occurs, and will +.>The recovery of the matrix semi-positive property is described as a first optimization objective (P1) below:

wherein ,representing the sample from n offline samples->,/>,…,/>And 1 on-line sample->Estimated initial similarity matrix, +.>Representing the square of the F-Norm (Frobenius Norm), S.gtoreq.0 represents that the matrix S has a Semi-Positive quality (Positive Semi-defined).

With continued reference to fig. 4, a flowchart of another embodiment of step S203 in fig. 2 is shown, only the portions relevant to the present application being shown for ease of illustration.

In some optional implementations of this embodiment, step S203 specifically includes: step S401.

In step S401, according to the second bestThe initial similarity matrix is subjected to the targetPerforming correction processing to obtain a correction similarity matrix, wherein the initial similarity matrix +.>Can be converted into:

；

wherein ,is an n x n dimensional matrix calculated from n of said all offline data; / >Is an n x 1-dimensional vector representing the initial similarity vector between the missing data and all of the offline data; />Is a vector of dimension 1 x n, is +.>Transposition of vectors; c represents a similarity constant; the second optimization objective is expressed as:

；

wherein ,representation->Inverse of the matrix.

In the embodiment of the present application, since the optimization target is the entire matrix S and is dynamically changed, there is a difficulty in solving the objective function of fig. 3. Thus, the application advancesThe first optimization objective is described as a vector optimization problem, and the similarity vector of the online data and other data can be updated in real time, so that the similarity matrix can be quickly corrected. Specifically, (n+1) × (n+1) dimensionThe matrix can be expressed as:

wherein ,is an n x n dimensional matrix calculated from n off-line data samples; />Is a vector of dimension n x 1, which represents the on-line missing data sample +.>And all offline samples->,/>,…,/>An initial similarity vector between the two due to +.>Contains the deletion value, so->Is an imprecise vector; />Representing a vector of dimension 1 x n, which is +.>Transposition of vectors; c is a constant representing the on-line sample +. >Similarity to itself, c=1 in most similarity measures.

In the correction of the initial similarity matrix, the offline data set X is assumed to be a complete data sampleIs a semi-positive definite matrix. The application uses->Equivalent conditions in the matrix theory with respect to the matrix semi-positive properties, equivalent transformation of the above-mentioned first optimization objective P1 with respect to the matrix into the following second optimization objective (P2) with respect to the vector:

wherein ,vector representing the dimension n×1 described above, +.>Representing->The Inverse of the Matrix (Inverse Matrix), c represents the similarity constant described above. After the matrix first optimization target P1 is converted into the vector second optimization target P2, the method utilizes KKT (Karush-Kuhn-Tucker, KKT) conditions to solve the second optimization target P2, so that the solving space is greatly reduced, and the solving speed is greatly improved. For the second optimization objective, the solution obtained by solving the KKT condition is the optimal solution of the optimization problem, and is marked as +.>It is the initial similarity vector +.>And correcting the new similarity vector obtained after correction. Thus, the initial similarity matrix ∈ ->Can be corrected as:

this Correction process is referred to herein as One-step Correction (One-step Correction), and represents an initial similarity vector for the new increase To correct. In theory, the application can prove that the corrected similarity vector +.>Is +.>Similarity vector closer to true +.>I.e.

Thus, the corrected similarity matrixThan the initial similarity matrix->Similarity matrix more closely resembling realityI.e.

The inequality is a theoretical guarantee provided by the application about matrix correction algorithm. When the data stream is deleted online,/>,…,/>When appearing in order, the present application corrects the corresponding initial similarity vector one by one>Thereby obtaining a more accurate correction similarity matrix +.>And let the similarity matrix +.>The dimension of (c) is gradually updated from n×n to (n+m) × (n+m).

With continued reference to fig. 5, a flowchart of one embodiment of fig. 2 after step S203 is shown, and for ease of illustration, only the portions relevant to the present application are shown.

In some optional implementations of the present embodiment, after step S203, further includes: step S501.

In step S501, when the missing data occurs in order in real time, the initial similarity matrix is repeatedly executedAnd calculating and correcting the processing operation, and updating the correction similarity matrix in real time until the dimension of the correction similarity matrix is finally enlarged from (n+1) x (n+1) to (n+m) x (n+m).

In the embodiment of the application, two picture analysis data sets, namely a handwriting digital data set MNIST and a PROTEIN data set PROTEIN, are selected. Specifically, each data set has 1000 offline complete data samples and 1000 online missing data samples, wherein the online missing data samples appear in real time in sequence, and the proposed OnMC algorithm updates the similarity matrix in real time after each online missing data sample arrives. Cosine similarity (Cosine Similarity) is employed as a similarity measure in embodiments of the present application.

The effectiveness of the model proposed by the application is demonstrated by carrying out the experiments on the two data sets. The method for comparing the performance of the application comprises the following steps:

1. RF (Random Forest) method: according to the method, a random forest regression algorithm is adopted to construct a mapping relation between non-missing values of samples, so that missing values are predicted through the non-missing values of new samples, and the purpose of missing data complement is achieved.

2. KFMC (Kernelized Factorization Matrix Completion, KFMC) method: the method adopts a matrix complement algorithm of kernel decomposition to carry out matrix decomposition on the missing data matrix, further decomposes by using a kernel function, and obtains the decomposition matrix of the missing data matrix by learning the form of the kernel function, thereby recovering the complete data matrix and achieving the purpose of missing data complement.

As shown in fig. 6 and 7, the present invention compares the quality comparison of similarity matrices estimated by different methods at different deletion ratios on both MNIST and PROTEIN datasets. Specifically, the quality or accuracy of the similarity matrix is measured by the Relative Mean-Square Error (RMSE), defined as

wherein ,representing an initial similarity matrix obtained from a preliminary estimate of missing data,/i>Representation ofCorrection similarity matrix obtained by various methods +.>，/>Representing a true similarity matrix. Therefore, the smaller the relative mean square error is, the correction similarity matrix is represented +>Distance real matrix->The smaller the error and thus the higher the quality or accuracy of the similarity matrix. When the relative mean square error is less than 1, the correction similarity matrix is described as +.>Quality of (2) is better than the initial similarity matrix +.>The method comprises the steps of carrying out a first treatment on the surface of the And vice versa. As shown in fig. 6 and 7, the OnMC significantly reduces the relative mean square error compared to the RF and KFMC methods and always maintains the relative mean square error less than 1, indicating the corrective similarity matrix +_ provided by the present invention>The quality of (2) is always better than the initial similarity matrix in each deletion ratio>. As shown in fig. 6, in particular, in the case where the missing proportion on the MNIST data set reaches 80%, the relative mean square error of OnMC is only 0.368, while the relative mean square error of RF and KFMC is 1.313 and 1.257, respectively, and the relative mean square error of OnMC is reduced by 72.0% and 70.7% compared to RF and KFMC, respectively; as shown in FIG. 7, in particular, in the case where the loss ratio on the PROTEIN dataset reaches 80%, the relative mean square error of OnMC is only 0.238, whereas the relative mean square error of RF and KFMC are 0.760 and 0.409, respectively, onMC is subtracted from RF and KFMC, respectively The relative mean square error is 68.7% and 41.8% less. It was further observed according to the present invention that the difference in performance between OnMC and the comparison algorithm increased and decreased when the deletion ratio increased from 20% to 80%, especially when the deletion ratio was around 50%, the advantages of OnMC were most pronounced, as the performance of the comparison algorithm was relatively worst under this condition. In addition, the invention also observes that the correction error of the OnMC decreases with the increase of the deletion proportion, which means that the correction similarity matrix obtained by the OnMC is better than the initial similarity matrix with the larger deletion proportion. Therefore, the correction method is applicable under the condition of wide missing proportion, has smaller relative error and higher correction precision, and shows generalization and robustness of the method.

Referring to fig. 8 and 9, the present invention measures the processing time of each algorithm for missing data at different missing ratios. Fig. 8 and 9 show the run times on the MNIST data set and the PROTEIN data set, respectively. As shown in fig. 8, onMC exhibited minimal run times at different miss ratios of MNIST data sets, where when the miss ratio was 20%, the on mc run time was only 4 seconds, 604 times the RF run speed, 7 times the KFMC run speed; as shown in FIG. 9, onMC also exhibited the fastest running speed at different miss ratios of the PROTEIN dataset, where the OnMC running time was 32 seconds, 85 times the RF running speed, and 9 times the KFMC running speed at a miss ratio of 20%. The vector optimization method replaces a matrix optimization method, so that the size of a variable space is greatly reduced, and the solving speed is remarkably improved, thereby realizing the real-time updating and correction of the similarity matrix.

In summary, the online missing data processing method based on the similarity matrix has the functions of processing online missing data streams in real time and correcting the similarity matrix in real time, greatly reduces the complexity and the running time of data processing under the condition of ensuring higher correction accuracy, improves the real-time reaction capacity of a system, and is suitable for various similarity measures including cosine similarity, jacquard coefficients, gaussian kernel functions and the like;

in the correction of the similarity matrix, the optimization idea is adopted, the semi-positive property of the similarity matrix is recovered, the minimum matrix change is ensured, and experiments prove that for any missing data and missing proportion, the quality of the similarity matrix can be obviously improved, so that the similarity matrix is more similar to a real similarity matrix, and better generalization and robustness are shown.

The method combines the matrix theory and the optimization method, converts the matrix optimization problem into the vector optimization problem in an equivalent way, ensures the solving precision, greatly reduces the complexity of the problem, and has higher correction precision and higher convergence rate. The correction similarity matrix provided by the invention is suitable for wide downstream application tasks, such as classification, clustering, similarity retrieval, sequencing, recommendation and other practical applications based on missing data, and is suitable for big data fields, recommendation system fields, social network fields, biological information fields and the like.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

Example two

With further reference to fig. 10, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an online missing data processing apparatus based on a similarity matrix, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 10, the online missing data processing apparatus 200 based on the similarity matrix according to the present embodiment includes: the system comprises a monitoring data acquisition module 210, an initial matrix calculation module 220, a correction matrix acquisition module 230 and a missing fill module 240. Wherein:

a monitoring data acquisition module 210, configured to acquire a current monitoring data matrix of the target network in real time;

an initial matrix calculation module 220, configured to calculate an initial similarity matrix of the current monitored data matrix and all offline data when missing data occurs in the current monitored data matrixThe initial similarity matrix->Expressed as:

；

A correction matrix acquisition module 230 for performing an optimization method on the initial similarity matrixPerforming correction processing to obtain a correction similarity matrix;

and the missing filling module 240 is configured to perform missing filling processing on missing data of the current monitored data matrix according to the correction similarity matrix.

In the embodiment of the application, the initial similarity is calculatedThe matrix may be a matrix that estimates the similarity between any two missing data samples, resulting in an initial similarity matrix between all data samples. The offline data set forms an offline sample set X, where X is a matrix of d X n dimensions comprising n samples, each sampleCharacteristic value with d dimensions +.>,,…,/>. For any two samples-> and />If there is a missing value, the missing eigenvalue is recorded as NaN, record I is the dimension that two samples have non-NaN eigenvalue at the same time, then +.> and />Respectively indicate-> and />A vector is limited in I dimensions. Thus for any two missing data samples +.> and />Is approximately equivalent to calculating the estimated value of the similarity of (2) and />Is a similarity of (3). Taking the calculation of cosine similarity as an example:

In an embodiment of the present application, there is provided an online missing data processing apparatus 200 based on a similarity matrix, including: a monitoring data acquisition module 210 for real world applicationsCollecting a current monitoring data matrix of the target network; an initial matrix calculation module 220, configured to calculate an initial similarity matrix of the current monitored data matrix and all offline data when missing data occurs in the current monitored data matrix The initial similarity matrix->Expressed as: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein said->Any one of the data representing the current monitoring data matrix, said +.>Any one of the data representing all the offline data, said +.>Expressed as:the method comprises the steps of carrying out a first treatment on the surface of the Wherein I is-> and />Simultaneously, the dimension of the non-NaN characteristic value is provided; correction matrix acquisition module 230 for ++the initial similarity matrix according to an optimization method>Performing correction processing to obtain a correction similarity matrix; and the missing filling module 240 is configured to perform missing filling processing on missing data of the current monitored data matrix according to the correction similarity matrix. Compared with the prior art, the application can effectively improve the processing efficiency of the missing data stream and greatlyThe accuracy of similarity calculation is improved, and the method has firm theoretical guarantee, good expandability and universality.

In some optional implementations of this embodiment, the correction matrix obtaining module 230 includes: a first corrective processing sub-module, wherein:

a first correction processing sub-module for correcting the initial similarity matrix according to a first optimization targetAnd performing correction processing to obtain a correction similarity matrix, wherein the first optimization target is expressed as:

；

In some optional implementations of this embodiment, the correction matrix obtaining module 230 further includes: a second corrective processing sub-module, wherein:

a second correction processing sub-module for performing initial similarity matrix according to a second optimization objectivePerforming correction to obtain correction similarity matrix, wherein the initial similarity matrix +.>Can be converted into:

；

wherein ,is a matrix of dimension n x n calculated from n all offline data; />Is an n x 1-dimensional vector representing the missing data and the initial similarity vector between all offline data; />Is a vector of dimension 1 x n, is +.>Transposition of vectors; c represents a similarity constant; the second optimization objective is expressed as:

；

wherein ,representation->Inverse of the matrix.

In some optional implementations of this embodiment, the online missing data processing apparatus 100 based on a similarity matrix further includes: a matrix update module, wherein:

matrix updating module for repeatedly executing initial similarity matrix when missing data appears in order in real time And calculating and correcting, namely updating the correction similarity matrix in real time until the dimension of the correction similarity matrix is finally enlarged from (n+1) x (n+1) to (n+m) x (n+m).

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 11, fig. 11 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 300 includes a memory 310, a processor 320, and a network interface 330 communicatively coupled to each other via a system bus. It should be noted that only computer device 300 having components 310-330 is shown in the figures, but it should be understood that not all of the illustrated components need be implemented, and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 310 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 310 may be an internal storage unit of the computer device 300, such as a hard disk or a memory of the computer device 300. In other embodiments, the memory 310 may also be an external storage device of the computer device 300, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 300. Of course, the memory 310 may also include both internal storage units and external storage devices of the computer device 300. In this embodiment, the memory 310 is generally used to store an operating system and various application software installed on the computer device 300, such as computer readable instructions of an online missing data processing method based on a similarity matrix. In addition, the memory 310 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 320 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 320 is generally used to control the overall operation of the computer device 300. In this embodiment, the processor 320 is configured to execute computer readable instructions stored in the memory 310 or process data, for example, computer readable instructions for executing the online missing data processing method based on the similarity matrix.

The network interface 330 may include a wireless network interface or a wired network interface, the network interface 330 typically being used to establish communication connections between the computer device 300 and other electronic devices.

The computer equipment provided by the application can effectively improve the processing efficiency of the missing data stream, greatly improve the accuracy of similarity calculation, and has firm theoretical guarantee and good expandability and universality.

The present application also provides another embodiment, namely, a computer readable storage medium storing computer readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the online missing data processing method based on a similarity matrix as described above.

The computer readable storage medium provided by the application can effectively improve the processing efficiency of the missing data stream, greatly improve the accuracy of similarity calculation, and has firm theoretical guarantee and good expandability and universality.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. The online missing data processing method based on the similarity matrix is characterized by comprising the following steps of:

collecting a current monitoring data matrix of a target network in real time;

；

performing missing filling processing on missing data of the current monitoring data matrix according to the correction similarity matrix;

the initial similarity matrix is subjected to the optimization methodCarrying out correction processing to obtain a correction similarity matrix, wherein the correction similarity matrix comprisesThe method comprises the following steps:

the initial similarity matrix is subjected to a second optimization targetPerforming correction processing to obtain the correction similarity matrix, wherein the initial similarity matrix +. >The method comprises the following steps of:

；

wherein ,representation->Matrix inverse momentAn array.

2. The method for online missing data processing based on a similarity matrix according to claim 1, wherein the initial similarity matrix is optimized according to an optimization methodAnd (3) performing correction processing to obtain a correction similarity matrix, wherein the method specifically comprises the following steps of:

；

wherein ,for the initial similarity matrix of the current monitoring data matrix and all the offline data, +.>Representing the square of the F-norm, S.gtoreq.0 represents that the matrix S has semi-positive properties.

3. The method for online missing data processing based on a similarity matrix according to claim 1, wherein the initial similarity matrix is optimized according to the optimization method After the step of correcting to obtain the correction similarity matrix, the method further comprises the following steps:

repeating the initial similarity matrix when the missing data occurs in order in real timeAnd calculating and correcting the processing operation, and updating the correction similarity matrix in real time until the dimension of the correction similarity matrix is finally enlarged from (n+1) x (n+1) to (n+m) x (n+m), wherein m represents the number of final online missing data.

4. An online missing data processing device based on a similarity matrix, comprising:

the initial matrix calculation module is used for calculating the initial similarity matrix of the current monitoring data matrix and all offline data when the missing data appear in the current monitoring data matrixThe initial similarity matrix->Expressed as:

；

the missing filling module is used for carrying out missing filling processing on missing data of the current monitoring data matrix according to the correction similarity matrix;

the correction matrix acquisition module includes:

a second correction processing sub-module for correcting the initial similarity matrix according to a second optimization objectivePerforming correction processing to obtain the correction similarity matrix, wherein the initial similarity matrix +.>The method comprises the following steps of:

；

wherein ,is an n x n dimensional matrix calculated from n of said all offline data; />Is a vector of dimension n x 1 representing the missing data toAnd an initial similarity vector between all of the offline data; />Is a vector of dimension 1 x n, is +.>Transposition of vectors; c represents a similarity constant; the second optimization objective is expressed as:

；

wherein ,representation->Inverse of the matrix.

5. The online missing data processing device based on a similarity matrix according to claim 4, wherein the correction matrix acquisition module includes:

a first correction processing sub-module for correcting the initial similarity matrix according to a first optimization target And carrying out correction processing to obtain the correction similarity matrix, wherein the first optimization target is expressed as:

；

6. The similarity matrix-based online missing data processing apparatus of claim 4, further comprising:

matrix updating module for repeatedly executing initial similarity matrix when the missing data appears in order in real timeAnd calculating and correcting the processing operation, and updating the correction similarity matrix in real time until the dimension of the correction similarity matrix is finally enlarged from (n+1) x (n+1) to (n+m) x (n+m), wherein m represents the number of final online missing data.

7. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the similarity matrix based on-line missing data processing method of any of claims 1 to 3.

8. A computer readable storage medium, characterized in that it has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the similarity matrix based on-line missing data processing method according to any of claims 1 to 3.