Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that in the description of embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description and to simplify the description, and are not indicative or implying that the apparatus or elements in question must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the present invention. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
In the construction data, there is also a logical relationship between the respective data. For example, the relationship between the compactness and the thickness of the asphalt pavement, each total thickness corresponds to one upper layer thickness and one compactness data, so that the relationship exists for the data in the same construction scene. However, the currently adopted statistical test-based method cannot effectively utilize the association relationship between construction data. In view of this, the present invention provides a method for self-checking construction data based on a neural network, and the method and system for self-checking construction data provided by the embodiments of the present invention are described below with reference to fig. 1 to 6.
Fig. 1 is a schematic flow chart of a construction data self-checking method provided by the present invention, as shown in fig. 1, including but not limited to the following steps:
step S1: acquiring all construction data in a construction scene, and constructing a construction data matrix;
step S2: calculating the correlation degree among all data fields in the construction data matrix to obtain a signature matrix;
step S3: extracting features of the signature matrix by using a pre-trained data anomaly detection model to obtain an output matrix;
step S4: calculating a mean square error between the output matrix and the signature matrix;
step S5: and determining self-checking results of all construction data under the construction scene according to the mean square error.
Constructing a construction data matrix L, L epsilon R for all the collected construction data under the construction scene to be detected n×m The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is an n multiplied by m dimension matrix, n is the field number of the construction data recorded in the construction data matrix L, m is the field length of each construction data recorded in the construction data matrix L, and R is a symbol representing dimension in mathematics.
It should be noted that, the construction data self-checking method provided by the invention takes all the construction data as a whole to comprehensively detect whether the acquired data is correct or not in the construction scene.
Furthermore, according to the construction data self-checking method provided by the invention, the correlation among all construction data is considered, so that the correlation among all construction data recorded in the construction data matrix is required to be calculated.
As an alternative embodiment, the correlation may be pearson correlation coefficient, and the method for calculating the correlation between the data fields in the construction data matrix includes:
wherein X is the record data of the data field X, and if the length of the record data is T, X is E R 1×T The method comprises the steps of carrying out a first treatment on the surface of the Y is the record data of data field Y, and its length is also T, i.e. Y E R 1×T ;σ X Is X standard deviation, sigma Y Is Y standard deviation, cov (X, Y) is the covariance of X and Y.
After calculating the correlation between the construction data in the construction data matrix by using the above equation 1, the signature matrix S can be obtained a ,S a ∈R n×n The method comprises the steps of carrying out a first treatment on the surface of the Each data S in the signature matrix xy Expressed as:
S xy =abs(ρ XY ) Equation 2
Wherein, the function abs (·) is calculated as taking absolute value.
Due to the signature matrix S
a The correlation characteristics of each data field are described, but in order to better extract the potential characteristics of the entered construction data, the construction data self-checking method provided by the invention utilizesThe pre-trained data anomaly detection model further generates a signature matrix S
a Extracting features to obtain corresponding output matrix
To a certain extent, output matrix
Can be regarded as a signature matrix S
a Is a reproduction of (a). Signature matrix S to be correlated with construction data to be detected
a After input to the data anomaly detection model, a corresponding output matrix is obtained>
The further available output matrix +.>
And signature matrix S
a The mean square error between them is denoted as LossE, ">
Finally, the output matrix can be obtained
And signature matrix S
a And (5) judging whether all the construction data in the construction scene are accurate or not according to the mean square error Losse.
According to the construction data self-checking method and system, the association relation among the construction data under the same construction scene is extracted, the data anomaly detection model is constructed, the anomaly data is automatically detected according to the mean square error between the input and the output of the model, false data and invalid data are prevented, the self-checking precision of the construction data is effectively improved, and the self-checking efficiency is greatly improved.
Based on the foregoing embodiment, as an optional embodiment, before extracting the features of the signature matrix by using a pre-trained data anomaly detection model, pre-training the pre-constructed data anomaly detection model by using an unsupervised learning method is further included; the objective function of the unsupervised learning is:
wherein ,S
i In order to sign a matrix sample,
for outputting matrix samples, M is the iteration number of training; and Θ is a parameter to be trained of the data anomaly detection model, and i is an intermediate parameter.
Fig. 2 is a second flow chart of the construction data self-checking method provided by the invention, as shown in fig. 2, before the feature extraction of the construction data matrix is performed by using the data anomaly detection model, the data anomaly detection model is further required to be pre-trained.
As an alternative embodiment, the invention adopts an unsupervised learning mode, adopts a mean square error as a loss function and calculates an output matrix
And signature matrix S
a The error between the two is ensured to be as small as possible, and the pretraining is performed, so that the objective function adopted by the pretraining is shown as a formula 3.
Optionally, a random gradient descent method may be adopted in the whole pre-training process, the learning rate is set to 0.01 (which may be adjusted according to the actual situation), the training iteration number is set to 500 (which may be adjusted according to the actual situation), until the training result converges, and the trained data anomaly detection model is obtained.
Then, the obtained signature matrix is input into a trained data anomaly detection model, and an output matrix output by the model is obtained.
And finally, calculating the mean square error between the signature matrix and the output matrix, and taking the mean square error as a self-checking basis of the construction data.
FIG. 3 is a schematic diagram of a data anomaly detection model provided by the present invention, and as shown in FIG. 3, the data anomaly detection model provided by the present invention is constructed based on a coding and decoding framework; the coding and decoding framework comprises a coding layer and a decoding layer; the coding layer comprises a convolution layer L 1 Pool layer L 2 Full connection layer L 3 Full connection layer L 4 And full connection layer L 5 The method comprises the steps of carrying out a first treatment on the surface of the The convolution layer L 1 The convolution kernel size of (2) is 3×3, the output channel is 6, and the input channel is 1; the pooling layer L 2 The convolution kernel size of (2 x 2); the full connection layer L 3 The input dimension of (n-2)/2 is rounded, the output dimension is 64, and n is the field number of the construction data recorded in the construction data matrix; the full connection layer L 4 Is 64, and the output dimension is 32; the full connection layer L 5 Is 32 and the output dimension is 8.
Correspondingly, the decoding layer comprises a full connection layer L 6 Full connection layer L 7 And full connection layer L 8 Convolutional layer L 9 And pooling layer L 10 The method comprises the steps of carrying out a first treatment on the surface of the The full connection layer L 6 Is 8, and the output dimension is 32; the full connection layer L 7 Is 32, and the output dimension is 64; the full connection layer L 8 The input dimension of (2) is 64, and the output dimension is n-2/2; the convolution layer L 9 The convolution kernel size of (2 x 2); the pooling layer L 10 The convolution kernel size of (2) is 3x3, the output channel is 6, and the input channel is 1.
In general, the input of the data anomaly detection model is a signature matrix, a feature vector E is obtained through an encoding layer, then the feature vector E is used as the input of a decoding layer, and finally an output matrix is obtained.
The data operation process of the anomaly detection model specifically comprises the following steps:
firstly, obtaining a signature matrix S after association relation calculation of a construction data matrix L in a target construction scene a ∈R n×n 。
Then, the signature matrix S a As a convolution layer L 1 The input of (convolution kernel 3x 3) passes through convolution layer L 1 Thereafter, the output data dimension becomes (n-3+1) x (n-3+1), i.e., (n-2) x (n-2).
At the pooling layer L 2 (convolution kernel size is 2 x 2), the output data dimension is further reduced, the kernel size is 2 x 2, representing a pooling operation every 2 rows and 2 columns (maximum pooling may be employed here, i.e., 2 columns per 2 rows); thus, the data dimension would be divided by 2. If it is not divisible by 2, i.e. there is a row or column that is not pooled, then no processing is done at this step by default. Therefore, after pooling, the data dimension becomes m ' ×m ', where m ' = (n-2)/2 is rounded.
Pooling (Pooling) can be regarded as normalizing the feature map values in a Pooling window, and randomly sampling and selecting according to the normalized probability value of the feature map, i.e. the selected probability of large element values is also large.
Then L is taken up 2 The pooled data is transferred into the full connection layer L 3 Full connection layer L 3 The input dimension of (2) is (n-2)/2.
The input dimension and the output dimension are lengths of a certain line. For example, at the full connection layer L 3 Its input data is m ' x m ', specifically the length of the latter m '. If a data is 3x4 in size (i.e., 3 rows and 4 columns), it is input to a fully-connected layer, the input dimension of the fully-connected layer is 4.
As can be appreciated from the above data processing procedure, at the encoding layer, after the data passes through the pooling layer, the output data dimensions of each layer are changed as follows: (n-2)/2 rounding-64-32-8; i.e. the process of changing the data from high dimension to low dimension in the whole coding layer.
Accordingly, after the feature vector E is input to the decoding layer, the whole data processing process and the data processing process in the encoding layer are symmetrical inverse processes, which is not described in detail herein.
It should be noted that, when performing convolution, the input data is a matrix L (the number of channels is 1), and to represent the feature information with higher dimensions, we will generally tense a matrix into a tensor (with multiple channels). For example, the current input is a 5×5 matrix whose dimension can be expressed as 1×5×5 if the number of channels is added. After convolution, the feature dimension becomes n×m×m, where n is the number of output channels set by the convolution layer, and the value of m has a relationship with the convolution kernel size.
Wherein Convolution (Convolution) refers to the integration of the overlap length by the product of the overlap function values characterizing the flip and translation of the functions f and g by generating a mathematical operator of the third function from the two functions f and g.
In addition, the input dimension and the output dimension of each convolution layer in the encoding layer and the decoding layer in the anomaly detection model can be correspondingly set according to different construction scenes or according to the data characteristics of the collected construction data, and the anomaly detection method is not particularly limited.
According to the construction data self-checking method, the signature matrix containing the relevance among all construction data is subjected to feature extraction by constructing the anomaly detection model, so that the self-checking precision can be effectively improved.
Based on the content of the above embodiment, as an alternative embodiment, before determining the self-checking results of all the construction data in the construction scene, determining an error mean value of the normal construction data as a reference value; and determining self-checking results of all construction data in the construction scene according to the mean square error, wherein the self-checking results comprise: under the condition that the mean square error is larger than the reference value, determining the self-checking result as abnormal; and under the condition that the mean square error is not greater than the reference value, determining that the self-checking result is normal.
Specifically, after pre-training, the data anomaly detection model of the present invention can be obtained. The invention considers that after the abnormal data is input into the model, the obtained output matrix
And signature matrix S
a The mean square error between (noted as
) Larger.
Based on the content of the above embodiment, as an alternative embodiment, before determining the self-checking results of all the construction data in the construction scene, determining an error mean value of the normal construction data as a reference value; and determining self-checking results of all construction data in the construction scene according to the mean square error, wherein the self-checking results comprise: under the condition that the mean square error is larger than the reference value, determining the self-checking result as abnormal; and under the condition that the mean square error is not greater than the reference value, determining that the self-checking result is normal.
Assume that all construction data includes abnormal data X E After the signature matrix corresponding to the length K is input to the trained data anomaly detection model, the error LossE can be further calculated, and if the error LossE is greater than the LossM, the error LossE is greater than the LossM. Here LossM is the error mean of normal construction data. Thus, it can be judged whether the data is abnormal by whether LossE is greater than LossM. When the error LossE is greater than LossM, it can be determined that abnormal data exists in all the construction data; when the error LossE of the data is not greater than LossM, it can be judged that all the construction data are normal.
Based on the foregoing embodiment, as an alternative embodiment, after pre-training the pre-built data anomaly detection model by using an unsupervised learning manner, the method further includes:
constructing a verification data set which consists of normal verification data generated by an autoregressive system and abnormal data after partial normal verification data are modified; determining the precision and recall of the pre-trained data anomaly detection model by using the verification data set; and determining the credibility of the pre-trained data anomaly detection model according to the precision and recall ratio.
In order to further ensure the precision of construction data self-checking, the construction data self-checking method provided by the invention further verifies the data anomaly detection model by creating a test data set after the pre-training of the model, and if the verification result is qualified, the method is applied to actual detection.
The verification index mainly comprises a precision rate and a recall rate, wherein the precision rate is calculated by the following steps:
the calculation formula of the recall ratio is as follows:
wherein TP represents the number of test results for which the verification sample data is abnormal and the model classification result is abnormal, FP represents the number of test results for which the verification sample data is normal and the model classification result is abnormal, and FN represents the number of test results for which the verification sample data is abnormal and the model classification is normal.
The verification data set of the present invention is divided into two types including normal data and abnormal data. Wherein the normal data is simulation data generated based on the following autoregressive system.
The autoregressive system refers to a data value method related to the value of the current variable in the system and the value of the past moment.
As an alternative embodiment, the present invention provides a method for acquiring normal data:
x 1,t =0.5x 1,t-1 +ε 1,t ;
x 2,t =0.6cos(x 2,t-1 )+ε 2,t ;
x 4,t =0.8x 7,t-1 +ε 4,t ;
x 5,t =0.9x 8,t-1 +ε 5,t ;
x 7,t =2cos(x 2,t-1 )+0.6sin(x 10,t-1 )+ε 7,t ;
x 8,t =0.8cos(x 3,t-1 )+cos(x 6,t-1 )+1+ε 8,t ;
x 9,t =sin(x 4,t-1 )-0.8x 7,t-1 +ε 9,t ;
the variable x can be obtained in total in the above embodiment 1 -x 10 The value at the time t is that 10 normal data x are generated 1,t -x 10,t 。
Further, the abnormal data may be obtained by randomly modifying part of the normal data, such as: by modifying any three variables, x 1 ,x 4 and x9 The values of (2) yield the following three outliers:
x′ 1,t =0.5x 1,t-1 +∈ 1,t ;
x′ 4,t =0.8x 7,t-1 +∈ 4,t ;
x′ 9,t =sin(x 4,t-1 )-0.8x 7,t-1 +∈ 9,t ;
where ε is Gaussian white noise and ε is a random number of 0 to 1.
And constructing a verification data set by using the normal data and the abnormal data, and verifying the trained data abnormal detection model, wherein the obtained verification result is that the precision rate can reach 0.88 and the recall ratio can reach 0.83.
A set of thresholds may be set, such as setting the precision threshold and the recall threshold to 0.8, and only when the precision and recall in the verification result are both greater than the corresponding thresholds, the trained data anomaly detection model is considered to be qualified.
According to the method, under a certain construction scene, the input data which are qualified in inspection are sorted, and an original construction data set is obtained. Assume that there are 10 fields in the dataset, each field being 1000 bytes in length.
FIG. 4 is a schematic diagram of sampling all construction data provided by the present invention, as shown in FIG. 4, preprocessing an original construction data set, including: each field of the original construction dataset is sampled by length T (t=45). When sampling, the starting point of the last sampling is recorded as s p Then the start of the next sample is s p +5 (let the start of the first sample be 0).
By adopting the method, 192 construction sample data can be obtained. All samples may be partitioned at 7:3 to construct a training sample set and a validation sample set, respectively.
And generating a construction data matrix aiming at the training sample set, and calculating the correlation degree among all data in the matrix by adopting a formula 1 to obtain a signature matrix.
And inputting the signature matrix into a data anomaly detection model to be trained so as to obtain an output matrix. And then pre-training the data anomaly detection model by using an unsupervised training method corresponding to the formula 3. The training process can be based on sampling random gradient descent for training, the learning rate is 0.01, and the iteration number is 500.
Then, after training of the data anomaly detection model is achieved, it is verified using a verification sample set.
On the basis of determining that the verification result is qualified, the method in the above embodiment can be sampled for the construction data to be detected, a signature matrix is generated, and then the signature matrix is input into a trained data anomaly detection model, so that the error rate LossM is finally counted. If the LossE value is larger than the LossM, the construction data to be detected is abnormal, otherwise, the construction data to be detected is normal.
Fig. 5 is a schematic structural diagram of self-checking construction data provided by the present invention, as shown in fig. 5, including but not limited to an initial data acquisition unit 501, a correlation calculation unit 502, a feature extraction unit 503, an error calculation unit 504, and a self-checking identification unit 505, wherein:
the initial data acquisition unit 501 is mainly used for acquiring all construction data in a construction scene and constructing a construction data matrix;
the correlation calculation unit 502 is mainly used for calculating the correlation degree between each data field in the construction data matrix to obtain a signature matrix;
the feature extraction unit 503 is mainly configured to perform feature extraction on the signature matrix by using a pre-trained data anomaly detection model, so as to obtain an output matrix;
the error calculation unit 504 is mainly configured to calculate a mean square error between the output matrix and the signature matrix;
the self-checking identification unit 505 is mainly configured to determine self-checking results of all construction data under the construction scene according to the mean square error.
It should be noted that, when the construction data self-checking system provided in the embodiment of the present invention is specifically executed, the construction data self-checking system may be implemented based on the construction data self-checking method described in any one of the foregoing embodiments, which is not described in detail in this embodiment.
According to the construction data self-checking system provided by the invention, the association relation among the construction data under the same construction scene is extracted, the data anomaly detection model is constructed, the automatic detection of the anomaly data is realized according to the mean square error between the input and the output of the model, the existence of false data invalid data is prevented, the self-checking precision of the construction data is effectively improved, and the self-checking efficiency is greatly improved.
Fig. 6 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 6, the electronic device may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. Processor 610 may invoke logic instructions in memory 630 to perform a construction data self-test method comprising: acquiring all construction data in a construction scene, and constructing a construction data matrix; calculating the correlation degree among all data fields in the construction data matrix to obtain a signature matrix; extracting features of the signature matrix by using a pre-trained data anomaly detection model to obtain an output matrix; calculating a mean square error between the output matrix and the signature matrix; and determining self-checking results of all construction data under the construction scene according to the mean square error.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the construction data self-checking method provided by the above methods, the method comprising: acquiring all construction data in a construction scene, and constructing a construction data matrix; calculating the correlation degree among all data fields in the construction data matrix to obtain a signature matrix; extracting features of the signature matrix by using a pre-trained data anomaly detection model to obtain an output matrix; calculating a mean square error between the output matrix and the signature matrix; and determining self-checking results of all construction data under the construction scene according to the mean square error.
In still another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the construction data self-checking method provided in the above embodiments, the method comprising: acquiring all construction data in a construction scene, and constructing a construction data matrix; calculating the correlation degree among all data fields in the construction data matrix to obtain a signature matrix; extracting features of the signature matrix by using a pre-trained data anomaly detection model to obtain an output matrix; calculating a mean square error between the output matrix and the signature matrix; and determining self-checking results of all construction data under the construction scene according to the mean square error.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.