CN110555051B

CN110555051B - Product test abnormal behavior detection system based on behavior sequence analysis

Info

Publication number: CN110555051B
Application number: CN201810456933.XA
Authority: CN
Inventors: 张贝格; 姜丽红; 蔡鸿明; 叶聪聪; 于晗
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-05-14
Filing date: 2018-05-14
Publication date: 2023-04-28
Anticipated expiration: 2038-05-14
Also published as: CN110555051A

Abstract

A system for detecting abnormal behavior of a product test based on behavior sequence analysis, comprising: the system comprises a data preprocessing module, a sequence model construction module, a storage module and a prediction module, wherein: the data preprocessing module acquires and analyzes the quality detection data record file, generates structured data, outputs the structured data to the sequence model construction module and the storage module respectively, the sequence model construction module calculates the sequence similarity of each group of data, clusters the sequence similarity according to the sequence similarity, and outputs cluster centers representing conventional behavior clusters to the storage module and the prediction module as conventional behavior models, and the prediction module calculates offset between any batch of data and the conventional behavior models according to the conventional behavior models and realizes abnormal behavior detection by comparing the offset. According to the invention, the similarity difference of the data sequences is analyzed, a conventional data recording behavior model is established, and the abnormality in the data recording process is detected, so that the reliability evaluation of the product quality detection data is obtained.

Description

Product test abnormal behavior detection system based on behavior sequence analysis

Technical Field

The invention relates to a technology in the field of information processing, in particular to a product test abnormal behavior detection system based on behavior sequence analysis.

Background

In the manufacturing industry, if a inspector does not actually test a product, but based on some of the actual test results, certain strategies are adopted to forge data, so that the forged data are also within reasonable error range, the false data are difficult to find, but the quality inspection result becomes unreliable. The existing abnormal behavior detection method comprises the steps of learning the characteristics of abnormal behaviors under the condition of a large number of labels, and detecting whether the known abnormal behaviors exist in new data according to the characteristics; and when the abnormal mode cannot be confirmed and represented by the characteristics, establishing a conventional behavior model, and finding abnormal behaviors by detecting deviations from the conventional behavior model. The strategies adopted by different testers in forging false data may be different, and it is difficult to build a model for each abnormal behavior when the labeled data set is less.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a product testing abnormal behavior detection system based on behavior sequence analysis, which is used for establishing a conventional data recording behavior model by analyzing similarity differences of data sequences and detecting the abnormality in the data recording process, so that the reliability of product quality detection data is evaluated.

The invention is realized by the following technical scheme:

the invention comprises the following steps: the system comprises a data preprocessing module, a sequence model construction module, a storage module and a prediction module, wherein: the data preprocessing module acquires and analyzes the quality detection data record file, generates structured data, outputs the structured data to the sequence model construction module and the storage module respectively, the sequence model construction module calculates the sequence similarity of each group of data, clusters the sequence similarity according to the sequence similarity, and outputs cluster centers representing conventional behavior clusters to the storage module and the prediction module as conventional behavior models, and the prediction module calculates offset between any batch of data and the conventional behavior models according to the conventional behavior models and realizes abnormal behavior detection by comparing the offset.

Technical effects

Compared with the prior art, the method and the device realize the process of data reliability assessment by modeling the data recording behavior based on quality test data and by utilizing the sequence information of the data and by a behavior model analysis method. Calculating sequence similarity for different sub-data segments to obtain the highest similarity sub-sequence score existing in each sub-sequence segment; for the same data sequence, the difference between the highest sequence similarity scores among different subsets can reflect the strategy change before and after the data sequence generation process; clustering the internal sequence similarity differences of the data sequences, wherein the largest cluster can represent the conventional behavior; the reliability of the batch data can be predicted from deviations from the conventional behavioral model.

Drawings

FIG. 1 is a schematic diagram of the structure of the present invention;

fig. 2 is a block diagram of an embodiment of the present invention.

Detailed Description

As shown in fig. 2, the present embodiment includes: the system comprises a data preprocessing module, a sequence model construction module, a storage module and a prediction module.

The data preprocessing module acquires quality detection data record files, wherein each file is a product inspection result of a certain batch, analyzes the data file, extracts data information in the data file, and converts a sequence of the data record into an ordered list for representation, and specifically comprises the following steps:

each element in the list represents the test result of one product in the batch, then the data line with the missing value or obvious abnormal value is removed through data cleaning, the list is uniformly divided into a plurality of non-repeated sub-data segments according to the set segmentation number, and the structured data is stored in a storage module and is simultaneously sent to a sequence model building module for further processing.

The sequence model construction module receives structured segment data information D, wherein D represents a data list obtained after processing a data file, and D comprises n sequences D with equal length ₁ ,d ₂ ,…,d _n The method comprises the steps of carrying out a first treatment on the surface of the The sequence model construction module constructs each segmented data sequence d _i Sub-sequence division is carried out, and two sub-sequences obtained after four groups of division are used for carrying out local sequence by using a dynamic programming algorithmAfter comparing and calculating the sequence similarity score matrix, taking the maximum value in the obtained similarity matrix to represent the maximum similarity subsequence score s in the segment _i I.e. the highest sequence similarity among them.

The division, considering that the data segment pasted after modification may be adjacent to the original data segment or have an interval, is divided into two cases, one is to divide the sequence into two sequences a, b directly from the middle, the other is to assume that the length of the sub-sequence to be divided is L, take an interval value as gap_length, splice [0, gap_length ], [2 x gap_length,3 x gap_length) … into a sub-sequence a _g The rest part is spliced into another subsequence b _g 。

Considering that the pasted data segment may be in the opposite sequence relation with the original data segment, the two sub-sequences are respectively obtained by taking one sub-sequence for the two divisions

b and->

b _g 。

The dynamic programming algorithm calculates the sequence similarity score matrix in the following manner

Wherein: a is a sequence similarity scoring matrix, a, b respectively represent two subsequences to be compared, s (a _i ,b _j ) Representing the similarity between the ith element in the sequence a and the jth element in the sequence b, A _ij Representing the alignment of two sequences from front to back to element a _i ,b _j C represents a gap penalty, n and m are the lengths of the sequences a, b, respectively. In the calculation process, firstly, initializing a matrix, wherein A is _i,0 ＝A _0,j =0, (0.ltoreq.i.ltoreq.n, 0.ltoreq.j.ltoreq.m). For each subsequent A _i,j Are calculated from the element scores that have been calculated previously.

By introducing a gap penalty mechanism, this algorithm can also find sequences with high similarity in addition to matching identical sequences.

The mechanism of gap penalty refers to: for two sequence segments, if one of the skipped intervals or if several elements are repeated the same as the other, then this interval is penalized. In the algorithm, the interval is 1 penalty of c.

The clustering is divided into four groups of the data list D obtained after the data file processing, namely, all s are taken respectively _i The coefficient of variation and the relative difference are calculated, namely: coefficient of variation: CV = σ/μ, relatively very poor: rr= (max-min)/μ, four groups (CV, RR) are used to represent differences of sequence similarity inside each batch of data D, and based on a large number of different batches of test data, the (CV, RR) values are clustered, wherein the largest cluster represents normal behavior, and the cluster center of the cluster is selected as a normal behavior model and stored in the storage module.

The prediction module receives four groups of value pairs representing the difference of the similarity of the internal sequence of any batch of data calculated by the sequence model construction module, takes out a conventional behavior model from the storage module, and calculates the distance between the two. By mapping the distance between 0 and 1 with the tanh function, a deviation of the batch data in percent from the conventional behavior model is obtained. And selecting the maximum value from the four offsets of one batch of data, comparing the maximum value with a set threshold value, and judging that the quality data of the batch of products has fake behaviors in the recording process if the maximum value exceeds the threshold value, otherwise, judging that the quality data of the batch of products does not have the non-compliant behaviors.

The storage module is provided with a database for storing the processed structured data, if parameters are adjusted, the model can be recalculated, and the file system stores model files obtained in the modeling process and is used for judging the category of the given batch data and analyzing the reliability of the given batch data by extracting the model from the prediction module.

The system specifically detects abnormal behaviors by the following modes: the data preprocessing module reads the product quality data file, analyzes the product quality data file to obtain a batch of product testing original data, and transmits the structured data obtained through data cleaning and segmentation processing to the sequence model construction module, and simultaneously stores a part of the structured data into the storage module; the sequence model construction module calculates the internal highest sequence similarity of the segmented data, represents sequence similarity difference of different segments by using a variation coefficient and relative extremely difference, finds out the largest cluster to represent conventional behavior through clustering, uses a cluster center as a conventional behavior model, and stores the conventional behavior model into the storage module and the prediction module; and the prediction module obtains the reliability of the batch of product data according to the deviation of the similarity difference between the conventional behavior model and the internal sequence of the data to be detected.

The comparison of the technical indexes of the work and the invention effects of similar products at home and abroad is shown in Table 1

TABLE 1 comparison of inventive effects

Compared with the prior art, the invention does not need to collect other information as an aid in the process of collecting the product quality data, but directly uses the obtained product quality data for analysis. In the analysis process, data characteristics caused by a behavior sequence are presumed from possible irregular behaviors of a data recorder, and a conventional behavior model is established by analyzing unlabeled data so as to compare with the behavior of a new data sequence, thereby finding data fake behaviors and realizing analysis of data reliability. The method does not depend on expert opinion and does not need to collect additional information, so that the method solves the problem of checking the product quality data and provides thought for reliability analysis of more types of data.

The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.

Claims

1. A system for detecting abnormal behavior of a product test based on behavior sequence analysis, comprising: the system comprises a data preprocessing module, a sequence model construction module, a storage module and a prediction module, wherein: the data preprocessing module acquires and analyzes the quality detection data record file, generates structured data, outputs the structured data to the sequence model construction module and the storage module respectively, the sequence model construction module calculates the sequence similarity of each group of data, clusters the sequence similarity according to the sequence similarity, and outputs cluster centers representing conventional behavior clusters to the storage module and the prediction module as conventional behavior models, and the prediction module calculates offset between any batch of data and the conventional behavior models according to the conventional behavior models and realizes abnormal behavior detection by comparing the offset;

the analysis refers to: converting the sequence of the data records into an ordered list representation, each element in the list representing a test result of a product in the batch; removing data lines with missing values or obvious abnormal values in the sequence table through data cleaning; uniformly dividing the sequence table into a plurality of non-repeated sub-data segments, namely structured data;

the sequence similarity of each group of data refers to: each segmented data sequence D of the data list D obtained after processing the data file _i Sub-sequence division is carried out, the two sub-sequences obtained after division are subjected to local sequence comparison by using a dynamic programming algorithm, and after the sequence similarity score matrix is calculated, the maximum value in the obtained similarity matrix is taken to represent the maximum similarity sub-sequence score s in the segment _i I.e., the highest sequence similarity therein;

the sequence similarity scoring matrix is as follows:

wherein: a is a sequence similarity scoring matrix, a, b respectively represent two subsequences to be compared, s (a _i ，b _j ) Representing the similarity between the ith element in the sequence a and the jth element in the sequence b, A _ij Representing the alignment of two sequences from front to back to element a _i ，b _j Highest subsequence at the time of (a)Similarity score, c represents gap penalty;

the clustering is divided into four groups of the data list D obtained after the data file processing, namely, all s are taken respectively _i The coefficient of variation and the relative difference are calculated, namely: coefficient of variation: CV = σ/μ, relatively very poor: rr= (max-min)/μ, using four groups (CV, RR) to represent differences of sequence similarity inside each batch of data D, clustering (CV, RR) values of the four groups based on a large number of different batches of test data, wherein the largest cluster represents a conventional behavior, selecting a cluster center of the cluster as a conventional behavior model, and storing the cluster center into a storage module;

the dividing comprises the following steps:

(1) dividing the sequence directly from the middle into two sequences a, b;

(2) when the length of the subsequence to be divided is L, taking an interval value as gap_length, splicing [0, gap_length ], [2, 3 ] gap_length) … into a subsequence a _g The rest part is spliced into another subsequence b _g The method comprises the steps of carrying out a first treatment on the surface of the The two divisions are respectively carried out one sub-sequence and the other sub-sequence is obtained

b and->

b _g ；

The gap penalty is: for two sequence segments, if one of the skipped intervals or if several elements are repeated the same as the other, then a penalty is placed on this interval;

the comparison offset is as follows: the prediction module receives four groups of value pairs representing the difference of the internal sequence similarity of any batch of data calculated by the sequence model construction module, takes out the conventional behavior model from the storage module, calculates the distance between the two, obtains the offset of the batch of data expressed by percentage and the conventional behavior model by mapping the distance between 0 and 1 by using a tanh function, and selects the maximum value from the four offsets of one batch of data to compare with a set threshold value.

2. The system of claim 1, wherein the storage module has a database for storing the processed structured data.

3. The abnormal behavior detection method based on the system of claim 1 or 2, characterized in that a product quality data file is read through a data preprocessing module, analyzed to obtain a batch of product test original data, and the structured data obtained through data cleaning and segmentation processing is transmitted to a sequence model construction module and stored in a storage module; the sequence model construction module calculates the internal highest sequence similarity of the segmented data, represents sequence similarity difference of different segments by using a variation coefficient and relative extremely difference, finds out the largest cluster to represent conventional behavior through clustering, uses a cluster center as a conventional behavior model, and stores the conventional behavior model into the storage module and the prediction module; and the prediction module obtains the reliability of the batch of product data according to the deviation of the similarity difference between the conventional behavior model and the internal sequence of the data to be detected.