CN112085045B

CN112085045B - Linear trace similarity matching algorithm based on improved longest common substring

Info

Publication number: CN112085045B
Application number: CN202010265484.8A
Authority: CN
Inventors: 潘楠; 沈鑫; 钱俊兵; 赵成俊; 夏丰领; 魏举伦
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2022-12-27
Anticipated expiration: 2040-04-07
Also published as: CN112085045A

Abstract

The invention discloses a linear trace similarity matching algorithm based on an improved longest common substring, which belongs to the field of trace matching, wherein the acquaintance matching algorithm comprises comparison calculation and result generation, the comparison calculation mainly comprises variance calculation for dynamic programming optimization, and the result generation comprises two conditions 1: partial match of the nth long portion is output, case 2: and outputting all the matches meeting the conditions by taking the Nth long match as a reference. The invention focuses more on the characteristic comparison of trace minuteness, is suitable for assisting users to do rapid trace discrimination work, improves the working efficiency of criminals, and accelerates the speed of judging the similarity degree of two traces/broken ends.

Description

Linear trace similarity matching algorithm based on improved longest common substring

Technical Field

The invention belongs to the field of trace detection, and particularly relates to a linear trace similarity matching algorithm based on an improved longest common substring.

Background

The railway is distributed with a large number of optical cables, leakage cables, signal cables, cables for rail transit vehicles and through ground wires, and the periphery of various airports is also provided with a large number of navigation aid light optical cables, communication optical cables and the like. Because the inner conductors of various cables are mostly made of copper, the economic value is higher, gradually becomes the target for criminal \35274andDNA encoding. In recent years, cables in high-speed rails and airports are frequently cut and stolen, which not only causes huge losses of dozens of millions of yuan to national property, but also easily causes interruption of power supply, signal and power supply of communication equipment, and further causes more serious accidents.

According to statistics, the linear trace of the cutting surface of the pliers formed by using pliers cutting tools such as wire clippers, cable shears and breaking pliers to break the cable when criminals do a case is the most common in case sites. The linear trace reflects the external morphological structure of the contact part of the pliers shearing tool, has the characteristics of difficult damage, frequent occurrence rate, high identification value and the like, and has very important significance for the clerk to determine the property of the case, determine the crime tool and further verify the criminal suspect.

Compared with the traditional mode of observing through a microscope and artificially comparing morphological characteristics, the image recognition and three-dimensional scanning technology which is aroused in recent years provides some new solutions for the nondestructive quantitative test of linear traces.

These methods achieve automatic matching of line marks to some extent, but still have the following problems.

1) The comparison method based on the picture has higher requirements on the photographing equipment, and the inconsistency of light reflection, photographing angle and focusing directly causes distortion of original data, so that the robustness of an analysis result is easily reduced;

2) Although the three-dimensional scanning is stable and can reflect the detail characteristics of linear traces more truly, the detection hardware cost is high, and the calculation magnitude is increased in geometric multiples due to the fact that the formed file volume is too large;

3) The existing matching method needs to distinguish classification characteristics and subclass characteristics while distinguishing individual characteristics of a tool, linear traces tested by the method, such as prying traces, cartridge case firing traces and the like, are regular in form and have high-degree individuality measurable characteristics, and the accuracy of pincer-shear linear traces with complex forms and high randomness is limited.

Disclosure of Invention

The state matrix generated by the dynamic programming method contains a plurality of matching results, so that the output of the basic longest substring can be realized, and simultaneously, all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, thereby being beneficial to perfecting the output of the results.

In order to realize the purpose, the invention is realized by the following technical scheme: the matching algorithm is applied to criminal investigation, bullet trace detection and other scenes needing trace comparison, and comprises (1) model training, (2) comparison calculation, (3) result generation, (1) model training is trained by adopting an image convolution neural network, (2) comparison calculation is mainly used for variance calculation optimized by dynamic programming, and (3) result generation comprises two conditions of 1: partial match of the nth long portion is output, case 2: and outputting all the matches meeting the conditions by taking the Nth long match as a reference.

Preferably, the step (1) comprises the steps of 1) establishing a training set, 2) adjusting parameters and establishing a graph convolution neural network model, and 3) introducing data to be detected to obtain a similarity calculation result;

2) The specific method for adjusting parameters and establishing the graph convolution neural network model is that G = (V, E). V represents a node set, namely

E represents a set of edges, i.e.

The training model consists of two parts: 1) The GCN component is responsible for sampling all node information in K-order neighborhood, 2) the self-encoder (AE) component is used for extracting hidden features of an activation value matrix A learned by the GCN component and preserving a node cluster structure by combining with Laplace Eigenmap (LE), and the GCN component uses a graph convolutional neural network to save nodes in a training model

Sampling the structure and characteristic information of all nodes in K steps for the center, namely coding K-order neighborhood information, generating an activation value matrix A used as the input of a self-coder component by combining the label training of the nodes, simultaneously coding the local structure and characteristic information of the network by GCN through supervised learning based on node labels, omitting secondary structure information which has small influence on low-dimensional vectors of the generated nodes outside the K-order neighborhood, utilizing the activation value matrix A learned by GCN as the input of the self-coder, further extracting the characteristic information from A by the self-coder in an unsupervised learning mode, and mapping the original network to a lower low-dimensional vector by combining Laplace characteristic mappingThe space of the dimension.

Preferably, the comparison calculation adopts a dynamic programming algorithm combining and improving a longest common Substring algorithm (LC-Substring) and a longest common Subsequence (LC-Subsequence), and the reason that the algorithm adopts dynamic programming as a basic method is that a state matrix generated by the dynamic programming method contains a plurality of matching results, so that not only can the output of a basic longest Substring/Subsequence be realized, but also all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, and the output of results can be improved.

Preferably, the dynamic programming method adopted by the comparison calculation is consistent with the dynamic programming steps of LC-Substring and LC-Subsequence (hereinafter, referred to as LCS for short) and is calculated in the sequence of rows or columns, and the main improvement of the algorithm lies in that a mode of local minimum difference is adopted on the decision rule of state transition.

Preferably, after the two-dimensional state matrix is established, the variance calculation for dynamic programming optimization calculates the local optimal solution of the current position by using a dynamic equation according to a certain sequence (in rows or columns); the formula that the average value is repeatedly calculated once at each position and then the variance is solved is used, so that the characteristics of dynamic planning can be well matched:

D＝E(X ² )-E(X) ² (3)

if the method is adopted, only the following matrixes need to be added according to the sequence length:

d (0.. N,0.. M), saving the variance of the current optimal result;

SumEx (0.. N,0.. M), the sum of the difference values of the current optimal result is saved;

SumExPower2 (0.. N,0.. M), which saves the sum of the squares of the difference values of the current optimal results;

len (0.. N,0.. M), and storing the length of the current optimal result, wherein x in the length is the difference amplitude of S and T at the corresponding position;

way (0.. N,0.. M), storing the relation of the current position and the previous matching;

the variance of each (i, j) location can thus be calculated as follows:

the SumEx, sumExPower2 and len matrixes only need to be operated once according to a transfer equation during state transfer, so that a large amount of repeated calculation is avoided.

Preferably, the comparison calculation comprises a strict mode, an unconfined mode and an adaptive mode; the whole algorithm of the strict mode in the LC-triggering strict mode has the following flow:

step 1, initializing five matrixes D, sumEx, sumExPower2, len and Way according to a sequence S and a sequence T, and setting all data of the matrixes to be null, wherein the redundant data with the row number of 0 and the column number of 0 are data in an initial state, the existence of the redundant data must be ensured, and meanwhile, a threshold maxDeficience is set according to the expected matching error tolerance of a user;

step 2, traversing and dynamically planning and calculating according to rows or columns, starting from the first row according to the row condition, calculating (1, 1) 1 row and 1 column, (1, 2) 1 row and 2 column data, similarly according to the column condition, executing step 3 on each position, and if all the positions are completely executed, ending the whole comparison calculation process;

step 3, for the current position (i, j), firstly finding a precursor variance value D (i-1, j-1) to judge the difference between the current position (i, j) and maxDiffience, if D (i-1, j-1) is smaller than maxDiffience, continuing the step 4, otherwise, not changing the state equation, and directly performing the step 5;

and 4, updating the state equation for the current situation when the variance of the current predecessor position is smaller than a tolerance value, and executing the following updating if the previous position is matched with len (i-1, j-1) > 0:

the main purpose of the update is the current position (i, j), the minimum difference (in other words, the maximum similarity) that can be achieved, and the lengths len and Way are updated to facilitate the subsequent backtracking;

however, at this time, if the previous position is a state without matching, i.e., len (i-1, j-1) =0, then the current position is selected as a starting point, matching is started, and the following update is performed:

if the corresponding situation is matched no matter whether the previous position is matched or not, the step 5 is switched;

and 5, mainly updating the difference degree of the current position, and returning to the step 2 after the calculation is finished:

preferably, the step of performing the non-strict mode is overlapped with the steps 1,2 and 5 of the LC-Substring strict mode, and the dynamic equations and the transfer modes of 3 and 4 are slightly different, specifically as follows:

(1) Initializing five matrixes of D, sumEx, sumExPower2, len, way and setting all data of the five matrixes to be null according to the sequence S and the sequence T, wherein the validity of the data is ensured, the data with the extra row number of 0 and the column number of 0 is the data in the initial state, the existence of the data must be ensured, and a threshold maxDefibrance is set according to the expected matching error tolerance of a user;

(2) The dynamic programming calculation is performed by row or column traversal, starting with the first row in the case of a row, to calculate the data of (1, 1) row 1 column, (1, 2) row 1 column, and column 2, and the same is true for a column. Step 3 is executed at each position, and if all the positions are executed, the whole comparison calculation process is also ended.

(3) Step five is mainly to update the difference degree of the current position, and return to the step 2 after the calculation is finished:

preferably, the result generation is performed, in case 1, the nth result obtained by the algorithm matching is accurate and continuous, so that the operator can see the maximum overlapping part of the two traces, and in case 2, the matching which meets the condition is continued under the condition that the nth result is output, so that the problem of identification fracture caused by trace mutation in some cases can be relieved.

The invention has the beneficial effects that:

The patterns obtained by laser scanning of the sheared sample of the tool trace were compared to determine whether there was a coincidence between the two, here identified as 1.

In the case where no tool type is available, the map is compared to the part of the database that has yet to be run in parallel to determine the degree of similarity, providing clues, here 1.

Drawings

FIG. 1 is a basic flow chart of a comparative calculation;

FIG. 2 is a flow chart of result generation;

FIG. 3 is a diagram of simulation data matching test results;

FIG. 4 is a graph of a 60% overlap ratio actual test data match test;

FIG. 5 is a graph showing the data matching test for the actual detection of 30% overlap ratio;

FIG. 6 is a graph showing the data match test of the actual detection of 45% overlap ratio.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings and examples, which are not intended to limit the present invention.

As shown in fig. 1-3, the matching algorithm is applied to criminal investigation, bullet trace detection, and other scenes requiring trace comparison, and the acquaintance matching algorithm includes (1) model training, (2) comparison calculation, (3) result generation, (1) model training is trained using graph convolutional neural network, (2) comparison calculation is mainly variance calculation optimized for dynamic programming, and (3) result generation includes two cases 1: partial match of the nth long portion is output, case 2: and outputting all the matches meeting the conditions by taking the Nth long match as a reference.

E represents a set of edges, i.e.

Sampling the structure and characteristic information of all nodes in K steps for the center, namely coding K-order neighborhood information, and combining the label training of the nodesThe method comprises the steps of training and generating an activation value matrix A used as input of a self-encoder component, wherein the GCN can simultaneously encode local structure and characteristic information of a network through supervised learning based on node labels, secondary structure information with small influence on low-dimensional vectors of generated nodes outside a K-order neighborhood is omitted, the activation value matrix A learned by the GCN is used as input of the self-encoder, the self-encoder further extracts the characteristic information from the A in an unsupervised learning mode, and the original network is mapped to a space with lower dimension by combining Laplace characteristic mapping.

Linearly combining the two components and combining the two components with a training set by using a Stacking method (Stacking) in ensemble learning, so that the low-dimensional vector representation of the node obtained by the whole model can retain the characteristic information of the node and the structure, linearly combining the GCN component and the AE component by using the Stacking method, controlling the loss functions of the two components by using two hyper-parameters alpha and beta,

wherein, the loss function of the node sampling component is as follows:

α is the weight of the node sampling component loss function.

The loss function of the self-encoder component AE is:

β is the weight of the AE loss function from the encoder component.

Finally, the loss function of the training model is defined as:

wherein, y _i In order for the node to be a true tag,

is a predictive tag for the GCN and,

is an activation value matrix, K is a node v _i The neighborhood order of (a) is,

in order to reconstruct the matrix of activation values,

implicit layers for AE from encoder L-th layer indicate, L is the number of implicit layers for AE.

Using the TensorFlow framework to accelerate model training model optimization via a graphics card (GPU) model optimization section updates the model parameters using an AdamaTimizer optimizer provided by TensorFlow, facilitates hyper-parameter dynamic tuning by using momentum (i.e., the moving average of the parameters) to improve the traditional gradient descent, allowing for rapid and efficient training of the model. The model parameters are updated on only one batch at a time, and the memory occupation during model training is further reduced.

The comparison calculation adopts a dynamic programming algorithm combining and improving the longest common Substring algorithm (LC-Substring) and the longest common Subsequence (LC-Subsequence), and the algorithm adopts dynamic programming as a basic method because a state matrix generated by the dynamic programming method contains a plurality of matching results, so that the output of the basic longest Substring/Subsequence can be realized, and all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, thereby being beneficial to perfecting the output of results.

Preferably, the dynamic programming method adopted by the comparison calculation is consistent with the dynamic programming steps of LC-subscription and LC-subscription (hereinafter, both are abbreviated as LCS), the calculation is carried out in the sequence of rows or columns, and the main improvement part of the algorithm is that a mode of local minimum difference is adopted on the decision rule of state transition.

Preferably, after the variance calculation for dynamic programming optimization is implemented by establishing a two-dimensional state matrix, a local optimal solution of the current position is calculated by using a dynamic equation according to a certain sequence (in rows or columns); the formula that the average value is repeatedly calculated once at each position and then the variance is solved is used, so that the characteristics of dynamic planning can be well matched:

D＝E(X ² )-E(X) ² (3)

d (0.. N,0.. M), saving the variance of the current optimal result;

SumExPower2 (0.. N,0.. M), the sum of the squares of the difference values of the current optimal results is saved;

way (0.. N,0.. M), which stores the contact of the current position and the previous matching;

the variance of each (i, j) position can thus be calculated as follows:

the SumEx, sumExPower2 and len matrixes only need to be operated once according to a transfer equation during state transfer, and a large amount of repeated calculation is avoided.

Preferably, the comparison calculation comprises a strict mode, a non-strict mode and an adaptive mode; the whole algorithm of the strict mode in the LC-triggering strict mode has the following flow:

step 1, initializing five matrixes D, sumEx, sumExPower2, len, way and setting all data of the five matrixes to be null according to a sequence S and a sequence T, wherein the data with the excessive row number of 0 and the excessive column number of 0 are data in an initial state, the existence of the data must be ensured, and a threshold maxDeficience is set according to the expected matching error tolerance of a user;

and 4, updating the state equation under the current condition when the variance of the current precursor position is smaller than a tolerance value, and executing the following updating if the previous position is matched with len (i-1, j-1) > 0:

however, at this time, if the previous position is in a state without matching, i.e., len (i-1, j-1) =0, then the current position is selected as a starting point, matching is started, and the following update is performed:

preferably, the execution steps of the relaxed mode are coincident with the steps 1,2 and 5 of the LC-subscription strict mode, and the dynamic equations and the transfer modes of 3 and 4 are slightly different as follows:

(1) Initializing five matrixes of D, sumEx, sumExPower2, len and Way according to the sequence S and the sequence T, and setting all data of the matrixes to be null, wherein the validity of the data is ensured, the data with the row number of 0 and the column number of 0 which are added out are data in an initial state, the existence of the data must be ensured, and meanwhile, a threshold value maxDeficience is given according to the expected matching error tolerance of a user;

(2) The dynamic programming calculation is performed by row or column traversal, starting with the first row in the case of a row, to calculate the data of (1, 1) row 1 column, (1, 2) row 1 column, and column 2, and the same is true for a column. Step 3 is executed at each position, and if all the positions are executed, the whole comparison calculation process is also finished.

preferably, the result generation is that the nth result obtained by the algorithm matching in case 1 is accurate and continuous, so that the operator can see the maximum overlapping part of the two traces, and in case 2, the matching which meets the condition is continued in addition to the nth result in case of outputting the nth result, which can alleviate the problem of recognition breakage caused by trace mutation in some cases.

Example 1:

partial matching of the Nth length is output and is usually used in the process of fast matching, and the Nth length result obtained by algorithm matching is accurate and continuous under the general condition, and the approximate (most strict) maximum overlapping part of two traces can be seen through the condition. The method comprises the following implementation steps:

the nth position in the len matrix represents the nth long match because it stores the longest consecutive match sizes of S and T at the corresponding positions.

First mark position (i, j) as the end position, and the i position of the S string and the j position of the T string as the matching end position, and then proceed to step 3.

Outputting and moving according to the state of Way (x, y) every time tracing back to the position finding (x, y):

step 3 needs to pass through the state of len (i, j) -1 time Way (x, y) =1, and corresponding sequences of S and T are output in each pass, which is compatible with all modes in the alignment algorithm.

The coordinate position at the end of step 3 is marked as the start position, then the previous end position and the present start position are returned, and all the output S and T sequences are also provided, and the sequence is the longest matching sequence of the two.

Simulation test

And 8 samples are counted, wherein the data0 is the sample, the data 1-100 are coincided with the data0 at different positions, and the coincidence amplitude is near 30%. The tolerance value adopted by the test is 1, and the test result is shown in fig. 3:

actual testing

The different broken ends of actual shearing are respectively detected, the coincident positions are difficult to distinguish from naked eyes, the corresponding positions also have large difference, actual testing is carried out through a program, the testing tolerance is 4, the different broken ends are divided into 100 groups which are all the same coincident test, and the maximum coincidence of each group is about 30% or 60% different. The matching effect is shown in fig. 4 and 5, and the matching accuracy reaches 85%.

The continuous shearing difficulty is continuously increased, the actual shearing of different broken ends is the same, but the shearing randomness is increased, the change is larger, the overlapping range is different from 30% to 60%, and other conditions are consistent. The matching effect is shown in fig. 6, and the matching accuracy reaches 80%.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and not to limit it; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the embodiments of the disclosure or equivalent substitutions of parts of the technical features may still be made; without departing from the spirit of the present disclosure, the present disclosure should be construed as being limited to the scope of the present disclosure as claimed.

Claims

1. A linear trace similarity matching algorithm based on improved longest common substrings is characterized in that: the matching algorithm is applied to criminal investigation, bullet trace detection and other scenes needing trace comparison, and comprises (1) model training, (2) comparison calculation, (3) result generation, (1) model training is trained by adopting an image convolution neural network, (2) comparison calculation is mainly used for variance calculation optimized by dynamic programming, and (3) result generation comprises two conditions of 1: output nth long partial local match, case 2: outputting all matches meeting the conditions by taking the Nth long match as a reference;

after the two-dimensional state matrix is established, the variance calculation for dynamic programming optimization uses a dynamic equation to calculate the local optimal solution of the current position according to a certain sequence and rows or columns; the formula that the average value is repeatedly calculated once at each position and then the variance is solved is used, so that the characteristics of dynamic planning can be well matched:

D＝E(X ² )-E(X) ² (3)

d (0.. N,0.. M), and storing the variance of the current optimal result;

the variance of each (i, j) location can thus be calculated as follows:

the SumEx, sumExPower2 and len matrixes only need to be operated once according to a transfer equation during state transfer, so that a large amount of repeated calculation is avoided;

the comparison calculation comprises a strict mode, an unconstricted mode and an adaptive mode; the whole algorithm of the strict mode in the LC-triggering strict mode has the following flow:

step 2, performing dynamic planning calculation according to row or column traversal, wherein the data of (1, 1) row and 1 column are calculated from the first row according to the condition of the row, and (1, 2) the data of 1 row and 2 columns are calculated according to the condition of the column, the step 3 is executed at each position, and if all the positions are executed, the whole comparison calculation process is also finished;

the main purpose of the update is the current position (i, j), the minimum difference that can be achieved, and the lengths len and Way are updated to facilitate the subsequent backtracking;

the execution steps of the relaxed mode are superposed with the steps 1,2 and 5 of the LC-Substring strict mode, and the dynamic equations and transfer modes of 3 and 4 are slightly different, specifically as follows:

(2) Performing dynamic planning calculation according to row or column traversal, wherein the data of (1, 1) row and 1 column are calculated from the first row according to the situation of the row, and (1, 2) row and 2 column are calculated from the data of (1, 1) row and 1 column, and the situation is the same according to the column, step 3 is executed on each position, if all the positions are completely executed, the whole comparison calculation process is also finished, (3) step five is mainly to update the difference degree of the current position, and the step 2 is returned after the calculation is finished:

2. the algorithm for matching line-type trace similarity based on the improved longest common substring of claim 1, wherein: the method comprises the following steps of (1) establishing a training set, 2) adjusting parameters and establishing a graph convolution neural network model, and 3) introducing data to be detected to obtain a similarity calculation result;

2) The specific method for adjusting parameters and establishing the graph convolution neural network model is that G = (V, E), V represents a node set, namely

E represents a set of edges, i.e.

The training model consists of two parts: 1) A GCN component responsible for sampling all node information in K-order neighborhood, 2) an auto-encoder AE component used for extracting hidden features of an activation value matrix A learned by the GCN component and preserving a node cluster structure by combining with Laplace eigenmap LE, wherein the GCN component uses a graph convolutional neural network to segment in a training model

Sampling the structure and characteristic information of all nodes in K steps for the center, namely coding K-order neighborhood information, generating an activation value matrix A used as input of a self-encoder component by combining with label training of the nodes, wherein the GCN can simultaneously code local structure and characteristic information of the network by supervised learning based on node labels, and omits secondary nodes of low-dimensional vectors of the nodes generated outside the K-order neighborhoodAnd constructing information, namely using an activation value matrix A learned by GCN as the input of a self-encoder, further extracting characteristic information from A by the self-encoder in an unsupervised learning mode, and mapping the original network to a low-dimensional space by combining Laplace characteristic mapping.

3. The algorithm for matching line-type trace similarity based on the improved longest common substring of claim 1, wherein: the comparison calculation adopts a dynamic programming algorithm combining and improving the longest common Substring algorithm LC-Substring and the longest common Subsequence LC-Subsequence, and the algorithm adopts dynamic programming as a basic method because a state matrix generated by the dynamic programming method contains a plurality of matching results, so that the output of the basic longest Substring or Subsequence can be realized, and simultaneously, all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, thereby being beneficial to perfecting the output of the results.

4. The algorithm for matching the similarity of the linear traces based on the improved longest common substring according to claim 1, wherein: the dynamic programming method adopted by the comparison calculation is consistent with the dynamic programming steps of LC-subscription and LC-subscription in the execution steps, and the dynamic programming method is calculated according to the sequence of rows or columns, and the main improvement part of the algorithm is that a local minimum difference mode is adopted on the decision rule of state transition.

5. The algorithm for matching line-type trace similarity based on the improved longest common substring of claim 1, wherein: the result generation is that the Nth long result obtained by algorithm matching is accurate and continuous under the general condition of the condition 1, a worker can see the maximum overlapping part of two traces in an approximate mode through the condition, and the condition 2 is that the matching which meets the conditions is continued under the condition that the Nth long matching is output, so that the problem of identification fracture caused by trace mutation under certain conditions can be relieved.