CN112085045B - Linear trace similarity matching algorithm based on improved longest common substring - Google Patents

Linear trace similarity matching algorithm based on improved longest common substring Download PDF

Info

Publication number
CN112085045B
CN112085045B CN202010265484.8A CN202010265484A CN112085045B CN 112085045 B CN112085045 B CN 112085045B CN 202010265484 A CN202010265484 A CN 202010265484A CN 112085045 B CN112085045 B CN 112085045B
Authority
CN
China
Prior art keywords
matching
calculation
data
row
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010265484.8A
Other languages
Chinese (zh)
Other versions
CN112085045A (en
Inventor
潘楠
沈鑫
钱俊兵
赵成俊
夏丰领
魏举伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010265484.8A priority Critical patent/CN112085045B/en
Publication of CN112085045A publication Critical patent/CN112085045A/en
Application granted granted Critical
Publication of CN112085045B publication Critical patent/CN112085045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a linear trace similarity matching algorithm based on an improved longest common substring, which belongs to the field of trace matching, wherein the acquaintance matching algorithm comprises comparison calculation and result generation, the comparison calculation mainly comprises variance calculation for dynamic programming optimization, and the result generation comprises two conditions 1: partial match of the nth long portion is output, case 2: and outputting all the matches meeting the conditions by taking the Nth long match as a reference. The invention focuses more on the characteristic comparison of trace minuteness, is suitable for assisting users to do rapid trace discrimination work, improves the working efficiency of criminals, and accelerates the speed of judging the similarity degree of two traces/broken ends.

Description

Linear trace similarity matching algorithm based on improved longest common substring
Technical Field
The invention belongs to the field of trace detection, and particularly relates to a linear trace similarity matching algorithm based on an improved longest common substring.
Background
The railway is distributed with a large number of optical cables, leakage cables, signal cables, cables for rail transit vehicles and through ground wires, and the periphery of various airports is also provided with a large number of navigation aid light optical cables, communication optical cables and the like. Because the inner conductors of various cables are mostly made of copper, the economic value is higher, gradually becomes the target for criminal \35274andDNA encoding. In recent years, cables in high-speed rails and airports are frequently cut and stolen, which not only causes huge losses of dozens of millions of yuan to national property, but also easily causes interruption of power supply, signal and power supply of communication equipment, and further causes more serious accidents.
According to statistics, the linear trace of the cutting surface of the pliers formed by using pliers cutting tools such as wire clippers, cable shears and breaking pliers to break the cable when criminals do a case is the most common in case sites. The linear trace reflects the external morphological structure of the contact part of the pliers shearing tool, has the characteristics of difficult damage, frequent occurrence rate, high identification value and the like, and has very important significance for the clerk to determine the property of the case, determine the crime tool and further verify the criminal suspect.
Compared with the traditional mode of observing through a microscope and artificially comparing morphological characteristics, the image recognition and three-dimensional scanning technology which is aroused in recent years provides some new solutions for the nondestructive quantitative test of linear traces.
These methods achieve automatic matching of line marks to some extent, but still have the following problems.
1) The comparison method based on the picture has higher requirements on the photographing equipment, and the inconsistency of light reflection, photographing angle and focusing directly causes distortion of original data, so that the robustness of an analysis result is easily reduced;
2) Although the three-dimensional scanning is stable and can reflect the detail characteristics of linear traces more truly, the detection hardware cost is high, and the calculation magnitude is increased in geometric multiples due to the fact that the formed file volume is too large;
3) The existing matching method needs to distinguish classification characteristics and subclass characteristics while distinguishing individual characteristics of a tool, linear traces tested by the method, such as prying traces, cartridge case firing traces and the like, are regular in form and have high-degree individuality measurable characteristics, and the accuracy of pincer-shear linear traces with complex forms and high randomness is limited.
Disclosure of Invention
The state matrix generated by the dynamic programming method contains a plurality of matching results, so that the output of the basic longest substring can be realized, and simultaneously, all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, thereby being beneficial to perfecting the output of the results.
In order to realize the purpose, the invention is realized by the following technical scheme: the matching algorithm is applied to criminal investigation, bullet trace detection and other scenes needing trace comparison, and comprises (1) model training, (2) comparison calculation, (3) result generation, (1) model training is trained by adopting an image convolution neural network, (2) comparison calculation is mainly used for variance calculation optimized by dynamic programming, and (3) result generation comprises two conditions of 1: partial match of the nth long portion is output, case 2: and outputting all the matches meeting the conditions by taking the Nth long match as a reference.
Preferably, the step (1) comprises the steps of 1) establishing a training set, 2) adjusting parameters and establishing a graph convolution neural network model, and 3) introducing data to be detected to obtain a similarity calculation result;
2) The specific method for adjusting parameters and establishing the graph convolution neural network model is that G = (V, E). V represents a node set, namely
Figure GDA0002649376810000021
E represents a set of edges, i.e.
Figure GDA0002649376810000022
The training model consists of two parts: 1) The GCN component is responsible for sampling all node information in K-order neighborhood, 2) the self-encoder (AE) component is used for extracting hidden features of an activation value matrix A learned by the GCN component and preserving a node cluster structure by combining with Laplace Eigenmap (LE), and the GCN component uses a graph convolutional neural network to save nodes in a training model
Figure GDA0002649376810000023
Sampling the structure and characteristic information of all nodes in K steps for the center, namely coding K-order neighborhood information, generating an activation value matrix A used as the input of a self-coder component by combining the label training of the nodes, simultaneously coding the local structure and characteristic information of the network by GCN through supervised learning based on node labels, omitting secondary structure information which has small influence on low-dimensional vectors of the generated nodes outside the K-order neighborhood, utilizing the activation value matrix A learned by GCN as the input of the self-coder, further extracting the characteristic information from A by the self-coder in an unsupervised learning mode, and mapping the original network to a lower low-dimensional vector by combining Laplace characteristic mappingThe space of the dimension.
Preferably, the comparison calculation adopts a dynamic programming algorithm combining and improving a longest common Substring algorithm (LC-Substring) and a longest common Subsequence (LC-Subsequence), and the reason that the algorithm adopts dynamic programming as a basic method is that a state matrix generated by the dynamic programming method contains a plurality of matching results, so that not only can the output of a basic longest Substring/Subsequence be realized, but also all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, and the output of results can be improved.
Preferably, the dynamic programming method adopted by the comparison calculation is consistent with the dynamic programming steps of LC-Substring and LC-Subsequence (hereinafter, referred to as LCS for short) and is calculated in the sequence of rows or columns, and the main improvement of the algorithm lies in that a mode of local minimum difference is adopted on the decision rule of state transition.
Preferably, after the two-dimensional state matrix is established, the variance calculation for dynamic programming optimization calculates the local optimal solution of the current position by using a dynamic equation according to a certain sequence (in rows or columns); the formula that the average value is repeatedly calculated once at each position and then the variance is solved is used, so that the characteristics of dynamic planning can be well matched:
D=E(X 2 )-E(X) 2 (3)
if the method is adopted, only the following matrixes need to be added according to the sequence length:
d (0.. N,0.. M), saving the variance of the current optimal result;
SumEx (0.. N,0.. M), the sum of the difference values of the current optimal result is saved;
SumExPower2 (0.. N,0.. M), which saves the sum of the squares of the difference values of the current optimal results;
len (0.. N,0.. M), and storing the length of the current optimal result, wherein x in the length is the difference amplitude of S and T at the corresponding position;
way (0.. N,0.. M), storing the relation of the current position and the previous matching;
the variance of each (i, j) location can thus be calculated as follows:
Figure GDA0002649376810000031
the SumEx, sumExPower2 and len matrixes only need to be operated once according to a transfer equation during state transfer, so that a large amount of repeated calculation is avoided.
Preferably, the comparison calculation comprises a strict mode, an unconfined mode and an adaptive mode; the whole algorithm of the strict mode in the LC-triggering strict mode has the following flow:
step 1, initializing five matrixes D, sumEx, sumExPower2, len and Way according to a sequence S and a sequence T, and setting all data of the matrixes to be null, wherein the redundant data with the row number of 0 and the column number of 0 are data in an initial state, the existence of the redundant data must be ensured, and meanwhile, a threshold maxDeficience is set according to the expected matching error tolerance of a user;
step 2, traversing and dynamically planning and calculating according to rows or columns, starting from the first row according to the row condition, calculating (1, 1) 1 row and 1 column, (1, 2) 1 row and 2 column data, similarly according to the column condition, executing step 3 on each position, and if all the positions are completely executed, ending the whole comparison calculation process;
step 3, for the current position (i, j), firstly finding a precursor variance value D (i-1, j-1) to judge the difference between the current position (i, j) and maxDiffience, if D (i-1, j-1) is smaller than maxDiffience, continuing the step 4, otherwise, not changing the state equation, and directly performing the step 5;
and 4, updating the state equation for the current situation when the variance of the current predecessor position is smaller than a tolerance value, and executing the following updating if the previous position is matched with len (i-1, j-1) > 0:
Figure GDA0002649376810000041
the main purpose of the update is the current position (i, j), the minimum difference (in other words, the maximum similarity) that can be achieved, and the lengths len and Way are updated to facilitate the subsequent backtracking;
however, at this time, if the previous position is a state without matching, i.e., len (i-1, j-1) =0, then the current position is selected as a starting point, matching is started, and the following update is performed:
Figure GDA0002649376810000042
if the corresponding situation is matched no matter whether the previous position is matched or not, the step 5 is switched;
and 5, mainly updating the difference degree of the current position, and returning to the step 2 after the calculation is finished:
Figure GDA0002649376810000051
preferably, the step of performing the non-strict mode is overlapped with the steps 1,2 and 5 of the LC-Substring strict mode, and the dynamic equations and the transfer modes of 3 and 4 are slightly different, specifically as follows:
(1) Initializing five matrixes of D, sumEx, sumExPower2, len, way and setting all data of the five matrixes to be null according to the sequence S and the sequence T, wherein the validity of the data is ensured, the data with the extra row number of 0 and the column number of 0 is the data in the initial state, the existence of the data must be ensured, and a threshold maxDefibrance is set according to the expected matching error tolerance of a user;
(2) The dynamic programming calculation is performed by row or column traversal, starting with the first row in the case of a row, to calculate the data of (1, 1) row 1 column, (1, 2) row 1 column, and column 2, and the same is true for a column. Step 3 is executed at each position, and if all the positions are executed, the whole comparison calculation process is also ended.
(3) Step five is mainly to update the difference degree of the current position, and return to the step 2 after the calculation is finished:
Figure GDA0002649376810000052
preferably, the result generation is performed, in case 1, the nth result obtained by the algorithm matching is accurate and continuous, so that the operator can see the maximum overlapping part of the two traces, and in case 2, the matching which meets the condition is continued under the condition that the nth result is output, so that the problem of identification fracture caused by trace mutation in some cases can be relieved.
The invention has the beneficial effects that:
the state matrix generated by the dynamic programming method contains a plurality of matching results, so that the output of the basic longest substring can be realized, and simultaneously, all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, thereby being beneficial to perfecting the output of the results.
The patterns obtained by laser scanning of the sheared sample of the tool trace were compared to determine whether there was a coincidence between the two, here identified as 1.
In the case where no tool type is available, the map is compared to the part of the database that has yet to be run in parallel to determine the degree of similarity, providing clues, here 1.
Drawings
FIG. 1 is a basic flow chart of a comparative calculation;
FIG. 2 is a flow chart of result generation;
FIG. 3 is a diagram of simulation data matching test results;
FIG. 4 is a graph of a 60% overlap ratio actual test data match test;
FIG. 5 is a graph showing the data matching test for the actual detection of 30% overlap ratio;
FIG. 6 is a graph showing the data match test of the actual detection of 45% overlap ratio.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings and examples, which are not intended to limit the present invention.
As shown in fig. 1-3, the matching algorithm is applied to criminal investigation, bullet trace detection, and other scenes requiring trace comparison, and the acquaintance matching algorithm includes (1) model training, (2) comparison calculation, (3) result generation, (1) model training is trained using graph convolutional neural network, (2) comparison calculation is mainly variance calculation optimized for dynamic programming, and (3) result generation includes two cases 1: partial match of the nth long portion is output, case 2: and outputting all the matches meeting the conditions by taking the Nth long match as a reference.
Preferably, the step (1) comprises the steps of 1) establishing a training set, 2) adjusting parameters and establishing a graph convolution neural network model, and 3) introducing data to be detected to obtain a similarity calculation result;
2) The specific method for adjusting parameters and establishing the graph convolution neural network model is that G = (V, E). V represents a node set, namely
Figure GDA0002649376810000061
E represents a set of edges, i.e.
Figure GDA0002649376810000062
The training model consists of two parts: 1) The GCN component is responsible for sampling all node information in K-order neighborhood, 2) the self-encoder (AE) component is used for extracting hidden features of an activation value matrix A learned by the GCN component and preserving a node cluster structure by combining with Laplace Eigenmap (LE), and the GCN component uses a graph convolutional neural network to save nodes in a training model
Figure GDA0002649376810000063
Sampling the structure and characteristic information of all nodes in K steps for the center, namely coding K-order neighborhood information, and combining the label training of the nodesThe method comprises the steps of training and generating an activation value matrix A used as input of a self-encoder component, wherein the GCN can simultaneously encode local structure and characteristic information of a network through supervised learning based on node labels, secondary structure information with small influence on low-dimensional vectors of generated nodes outside a K-order neighborhood is omitted, the activation value matrix A learned by the GCN is used as input of the self-encoder, the self-encoder further extracts the characteristic information from the A in an unsupervised learning mode, and the original network is mapped to a space with lower dimension by combining Laplace characteristic mapping.
Linearly combining the two components and combining the two components with a training set by using a Stacking method (Stacking) in ensemble learning, so that the low-dimensional vector representation of the node obtained by the whole model can retain the characteristic information of the node and the structure, linearly combining the GCN component and the AE component by using the Stacking method, controlling the loss functions of the two components by using two hyper-parameters alpha and beta,
wherein, the loss function of the node sampling component is as follows:
Figure GDA0002649376810000071
α is the weight of the node sampling component loss function.
The loss function of the self-encoder component AE is:
Figure GDA0002649376810000072
β is the weight of the AE loss function from the encoder component.
Finally, the loss function of the training model is defined as:
Figure GDA0002649376810000073
wherein, y i In order for the node to be a true tag,
Figure GDA0002649376810000074
is a predictive tag for the GCN and,
Figure GDA0002649376810000075
is an activation value matrix, K is a node v i The neighborhood order of (a) is,
Figure GDA0002649376810000076
in order to reconstruct the matrix of activation values,
Figure GDA0002649376810000077
implicit layers for AE from encoder L-th layer indicate, L is the number of implicit layers for AE.
Using the TensorFlow framework to accelerate model training model optimization via a graphics card (GPU) model optimization section updates the model parameters using an AdamaTimizer optimizer provided by TensorFlow, facilitates hyper-parameter dynamic tuning by using momentum (i.e., the moving average of the parameters) to improve the traditional gradient descent, allowing for rapid and efficient training of the model. The model parameters are updated on only one batch at a time, and the memory occupation during model training is further reduced.
The comparison calculation adopts a dynamic programming algorithm combining and improving the longest common Substring algorithm (LC-Substring) and the longest common Subsequence (LC-Subsequence), and the algorithm adopts dynamic programming as a basic method because a state matrix generated by the dynamic programming method contains a plurality of matching results, so that the output of the basic longest Substring/Subsequence can be realized, and all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, thereby being beneficial to perfecting the output of results.
Preferably, the dynamic programming method adopted by the comparison calculation is consistent with the dynamic programming steps of LC-subscription and LC-subscription (hereinafter, both are abbreviated as LCS), the calculation is carried out in the sequence of rows or columns, and the main improvement part of the algorithm is that a mode of local minimum difference is adopted on the decision rule of state transition.
Preferably, after the variance calculation for dynamic programming optimization is implemented by establishing a two-dimensional state matrix, a local optimal solution of the current position is calculated by using a dynamic equation according to a certain sequence (in rows or columns); the formula that the average value is repeatedly calculated once at each position and then the variance is solved is used, so that the characteristics of dynamic planning can be well matched:
D=E(X 2 )-E(X) 2 (3)
if the method is adopted, only the following matrixes need to be added according to the sequence length:
d (0.. N,0.. M), saving the variance of the current optimal result;
SumEx (0.. N,0.. M), the sum of the difference values of the current optimal result is saved;
SumExPower2 (0.. N,0.. M), the sum of the squares of the difference values of the current optimal results is saved;
len (0.. N,0.. M), and storing the length of the current optimal result, wherein x in the length is the difference amplitude of S and T at the corresponding position;
way (0.. N,0.. M), which stores the contact of the current position and the previous matching;
the variance of each (i, j) position can thus be calculated as follows:
Figure GDA0002649376810000081
the SumEx, sumExPower2 and len matrixes only need to be operated once according to a transfer equation during state transfer, and a large amount of repeated calculation is avoided.
Preferably, the comparison calculation comprises a strict mode, a non-strict mode and an adaptive mode; the whole algorithm of the strict mode in the LC-triggering strict mode has the following flow:
step 1, initializing five matrixes D, sumEx, sumExPower2, len, way and setting all data of the five matrixes to be null according to a sequence S and a sequence T, wherein the data with the excessive row number of 0 and the excessive column number of 0 are data in an initial state, the existence of the data must be ensured, and a threshold maxDeficience is set according to the expected matching error tolerance of a user;
step 2, traversing and dynamically planning and calculating according to rows or columns, starting from the first row according to the row condition, calculating (1, 1) 1 row and 1 column, (1, 2) 1 row and 2 column data, similarly according to the column condition, executing step 3 on each position, and if all the positions are completely executed, ending the whole comparison calculation process;
step 3, for the current position (i, j), firstly finding a precursor variance value D (i-1, j-1) to judge the difference between the current position (i, j) and maxDiffience, if D (i-1, j-1) is smaller than maxDiffience, continuing the step 4, otherwise, not changing the state equation, and directly performing the step 5;
and 4, updating the state equation under the current condition when the variance of the current precursor position is smaller than a tolerance value, and executing the following updating if the previous position is matched with len (i-1, j-1) > 0:
Figure GDA0002649376810000091
the main purpose of the update is the current position (i, j), the minimum difference (in other words, the maximum similarity) that can be achieved, and the lengths len and Way are updated to facilitate the subsequent backtracking;
however, at this time, if the previous position is in a state without matching, i.e., len (i-1, j-1) =0, then the current position is selected as a starting point, matching is started, and the following update is performed:
Figure GDA0002649376810000092
if the corresponding situation is matched no matter whether the previous position is matched or not, the step 5 is switched;
and 5, mainly updating the difference degree of the current position, and returning to the step 2 after the calculation is finished:
Figure GDA0002649376810000101
preferably, the execution steps of the relaxed mode are coincident with the steps 1,2 and 5 of the LC-subscription strict mode, and the dynamic equations and the transfer modes of 3 and 4 are slightly different as follows:
(1) Initializing five matrixes of D, sumEx, sumExPower2, len and Way according to the sequence S and the sequence T, and setting all data of the matrixes to be null, wherein the validity of the data is ensured, the data with the row number of 0 and the column number of 0 which are added out are data in an initial state, the existence of the data must be ensured, and meanwhile, a threshold value maxDeficience is given according to the expected matching error tolerance of a user;
(2) The dynamic programming calculation is performed by row or column traversal, starting with the first row in the case of a row, to calculate the data of (1, 1) row 1 column, (1, 2) row 1 column, and column 2, and the same is true for a column. Step 3 is executed at each position, and if all the positions are executed, the whole comparison calculation process is also finished.
(3) Step five is mainly to update the difference degree of the current position, and return to the step 2 after the calculation is finished:
Figure GDA0002649376810000102
preferably, the result generation is that the nth result obtained by the algorithm matching in case 1 is accurate and continuous, so that the operator can see the maximum overlapping part of the two traces, and in case 2, the matching which meets the condition is continued in addition to the nth result in case of outputting the nth result, which can alleviate the problem of recognition breakage caused by trace mutation in some cases.
Example 1:
partial matching of the Nth length is output and is usually used in the process of fast matching, and the Nth length result obtained by algorithm matching is accurate and continuous under the general condition, and the approximate (most strict) maximum overlapping part of two traces can be seen through the condition. The method comprises the following implementation steps:
the nth position in the len matrix represents the nth long match because it stores the longest consecutive match sizes of S and T at the corresponding positions.
First mark position (i, j) as the end position, and the i position of the S string and the j position of the T string as the matching end position, and then proceed to step 3.
Outputting and moving according to the state of Way (x, y) every time tracing back to the position finding (x, y):
Figure GDA0002649376810000111
step 3 needs to pass through the state of len (i, j) -1 time Way (x, y) =1, and corresponding sequences of S and T are output in each pass, which is compatible with all modes in the alignment algorithm.
The coordinate position at the end of step 3 is marked as the start position, then the previous end position and the present start position are returned, and all the output S and T sequences are also provided, and the sequence is the longest matching sequence of the two.
Simulation test
And 8 samples are counted, wherein the data0 is the sample, the data 1-100 are coincided with the data0 at different positions, and the coincidence amplitude is near 30%. The tolerance value adopted by the test is 1, and the test result is shown in fig. 3:
actual testing
The different broken ends of actual shearing are respectively detected, the coincident positions are difficult to distinguish from naked eyes, the corresponding positions also have large difference, actual testing is carried out through a program, the testing tolerance is 4, the different broken ends are divided into 100 groups which are all the same coincident test, and the maximum coincidence of each group is about 30% or 60% different. The matching effect is shown in fig. 4 and 5, and the matching accuracy reaches 85%.
The continuous shearing difficulty is continuously increased, the actual shearing of different broken ends is the same, but the shearing randomness is increased, the change is larger, the overlapping range is different from 30% to 60%, and other conditions are consistent. The matching effect is shown in fig. 6, and the matching accuracy reaches 80%.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and not to limit it; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the embodiments of the disclosure or equivalent substitutions of parts of the technical features may still be made; without departing from the spirit of the present disclosure, the present disclosure should be construed as being limited to the scope of the present disclosure as claimed.

Claims (5)

1. A linear trace similarity matching algorithm based on improved longest common substrings is characterized in that: the matching algorithm is applied to criminal investigation, bullet trace detection and other scenes needing trace comparison, and comprises (1) model training, (2) comparison calculation, (3) result generation, (1) model training is trained by adopting an image convolution neural network, (2) comparison calculation is mainly used for variance calculation optimized by dynamic programming, and (3) result generation comprises two conditions of 1: output nth long partial local match, case 2: outputting all matches meeting the conditions by taking the Nth long match as a reference;
after the two-dimensional state matrix is established, the variance calculation for dynamic programming optimization uses a dynamic equation to calculate the local optimal solution of the current position according to a certain sequence and rows or columns; the formula that the average value is repeatedly calculated once at each position and then the variance is solved is used, so that the characteristics of dynamic planning can be well matched:
D=E(X 2 )-E(X) 2 (3)
if the method is adopted, only the following matrixes need to be added according to the sequence length:
d (0.. N,0.. M), and storing the variance of the current optimal result;
SumEx (0.. N,0.. M), the sum of the difference values of the current optimal result is saved;
SumExPower2 (0.. N,0.. M), the sum of the squares of the difference values of the current optimal results is saved;
len (0.. N,0.. M), and storing the length of the current optimal result, wherein x in the length is the difference amplitude of S and T at the corresponding position;
way (0.. N,0.. M), storing the relation of the current position and the previous matching;
the variance of each (i, j) location can thus be calculated as follows:
Figure FDA0003938903290000011
the SumEx, sumExPower2 and len matrixes only need to be operated once according to a transfer equation during state transfer, so that a large amount of repeated calculation is avoided;
the comparison calculation comprises a strict mode, an unconstricted mode and an adaptive mode; the whole algorithm of the strict mode in the LC-triggering strict mode has the following flow:
step 1, initializing five matrixes D, sumEx, sumExPower2, len, way and setting all data of the five matrixes to be null according to a sequence S and a sequence T, wherein the data with the excessive row number of 0 and the excessive column number of 0 are data in an initial state, the existence of the data must be ensured, and a threshold maxDeficience is set according to the expected matching error tolerance of a user;
step 2, performing dynamic planning calculation according to row or column traversal, wherein the data of (1, 1) row and 1 column are calculated from the first row according to the condition of the row, and (1, 2) the data of 1 row and 2 columns are calculated according to the condition of the column, the step 3 is executed at each position, and if all the positions are executed, the whole comparison calculation process is also finished;
step 3, for the current position (i, j), firstly finding a precursor variance value D (i-1, j-1) to judge the difference between the current position (i, j) and maxDiffience, if D (i-1, j-1) is smaller than maxDiffience, continuing the step 4, otherwise, not changing the state equation, and directly performing the step 5;
and 4, updating the state equation for the current situation when the variance of the current predecessor position is smaller than a tolerance value, and executing the following updating if the previous position is matched with len (i-1, j-1) > 0:
Figure FDA0003938903290000021
the main purpose of the update is the current position (i, j), the minimum difference that can be achieved, and the lengths len and Way are updated to facilitate the subsequent backtracking;
however, at this time, if the previous position is a state without matching, i.e., len (i-1, j-1) =0, then the current position is selected as a starting point, matching is started, and the following update is performed:
Figure FDA0003938903290000022
if the corresponding situation is matched no matter whether the previous position is matched or not, the step 5 is switched;
and 5, mainly updating the difference degree of the current position, and returning to the step 2 after the calculation is finished:
Figure FDA0003938903290000023
the execution steps of the relaxed mode are superposed with the steps 1,2 and 5 of the LC-Substring strict mode, and the dynamic equations and transfer modes of 3 and 4 are slightly different, specifically as follows:
(1) Initializing five matrixes of D, sumEx, sumExPower2, len, way and setting all data of the five matrixes to be null according to the sequence S and the sequence T, wherein the validity of the data is ensured, the data with the extra row number of 0 and the column number of 0 is the data in the initial state, the existence of the data must be ensured, and a threshold maxDefibrance is set according to the expected matching error tolerance of a user;
(2) Performing dynamic planning calculation according to row or column traversal, wherein the data of (1, 1) row and 1 column are calculated from the first row according to the situation of the row, and (1, 2) row and 2 column are calculated from the data of (1, 1) row and 1 column, and the situation is the same according to the column, step 3 is executed on each position, if all the positions are completely executed, the whole comparison calculation process is also finished, (3) step five is mainly to update the difference degree of the current position, and the step 2 is returned after the calculation is finished:
Figure FDA0003938903290000031
2. the algorithm for matching line-type trace similarity based on the improved longest common substring of claim 1, wherein: the method comprises the following steps of (1) establishing a training set, 2) adjusting parameters and establishing a graph convolution neural network model, and 3) introducing data to be detected to obtain a similarity calculation result;
2) The specific method for adjusting parameters and establishing the graph convolution neural network model is that G = (V, E), V represents a node set, namely
Figure FDA0003938903290000032
E represents a set of edges, i.e.
Figure FDA0003938903290000033
The training model consists of two parts: 1) A GCN component responsible for sampling all node information in K-order neighborhood, 2) an auto-encoder AE component used for extracting hidden features of an activation value matrix A learned by the GCN component and preserving a node cluster structure by combining with Laplace eigenmap LE, wherein the GCN component uses a graph convolutional neural network to segment in a training model
Figure FDA0003938903290000034
Sampling the structure and characteristic information of all nodes in K steps for the center, namely coding K-order neighborhood information, generating an activation value matrix A used as input of a self-encoder component by combining with label training of the nodes, wherein the GCN can simultaneously code local structure and characteristic information of the network by supervised learning based on node labels, and omits secondary nodes of low-dimensional vectors of the nodes generated outside the K-order neighborhoodAnd constructing information, namely using an activation value matrix A learned by GCN as the input of a self-encoder, further extracting characteristic information from A by the self-encoder in an unsupervised learning mode, and mapping the original network to a low-dimensional space by combining Laplace characteristic mapping.
3. The algorithm for matching line-type trace similarity based on the improved longest common substring of claim 1, wherein: the comparison calculation adopts a dynamic programming algorithm combining and improving the longest common Substring algorithm LC-Substring and the longest common Subsequence LC-Subsequence, and the algorithm adopts dynamic programming as a basic method because a state matrix generated by the dynamic programming method contains a plurality of matching results, so that the output of the basic longest Substring or Subsequence can be realized, and simultaneously, all candidate results can be obtained in a matching area according to different minimum matching lengths under the condition of only once calculation, thereby being beneficial to perfecting the output of the results.
4. The algorithm for matching the similarity of the linear traces based on the improved longest common substring according to claim 1, wherein: the dynamic programming method adopted by the comparison calculation is consistent with the dynamic programming steps of LC-subscription and LC-subscription in the execution steps, and the dynamic programming method is calculated according to the sequence of rows or columns, and the main improvement part of the algorithm is that a local minimum difference mode is adopted on the decision rule of state transition.
5. The algorithm for matching line-type trace similarity based on the improved longest common substring of claim 1, wherein: the result generation is that the Nth long result obtained by algorithm matching is accurate and continuous under the general condition of the condition 1, a worker can see the maximum overlapping part of two traces in an approximate mode through the condition, and the condition 2 is that the matching which meets the conditions is continued under the condition that the Nth long matching is output, so that the problem of identification fracture caused by trace mutation under certain conditions can be relieved.
CN202010265484.8A 2020-04-07 2020-04-07 Linear trace similarity matching algorithm based on improved longest common substring Active CN112085045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010265484.8A CN112085045B (en) 2020-04-07 2020-04-07 Linear trace similarity matching algorithm based on improved longest common substring

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010265484.8A CN112085045B (en) 2020-04-07 2020-04-07 Linear trace similarity matching algorithm based on improved longest common substring

Publications (2)

Publication Number Publication Date
CN112085045A CN112085045A (en) 2020-12-15
CN112085045B true CN112085045B (en) 2022-12-27

Family

ID=73734671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010265484.8A Active CN112085045B (en) 2020-04-07 2020-04-07 Linear trace similarity matching algorithm based on improved longest common substring

Country Status (1)

Country Link
CN (1) CN112085045B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109191502A (en) * 2018-08-14 2019-01-11 南京工业大学 A kind of method of automatic identification shell case trace
CN110555389A (en) * 2019-08-09 2019-12-10 南京工业大学 bullet line-bore trace identification method based on ridgelet transformation and rotation matching

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120254333A1 (en) * 2010-01-07 2012-10-04 Rajarathnam Chandramouli Automated detection of deception in short and multilingual electronic messages
CN104123546A (en) * 2014-07-25 2014-10-29 黑龙江省科学院自动化研究所 Multi-dimensional feature extraction based bullet trace comparison method
CN105589838B (en) * 2015-12-24 2018-06-12 中国电子科技集团公司第三十三研究所 A kind of electronic government documents trace reservation method based on Documents Comparison
US20190184561A1 (en) * 2017-12-15 2019-06-20 The Regents Of The University Of California Machine Learning based Fixed-Time Optimal Path Generation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109191502A (en) * 2018-08-14 2019-01-11 南京工业大学 A kind of method of automatic identification shell case trace
CN110555389A (en) * 2019-08-09 2019-12-10 南京工业大学 bullet line-bore trace identification method based on ridgelet transformation and rotation matching

Also Published As

Publication number Publication date
CN112085045A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN111783100A (en) Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN111506599B (en) Industrial control equipment identification method and system based on rule matching and deep learning
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
CN113032672A (en) Method and device for extracting multi-modal POI (Point of interest) features
CN110458132A (en) One kind is based on random length text recognition method end to end
CN112329767A (en) Contract text image key information extraction system and method based on joint pre-training
CN111104398A (en) Detection method and elimination method for approximate repeated record of intelligent ship
CN111860981A (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN111833310A (en) Surface defect classification method based on neural network architecture search
CN115983250A (en) Knowledge graph-based power anomaly data root cause positioning method and system
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN110825642A (en) Software code line-level defect detection method based on deep learning
CN112085045B (en) Linear trace similarity matching algorithm based on improved longest common substring
CN116383422B (en) Non-supervision cross-modal hash retrieval method based on anchor points
CN116291336B (en) Automatic segmentation clustering system based on deep self-attention neural network
CN112084353A (en) Bag-of-words model method for rapid landmark-convolution feature matching
CN116975161A (en) Entity relation joint extraction method, equipment and medium of power equipment partial discharge text
CN115292962B (en) Path similarity matching method and device based on track rarefaction and storage medium
CN108388574B (en) Quick face retrieval method based on triplet depth binary network
CN116304213A (en) RDF graph database sub-graph matching query optimization method based on graph neural network
CN112465838B (en) Ceramic crystal grain image segmentation method, system, storage medium and computer equipment
CN113590867B (en) Cross-modal information retrieval method based on hierarchical measurement learning
CN111859924B (en) Word network construction method and device based on word2vec model
CN115238705A (en) Semantic analysis result reordering method and system
CN114647751A (en) Image retrieval method, model training method, device, equipment, medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant