CN110795530A

CN110795530A - Context-based value feature extraction system and method

Info

Publication number: CN110795530A
Application number: CN201910857258.6A
Authority: CN
Inventors: 程华; 袁洋
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2020-02-14
Anticipated expiration: 2039-09-11
Also published as: CN110795530B

Abstract

The invention relates to the technical field of computer application, in particular to a value feature extraction system and a value feature extraction method based on context. The invention is realized by the following technical scheme: a context-based value feature extraction system, comprising: the variable extraction module is used for automatically extracting variables from the codes; the variable context quantity counting module is used for counting quantity information of each variable in different context environments; the characteristic matrix generation module is used for generating characteristic matrixes, each row in each characteristic matrix corresponds to a variable, and each element in the row represents the word frequency of the variable in different contexts; and a feature comparison module. The invention aims to provide a value feature extraction system and a method based on context, which adopt a design structure that the value feature is matched with context sensitivity, have the high efficiency of a value feature matching technology, and increase the matching precision due to the consideration of context sensitive information.

Description

Context-based value feature extraction system and method

Technical Field

The invention relates to the technical field of computer application, in particular to a value feature extraction system and a value feature extraction method based on context.

Background

The code similarity detection technology is mainly used for detecting the code plagiarism at present and is an important task in the development and maintenance activities of computer software. The method has wide application in a plurality of fields such as software copyright protection intellectual property, source code plagiarism detection, software component library inquiry, program understanding and the like. The method can help the copying and original situation of the software code, and has important practical significance for the adherence in the software copyright.

For example, chinese patent document No. CN109918218 discloses a technical solution: a code similarity detection method and system based on a relation variable graph comprise an identifier determination module, a similarity calculation module and the like, and are used for determining matching query results of code similarity among different documents.

Other solutions exist both at home and abroad to detect code plagiarism problems, such as the MOSS system of Stanford university in the United states, the SIM system of Wischatan State university, and the YAP3 system of Sydney university in Australia. However, in both the technical solution of the document and the prior art, the document to be tested and the comparison document need to be subjected to code segment feature extraction, thereby serving as a data source for the subsequent authentication step. This process often involves two approaches: value feature extraction and tree feature or graph feature extraction.

In the first approach, a value feature on the code segment scale is employed, since the value feature does not contain context information. So that the precision is not high.

In the second mode, the tree feature and the graph feature retain all the context information, so that the detection precision is high, but the complexity is high, the calculation is time-consuming, and the requirement on system resources is high.

Disclosure of Invention

The invention aims to provide a value feature extraction system and a method based on context, which adopt a design structure that the value feature is matched with context sensitivity, have the high efficiency of a value feature matching technology, and increase the matching precision due to the consideration of context sensitive information.

The technical purpose of the invention is realized by the following technical scheme:

a context-based value feature extraction system, comprising:

the variable extraction module is used for automatically extracting variables from the codes;

the variable context quantity counting module is used for counting quantity information of each variable in different context environments;

the characteristic matrix generation module is used for generating characteristic matrixes, each row in each characteristic matrix corresponds to a variable, and each element in the row represents the word frequency of the variable in different contexts;

and the characteristic comparison module is used for comparing the similarity of the two characteristic matrixes.

Preferably, in the statistical process of the variable context quantity statistical module, a fixed value and a use context are required to be counted, when the variable is fixed, the variable is located on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is located on the right side of "=" or "- =" or "+ =".

Preferably, in the statistical process of the variable context quantity statistical module, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a computational statement and/or an array access and/or a constant assignment.

Preferably, in the statistical process of the variable context quantity statistical module, the context of the nested statement needs to be counted, and the nested statement is an outermost layer loop, or a second outer layer loop, or a third layer loop, or an inner layer loop.

The invention preferably further comprises a matrix merging module, wherein after the characteristic matrix generating module generates the characteristic matrixes of the plurality of subcodes, the matrix merging module merges the characteristic matrixes of the plurality of subcodes into the characteristic matrix of the original code segment.

Preferably, when comparing the variable similarity of the two feature matrices, the variable comparison module compares the data of the feature matrix B line by using a line of data of the feature matrix a in a line-by-line comparison scanning manner.

Preferably, in the progressive comparison scanning process, a calculation mode of cosine included angles between vectors is adopted.

Preferably, after the cosine included angle value is calculated, the cosine included angle value is multiplied by a length proportion coefficient; the length scaling factor is the ratio of the length of the shorter of the vectors to the length of the longer of the vectors.

A method for extracting value feature based on context comprises the system for extracting value feature based on context, comprising the following steps:

s01, a variable identification step,

the system identifies variables in the code;

s02, a frequency statistics step,

the system counts the word frequency of the variable appearing in different contexts;

s03, a matrix generating step,

the generated data matrix comprises n rows, each row corresponds to one variable, each row comprises m data, and each data is the number of the variables appearing in the corresponding upper and lower questions;

s04, a matrix matching step,

and comparing the generated feature matrixes pairwise, calculating a similarity value, and comparing the similarity value with a preset similarity interval.

Preferably, the present invention further comprises a matrix merging step after the matrix generating step at S03, wherein the system merges the feature matrices corresponding to the sub-code segments into the feature matrix of the original code segment.

In conclusion, the invention has the following beneficial effects:

1. the method uses the extraction and matching mode of the value characteristics and includes the word frequency information of the context, so that the matching mode gives consideration to the matching accuracy and the matching efficiency.

2. After the characteristic matrix is formed, the line-by-line comparison is carried out, and the pertinence to the variable matching degree is stronger.

3. The matrices of the sub-codes are combined rather than simply matrix-added.

4. The matching mode is a cosine similarity calculation idea, and a length proportion coefficient is introduced to further increase the matching accuracy.

Drawings

FIG. 1 is a schematic code diagram of embodiment one;

FIG. 2 is a schematic diagram of a feature matrix formed;

fig. 3 is a schematic diagram of matrix matching.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The present embodiment is only for explaining the present invention, and it is not limited to the present invention, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present invention.

Embodiment 1, a feature extraction method of a context-based value feature extraction system, first of all, is S01, a variable identification step. As shown in fig. 1, fig. 1 shows a piece of code. Taking the code as an example, s, pi, i, n, r, b are variables of the code, and can be automatically recognized by the system. Automatic identification of variables is a conventional technical means in software programming in the prior art and is not described in detail here. After the variable identification is successful, the key S02 and SO3 steps are entered. In S02, as a frequency statistics step, the system calculates the word frequency of a variable appearing in different contexts. The context of the values here often includes the following three cases, respectively:

(1) and (5) fixing the value and using. The number of times a statistical variable is valued and used throughout a code segment. The fixed value is located on the left side of "=" (or "+ =", "- =", etc.), and the use is located on the right side of the above operation.

(2) A general sentence. And counting the occurrence times of the variables in the common sentences. Common sentence contexts that need attention include conditional sentences, compute (add, subtract, multiply, divide) sentences, array accesses (variables as subscripts), constant assignments, and the like.

(3) Nesting statements. Counting the number of times a variable appears in a nested statement. Such as an outermost cycle, a next outermost cycle, a third cycle, or a further inner cycle.

After the statistics, the process proceeds to the matrix generation step of S03, where the system generates a two-dimensional feature matrix. In this matrix, a plurality of rows are included, each row corresponding to a separate variable. As shown in fig. 2, fig. 2 shows a feature matrix generated after statistics of a certain code segment. If the first row is the variable s, the second row is the variable pi, and the third row is the variable i. And each row contains a plurality of data elements, each data element representing the number of word frequencies of the variable in the context of a different condition.

If the first two data elements in the first row of fig. 2 are 4 and 2, then this indicates that s is a variable that is 4 times used in the context and 2 in the context.

The number and type of variables in each row may also be varied accordingly, depending on the embodiment, depending on the design of the software designer. For example, in this embodiment, 4 and 2 in the first row corresponding to s are word frequencies of use and fixed value, the following 1, 0 and 1 are word frequencies in different general sentences, and the following 2 and 0 are word frequencies in the nested loop sentence.

The time for generating the feature matrix is relatively efficient. If the size is n × m, the code segment contains n variables, and m contexts are analyzed. The matrix may be calculated within the time of O (L + knm), where L is the length of the code segment and k is the number of child code segments in the code segment.

In this process, a matrix merging step is also often included. Specifically, in the process of calculating the feature matrix of a code segment, the feature matrix of the sub-code segments is calculated first, and then the feature matrices of the sub-code segments are combined to obtain the feature matrix of the original code segment. The process of combining is not simply adding the matrices because the context like the number of loop layers needs to be recalculated at the time of combining. This is because the number of loop layers seen from the angle of the sub-code segment may not be the same as the number of loop layers seen from the angle of the original code segment. In addition, if a variable is newly declared in the subcode segment, the size of the feature matrix after merging will increase accordingly.

After the merging is completed, the subsequent processing step, i.e., the matrix matching step of S04, is entered. After the above steps, whether the code in the database or the code to be compared is used, a corresponding data feature matrix is formed, and at this time, comparison processing is performed to judge whether the similarity exists.

Specifically, as shown in fig. 3, fig. 3 is a schematic diagram of calculating a matching degree by comparing two feature matrices. In the figure, two matrices CM1 and CM2 are matched, and in the embodiment, a two-step matching design idea is used. Namely, firstly, sorting is carried out according to the weight, and then, the calculation mode of the cosine included angle between vectors is used.

Specifically, as shown in fig. 3, sorting is performed according to the weight first, for example, if the data according to the first column is the most important weight value, the row number distribution of the array is reordered. The first column in fig. 3 is 3, 1, 4, 8, then the ranks are 1, 3, 4, 8 or 8, 4, 3, 1. And CM2 has the same row number order of the first column of 3, 4, 5, 8 or 8, 5, 4, 3.

And after the first-step sorting is finished, starting to compare line by line. In order to improve efficiency and save resources, a certain row of codes is often compared with the same position and adjacent row number of the opposite matrix. For example, the second row of CM2 is aligned with the first, second, and third rows of CM 1.

As described above, each line corresponds to a variable, so the action is essentially a comparison of the variables of the two codes one by one. Hereinafter, comparison between V1 and V2 is abbreviated. The value of each variable V may be calculated using the sum of squares root of the data, e.g., the

first line

3, 0, 2 in CM 1. . . The V value of this row of data is calculated to be 9+0+ 4. . . . . The sum after the addition is given the root number, and the V value calculated as the same operation as that of a certain line in CM 2.

Subsequently, matching of cosine included angles is adopted for V1 and V2. I.e., COS values of V1 and V2 in fig. 3. This value is between 0 and 1, and when this value is close to 0, it means that the directions of the two vectors are far apart and the similarity is low. When the value is close to 1, the similarity is high, which means that the two vectors are close to the same direction.

In the scheme, in order to further improve the matching accuracy, a length proportionality coefficient K is also added. That is, the COS values of V1 and V2 are not the final match results. But multiplied by K, which is calculated as the shorter length of the two vectors divided by the longer length. If the length of V1 is 40 and the length of V2 is 80, then the COS values of both need to be multiplied by 0.5 to obtain the final match value.

The system can preset a matching degree value interval, and when the calculated matching degree value is in the interval range, the matching is judged. If the range is out of the range, the system judges that the range is not matched.

Claims

1. A context-based value feature extraction system, comprising: the variable extraction module is used for automatically extracting variables from the codes; the variable context quantity counting module is used for counting quantity information of each variable in different context environments; the characteristic matrix generation module is used for generating characteristic matrixes, each row in each characteristic matrix corresponds to a variable, and each element in the row represents the word frequency of the variable in different contexts; and the characteristic comparison module is used for comparing the similarity of the two characteristic matrixes.

2. The context-based value feature extraction system of claim 1, wherein: in the process of counting the number of contexts in the variable, a constant value and a usage context are required to be counted, when the variable is constant, the variable is located on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is located on the right side of "=" or "- =" or "+ =".

3. The context-based value feature extraction system of claim 1, wherein: in the counting process of the variable context quantity counting module, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a calculation statement and/or an array access and/or a constant assignment.

4. The context-based value feature extraction system of claim 1, wherein: in the counting process of the variable context quantity counting module, the context of the nested sentences needs to be counted, wherein the nested sentences are outmost circulation or second outer circulation or third layer circulation or even inner layer circulation.

5. A context-based value feature extraction system according to claim 1, 2, 3 or 4, wherein: the system also comprises a matrix merging module, wherein after the characteristic matrix generating module generates the characteristic matrixes of the plurality of subcodes, the matrix merging module merges the characteristic matrixes of the plurality of subcodes into the characteristic matrix of the original code segment.

6. A context-based value feature extraction system according to claim 1, 2, 3 or 4, wherein: and when the feature comparison module compares the similarity of the two feature matrixes, the feature comparison module compares the data of the feature matrix B line by using a line of data of the feature matrix A in a line-by-line comparison scanning mode.

7. The context-based value feature extraction system of claim 6, wherein: and in the process of the progressive comparison scanning, a calculation mode of cosine included angles among vectors is adopted.

8. The context-based value feature extraction system of claim 7, wherein: after the cosine included angle value is calculated, the cosine included angle value needs to be multiplied by a length proportion coefficient; the length scaling factor is the ratio of the length of the shorter of the vectors to the length of the longer of the vectors.

9. A context-based value feature extraction method comprising a context-based value feature extraction system according to any one of claims 1 to 8, characterized by comprising the steps of: s01, a variable identification step, wherein the system identifies variables in the codes; s02, frequency counting, wherein the system counts the word frequency of the variable in different contexts; s03, a matrix generating step, wherein the generated data matrix comprises n rows, each row corresponds to one variable, each row comprises m data, and each data is the number of the variables appearing in the corresponding upper and lower questions; and S04, matrix matching, namely comparing the generated feature matrixes pairwise, calculating a similarity value, and comparing the similarity value with a preset similarity interval.

10. The context-based value feature extraction method of claim 9, wherein: after the step of S03, generating a matrix, a step of combining matrices is further included, in which the system combines the feature matrices corresponding to the sub-code segments into the feature matrix of the original code segment.