CN110795530A - Context-based value feature extraction system and method - Google Patents

Context-based value feature extraction system and method Download PDF

Info

Publication number
CN110795530A
CN110795530A CN201910857258.6A CN201910857258A CN110795530A CN 110795530 A CN110795530 A CN 110795530A CN 201910857258 A CN201910857258 A CN 201910857258A CN 110795530 A CN110795530 A CN 110795530A
Authority
CN
China
Prior art keywords
context
variable
matrix
feature extraction
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910857258.6A
Other languages
Chinese (zh)
Other versions
CN110795530B (en
Inventor
程华
袁洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910857258.6A priority Critical patent/CN110795530B/en
Publication of CN110795530A publication Critical patent/CN110795530A/en
Application granted granted Critical
Publication of CN110795530B publication Critical patent/CN110795530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Mathematics (AREA)
  • Operations Research (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)

Abstract

The invention relates to the technical field of computer application, in particular to a value feature extraction system and a value feature extraction method based on context. The invention is realized by the following technical scheme: a context-based value feature extraction system, comprising: the variable extraction module is used for automatically extracting variables from the codes; the variable context quantity counting module is used for counting quantity information of each variable in different context environments; the characteristic matrix generation module is used for generating characteristic matrixes, each row in each characteristic matrix corresponds to a variable, and each element in the row represents the word frequency of the variable in different contexts; and a feature comparison module. The invention aims to provide a value feature extraction system and a method based on context, which adopt a design structure that the value feature is matched with context sensitivity, have the high efficiency of a value feature matching technology, and increase the matching precision due to the consideration of context sensitive information.

Description

Context-based value feature extraction system and method
Technical Field
The invention relates to the technical field of computer application, in particular to a value feature extraction system and a value feature extraction method based on context.
Background
The code similarity detection technology is mainly used for detecting the code plagiarism at present and is an important task in the development and maintenance activities of computer software. The method has wide application in a plurality of fields such as software copyright protection intellectual property, source code plagiarism detection, software component library inquiry, program understanding and the like. The method can help the copying and original situation of the software code, and has important practical significance for the adherence in the software copyright.
For example, chinese patent document No. CN109918218 discloses a technical solution: a code similarity detection method and system based on a relation variable graph comprise an identifier determination module, a similarity calculation module and the like, and are used for determining matching query results of code similarity among different documents.
Other solutions exist both at home and abroad to detect code plagiarism problems, such as the MOSS system of Stanford university in the United states, the SIM system of Wischatan State university, and the YAP3 system of Sydney university in Australia. However, in both the technical solution of the document and the prior art, the document to be tested and the comparison document need to be subjected to code segment feature extraction, thereby serving as a data source for the subsequent authentication step. This process often involves two approaches: value feature extraction and tree feature or graph feature extraction.
In the first approach, a value feature on the code segment scale is employed, since the value feature does not contain context information. So that the precision is not high.
In the second mode, the tree feature and the graph feature retain all the context information, so that the detection precision is high, but the complexity is high, the calculation is time-consuming, and the requirement on system resources is high.
Disclosure of Invention
The invention aims to provide a value feature extraction system and a method based on context, which adopt a design structure that the value feature is matched with context sensitivity, have the high efficiency of a value feature matching technology, and increase the matching precision due to the consideration of context sensitive information.
The technical purpose of the invention is realized by the following technical scheme:
a context-based value feature extraction system, comprising:
the variable extraction module is used for automatically extracting variables from the codes;
the variable context quantity counting module is used for counting quantity information of each variable in different context environments;
the characteristic matrix generation module is used for generating characteristic matrixes, each row in each characteristic matrix corresponds to a variable, and each element in the row represents the word frequency of the variable in different contexts;
and the characteristic comparison module is used for comparing the similarity of the two characteristic matrixes.
Preferably, in the statistical process of the variable context quantity statistical module, a fixed value and a use context are required to be counted, when the variable is fixed, the variable is located on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is located on the right side of "=" or "- =" or "+ =".
Preferably, in the statistical process of the variable context quantity statistical module, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a computational statement and/or an array access and/or a constant assignment.
Preferably, in the statistical process of the variable context quantity statistical module, the context of the nested statement needs to be counted, and the nested statement is an outermost layer loop, or a second outer layer loop, or a third layer loop, or an inner layer loop.
The invention preferably further comprises a matrix merging module, wherein after the characteristic matrix generating module generates the characteristic matrixes of the plurality of subcodes, the matrix merging module merges the characteristic matrixes of the plurality of subcodes into the characteristic matrix of the original code segment.
Preferably, when comparing the variable similarity of the two feature matrices, the variable comparison module compares the data of the feature matrix B line by using a line of data of the feature matrix a in a line-by-line comparison scanning manner.
Preferably, in the progressive comparison scanning process, a calculation mode of cosine included angles between vectors is adopted.
Preferably, after the cosine included angle value is calculated, the cosine included angle value is multiplied by a length proportion coefficient; the length scaling factor is the ratio of the length of the shorter of the vectors to the length of the longer of the vectors.
A method for extracting value feature based on context comprises the system for extracting value feature based on context, comprising the following steps:
s01, a variable identification step,
the system identifies variables in the code;
s02, a frequency statistics step,
the system counts the word frequency of the variable appearing in different contexts;
s03, a matrix generating step,
the generated data matrix comprises n rows, each row corresponds to one variable, each row comprises m data, and each data is the number of the variables appearing in the corresponding upper and lower questions;
s04, a matrix matching step,
and comparing the generated feature matrixes pairwise, calculating a similarity value, and comparing the similarity value with a preset similarity interval.
Preferably, the present invention further comprises a matrix merging step after the matrix generating step at S03, wherein the system merges the feature matrices corresponding to the sub-code segments into the feature matrix of the original code segment.
In conclusion, the invention has the following beneficial effects:
1. the method uses the extraction and matching mode of the value characteristics and includes the word frequency information of the context, so that the matching mode gives consideration to the matching accuracy and the matching efficiency.
2. After the characteristic matrix is formed, the line-by-line comparison is carried out, and the pertinence to the variable matching degree is stronger.
3. The matrices of the sub-codes are combined rather than simply matrix-added.
4. The matching mode is a cosine similarity calculation idea, and a length proportion coefficient is introduced to further increase the matching accuracy.
Drawings
FIG. 1 is a schematic code diagram of embodiment one;
FIG. 2 is a schematic diagram of a feature matrix formed;
fig. 3 is a schematic diagram of matrix matching.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The present embodiment is only for explaining the present invention, and it is not limited to the present invention, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present invention.
Embodiment 1, a feature extraction method of a context-based value feature extraction system, first of all, is S01, a variable identification step. As shown in fig. 1, fig. 1 shows a piece of code. Taking the code as an example, s, pi, i, n, r, b are variables of the code, and can be automatically recognized by the system. Automatic identification of variables is a conventional technical means in software programming in the prior art and is not described in detail here. After the variable identification is successful, the key S02 and SO3 steps are entered. In S02, as a frequency statistics step, the system calculates the word frequency of a variable appearing in different contexts. The context of the values here often includes the following three cases, respectively:
(1) and (5) fixing the value and using. The number of times a statistical variable is valued and used throughout a code segment. The fixed value is located on the left side of "=" (or "+ =", "- =", etc.), and the use is located on the right side of the above operation.
(2) A general sentence. And counting the occurrence times of the variables in the common sentences. Common sentence contexts that need attention include conditional sentences, compute (add, subtract, multiply, divide) sentences, array accesses (variables as subscripts), constant assignments, and the like.
(3) Nesting statements. Counting the number of times a variable appears in a nested statement. Such as an outermost cycle, a next outermost cycle, a third cycle, or a further inner cycle.
After the statistics, the process proceeds to the matrix generation step of S03, where the system generates a two-dimensional feature matrix. In this matrix, a plurality of rows are included, each row corresponding to a separate variable. As shown in fig. 2, fig. 2 shows a feature matrix generated after statistics of a certain code segment. If the first row is the variable s, the second row is the variable pi, and the third row is the variable i. And each row contains a plurality of data elements, each data element representing the number of word frequencies of the variable in the context of a different condition.
If the first two data elements in the first row of fig. 2 are 4 and 2, then this indicates that s is a variable that is 4 times used in the context and 2 in the context.
The number and type of variables in each row may also be varied accordingly, depending on the embodiment, depending on the design of the software designer. For example, in this embodiment, 4 and 2 in the first row corresponding to s are word frequencies of use and fixed value, the following 1, 0 and 1 are word frequencies in different general sentences, and the following 2 and 0 are word frequencies in the nested loop sentence.
The time for generating the feature matrix is relatively efficient. If the size is n × m, the code segment contains n variables, and m contexts are analyzed. The matrix may be calculated within the time of O (L + knm), where L is the length of the code segment and k is the number of child code segments in the code segment.
In this process, a matrix merging step is also often included. Specifically, in the process of calculating the feature matrix of a code segment, the feature matrix of the sub-code segments is calculated first, and then the feature matrices of the sub-code segments are combined to obtain the feature matrix of the original code segment. The process of combining is not simply adding the matrices because the context like the number of loop layers needs to be recalculated at the time of combining. This is because the number of loop layers seen from the angle of the sub-code segment may not be the same as the number of loop layers seen from the angle of the original code segment. In addition, if a variable is newly declared in the subcode segment, the size of the feature matrix after merging will increase accordingly.
After the merging is completed, the subsequent processing step, i.e., the matrix matching step of S04, is entered. After the above steps, whether the code in the database or the code to be compared is used, a corresponding data feature matrix is formed, and at this time, comparison processing is performed to judge whether the similarity exists.
Specifically, as shown in fig. 3, fig. 3 is a schematic diagram of calculating a matching degree by comparing two feature matrices. In the figure, two matrices CM1 and CM2 are matched, and in the embodiment, a two-step matching design idea is used. Namely, firstly, sorting is carried out according to the weight, and then, the calculation mode of the cosine included angle between vectors is used.
Specifically, as shown in fig. 3, sorting is performed according to the weight first, for example, if the data according to the first column is the most important weight value, the row number distribution of the array is reordered. The first column in fig. 3 is 3, 1, 4, 8, then the ranks are 1, 3, 4, 8 or 8, 4, 3, 1. And CM2 has the same row number order of the first column of 3, 4, 5, 8 or 8, 5, 4, 3.
And after the first-step sorting is finished, starting to compare line by line. In order to improve efficiency and save resources, a certain row of codes is often compared with the same position and adjacent row number of the opposite matrix. For example, the second row of CM2 is aligned with the first, second, and third rows of CM 1.
As described above, each line corresponds to a variable, so the action is essentially a comparison of the variables of the two codes one by one. Hereinafter, comparison between V1 and V2 is abbreviated. The value of each variable V may be calculated using the sum of squares root of the data, e.g., the first line 3, 0, 2 in CM 1. . . The V value of this row of data is calculated to be 9+0+ 4. . . . . The sum after the addition is given the root number, and the V value calculated as the same operation as that of a certain line in CM 2.
Subsequently, matching of cosine included angles is adopted for V1 and V2. I.e., COS values of V1 and V2 in fig. 3. This value is between 0 and 1, and when this value is close to 0, it means that the directions of the two vectors are far apart and the similarity is low. When the value is close to 1, the similarity is high, which means that the two vectors are close to the same direction.
In the scheme, in order to further improve the matching accuracy, a length proportionality coefficient K is also added. That is, the COS values of V1 and V2 are not the final match results. But multiplied by K, which is calculated as the shorter length of the two vectors divided by the longer length. If the length of V1 is 40 and the length of V2 is 80, then the COS values of both need to be multiplied by 0.5 to obtain the final match value.
The system can preset a matching degree value interval, and when the calculated matching degree value is in the interval range, the matching is judged. If the range is out of the range, the system judges that the range is not matched.

Claims (10)

1. A context-based value feature extraction system, comprising: the variable extraction module is used for automatically extracting variables from the codes; the variable context quantity counting module is used for counting quantity information of each variable in different context environments; the characteristic matrix generation module is used for generating characteristic matrixes, each row in each characteristic matrix corresponds to a variable, and each element in the row represents the word frequency of the variable in different contexts; and the characteristic comparison module is used for comparing the similarity of the two characteristic matrixes.
2. The context-based value feature extraction system of claim 1, wherein: in the process of counting the number of contexts in the variable, a constant value and a usage context are required to be counted, when the variable is constant, the variable is located on the left side of "=" or "+ =" or "- =", and when the variable is used, the variable is located on the right side of "=" or "- =" or "+ =".
3. The context-based value feature extraction system of claim 1, wherein: in the counting process of the variable context quantity counting module, the context of the common statement needs to be counted, and the context of the common statement is a conditional statement and/or a calculation statement and/or an array access and/or a constant assignment.
4. The context-based value feature extraction system of claim 1, wherein: in the counting process of the variable context quantity counting module, the context of the nested sentences needs to be counted, wherein the nested sentences are outmost circulation or second outer circulation or third layer circulation or even inner layer circulation.
5. A context-based value feature extraction system according to claim 1, 2, 3 or 4, wherein: the system also comprises a matrix merging module, wherein after the characteristic matrix generating module generates the characteristic matrixes of the plurality of subcodes, the matrix merging module merges the characteristic matrixes of the plurality of subcodes into the characteristic matrix of the original code segment.
6. A context-based value feature extraction system according to claim 1, 2, 3 or 4, wherein: and when the feature comparison module compares the similarity of the two feature matrixes, the feature comparison module compares the data of the feature matrix B line by using a line of data of the feature matrix A in a line-by-line comparison scanning mode.
7. The context-based value feature extraction system of claim 6, wherein: and in the process of the progressive comparison scanning, a calculation mode of cosine included angles among vectors is adopted.
8. The context-based value feature extraction system of claim 7, wherein: after the cosine included angle value is calculated, the cosine included angle value needs to be multiplied by a length proportion coefficient; the length scaling factor is the ratio of the length of the shorter of the vectors to the length of the longer of the vectors.
9. A context-based value feature extraction method comprising a context-based value feature extraction system according to any one of claims 1 to 8, characterized by comprising the steps of: s01, a variable identification step, wherein the system identifies variables in the codes; s02, frequency counting, wherein the system counts the word frequency of the variable in different contexts; s03, a matrix generating step, wherein the generated data matrix comprises n rows, each row corresponds to one variable, each row comprises m data, and each data is the number of the variables appearing in the corresponding upper and lower questions; and S04, matrix matching, namely comparing the generated feature matrixes pairwise, calculating a similarity value, and comparing the similarity value with a preset similarity interval.
10. The context-based value feature extraction method of claim 9, wherein: after the step of S03, generating a matrix, a step of combining matrices is further included, in which the system combines the feature matrices corresponding to the sub-code segments into the feature matrix of the original code segment.
CN201910857258.6A 2019-09-11 2019-09-11 Context-based value feature extraction system and method Active CN110795530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910857258.6A CN110795530B (en) 2019-09-11 2019-09-11 Context-based value feature extraction system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910857258.6A CN110795530B (en) 2019-09-11 2019-09-11 Context-based value feature extraction system and method

Publications (2)

Publication Number Publication Date
CN110795530A true CN110795530A (en) 2020-02-14
CN110795530B CN110795530B (en) 2022-10-04

Family

ID=69427128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910857258.6A Active CN110795530B (en) 2019-09-11 2019-09-11 Context-based value feature extraction system and method

Country Status (1)

Country Link
CN (1) CN110795530B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473104A (en) * 2013-09-24 2013-12-25 北京大学 Method for discriminating re-package of application based on keyword context frequency matrix
CN105824756A (en) * 2016-03-17 2016-08-03 南京大学 Automatic detection method and system of outmoded demand on basis of code dependency relationship
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
US20190065443A1 (en) * 2017-08-29 2019-02-28 Fujitsu Limited Matrix generation program, matrix generation apparatus, and plagiarism detection program
CN109634594A (en) * 2018-11-05 2019-04-16 南京航空航天大学 A kind of code snippet recommended method considering code statement order information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473104A (en) * 2013-09-24 2013-12-25 北京大学 Method for discriminating re-package of application based on keyword context frequency matrix
CN105824756A (en) * 2016-03-17 2016-08-03 南京大学 Automatic detection method and system of outmoded demand on basis of code dependency relationship
US20190065443A1 (en) * 2017-08-29 2019-02-28 Fujitsu Limited Matrix generation program, matrix generation apparatus, and plagiarism detection program
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
CN109634594A (en) * 2018-11-05 2019-04-16 南京航空航天大学 A kind of code snippet recommended method considering code statement order information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨超: "基于多种技术的混合式程序代码抄袭检测方法", 《计算机工程与应用》 *

Also Published As

Publication number Publication date
CN110795530B (en) 2022-10-04

Similar Documents

Publication Publication Date Title
US8112421B2 (en) Query selection for effectively learning ranking functions
CN108776673B (en) Automatic conversion method and device of relation mode and storage medium
CN109685092B (en) Clustering method, equipment, storage medium and device based on big data
US8788990B2 (en) Reuse of circuit labels in subcircuit recognition
CN104199969A (en) Webpage data analysis method and device
CN112307820A (en) Text recognition method, device, equipment and computer readable medium
CN115392955B (en) Store duplicate removal processing method, store duplicate removal processing device, store duplicate removal processing equipment and storage medium
CN111045670B (en) Method and device for identifying multiplexing relationship between binary code and source code
CN110795464B (en) Method, device, terminal and storage medium for checking field of object marker data
Lee et al. SNAS: Fast hardware-aware neural architecture search methodology
CN111258905A (en) Defect positioning method and device, electronic equipment and computer readable storage medium
CN114201756A (en) Vulnerability detection method and related device for intelligent contract code segment
CN110795530B (en) Context-based value feature extraction system and method
CN113158627A (en) Code complexity detection method and device, storage medium and electronic equipment
CN101923632B (en) Maxi Code bar code decoding chip and decoding method thereof
CN109657060B (en) Safety production accident case pushing method and system
CN110110119B (en) Image retrieval method, device and computer readable storage medium
CN113806601B (en) Peripheral interest point retrieval method and storage medium
CN116150371A (en) Asset repayment plan mass data processing method based on sharingJDBC
CN109326324A (en) A kind of detection method of epitope, system and terminal device
Nakano et al. A time-optimal solution for the path cover problem on cographs
CN113722321A (en) Data export method and device and electronic equipment
CN110990017B (en) Credible tree based feature storage and matching method
Peng et al. Recognizing unordered depth-first search trees of an undirected graph in parallel
CN115860034B (en) Decoding method, decoding device, decoding chip and code scanning equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant