CN114201406B

CN114201406B - Code detection method, system, equipment and storage medium based on open source component

Info

Publication number: CN114201406B
Application number: CN202111543638.6A
Authority: CN
Inventors: 高思雨; 闻剑峰; 殷铭
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2024-02-02
Anticipated expiration: 2041-12-16
Also published as: CN114201406A

Abstract

The invention provides a code detection method, a system, equipment and a storage medium based on an open source component, wherein the method comprises the following steps: acquiring a first source code of a code to be tested; dividing a first source code of a code to be detected and a second source code of an open source assembly set respectively to obtain a first code segment and a second code segment, and respectively converting the first code segment and the second code segment into abstract syntax trees; converting each sentence tree of each abstract syntax tree into a sentence vector through a traversal algorithm, extracting the sentence vectors of the sentence tree according to a preset sequence, and obtaining a group of ordered sentence vectors corresponding to each code segment; generating a representative vector from each group of sentence vectors through node coding; and obtaining the similarity according to the representative vector corresponding to the first code segment and the representative vector corresponding to the second code segment. The invention can generate the similarity of the open source assembly and the code to be tested through the discrimination model, automatically realize the detection of the open source assembly and the detection of the open source fragment, and improve the network security.

Description

Code detection method, system, equipment and storage medium based on open source component

Technical Field

The present invention relates to the field of code detection, and in particular, to a method, a system, an apparatus, and a storage medium for detecting a code based on an open source component.

Background

Open source code (Open source code), also known as source code disclosure, refers to a software release mode. Typical software only has access to a binary executable that has been compiled, typically only the author or copyright owner of the software, etc., has the original code of the program. Some authors of software will expose the original code, referred to herein as "source code exposure," but this is not necessarily in compliance with the definition and conditions of "open source code," as the author may set conditional restrictions on exposing the original code, such as restricting objects that can read the original code, restricting derivatives, etc. Open sources are the basis for building software applications, and enterprises are exposed to security, license compliance, and code quality risks from using open sources if there is no effective way to track and manage it. An effective open source tool is therefore critical to combat hacking, protect sensitive data, and gain customer trust.

The current open source component detection method for the source code mainly comprises AST subtree matching, PDG matching, code hash searching and other methods, and the detection efficiency of the methods is low or the accuracy is not high. Wherein the abstract syntax tree (abstract syntax code, AST) is a tree representation of the abstract syntax structure of the source code, each node on the tree representing a structure in the source code, which is abstract because the abstract syntax tree does not represent every detail of the actual syntax appearance, for example nested brackets are implicit in the tree structure and not presented in the form of nodes. The program dependency graph (Program Dependence Graph) is a graphical representation of a program, which is a directed multiple graph with labels. The program dependency graph can represent control dependencies and data dependencies of a program. The program dependency graph (Program Dependence Graph) is a graph model for source code. The source code graph model includes a flow chart (Control Flow Graph), a control dependency chart (Control Dependence Graph), a data flow chart (dataDependence Graph), and a program dependency chart (Program Dependency Graph).

In view of the above, the invention provides a code detection method, a system, a device and a storage medium based on an open source component.

It should be noted that the information disclosed in the foregoing background section is only for enhancement of understanding of the background of the invention and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a code detection method, a system, equipment and a storage medium based on an open source component, which overcome the difficulty in the prior art, can generate the similarity between the open source component and a code to be detected through a discrimination model, automatically realize the detection of the open source component and the detection of an open source fragment, and improve the network security.

The embodiment of the invention provides a code detection method based on an open source component, which comprises the following steps:

acquiring a first source code of a code to be tested;

dividing a first source code of the code to be detected and a second source code of the open source assembly set respectively to obtain a first code segment and a second code segment, and respectively converting the first code segment and the second code segment into abstract syntax trees;

converting each sentence tree of each abstract syntax tree into a sentence vector through a traversal algorithm, and extracting the sentence vectors of the sentence tree according to a preset sequence to obtain a group of ordered sentence vectors corresponding to each code segment;

Generating a representative vector from each group of statement vectors through node coding; and

and obtaining the similarity according to the representative vector corresponding to the first code segment and the representative vector corresponding to the second code segment.

Preferably, the obtaining the first source code of the code to be tested includes:

the code to be tested is compiled reversely to obtain a first source code;

preprocessing the first source code;

obtaining a first component code pattern according to the character strings, the files and the catalog information of each code file in the first source code;

obtaining the similarity between a first component code pattern of the first source code and a second component code pattern preset in the open source component library;

and collecting open source components corresponding to code patterns of the second component, the similarity of which is greater than or equal to a preset threshold value, and obtaining the open source component set.

Preferably, the preprocessing the first source code includes:

deleting preset invalid texts from the first source code and the second source code, wherein the preset invalid texts at least comprise packages, imported sentences, notes and blank lines;

unifying the function access keyword specifications into preset keyword specifications; and

the source code is normalized to lowercase.

Preferably, the splitting the first source code of the code to be tested and the second source code of the open source component set to obtain a first code segment and a second code segment, and converting the first code segment and the second code segment into abstract syntax trees respectively, including:

The first source code of the code to be detected and the second source code of the open source assembly set are respectively segmented to obtain a first code segment and a second code segment;

the first code segment and the second code segment are respectively decomposed according to the granularity of the natural sentences in sequence, so that sentence tree sequences are respectively formed;

and generating an abstract syntax tree based on the sentence tree sequence corresponding to the same code segment.

Preferably, the decomposing the codes of the first code segment and the second code segment according to the granularity of the natural sentence respectively, to form sentence tree sequences respectively, includes:

when the constructor encounters a tree node representing a programming statement in the first code segment and the second code segment, the constructor generates a statement tree rooted at the node and adds the statement tree to the sequence of statement trees.

Preferably, the step of converting each sentence tree of each abstract syntax tree into a sentence vector by a traversal algorithm, extracting the sentence vectors of the sentence tree according to a preset sequence, and obtaining an ordered set of sentence vectors corresponding to each code segment includes:

converting all code segments in the training set into an abstract syntax tree;

Then generating a feature dictionary by extracting all nodes from the abstract syntax tree;

performing unsupervised learning training on the node encoder based on the feature dictionary;

encoding all internal nodes by a node encoder;

traversing a statement tree of the abstract syntax tree by using a breadth-first algorithm to obtain a node vector of the statement tree;

all node vectors V _n Sequentially placing the ordered sets V;

all node vectors V _n The neural network encoder is sequentially input, and a group of ordered sentence vectors corresponding to one sentence tree are output.

Preferably, the generating a representative vector from each set of the sentence vectors by node encoding includes:

node encoding is performed using a neural network encoder, and the representative vector is generated from an ordered set of sentence vectors corresponding to one of the sentence trees by a bi-directional neural network encoder.

Preferably, the bi-directional neural network encoder includes a self-attention layer.

Preferably, the obtaining the similarity according to the representative vector corresponding to the first code segment and the representative vector corresponding to the second code segment includes:

obtaining cosine similarity between representative vectors corresponding to each first code segment and representative vectors corresponding to all second code segments;

Judging whether cosine similarity between representative vectors corresponding to each first code segment and at least one second code segment is larger than a preset threshold value, if so, the first code segments are similar to the second code segments; if not, the first code segment is dissimilar to the set of open source components.

The embodiment of the invention also provides a code detection system based on the open source component, which is used for realizing the code detection method based on the open source component, and comprises the following steps:

the source code acquisition module acquires a first source code of a code to be detected;

the grammar tree conversion module is used for respectively dividing the first source code of the code to be detected and the second source code of the open source assembly set to obtain a first code segment and a second code segment, and respectively converting the first code segment and the second code segment into abstract grammar trees;

the sentence vector generation module converts each sentence tree of each abstract syntax tree into a sentence vector through a traversal algorithm, and extracts the sentence vectors of the sentence tree according to a preset sequence to obtain a group of ordered sentence vectors corresponding to each code segment;

the representative vector generation module is used for generating a representative vector from each group of statement vectors through node coding; and

And the similarity judging module is used for obtaining the similarity according to the representative vector corresponding to the first code segment and the representative vector corresponding to the second code segment.

The embodiment of the invention also provides code detection equipment based on the open source component, which comprises the following steps:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the open source component based code detection method described above via execution of the executable instructions.

Embodiments of the present invention also provide a computer-readable storage medium storing a program that when executed implements the steps of the open source component-based code detection method described above.

The invention aims to provide a code detection method, a system, equipment and a storage medium based on an open source component, which can automatically realize the detection of the open source component and the detection of an open source fragment by generating the similarity between the open source component and a code to be detected through a discrimination model and improve the network security.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings.

FIG. 1 is a flow chart of one embodiment of a method of open source component based code detection of the present invention.

Fig. 2 is a flowchart of step S110 in the open source component-based code detection method of the present invention.

Fig. 3 is a flowchart of step S120 in the open source component-based code detection method of the present invention.

Fig. 4 is a flowchart of step S130 in the open source component-based code detection method of the present invention.

FIG. 5 is a flow chart of an implementation process of the open source component based code detection method of the present invention.

FIG. 6 is a schematic diagram of an implementation process of the open source component based code detection method of the present invention.

Fig. 7 is a schematic diagram of a bi-directional neural network encoder in the implementation of the open source component based code detection method of the present invention.

FIG. 8 is a block diagram of one embodiment of an open source component based code detection system of the present invention.

FIG. 9 is a block diagram of a source code acquisition module in the open source component based code detection system of the present invention.

Fig. 10 is a schematic block diagram of a syntax tree conversion module in the open source component based code detection system of the present invention.

FIG. 11 is a block diagram of a statement vector generation module in an open source component-based code detection system of the present invention.

FIG. 12 is a schematic diagram of an open source component based code detection apparatus of the present invention.

Detailed Description

Other advantages and effects of the present application will be readily apparent to those skilled in the art from the present disclosure, by describing embodiments of the present application with specific examples. The present application may be embodied or applied in other specific forms and details, and various modifications and alterations may be made to the details of the present application from different points of view and application without departing from the spirit of the present application. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

The embodiments of the present application will be described in detail below with reference to the drawings so that those skilled in the art to which the present application pertains can easily implement the same. This application may be embodied in many different forms and is not limited to the embodiments described herein.

In the description of the present application, reference to the terms "one embodiment," "some embodiments," "examples," "particular examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples, and features of the various embodiments or examples, presented herein may be combined and combined by those skilled in the art without conflict.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the context of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

For the purpose of clarity of the description of the present application, components that are not related to the description are omitted, and the same or similar components are given the same reference numerals throughout the description.

Throughout the specification, when a device is said to be "connected" to another device, this includes not only the case of "direct connection" but also the case of "indirect connection" with other elements interposed therebetween. In addition, when a certain component is said to be "included" in a certain device, unless otherwise stated, other components are not excluded, but it means that other components may be included.

When a device is said to be "on" another device, this may be directly on the other device, but may also be accompanied by other devices therebetween. When a device is said to be "directly on" another device in contrast, there is no other device in between.

Although the terms first, second, etc. may be used herein to connote various elements in some instances, the elements should not be limited by the terms. These terms are only used to distinguish one element from another element. For example, a first interface, a second interface, etc. Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, steps, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, steps, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions, steps or operations are in some way inherently mutually exclusive.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the language clearly indicates the contrary. The meaning of "comprising" in the specification is to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not preclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.

Although not differently defined, including technical and scientific terms used herein, all terms have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The term addition defined in the commonly used dictionary is interpreted as having a meaning conforming to the contents of the related art document and the current hint, so long as no definition is made, it is not interpreted as an ideal or very formulaic meaning too much.

FIG. 1 is a flow chart of one embodiment of a method of open source component based code detection of the present invention. As shown in FIG. 1, the invention relates to the field of network configuration, and discloses a code detection method based on an open source component. The flow of the invention comprises the following steps:

S110, acquiring a first source code of the code to be tested. The invention obtains the first source code by reversely compiling the code to be tested. The computer software reverse engineering (Reverse engineering) is also called computer software restoration engineering, and refers to the process of performing reverse analysis and research on a target program (such as an executable program) of other software to deduce design elements such as ideas, principles, structures, algorithms, processing procedures, running methods and the like used by other software products, and possibly deducing source codes under certain specific conditions. Decompilation is used as a reference when developing software itself or directly in its own software product.

S120, the first source code of the code to be tested and the second source code of the open source assembly set are respectively segmented to obtain a first code segment and a second code segment, and the first code segment and the second code segment are respectively converted into abstract syntax trees. Wherein the abstract syntax tree (abstract syntax code, AST) is a tree representation of the abstract syntax structure of the source code, each node on the tree representing a structure in the source code, which is abstract because the abstract syntax tree does not represent every detail of the actual syntax appearance, for example nested brackets are implicit in the tree structure and not presented in the form of nodes. The abstract syntax tree does not depend on the grammar of the source language, that is, the context used in the grammar analysis stage is free of grammar, because when grammar is written, equivalent conversion (elimination of left recursion, backtracking, ambiguity, etc.) is often performed on the grammar, so that some redundant components are introduced into grammar analysis, which adversely affects the subsequent stage and even causes confusion of the stage. Therefore, many compilers often independently construct parse trees to create a clear interface for the front-end and the back-end.

S130, converting each sentence tree of each abstract syntax tree into a sentence vector through a traversal algorithm, and extracting the sentence vectors of the sentence tree according to a preset sequence to obtain a group of ordered sentence vectors corresponding to each code segment.

S140, generating a representative vector from each group of statement vectors through node coding. And

S150, obtaining the similarity according to the representative vector corresponding to the first code segment and the representative vector corresponding to the second code segment. In this embodiment, the cosine similarity is calculated by using the representative vector corresponding to each first code segment and the representative vector corresponding to each second code segment, and if the cosine similarity exceeds the preset threshold, the first code segment is considered to be similar to the second code segment of the open source component set, and based on the position of the first code segment in the code to be detected, the positioning of the code to be detected and the similar code segment in the open source component is completed. The invention can generate the similarity of the open source assembly and the code to be tested through the discrimination model, automatically realize the detection of the open source assembly and the detection of the open source fragment, and improve the network security.

Fig. 2 is a flowchart of step S110 in the open source component-based code detection method of the present invention. Fig. 3 is a flowchart of step S120 in the open source component-based code detection method of the present invention. Fig. 4 is a flowchart of step S130 in the open source component-based code detection method of the present invention. As shown in fig. 2, 3 and 4, in the embodiment of fig. 1, the open source component-based code detection method replaces step S110 with S111, S112, S113, S114 and S115, step S120 with S121, S122 and S123, step S130 with S131, S132, S133, S134, S135, S136 and S137, step S140 with S141, step S150 with S151 and S152, and each step is described below.

S111, the code to be tested is compiled reversely to obtain a first source code.

S112, preprocessing the first source code, including: deleting preset invalid texts from the first source code and the second source code, wherein the preset invalid texts at least comprise packages, import sentences, notes and blank lines; unifying the function access keyword specifications into preset keyword specifications; and normalizing the source code to lowercase, but not limited thereto.

S113, obtaining the code patterns of the first component according to the character strings, the files and the catalog information of each code file in the first source code.

S114, obtaining the similarity between the first component code pattern of the first source code and the second component code pattern preset in the open source component library.

S115, collecting open source components corresponding to second component code patterns with similarity larger than or equal to a preset threshold value to obtain an open source component set. And the massive open source component libraries are subjected to preliminary screening in sequence, and open source components similar to code patterns of the second components are left to serve as an open source component set, so that similarity calculation of the whole open source component libraries for codes to be tested is avoided, the calculated amount is saved, and the speed of obtaining code detection results is increased.

S121, dividing a first source code of a code to be detected and a second source code of an open source assembly set to obtain a first code segment and a second code segment.

S122, decomposing the codes of the first code segment and the second code segment according to the granularity of the natural sentences in sequence respectively to form sentence tree sequences. When the constructor encounters a tree node representing a programming statement in the first code fragment and the second code fragment, the constructor generates a statement tree rooted at the node and adds the statement tree to the sequence of statement trees, but is not limited thereto.

S123, generating an abstract syntax tree based on the sentence tree sequence corresponding to the same code segment.

S131, all code segments in the training set are converted into an abstract syntax tree.

S132, a feature dictionary is then generated by extracting all nodes from the abstract syntax tree.

S133, performing unsupervised learning training on the node encoder based on the feature dictionary.

S134, all internal nodes are encoded through a node encoder.

S135, traversing the statement tree of the abstract syntax tree by using a breadth-first algorithm to obtain the node vector of the statement tree.

S136, all node vectors V _n Sequentially placing the ordered sets V.

S137, all sections are takenPoint vector V _n The neural network encoder is sequentially input and outputs a set of ordered sentence vectors corresponding to a sentence tree.

S141, performing node coding by using a neural network encoder, and generating a representative vector from a group of ordered sentence vectors corresponding to a sentence tree by using a bidirectional neural network encoder, wherein the bidirectional neural network encoder comprises a self-attention layer.

S151, obtaining cosine similarity between the representative vector corresponding to each first code segment and the representative vectors corresponding to all second code segments. In this embodiment, the cosine similarity between the representative vector corresponding to the first code segment and the representative vectors corresponding to all the second code segments is obtained by using the cosine similarity calculation method in the prior art, which is not described herein.

S152, judging whether the cosine similarity between the representative vectors corresponding to each first code segment and at least one second code segment is greater than a preset threshold value, if so, the first code segments are similar to the second code segments; if not, the first code segment is dissimilar to the set of open source components.

The patent discloses an open source component detection method based on code pattern analysis and bidirectional RNN coding. Firstly, code pattern analysis is carried out on codes to be detected, an open source component is detected, then, the source codes are preprocessed, code fragments are segmented, an abstract syntax tree and sentence tree subtree sequence is generated, node coding is carried out by using a bidirectional RNN coder, a final representative vector is generated from a group of sentence vectors by using a specific bidirectional RNN coder with a self-attention layer, the open source component and the similarity of the codes to be detected are generated by using a discrimination model, and the detection of the open source component and the detection of the open source fragments are realized.

The recurrent neural network (Recurrent Neural Network, RNN) is a type of recurrent neural network (recursive neural network) that takes sequence data as input, performs recursion (recovery) in the evolution direction of the sequence, and all nodes (circulation units) are chained. The recurrent neural network has memory, parameter sharing and complete graphics (Turing completeness), so that the recurrent neural network has certain advantages in learning the nonlinear characteristics of the sequence. The recurrent neural network has application in the fields of natural language processing (Natural Language Processing, NLP), such as speech recognition, language modeling, machine translation, etc., and is also used for various time series predictions. A recurrent neural network constructed with the introduction of convolutional neural networks (Convolutional Neural Network, CNN) can address computer vision problems involving sequence inputs. The bidirectional RNN in this embodiment is composed of two RNNs stacked one on top of the other. The output is determined by the state of both RNNs together. The subject structure of a bi-directional RNN is a combination of two uni-directional RNNs. At each time t, the inputs are simultaneously provided to the two RNNs in opposite directions, while the outputs are determined jointly by the two unidirectional RNNs (which may be spliced or summed, etc.).

FIG. 5 is a flow chart of an implementation process of the open source component based code detection method of the present invention. FIG. 6 is a schematic diagram of an implementation process of the open source component based code detection method of the present invention. Fig. 7 is a schematic diagram of a bi-directional neural network encoder in the implementation of the open source component based code detection method of the present invention. As shown in fig. 5, 6 and 7, the implementation process of the open source component-based code detection method of the present invention is as follows:

step 101: decompiling the code to be tested, opening source components to obtain source codes, deleting packets and imported sentences, deleting comments, deleting blank lines, standardizing function access keywords as PUBLIC, standardizing the source codes as lowercase, extracting character strings, files and catalog information, generating a component code inverted index library, analyzing component code patterns, and judging whether corresponding opening source components exist.

Step 102: the code fragments are partitioned and the source code fragments to be detected are converted into abstract syntax trees by existing AST tools. Each AST is decomposed according to the granularity of language sentences, and the sequences of sentence trees are extracted through pre-sequence traversal. Subtree segmentation process: the natural sentences are decomposed according to their granularity, first using a pre-ordered traversal scan AST. When the constructor encounters a tree node representing a programming statement, the constructor generates a statement tree rooted at that node. The tree is added to the subtree queue.

Step 103: each statement tree is converted to a statement vector by the RNN encoder using a breadth-first traversal algorithm. At this point, the entire AST representing the source code fragment is converted into an ordered set of statement vectors. The process of coding the declaration tree: all code segments in the training set are converted to an AST and then a token dictionary is generated by extracting all nodes from the AST. Finally, performing unsupervised learning training on the node encoder E based on the token dictionary by using word2 vec. For a given statement tree ST, we encode all internal nodes using E, traverse ST using breadth-first algorithm, and then put all node vectors sequentially into ordered set V. All vectors V are sequentially input to the RNN encoder, the output of the RNN being a vector representation of ST.

Step 104: the model generates a final representative vector from a set of statement vectors by a specific bi-directional RNN encoder with a self-attention layer.

Step 105: and calculating the similarity of the two vectors, and judging whether the probability is larger than a corresponding threshold value.

The method can obtain the source code by decompiling the code to be tested and the open source component, delete the package and the imported sentence, delete the annotation, delete the blank line, normalize the function access key word, normalize the source code into lower case, extract the character string, the file and the catalog information, analyze the code pattern and judge whether the open source component exists. Decompiling the code to be tested and the open source component to obtain source codes, deleting packets and imported sentences in the source codes, deleting notes, deleting blank lines, returning to the Chinese function access keywords, and standardizing the source codes into lowercase. And extracting the code to be detected and the character string, file and directory information in the open source assembly to form code patterns, comparing the code patterns, and detecting the open source assembly.

In the invention, code segments are segmented, and source code segments to be detected are converted into abstract syntax trees by the existing AST tool. Each AST is decomposed according to the granularity of language sentences, and the sequences of sentence trees are extracted through pre-sequence traversal. The method comprises the steps of dividing source codes in a code to be detected and an open source component to form code fragments, and converting the source codes into abstract syntax trees AST by using an existing AST tool. Each AST is decomposed according to the granularity of language sentences, and the sequences of sentence trees are extracted through pre-sequence traversal. Subtree segmentation process: the natural sentences are decomposed according to their granularity, first using a pre-ordered traversal scan AST. When the constructor encounters a tree node representing a programming statement, the constructor generates a statement tree rooted at that node. The tree is added to the subtree queue.

In the invention, each statement tree is converted into a statement vector by an RNN encoder by using a breadth-first traversal algorithm. At this point, the entire AST representing the source code fragment is converted into an ordered set of statement vectors. A method according to claim 3, characterized in that: each statement tree is converted to a statement vector by the RNN encoder using a breadth-first traversal algorithm. At this point, the entire AST representing the source code fragment is converted into an ordered set of statement vectors. The process of the coding statement tree is as follows: all code segments in the training set are converted to an AST, and then a token dictionary (feature dictionary) is generated by extracting all nodes from the AST. Finally, performing unsupervised learning training on the node encoder E based on the token dictionary by using word2 vec. Word2vec is a group of correlation models used to generate Word vectors. These models are shallow, bi-layer neural networks that are used to train to reconstruct linguistic word text. The network is represented by words and guesses the input words in adjacent positions, and the order of the words is unimportant under the word bag model assumption in word2 vec. After training is completed, word2vec models can be used to map each word to a vector that can be used to represent word-to-word relationships, which is the hidden layer of the neural network. An extended application of Word2vec to construct whole documents (rather than independent words) has been proposed, called paralog 2vec or doc2vec, and implemented as a tool with C, python and Java/Scala. Java and Python also support inferring that files are embedded in unobserved files. Conceptually, it refers to embedding a high-dimensional space, which is the number of all words in dimension, into a continuous vector space, which is much lower in dimension, each word or phrase being mapped to a vector on the real number domain. The word embedding method comprises an artificial neural network, dimension reduction of a word co-occurrence matrix, a probability model, explicit representation of the context of the word and the like. In the bottom layer input, the word embedding method is used for representing the word group, so that the effects of a grammar analyzer, text emotion analysis and the like in the NLP are greatly improved. For a given statement tree ST, we encode all internal nodes using E, traverse ST using breadth-first algorithm, and then put all node vectors sequentially into ordered set V. All vectors V are sequentially input to the RNN encoder, the output of the RNN being a vector representation of ST.

The model in the present invention generates the final representative vector from a set of statement vectors by a specific bi-directional RNN encoder with self-attention layer (see fig. 7). The final representative vector is generated from a set of statement vectors using a model by a specific bi-directional RNN encoder with a self-attention layer. An ordered set of statement vectors is sequentially input into the RNN network. To obtain sequence information and represent the different contributions of each statement to the system functionality, a bi-directional RNN network is used. The network structure is shown in the following figure.

FIG. 8 is a block diagram of one embodiment of an open source component based code detection system of the present invention. The open source component based code detection system of the present invention, as shown in FIG. 8, includes, but is not limited to:

the source code obtaining module 51 obtains a first source code of a code to be tested.

The syntax tree conversion module 52 divides the first source code of the code to be tested and the second source code of the open source component set to obtain a first code segment and a second code segment, and converts the first code segment and the second code segment into abstract syntax trees.

The sentence vector generating module 53 converts each sentence tree of each abstract syntax tree into a sentence vector by a traversal algorithm, extracts the sentence vectors of the sentence tree according to a preset sequence, and obtains a set of ordered sentence vectors corresponding to each code segment.

The representative vector generation module 54 generates a representative vector from each set of statement vectors by node encoding. And

The similarity judging module 55 obtains the similarity according to the representative vector corresponding to the first code segment and the representative vector corresponding to the second code segment.

The implementation principle of the above module is referred to the related description in the code detection method based on the open source component, and will not be repeated here.

The code detection system based on the open source component can generate the similarity between the open source component and the code to be detected through the discrimination model, automatically realize the detection of the open source component and the detection of the open source fragment, and improve the network security.

FIG. 9 is a block diagram of a source code acquisition module in the open source component based code detection system of the present invention. Fig. 10 is a schematic block diagram of a syntax tree conversion module in the open source component based code detection system of the present invention. FIG. 11 is a block diagram of a statement vector generation module in an open source component-based code detection system of the present invention. Referring to fig. 9, 10 and 11, based on the embodiment of the apparatus of fig. 8, the open source component-based code detection system of the present invention replaces the source code acquisition module 51 with a first source code module 511, a preprocessing module 512, a component code module 513, a similarity module 514 and a component set module 515. The syntax tree conversion module 52 is replaced by a code segmentation module 521, a sentence tree sequence module 522, and an abstract syntax tree module 523. The sentence vector generation module 53 is replaced by a transcoding module 531, a feature dictionary module 532, a learning training module 533, an internal node module 534, a node vector module 535, an ordered set module 536, and a sentence vector module 537. The representative vector generation module 54 is replaced by a representative vector module 541. The similarity determination module 55 is replaced by a similarity calculation module 551 and a similarity determination module 552. The following is described for each module:

The first source code module 511 obtains the first source code by reverse compiling the code to be tested.

The preprocessing module 512 preprocesses the first source code.

The component code pattern module 513 obtains the first component code pattern according to the character strings, files and directory information of each code file in the first source code.

The similarity module 514 obtains the similarity between the first component code of the first source code and the second component code preset in the open source component library.

The component set module 515 collects open source components corresponding to the second component code patterns with the similarity greater than or equal to the preset threshold to obtain an open source component set.

The code segmentation module 521 segments the first source code of the code to be tested and the second source code of the open source component set to obtain a first code segment and a second code segment.

The sentence tree sequence module 522 sequentially decomposes the codes according to the granularity of the natural sentence for the first code segment and the second code segment, respectively, to form sentence tree sequences, respectively.

The abstract syntax tree module 523 generates an abstract syntax tree based on the sentence tree sequence corresponding to the same code fragment.

The transcoding module 531 converts all the code segments in the training set into an abstract syntax tree.

The feature dictionary module 532 then generates a feature dictionary by extracting all the nodes from the abstract syntax tree.

The learning training module 533 performs unsupervised learning training on the feature dictionary-based node encoders.

The internal node module 534 encodes all internal nodes through a node encoder.

The node vector module 535 traverses the statement tree of the abstract syntax tree using a breadth-first algorithm to obtain a node vector for the statement tree.

Ordered set module 536, which combines all node vectors V _n Sequentially placing the ordered sets V.

Statement vector module 537, which vectors all nodes V _n The neural network encoder is sequentially input and outputs a set of ordered sentence vectors corresponding to a sentence tree.

The representative vector module 541 performs node encoding using a neural network encoder, and generates representative vectors from a set of ordered sentence vectors corresponding to a sentence tree by the bi-directional neural network encoder.

The similarity calculation module 551 obtains cosine similarity between the representative vector corresponding to each first code segment and the representative vectors corresponding to all the second code segments.

The similarity determining module 552 determines whether the cosine similarity between the representative vectors corresponding to each of the first code segments and at least one of the second code segments is greater than a predetermined threshold, if so, the first code segments are similar to the second code segments; if not, the first code segment is dissimilar to the set of open source components.

The embodiment of the invention also provides code detection equipment based on the open source component, which comprises a processor. A memory having stored therein executable instructions of a processor. Wherein the processor is configured to perform the steps of the open source component based code detection method via execution of the executable instructions.

As shown above, the code detection system based on the open source component can generate the similarity between the open source component and the code to be detected through the discrimination model, automatically realize the detection of the open source component and the detection of the open source fragment, and improve the network security.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" platform.

Fig. 12 is a schematic structural diagram of an open source component-based code detection apparatus of the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 12. The electronic device 600 shown in fig. 12 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 12, the electronic device 600 is in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including memory unit 620 and processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code executable by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention described in the above-described electronic prescription flow processing method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 6201 and/or cache memory unit 6202, and may further include Read Only Memory (ROM) 6203.

The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: processing systems, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.

The embodiment of the invention also provides a computer readable storage medium for storing a program, and the steps of the open source component-based code detection method are realized when the program is executed. In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the electronic prescription stream processing method section of this specification, when the program product is run on the terminal device.

The program product 800 for implementing the above-described method according to an embodiment of the present invention may employ a portable compact disc read-only memory (CD-ROM) and include program code and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out processes of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

In summary, the invention aims to provide a code detection method, a system, equipment and a storage medium based on an open source component, which can automatically realize the detection of the open source component and the detection of an open source fragment by generating the similarity between the open source component and a code to be detected through a discrimination model and improve the network security.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A code detection method based on an open source component, comprising:

acquiring a first source code of a code to be tested;

converting all code segments in the training set into an abstract syntax tree; then generating a feature dictionary by extracting all nodes from the abstract syntax tree; performing unsupervised learning training on the node encoder based on the feature dictionary; encoding all internal nodes by a node encoder; traversing a statement tree of the abstract syntax tree by using a breadth-first algorithm to obtain a node vector of the statement tree; all node vectors V _n Sequentially placing the ordered sets V; all node vectors V _n Sequentially inputting a neural network encoder, and outputting a group of ordered sentence vectors corresponding to a sentence tree;

2. The method for detecting a code based on an open source component according to claim 1, wherein the obtaining the first source code of the code to be detected comprises:

the code to be tested is compiled reversely to obtain a first source code;

preprocessing the first source code;

obtaining the similarity between a first component code pattern of the first source code and a second component code pattern preset in an open source component library;

3. The open source component based code detection method of claim 2, wherein the preprocessing the first source code comprises:

the source code is normalized to lowercase.

4. The method for detecting codes based on open source components according to claim 1, wherein the dividing the first source code of the code to be detected and the second source code of the open source component set to obtain a first code segment and a second code segment, respectively, and converting the first code segment and the second code segment into abstract syntax trees, respectively, comprises:

5. The open source component-based code detection method of claim 4, wherein the sequentially decomposing the codes for the first code segment and the second code segment according to the granularity of natural sentences, respectively, forms sentence tree sequences, respectively, comprises:

6. The open source component based code detection method of claim 1, wherein generating a representative vector from each set of the statement vectors by node encoding comprises:

7. The open source component based code detection method of claim 6, wherein the bi-directional neural network encoder comprises a self-attention layer.

8. The open source component based code detection method of claim 1, wherein the obtaining the similarity from the representative vector corresponding to the first code segment and the representative vector corresponding to the second code segment comprises:

9. A code detection system based on an open source component, comprising:

the sentence vector generation module is used for converting all code segments in the training set into an abstract syntax tree; then generating a feature dictionary by extracting all nodes from the abstract syntax tree; performing unsupervised learning training on the node encoder based on the feature dictionary; encoding all internal nodes by a node encoder; traversing a statement tree of the abstract syntax tree by using a breadth-first algorithm to obtain a node vector of the statement tree; all node vectors V _n Sequentially placing the ordered sets V; all node vectors V _n Sequentially inputting a neural network encoder, and outputting a group of ordered sentence vectors corresponding to a sentence tree;

10. A code detection device based on an open source component, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the open source component based code detection method of any one of claims 1 to 8 via execution of the executable instructions.

11. A computer-readable storage medium storing a program, wherein the program when executed by a processor implements the steps of the open source component-based code detection method of any one of claims 1 to 8.