CN114841124A

CN114841124A - Third-party component document fine-grained automatic extraction method and system based on question-answer model

Info

Publication number: CN114841124A
Application number: CN202210331439.7A
Authority: CN
Inventors: 纪守领; 赵彬彬; 王琴应; 张旭鸿; 邓水光; 王文海; 祝羽艳; 杨星
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-08-02

Abstract

The invention discloses a third-party component document fine-grained automatic extraction method and system based on a question-answer model, and belongs to the technical field of third-party component testing. The system comprises: the third-party component document preprocessing module is used for preliminarily filtering the third-party component document to obtain coarse-grained third-party component use rules; the document question-answer tree construction module deeply analyzes the misuse types of the third-party components, designs query questions for each misuse type, and manually marks the document to be tested according to the questions; and a third-party component use rule extraction module based on question and answer performs question and answer type information extraction on the document by adopting a natural language processing model based on a RoBERTA model to obtain fine-grained use rules related to the third-party component. The system solves the problem of coarse-grained refinement of the third-party component document without a uniform format, and can perform fine-grained automatic extraction on the use rules in the third-party component document.

Description

Third-party component document fine-grained automatic extraction method and system based on question-answer model

Technical Field

The invention relates to the technical field of third-party component testing, in particular to a question-answer model-based third-party component document fine-grained automatic extraction method and system.

Background

With the continuous advance of open source communities, various third-party components are developed vigorously and are widely applied to the development of software in various industries at present. However, recent research and practical events have shown that security issues arising during use of third party components are alarming while providing convenience to software developers. Software developed based on various types of third party components may present a serious security threat due to the lack of an effective method to impose strict specifications on developers in the process of using the third party components. For example, some third party components' function requirements may need to be released after being called, and developers may ignore or miss similar usage rules, thereby posing a serious security threat. Such threats affect user privacy at a low rate and national key device security at a high rate.

In order to detect whether a developer calls a third-party component strictly according to a usage rule in using the third-party component, researchers have proposed various detection systems for mining misuse of such third-party components. These detection systems have in common that they need to accurately obtain the usage rules of the third-party components. At present, researchers mainly obtain corresponding use rules from third-party component documents through methods of manual obtaining, regular expression matching, syntax dependency tree and the like. However, the manner in which usage rules are obtained by manually reading third-party components is time consuming and laborious. A third party component often contains hundreds of functions, each with multiple usage rules, and thus a third party component may contain up to thousands of usage rules. Secondly, the usage rules obtained by the regular expression matching method often generate a large amount of false reports, and the usage rules of the third-party components cannot be comprehensively obtained, so that the accuracy of subsequent detection on the misuse condition of the third-party components is influenced. In addition, a large amount of document preprocessing work is needed to mine the rules through the syntax dependency tree method, the effect on the third-party component documents with loose structures is poor, and the method is difficult to apply to large-scale detection.

The following challenges exist in designing an effective third-party component document fine-grained automatic extraction method: (1) third party component documents of different formats are adapted. Currently, third party component documents are not written in a uniform format, and there may be significant differences between any two third party component documents. The method of matching only with regular expressions cannot be applied to different classes of third party component documents. (2) And comprehensively acquiring the use rule of the third-party component. Because the third-party component document contains a large number of interference sentences, such as function description of functions, the difficulty of comprehensively acquiring the use rule of the third-party component is greatly influenced, and a large number of false reports are caused. Secondly, the usage rule description of some third-party components is ambiguous, and even if the usage rule is judged manually, the fact that the usage rule is real or not can not be judged sometimes.

Because the document of the third-party component has no uniform format and has various use rules, an effective method for automatically obtaining the use rules of the third-party component in a fine-grained manner does not exist at present, and the design of the method for automatically extracting the corresponding use rules from the document of the third-party component in a fine-grained manner is important and necessary for subsequently detecting the vulnerability caused by the misuse condition of the third-party component.

Disclosure of Invention

The invention provides a question-answering model-based third-party component document fine-grained automatic extraction method and system aiming at the defects of third-party component use rule fine-grained automatic extraction work.

The specific technical scheme of the invention is as follows:

the invention aims to provide a third-party component document fine-grained automatic extraction method based on a question-answer model, which comprises the following steps:

step 1: collecting documents of a plurality of different third-party components, preprocessing the documents and constructing a document warehouse; performing statement refining on a document to be tested in a document warehouse by using an attention model, and acquiring coarse-grained use rules of a third-party component;

step 2: designing corresponding questions of the question-answer model according to the misuse types of the third-party components; selecting partial documents from the documents to be tested in the document warehouse, and marking answers according to designed questions;

and step 3: dividing the marked documents to be tested into a training set and a verification set, and training the natural language processing model by using the training set until the testing accuracy of the verification set meets the preset requirement; and performing fine-grained mining on the coarse-grained usage rules of the remaining unmarked answer documents in the document warehouse by using the trained natural language processing model.

The second objective of the present invention is to provide a question-answering model-based third-party component document fine-grained automatic extraction system, which is used for implementing the above method, and the extraction system includes:

the third-party component document preprocessing module is used for collecting documents of a plurality of different third-party components, preprocessing the documents and constructing a document warehouse; performing statement refining on a document to be tested in a document warehouse by using an attention model, and acquiring coarse-grained use rules of a third-party component;

the document question-answer tree construction module is used for designing corresponding questions of the question-answer model according to the misuse types of the third-party components; selecting partial documents from the documents to be tested in the document warehouse, and marking answers according to designed questions;

the third-party component usage rule extraction module based on question answering is used for dividing the marked documents to be tested into a training set and a verification set, and training the natural language processing model by using the training set until the test accuracy of the verification set meets the preset requirement; and performing fine-grained mining on the coarse-grained usage rules of the remaining unmarked answer documents in the document warehouse by using the trained natural language processing model.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a third-party component document fine-grained automatic extraction system, and provides a third-party component document fine-grained automatic extraction method based on a question-answering model, so that the problem of fine-grained automatic extraction of the use rules of a third-party component is solved, the use rules of different types of third-party components can be effectively extracted, and the system has practicability;

(2) the invention provides a document preprocessing method based on an attention model, which solves the problem of coarse granularity refinement of third-party component documents without a uniform format; the invention provides a method for extracting document contents based on a question-answer model, which provides a basis for constructing a document question-answer tree and extracting the use rule of a third-party component without a uniform format in a fine granularity.

Drawings

FIG. 1 is a schematic diagram of an overall module structure of a third-party component document fine-grained automatic extraction system based on a question-answering model;

FIG. 2 is a schematic flow chart of a third-party component document fine-grained automatic extraction method based on a question-answering model;

FIG. 3 is a schematic diagram of a third party component document preprocessing method;

FIG. 4 is a schematic diagram of a third-party component document question-answer tree construction method;

FIG. 5 is a schematic diagram of a third-party component usage rule extraction method based on question answering.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in FIG. 1, the question-answer model-based third-party component document fine-grained automatic extraction system of the invention comprises a third-party component document preprocessing module, a document question-answer tree construction module and a question-answer-based third-party component usage rule extraction module.

The work flow of the whole third-party component document fine-grained automatic extraction system is shown in fig. 2, and comprises the following steps:

(1) collecting documents of a plurality of different third-party components, such as OpenSSL, SQLite, uClibc and the like, preprocessing the documents, and forming a large-scale document warehouse. Performing statement refining on a document to be tested in a document warehouse by using an attention model to obtain coarse-grained use rules of third-party components;

(2) aiming at the misuse types of the third-party components, designing corresponding questions for subsequent question answering models; selecting at least 5 documents from the documents to be tested in the document warehouse, and marking the documents with manual answers according to the designed problems;

(3) dividing the marked documents to be tested into a training set and a verification set, inputting the training set into a natural language processing model for training, and testing the accuracy on the verification set. And applying the trained model to the unmarked documents to be tested in the document warehouse, and performing fine-grained mining on the use rule of each involved third-party component.

In the invention, the core of the step (1) is to acquire the characteristics of the use rule of the third-party component and roughly filter out the content irrelevant to the use rule by utilizing the attention model. Based on experience, even documents of different writing styles, in which the content of the usage rules includes emphasized mood, use emotional mood words such as can (result), may (light), must, need, out to, dare (dared), shell (shell), and will (woold). Therefore, in order to roughly filter irrelevant content in a third-party component document, the invention provides a filtering method of a natural language processing model based on attention, which mainly comprises the following steps:

(1-1) collecting related documents of the third party components recorded in the documents according to the Linux help documents. Filtering third-party components with unclear document narration or serious document missing;

and (1-2) according to the characteristics of the third-party component usage rules, performing coarse-grained filtering on the third-party component document contents by using an attention model, and reserving sentences with emphasized tone in the document to obtain the usage rules after coarse-grained filtering.

In the invention, the core of the step (2) is to deeply and manually analyze the misuse types of the third-party components and design corresponding problems according to the misuse types, thereby constructing a question-answering tree, which mainly comprises the following steps:

(2-1) based on the public third-party component misuse data set, deeply and manually analyzing the third-party component misuse types, wherein the obtained misuse types comprise: the method comprises four categories of obsolete function misuse, return value misuse, calling sequence misuse and parameter misuse;

(2-2) designing a corresponding query question for each misuse type for constructing a question-answer model, wherein the query question comprises: whether the function is outdated, whether the function has a return value, which conditions of the function have the return value, what the return value of the function is under the conditions, whether other functions need to be called in advance, whether other functions need to be called later, and what the parameter type is in seven categories;

and (2-3) selecting the documents to be tested in the document warehouse to be manually marked according to the designed problems, and constructing a question-answer tree structure for each third-party component document.

In the invention, the core of the step (3) is to perform fine-grained extraction on the use rule by using a question-answer model. Based on experience, a large number of third-party component documents with loose structures and widely different styles can not be processed by methods such as regular expressions. Therefore, in order to automatically process a large number of third-party component documents with different styles, the invention provides a question-answering model-based third-party component document fine-grained extraction method, which mainly comprises the following steps:

(3-1) marking the document to be tested which is marked in the step (2-3) according to the following steps of 8: 2, dividing the proportion into a training set and a verification set, performing iterative training on the training set by using a natural language processing model developed based on RoBERTA, and performing verification on the verification set until a loss function is converged;

and (3-2) performing fine-grained extraction on the remaining third-party component documents to be tested in the document warehouse by using the trained model. The model takes the answer with the highest confidence probability of each question as a correct answer, and simultaneously generates a question-answer tree for each document to be tested. And extracting each third-party component to generate a fine-grained use rule according to the processing result of the question-answer model.

The following describes each module:

1. the third-party component document preprocessing module performs coarse-grained filtering on the third-party component usage rules by using an attention model, as shown in fig. 3, and the process is as follows:

relevant documents of the third party components recorded therein are collected first from the Linux help documents. Filtering third-party components with unclear document narration or serious document missing;

and then, according to the characteristics of the use rules of the third-party components, performing coarse-grained filtering on the document contents of the third-party components by using the attention model, and reserving the sentences with emphasized tone in the document to obtain the use rules after coarse-grained filtering.

2. And the document question-answer tree construction module is used for acquiring the misuse types of the third-party components, designing query questions for each type, and manually marking relevant question answers to the document to be tested so as to construct a document question-answer tree. As shown in fig. 4, the process is as follows:

firstly, based on a public third-party component misuse data set, deeply and manually analyzing the third-party component misuse types, wherein the obtained misuse types comprise: the method comprises four categories of obsolete function misuse, return value misuse, calling sequence misuse and parameter misuse;

and then designing a corresponding query question for each misuse type for constructing a question-answer model, wherein the query question comprises: whether the function is outdated, whether the function has a return value, which conditions of the function have the return value, what the return value of the function is under the conditions, whether other functions need to be called in advance, whether other functions need to be called later, and what the parameter type is;

and finally, selecting the documents to be tested in the document warehouse for manual marking according to the designed problems, and constructing a question-answer tree structure for each third-party component document.

3. And the question-answer-based third-party component usage rule extraction module is used for obtaining fine-grained usage rules related to the third-party component based on corresponding problems designed by the misuse type of the third-party component and combined with a natural language processing model based on a RoBERTA model. As shown in fig. 5, the process is as follows:

firstly, a document to be tested marked by a document question-answer tree construction module is marked according to the following steps of 8: 2, dividing the proportion into a training set and a verification set, performing iterative training on the training set by using a RoBERTA model, and performing verification on the verification set until a loss function is converged;

and then performing fine-grained extraction on the remaining third-party component documents to be tested in the document warehouse by using the trained model. The model takes the answer with the highest confidence probability of each question as a correct answer, and simultaneously generates a question-answer tree for each document to be tested. And generating a fine-grained use rule of each third-party component according to the processing result of the question-answering model.

The technical solutions and advantages of the present invention have been described in detail with reference to the above embodiments, it should be understood that the above embodiments are only specific examples of the present invention and should not be construed as limiting the present invention, and any modifications, additions, equivalents and the like made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A third-party component document fine-grained automatic extraction method based on a question-answer model is characterized by comprising the following steps:

2. The method for automatically extracting the document fine granularity of the third-party components based on the question-answering model according to claim 1, wherein the third-party components which are unclear in description or seriously missing in the document are filtered when the document of the third-party components is preprocessed.

3. The question-answering model-based third-party component document fine-grained automatic extraction method according to claim 1, wherein the attention model works by the following method: and (4) retaining sentences with emphasized language words in the document, and filtering out contents irrelevant to the usage rule.

4. The method for fine-grained automatic extraction of a third-party component document based on a question-answer model according to claim 3, wherein the emphatic lexical words are at least one of can, could, may, might, must, need, ought to, dare, dared, shall, should, will, and would.

5. The method for fine-grained automatic extraction of a third-party component document based on a question-and-answer model according to claim 1, wherein the types of misuse of the third-party component include obsolete function misuse, return value misuse, call order misuse and parameter misuse.

6. The method for fine-grained automatic extraction of a third-party component document based on a question-and-answer model according to claim 5, wherein the corresponding questions of the question-and-answer model comprise: a. whether the function is outdated; b. whether the function has a return value; c. which cases of the function have return values; d. what the return values of the functions are in the above cases, respectively; e. whether other functions need to be called in advance or not; f. whether other functions need to be called later; g. what the parameter type is; each question is provided with an optional answer.

7. The method for automatically extracting the fine granularity of the third-party component document based on the question-answering model according to claim 1, wherein the natural language processing model in the step 3 adopts a RoBERTA model; in the training process, the coarse-grained usage rule of the document is used as the input of the model, the confidence coefficient of each answer corresponding to each question is output, the answer corresponding to the highest confidence coefficient is used as the result, and a question-answer tree is generated for each document.

8. A question-answering model-based third-party component document fine-grained automatic extraction system for implementing the method of claim 1, wherein the extraction system comprises:

the third-party component usage rule extraction module based on question answering is used for dividing the marked documents to be tested into a training set and a verification set, and training the natural language processing model by using the training set until the testing accuracy of the verification set meets the preset requirement; and performing fine-grained mining on the coarse-grained usage rules of the remaining unmarked answer documents in the document warehouse by using the trained natural language processing model.