CN115964719A - Method and system for identifying safety defect report - Google Patents

Method and system for identifying safety defect report Download PDF

Info

Publication number
CN115964719A
CN115964719A CN202211705508.2A CN202211705508A CN115964719A CN 115964719 A CN115964719 A CN 115964719A CN 202211705508 A CN202211705508 A CN 202211705508A CN 115964719 A CN115964719 A CN 115964719A
Authority
CN
China
Prior art keywords
defect report
report
text
unit
defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211705508.2A
Other languages
Chinese (zh)
Inventor
张贺
谭睿
毛润丰
周鑫
荣国平
邵栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Kuangji Information Technology Co ltd
Original Assignee
Nanjing Kuangji Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Kuangji Information Technology Co ltd filed Critical Nanjing Kuangji Information Technology Co ltd
Priority to CN202211705508.2A priority Critical patent/CN115964719A/en
Publication of CN115964719A publication Critical patent/CN115964719A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a method and a system for identifying a safety defect report, which judges whether the defect report is the safety defect report or not by utilizing a deep learning network through analyzing the description and summary of the defect report according to the uploading of the defect report by a file; counting the safety defect report result in the project to obtain the safety defect report data of the project; and managing the defect report according to the safety submitted by the defect report. The technical scheme of the embodiment of the invention realizes that the safety state of the software is accurately identified according to the submitted defect report in the software development and maintenance process, and completes the identification management of the historical defect report and the data statistics of the safety defect report, thereby being convenient for a practitioner to efficiently identify and manage the safety defect report, advancing the repair time of the safety loophole and effectively ensuring the safety of the software.

Description

Method and system for identifying safety defect report
Technical Field
The invention relates to the technical field of software development, in particular to a method and a system for identifying a safety defect report.
Background
The software security is concerned with the interests of everyone, and as the software is applied to various industries, the personal information and property security of the people are tightly combined with the software security. The main source of software security problems is the security hole that exists in the software itself. The defect report is a text carrier of software defects and is one of information sources for identifying security vulnerabilities. The security flaw report is a flaw report describing the software security flaw, provides a direction for repairing the security flaw, and has higher repairing priority and stricter confidentiality. The early identification of the security flaw report can help developers to repair security flaws more quickly, and the risk of software attack is reduced. However, it is difficult for personnel lacking knowledge of software security to distinguish security flaw reports from other flaw reports. Even professional safety engineers require a significant amount of time to identify safety defect reports because in actual development projects, the number of defect reports is large and the safety defect reports are small.
In the prior art, recognition methods are based on machine learning for recognition. The adoption of the machine learning method needs to manually complete some work and has extremely high false alarm rate. The effect of unbalanced data sets is not noticed with the method of deep learning. At present, a system for identifying a security defect report does not exist, so as to help practitioners improve identification efficiency, and the security defect report is usually repaired or revealed in a delayed manner.
Disclosure of Invention
The invention aims to provide a method and a system for identifying a security flaw report, which are used for solving the problems in the background technology, helping a practitioner improve the efficiency of identifying the security flaw report, repairing the security flaw as early as possible and reducing the risk of software attack. In the invention, a system capable of automatically identifying the safety defect report, effectively managing the safety defect report and visualizing the identification result of the safety defect report is designed and disclosed for the first time through analysis and research on the requirement of actual software safety maintenance. The invention provides a safety defect report recognition method in the system, and the method improves the efficiency of recognizing the safety defect report by a deep learning model by means of the advantages of deep learning and paying attention to the difficulty of unbalanced data set and utilizing a data set rebalancing technology.
The invention provides the following technical scheme: a system for identifying a security defect report comprises a security defect report identifying module, a historical defect report managing module and a defect report counting module;
the safety defect report recognition module is used for extracting effective text information of a defect report from a file uploaded by a user, then performing text preprocessing, and sending the effective text information into a trained deep learning model for safety defect report recognition;
the historical defect report module is used for managing the defect reports uploaded and identified by the user, and comprises searching, screening, re-identifying, checking details and deleting the historical defect reports;
the defect report counting module is used for counting safety defect reports according to items and time screened by a user and conditions of a defect tracking system, such as daily safety defect report quantity trend statistics and safety defect report proportion statistics;
furthermore, the safety defect report recognition module comprises a user input receiving unit, a text analyzing unit, a text preprocessing unit, a text feature extracting unit, a training model loading unit and a result predicting unit;
the receiving user input unit is used for receiving the defect report file uploaded by the user, the selected identification model and the input project name;
the text analysis unit is used for analyzing text segments required by analysis of the security defect report from the file uploaded by the user, such as summary and description of the defect report;
the text preprocessing unit is used for deleting useless information from the text and unifying case and case;
the text feature extraction unit is used for carrying out vector coding on the text and converting the text into data representation which can be input by the model;
and the loading training model unit is used for loading the deep learning model trained by the original corpus, such as TextCNN and TextRNN based on BiLSTM and LSTM.
And the result prediction unit is used for sending the coded vector representation into a model for identifying the safety defect report and then returning the result.
Further, the historical defect report management module comprises a historical defect report list unit, a single defect report information detail unit, a deleted defect report unit and an updated single defect report identification recording unit;
the historical defect report list unit is used for inquiring and displaying a historical defect report list meeting the screening condition;
the single defect report information detail unit is used for inquiring single defect report information and all identification records thereof and supporting the export of the identification records;
the deletion defect report unit is used for deleting the defect report which is not wanted to be continuously stored and the corresponding identification record;
the updated single defect report identification record unit is used for re-identifying a certain historical security defect report and then storing a new identification record.
Further, the defect report statistic module comprises a safety defect report data chart display unit and a statistic result derivation unit;
the safety defect report data chart display unit is used for counting and displaying a safety defect report quantity trend chart and a proportion chart of a certain project or a certain time period;
the statistical result deriving unit is used for deriving all the identification records.
A safety defect report-oriented identification method comprises the following execution steps:
s1: and (5) preprocessing the text. Firstly, format analysis is carried out to extract a required effective information text. After the text is obtained, the text is preprocessed, including removing invalid information, converting to lower case, word segmentation, etc.
S2: and extracting text features. The preprocessed text needs to be converted into vector representation;
s3: data set balancing. In order to relieve the influence of unbalanced data, before model training is carried out, a data balance strategy is adopted for carrying out positive and negative sample balance on a training set;
s4: and training and evaluating a classification model. The balanced training set is sent to a classification model for training, and then the performance of the classification model is evaluated.
Wherein, the step S1 specifically includes:
s101: firstly, format analysis is carried out to extract required effective information texts, such as summary and description of a defect report;
s102: processing the text, removing carriage returns, punctuation marks, overlong and overlong words and stop words, converting the text into lower case, segmenting words and the like;
wherein, the step S2 specifically includes:
s201: training a Word2Vec model by using corpora;
s202: obtaining vector representation corresponding to the words according to the Word2Vec training result, making a Word embedding matrix, and sending the Word embedding matrix into a classification model;
s203: and loading a BERT pre-training model, acquiring vector codes of hidden layers corresponding to each word, splicing, and obtaining vector representation of the words sent into the model for training.
Wherein, the step S3 specifically includes:
s301: and selecting a balance algorithm such as random oversampling, random undersampling, synthesis of few oversampling and integral replication to combine a random oversampling method to balance the defect report training set.
A method for identifying a security defect report, wherein the step S4 specifically includes:
s401: textCNN and a TextRNN model based on BiLSTM and LSTM are constructed.
S402: and transmitting the processed vector representation of the training set into a model for training, and adjusting parameters of the relevant parameters.
S403: the effect of the security defect report classifier is verified using the test set data.
In step S403, the effect of the security defect report classifier is measured by Accuracy, recall, precision, F-Measure:
compared with the prior art, the invention has the following beneficial effects:
1. the invention designs and provides a security defect report recognition system for helping practitioners improve recognition efficiency for the first time, the system analyzes text content submitted in a defect report to obtain effective text information, then predicts the security of the defect report according to the summary and description of the defect report, counts the prediction result of a historical security defect report to obtain a statistical chart for display, and provides an effective management function for the historical defect report, thereby realizing that in daily continuous development and maintenance, according to the submitted defect report, the efficiency of recognizing the security defect report is improved, the repair time of security flaws is advanced, and the overall maintenance duration is saved;
2. a safety defect report identification method based on deep learning and data rebalancing technology is provided. The method comprises the steps of preprocessing texts in a defect report, solving the problem of small proportion of the safety defect report by using four data set balancing strategies, acquiring the characteristics of the defect report by adopting a Word embedding method based on Word2Vec and a BERT model, sending the characteristics into a TextCNN network and a TextRNN network based on BiLSTM and LSTM for classification, and verifying the effect of a trained safety defect report classifier by using four measurement indexes, so that the efficiency of identifying the safety defect report can be ensured to be improved, and a practitioner is helped to enhance the safety of software.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is an overall framework diagram of a security flaw report identification method;
FIG. 2 is a structural diagram of TextCNN in the security flaw report recognition method;
FIG. 3 is a diagram illustrating the structure of a TextRNN in the Security Defect report recognition method;
FIG. 4 is a block diagram of a security flaw report identification system;
FIG. 5 is a technical framework diagram of a security flaw report identification system;
FIG. 6 is a pseudo code diagram of an algorithm incorporating random oversampling for global replication.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1: the invention provides a system and a method for identifying a security defect report. The system for helping a practitioner improve safety defect report identification is designed and realized for the first time and comprises three modules, namely safety defect report identification, historical defect report management and defect report data statistics. And a front-end and rear-end separation architecture is adopted, and a flash and Vue framework is used for development and a MySQL database is used for data persistence storage. A safety defect report recognition method is carried out on the basis of deep learning, the flow is as follows, in a data preprocessing stage, in order to relieve the serious unbalance condition of a data set, the data set is processed by adopting a rebalance technology, then the influence of the sparsity of text vectors on a classification result is considered, and Word2Vec and BERT models are adopted to carry out feature extraction on texts. In the selection of classification models, the method adopts a classic TextCNN network and a TextRNN network based on BiLSTM and LSTM. The method comprises the following specific steps:
a security defect report identification system comprises a security defect report identification module, a historical defect report management module and a defect report statistic module;
1. the safety defect report recognition module is used for extracting effective text information of a defect report from a file uploaded by a user, then performing text preprocessing, and sending the effective text information into a trained deep learning model for safety defect report recognition;
the safety defect report recognition module comprises a user input receiving unit, a text analyzing unit, a text preprocessing unit, a text feature extraction unit, a training model loading unit and a result prediction unit;
1.1 the accepting user input unit is used for receiving the defect report file uploaded by the user, the selected identification model and the input project name.
1.2 the text parsing unit is used for parsing out text segments required for analyzing a security defect report from a file uploaded by a user, such as summary and description of the defect report, and the specific steps are as follows:
step 1.2.1: the XML format files exported by the Bugzillal and the JIRA defect tracking systems are analyzed by using an XML DOM tool at the front end, items of XML DOM objects are traversed, the XML format is converted into a JSON format, and then the required text is obtained through the JSON format. Wherein the JIRA system's defect report name, summary, and description are under the rss. The names and defects of the defect reports of the Bugzilla system are summarized under the bug.
1.3 the text preprocessing unit is used for deleting useless information from the text, and the case and case are unified, and the specific steps are as follows:
step 1.3.1: at the back end, the Handler method receiving the request acquires the file or text sent in the request and then calls the function encapsulated in the Api layer. Firstly, a preprocessing function is called to remove connection words and punctuation, convert the text into lower case, divide words and the like.
1.4 the text feature extraction unit is used for vector coding the text and converting the text into data representation which can be input by the model, and the specific steps are as follows:
step 1.4.1: then, the judgment is made according to the content in the request, and if the user selects TextCNN or TextRNN, the Word vector is obtained by Word2 Vec. And (4) using the BERT to acquire the word vector if the BERT is selected.
1.5 load training model unit is used to load deep learning model trained by original corpus, such as TextCNN and TextRNN based on BilSTM and LSTM.
And the 1.6 result prediction unit is used for sending the coded vector representation into a model for identifying the safety defect report and then returning the result.
2. The historical defect report module is used for managing the defect reports uploaded and identified by the user, and comprises searching, screening, re-identifying, checking details and deleting the historical defect reports; the module may store an identification ID including a defect report, a defect report name, a defect summary, a defect description, a source item, a source defect tracking system, a last identification time, a last identification result, an upload time, and an update time. The module may also store the identification ID, the corresponding defect report ID, the selected identification network, the identification result, and the identification time.
The historical defect report management module comprises a historical defect report list unit, a single defect report information detail unit, a defect report deleting unit and a single defect report identification recording updating unit;
2.1 the historical defect report list unit is used for inquiring and displaying a historical defect report list meeting the screening condition, and comprises the following specific steps:
and 2.1.1, acquiring the screening conditions input by the user at the front end, such as information of report numbers, report file names, defect summaries, whether the defects are safety defects, uploading time and the like.
Step 2.1.2: and screening in the defect report table according to the front-end conditions, packaging the result and returning the result to the front end, and presenting the result in a list mode by the front end.
2.2 the detail unit of single defect report information is used for querying single defect report information and all identification records thereof, and supports export of identification records, and comprises the following specific steps:
step 2.2.1: when querying the details of a defect report, the information in the defect report table and all the identification records in the identification record table need to be queried by the ID of the defect report.
Step 2.2.2: and packaging the query result through a file, and then exporting and downloading the query result on a browser.
2.3 the deleted defect report unit is used for deleting the defect report which is not wanted to be stored continuously and the corresponding identification record;
2.4 the update single defect report identification record unit is used for re-identifying a certain historical security defect report, and then storing a new identification record, and the specific steps are as follows:
step 2.4.1: the user selects a certain defect report and reads the detailed information of the defect report.
Step 2.4.2: the detailed information of the report is re-identified according to the steps of the identification method.
Step 2.4.3: and updating the new identification record to an identification table for storage, and returning the identification result to the front end for displaying to the user.
3. The defect report counting module is used for counting safety defect reports according to items and time screened by a user and conditions of a defect tracking system, such as daily safety defect report quantity trend statistics and safety defect report proportion statistics;
the defect report statistic module comprises a safety defect report data chart display unit and a statistic result derivation unit;
3.1 the safety defect report data chart display unit is used for counting and displaying a safety defect report quantity trend chart and a proportion chart of a certain project or a certain time period, and comprises the following specific steps:
step 3.1.1: firstly, quantity inquiry is carried out from a database, then statistical data are calculated in a service logic layer, the result of the last identification of a user is taken as the final result of the defect report, and then the quantity of the safety defect reports is counted.
Step 3.1.2: the Echarts component is used for the graphical presentation in the page. The front end uses the base line graph and the fillet annular pie graph in Echarts to show the trend of the number of safety defect reports per day and the total safety defect ratio. The yAxis parameter in the base line graph is a list of the number of safety defect reports per day obtained from the back end, and xAxis is a list of dates. The data in the rounded annular pie chart are the total number of safe defect reports and the number of non-safe defect reports obtained from the back end.
3.2 the statistic result deriving unit is used for deriving all identification records.
Example 2: a safety defect report-oriented identification method comprises the following execution steps:
step 1: text preprocessing, firstly, format analysis is carried out to extract a required effective information text. After the text is obtained, the text is preprocessed, including removing invalid information, converting to lower case, word segmentation, etc.
Step 11: the derived defect report is obtained by the defect tracking system, and the specific defect report derived by the defect tracking system can be in an XML format or a JSON format.
Step 12: the obtained file, such as an XML format, adopts a DOM parsing tool to extract the text content under the description tag and the metadata tag in the file, the description tag, i.e., the detailed description of the defect in the defect report, contains a lot of effective information, and the summary of the defect in the metadata tag, i.e., the defect report, is highly summarized information.
Step 13: the description and summary of the defect report are concatenated.
Step 14: removing punctuation, carriage return and numbers in the text, then removing stop words contained in a stop word set provided in the nltk packet, then removing overlength or overlength words, then unifying letters into lowercase, and then segmenting words in the text;
step 2: and extracting text features. The preprocessed text needs to be converted into vector representation;
step 21: training a Word2Vec model by using corpora; based on the model of Word2Vec, training is needed to learn the vector representation of each Word according to the corpus of the task, and the default setting, namely the single Word representation, is used at this time as a 100-dimensional vector. The model based on Word2Vec is divided into a CBOW model and a Skip-gram model, the CBOW model is adopted in the Word2Vec method packaged by gensim by default, and the model can be converted into the Skip-gram model for training by setting the sg parameter to be 1
Step 22: the number of each word is obtained by using Tokenizer learning, and the number is sorted according to the frequency of the occurrence of the words, and the number is smaller as the frequency of the words is larger. The process of converting words into numbers is performed for each sentence using Tokenizer.
Step 23: the sentence fixed length is set to 2 times the average of all sentence lengths. And establishing a word embedding matrix according to the number of the words and the vector dimension number of the single word.
Step 24: loading a BERT-Base-Uncast pre-training model, obtaining the vector code of the hidden layer corresponding to each word, splicing, and obtaining the vector representation of the word sent to the model for training.
And 3, step 3: data set balancing. In order to relieve the influence of unbalanced data, before model training is carried out, a data balance strategy is adopted for carrying out positive and negative sample balance on a training set;
step 31: and selecting a balance algorithm such as random oversampling, random undersampling, synthesis of few oversampling and integral replication to combine a random oversampling method to balance the defect report training set.
Step 311: the random oversampling method is characterized in that in a defect report training set, one of safety defect reports is randomly extracted and copied every time, so that the number of the safety defect reports and the number of the non-safety defect reports reach a certain proportion.
Step 312: the random undersampling method randomly selects one defect report from the non-safety defect report data set each time to delete, and repeats for multiple times.
Step 313: and synthesizing a few oversampling methods, firstly selecting a central point, and calculating K sample points which are most adjacent to the central point according to Euclidean distance. The connection of the central point and each of its adjacent points defines a straight line and a distance, and then a random function is used to generate a decimal between 0 and 1, and the randomly generated decimal defines a position on the connecting line, i.e. the position of the new sample point.
Step 314: and (3) calculating the final required number N _ new of the safety defect reports according to the specified balanced positive and negative sample proportion by combining integral replication with a random oversampling method, and then calculating that the new sample number is K times of the original sample number N, namely N _ new = K × N. If K is an integer, copying each original security defect report sample by K-1 times, and randomly selecting one sample for copying each time for N times by random oversampling because N few samples are needed to meet the balance requirement. If K is a decimal, expressing K _ int as an integer part of K, expressing K _ decimal as a decimal part of K, copying all safety defect reports by K _ int times, and obtaining the number of the remaining required safety defect reports by random oversampling, wherein the number of the remaining safety defect reports is K _ decimal multiplied by N.
And 4, step 4: and training and evaluating a classification model. The balanced training set is sent to a classification model for training, and then the performance of the training set is evaluated;
step 41: textCNN and a TextRNN model based on BilSTM and LSTM are constructed.
Step 411: sending the Word embedding matrix constructed by Word2Vec into a TextCNN model and a TextRNN model to construct an embedding layer, and obtaining a two-dimensional matrix of maxLen x EmbeddingLen by a training set through the embedding layer, wherein the maxLen is a fixed sentence length.
Step 412: next, convolutional layer operation is performed, in the method, convolutional kernels with three sizes, namely, lengths of 1, 2 and 3 and widths of embeddingelen are selected, namely, the conditions of a single word, two word groups and three word groups are considered respectively, and the number of each convolutional kernel is 64.
Step 413: next, the feature size is unified using the largest pooling layer, setting the size of the pooling kernel to
[ maxLen-fLen +1,1], the unified feature size is [1,1, filterNum ]. Where fLen is the length of the convolution kernel and filterNum is the number of each convolution kernel.
Step 414: and splicing the characteristic graphs by using the connecting layer. The multidimensional data input is then converted into a one-dimensional vector using the Flatten layer.
Step 415: in order to avoid the over-fitting phenomenon, a Dropout layer is connected behind the Flatten layer, the number of intermediate features is reduced, and the discard rate can be adjusted and optimized in experiments.
Step 416: finally, the Dense layer is used as a mapping classification, and softmax is used as an activation function.
Step 417: and for the TextRNN, sending the obtained word vector to the next layer of BiLSTM, wherein the layer comprises a forward word input sequence and a reverse word input sequence, the fixed output dimension of the LSTM of each layer is set to be 64 dimensions, and the return sequence state of the LSTM of each layer is set to be true so that the two layers of LSTMs can be spliced according to the output information of each time sequence to obtain the 128-dimensional output vector of maxLen positions.
Step 418: followed by a Dropout layer that can be adjusted in experiments to avoid overfitting. Then a forward LSTM layer is accessed, each position output dimension being 64 dimensions, which changes the variable length input sentence length to a one-dimensional fixed length.
Step 419: and finally, mapping an output sample by adopting a Dense layer to serve as the probability of a safety defect report, wherein the activation function of the output sample is softmax.
Step 42: and transmitting the processed vector representation of the training set into a model for training, adjusting the learning rate by adopting a gradient optimization algorithm, performing back propagation, updating parameters, and manually performing parameter adjustment on related parameters such as drop _ rate, batch _ size and learning _ rate.
Step 43: the effect of the security flaw report classifier is verified using the test set data.
Example 3: this embodiment selects 4 Apache open source projects and 1 large project. The 4 Apache projects are collected for defect reports by the JIRA defect tracking system, labeled secure defect reports and non-secure defect reports by Ohira et al. The chromosome dataset, from the Google open source project, was collected by the Google breakpa system, and comprised 41940 defect reports. The chrome data set was labeled with a security defect report and a non-security defect report by a Mining Software reviews meeting Mining challenge contest host in 2011.
Firstly, text preprocessing is carried out on Summary and Description in the data set of the example, then Word2Vec and BERT are used for extracting features, a training set and a test set are divided, data balance is carried out on the training set, then model textCNN and model textRNN are sent for training, and then verification is carried out on the test set. The results of the verification are shown in the following table:
Project Accuracy(%) Precision(%) Recall(%) F1(%)
Ambari 97.17 28.57 66.66 40.00
Wicket 96.00 12.50 55.00 21.00
Camel 92.90 21.43 53.00 31.00
Derby 80.10 26.08 66.66 37.50
Chromium 99.98 37.32 74.00 49.12

Claims (10)

1. a security defect report identification system, characterized by: the identification system comprises a security defect report identification module, a historical defect report management module and a defect report statistical module;
the safety defect report recognition module is used for extracting effective text information of a defect report from a file uploaded by a user, then carrying out text preprocessing, and sending the effective text information into a trained deep learning model for safety defect report recognition;
the historical defect report module is used for managing the defect reports uploaded and identified by the user, and comprises searching, screening, re-identifying, checking details and deleting the historical defect reports;
the defect report counting module is used for counting safety defect reports according to items and time screened by a user and conditions of a defect tracking system, such as daily safety defect report quantity trend statistics and safety defect report proportion statistics.
2. A system for security flaw report identification according to claim 1, characterized in that: the safety defect report recognition module comprises a user input receiving unit, a text analyzing unit, a text preprocessing unit, a text feature extraction unit, a training model loading unit and a result prediction unit;
the receiving user input unit is used for receiving the defect report file uploaded by a user, the selected identification model and the input project name;
the text analysis unit is used for analyzing text segments required by analysis of the security defect report from the file uploaded by the user, such as summary and description of the defect report;
the text preprocessing unit is used for deleting useless information from the text and unifying case and case;
the text feature extraction unit is used for carrying out vector coding on the text and converting the text into data representation which can be input by the model;
the loading training model unit is used for loading the deep learning model trained by the original corpus, such as TextCNN and TextRNN based on BiLSTM and LSTM;
and the result prediction unit is used for sending the coded vector representation into a model for identifying the safety defect report and then returning the result.
3. A system for security flaw report identification as claimed in claim 2, wherein: the historical defect report management module comprises a historical defect report list unit, a single defect report information detail unit, a deleted defect report unit and an updated single defect report identification recording unit;
the historical defect report list unit is used for inquiring and displaying a historical defect report list meeting the screening condition;
the single defect report information detail unit is used for inquiring single defect report information and all identification records thereof and supporting the export of the identification records;
the deletion defect report unit is used for deleting the defect report which is not wanted to be continuously stored and the corresponding identification record;
the updated single defect report identification record unit is used for re-identifying a certain historical security defect report and then storing a new identification record.
4. A system for security flaw report identification according to claim 3, characterized in that: the defect report statistic module comprises a safety defect report data chart display unit and a statistic result derivation unit;
the safety defect report data chart display unit is used for counting and displaying a safety defect report quantity trend chart and a proportion chart of a certain project or a certain time period;
the statistical result deriving unit is used for deriving all the identification records.
5. A method for identifying a security defect report, which is characterized in that the system for identifying a security defect report of any one of claims 1-4 is adopted, and the method is implemented by the following steps:
s1: text preprocessing, namely firstly analyzing the format to extract a required effective information text, and after the text is obtained, preprocessing the text, including removing ineffective information, converting the ineffective information into lower case, word segmentation and other operations;
s2: extracting text features, wherein the preprocessed text needs to be converted into vector representation;
s3: data set balancing, wherein in order to relieve the influence of unbalanced data, a data balancing strategy is adopted to balance positive and negative samples for a training set before model training is carried out;
s4: and training and evaluating the classification model, wherein the balanced training set is sent to the classification model for training, and then the performance of the classification model is evaluated.
6. The method of claim 5, wherein the method further comprises: the step S1 specifically includes:
s101: firstly, format analysis is carried out to extract required effective information texts, such as summary and description of a defect report;
s102: and processing the text, removing carriage returns, punctuation marks, overlong and overlong words and stop words, converting into lower case, segmenting words and the like.
7. The method of claim 5, wherein the method further comprises: the step S2 specifically includes:
s201: training a Word2Vec model by using corpora;
s202: obtaining vector representation corresponding to the words according to the Word2Vec training result, making a Word embedding matrix, and sending the Word embedding matrix into a classification model;
s203: loading a BERT pre-training model, obtaining the vector code of the hidden layer corresponding to each word, splicing, and obtaining the vector representation of the word sent into the model for training.
8. The method of claim 5, wherein the method further comprises: the step S3 specifically includes:
s301: and selecting a balance algorithm such as random oversampling, random undersampling, synthesis of few oversampling and integral replication to combine a random oversampling method to balance the defect report training set.
9. A method for identifying a security flaw report according to claim 5, characterized in that: the step S4 specifically includes:
s401: constructing a TextCNN and a TextRNN model based on the BiLSTM and the LSTM;
s402: the vector representation of the processed training set is sent to a model for training, and parameters of relevant parameters are adjusted;
s403: the effect of the security defect report classifier is verified using the test set data.
10. A method for identifying a security flaw report according to claim 9, characterized in that: in step S403, the effect of the security flaw report classifier is measured by the following four formulas:
Figure FDA0004026227290000031
accuracy represents the proportion of all predicted correct numbers to the total number of test samples; TP represents the number that the actual value is positive and the predicted value is also positive; FP represents the number that the actual value is negative, but the predicted value is positive; FN represents the number that the actual value is negative, but the predicted value is positive; TN represents the number that the actual value is negative and the predicted value is negative;
Figure FDA0004026227290000032
precision indicates that the correct sample proportion is predicted in all samples with positive prediction results; TP represents the number that the actual value is positive and the predicted value is also positive; FP represents the number that the actual value is negative, but the predicted value is positive;
Figure FDA0004026227290000033
/>
recall means that the correct sample specific gravity is predicted in a sample with a positive actual result; TP represents the number by which the actual value is positive and the predicted value is also positive; FN represents the number that the actual value is negative, but the predicted value is positive;
Figure FDA0004026227290000041
f1 integrates Accuracy and Recall for comprehensively reflecting the overall performance of the security defect report classifier.
CN202211705508.2A 2022-12-29 2022-12-29 Method and system for identifying safety defect report Pending CN115964719A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211705508.2A CN115964719A (en) 2022-12-29 2022-12-29 Method and system for identifying safety defect report

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211705508.2A CN115964719A (en) 2022-12-29 2022-12-29 Method and system for identifying safety defect report

Publications (1)

Publication Number Publication Date
CN115964719A true CN115964719A (en) 2023-04-14

Family

ID=87353915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211705508.2A Pending CN115964719A (en) 2022-12-29 2022-12-29 Method and system for identifying safety defect report

Country Status (1)

Country Link
CN (1) CN115964719A (en)

Similar Documents

Publication Publication Date Title
US20230008175A1 (en) Systems and methods for selecting machine learning training data
US9418144B2 (en) Similar document detection and electronic discovery
US9323794B2 (en) Method and system for high performance pattern indexing
EP2092419B1 (en) Method and system for high performance data metatagging and data indexing using coprocessors
US8533203B2 (en) Identifying synonyms of entities using a document collection
CN112579155B (en) Code similarity detection method and device and storage medium
US11782928B2 (en) Computerized information extraction from tables
US20190073406A1 (en) Processing of computer log messages for visualization and retrieval
US20050262039A1 (en) Method and system for analyzing unstructured text in data warehouse
EP2419820A1 (en) Concept-based analysis of structured and unstructured data using concept inheritance
CN110892398A (en) Multi-factor document analysis
CN110741376A (en) Automatic document analysis for different natural languages
CN106815605B (en) Data classification method and equipment based on machine learning
CN112181490A (en) Method, device, equipment and medium for identifying function category in function point evaluation method
US20220229854A1 (en) Constructing ground truth when classifying data
KR101585644B1 (en) Apparatus, method and computer program for document classification using term association analysis
WO2018138205A1 (en) Model search method and device based on semantic model framework
CN115964719A (en) Method and system for identifying safety defect report
CN113806492A (en) Record generation method, device and equipment based on semantic recognition and storage medium
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
KR102081867B1 (en) Method for building inverted index, method and apparatus searching similar data using inverted index
CN111400369A (en) Big data analysis-based policy information service system and method
CN117251532B (en) Large-scale literature mechanism disambiguation method based on dynamic multistage matching
Zamyatina Text mining of companies annual reports in PDF format
CN117972170A (en) Slow query identification method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination