CN114943229B - Multi-level feature fusion-based software defect named entity identification method - Google Patents

Multi-level feature fusion-based software defect named entity identification method Download PDF

Info

Publication number
CN114943229B
CN114943229B CN202210396841.3A CN202210396841A CN114943229B CN 114943229 B CN114943229 B CN 114943229B CN 202210396841 A CN202210396841 A CN 202210396841A CN 114943229 B CN114943229 B CN 114943229B
Authority
CN
China
Prior art keywords
defect
software
word
software defect
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210396841.3A
Other languages
Chinese (zh)
Other versions
CN114943229A (en
Inventor
郑炜
廖慧玲
王晓龙
吴潇雪
成婧源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202210396841.3A priority Critical patent/CN114943229B/en
Publication of CN114943229A publication Critical patent/CN114943229A/en
Application granted granted Critical
Publication of CN114943229B publication Critical patent/CN114943229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3696Methods or tools to render software testable
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for identifying a software defect named entity based on multi-level feature fusion, which comprises the steps of firstly acquiring a defect report target text data set to be identified by a named entity and preprocessing; according to the method, more word features are extracted through word embedding of different levels, and the features of different levels are fused to obtain the final word embedding which is used as the input of the BiLSTM network. Through BiLSTM learning features and adding an attention mechanism, the inconsistency of entity labels in longer defect documents is reduced, and finally a predicted label sequence is obtained through a CRF layer. The invention can improve the accuracy of identifying the software defect naming entity, thereby identifying the software defect entity in the defect report.

Description

Multi-level feature fusion-based software defect named entity identification method
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method for identifying a software defect named entity.
Background
Named entity recognition (Named Entity Recognition, NER) is an important basic tool in application fields such as information extraction, question-answering systems, syntactic analysis, machine translation, knowledge graphs and the like, and plays an important role in the process of the natural language processing technology going to practical use. Named entity recognition refers to the recognition of entities with specific meanings in text, and mainly comprises personal names, place names, organization names, proper nouns and the like. With the explosion growth of data in various application fields and the maturation of NER technology, NER application has been penetrated into various vertical fields such as business, finance, electronic medical records, network security, biomedicine, military, ecological management, agriculture and the like. When NER processes data of unstructured text, the NER can solve the problems of various entity forms, fuzzy semantics and the like in the text, and key information is extracted from the NER. Therefore, NER has gained widespread attention from scientific researchers at home and abroad. In the field of software defects, defect reports are described herein as important defect descriptions, which contain large amounts of unstructured data. Researchers have been working on extracting critical information in defect reporting to address specific tasks in the software defect domain, such as defect prediction, defect localization, defect repair, etc. And classifying the named entities of the defect entities in the defect report to extract more key information in the defect report.
In recent years, deep learning technology has achieved a certain result in the field of software defect named entities, and the deep learning technology has the capability of representing learning and semantic composition given by vector representation and neural processing. BiLSTM, one of the main sequence feature extractors, can successfully extract word-level context features. However, biLSTM requires memory information and current words in previous memories to be embedded as inputs, which cannot extract global information.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for identifying a software defect named entity based on multi-level feature fusion, which comprises the steps of firstly acquiring a defect report target text data set to be identified by a named entity and preprocessing; according to the method, more word features are extracted through word embedding of different levels, and the features of different levels are fused to obtain the final word embedding which is used as the input of the BiLSTM network. Through BiLSTM learning features and adding an attention mechanism, the inconsistency of entity labels in longer defect documents is reduced, and finally a predicted label sequence is obtained through a CRF layer. The invention can improve the accuracy of identifying the software defect naming entity, thereby identifying the software defect entity in the defect report.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
step 1: selecting a data source: selecting a defect report from the open source project, wherein the selected defect report is a repaired defect;
step 2: software defect named entity class definition: the software defect named entities are divided into 7 classes: program language, application program interface, environment, user interface, platform, security, and software standard;
step 3: processing a data set;
preprocessing the text information of the data source obtained in the step 1, wherein the preprocessing comprises three parts of word segmentation, part-of-speech tagging and sequence tagging; for word segmentation and part-of-speech tagging, using the natural language toolkit NLTK of python for implementation; for the sequence labeling label, manually labeling the sequence of words in the data source, finding out the vocabulary and the phrase related to the software defect defined in the step 2 in the text information, and completing the task of labeling the software defect naming entity;
step 4: verifying the marking accuracy based on a card classification method;
marking entities in the sample by adopting a card classification method, and measuring the difference of results among different marking personnel by using a Fleiss Kappa coefficient;
the card classification method is characterized in that two or more members extract the entity of the sample, the final entity type is judged according to the marked result of each member sample, and if the marked results of a plurality of members are consistent, the label is the final result of the sample; if the marked results among the members are inconsistent, the members discuss each other to obtain a final result;
step 5: an input layer;
using different levels of word embedding as input: word-level Word embedding obtained by Word2Vec, character n-gram Word embedding obtained by Fasttext, morphological-level Word embedding obtained by Morph2Vec, character-level Word embedding and orthography character-level Word embedding; using the five word embedding models, inputting the five word embedding models into a BiLSTM model, and capturing the font, the morphology and the context information of the words in the defect report;
step 6: a feature encoding layer;
based on the context information extracted by BiLSTM, extracting the attention distribution among words by using a self-attention mechanism, and normalizing the attention distribution by using a SOFTMAX function;
step 7: marking a prediction layer;
the CRF layer is used to sequentially label software defect named entities.
Further, the defect report was from 4 open source projects in Bugzilla and Jira, mozilla, spark, eclipse and Hadoop, respectively.
Further, the 7 classes of the software defect naming entity are respectively: program language, application program interface, environment, user interface, platform, security, and software standard;
(1) Programming language: defects related to which development language the current defect belongs include mainstream object-oriented language, procedural language and structured query language;
(2) Application program interface: API elements that refer to libraries and frameworks that developers use to program;
(3) Environment: comprising 4 subcategories: software tools, software libraries, development frameworks, and general-purpose software tools;
(4) User interface: refers to defects associated with the graphical user interface;
(5) And (3) a platform: refers to a software or hardware platform;
(6) Safety: refers to a flaw related to code security or software security;
(7) Software standard: including standard specifications in the field of software engineering.
Further, the labeling of the sequence labeling label is divided into 3 stages, wherein the first stage maintains a dictionary of software defect entities, and the dictionary is a corresponding table of the software defect entities and entity categories; the second stage carries out manual inspection through double verification, namely each defect report is independently inspected at least twice by two participants, each participant marks the data of four different projects, and the respective marking results can be rapidly inspected by utilizing a software defect entity dictionary; if the marked results are inconsistent, the participants discuss and agree, and finally unify the software defect entity dictionary; the third stage is to verify the accuracy of the mark based on card classification.
Further, the Fleiss Kappa coefficient is an index for verifying consistency of the labeling result data of the experiment, and specifically comprises the following steps:
let N be the total number of objects to be evaluated, N be the total number of objects to be evaluated, T be the number of classes to be evaluated, N ij And (3) classifying the j-th evaluation object into the number of grades of the i-th evaluated object, wherein the calculation formula of the Fleiss Kappa coefficient is as follows:
wherein,indicating relative viewing consistency between evaluators, < >>Hypothesis probability representing accidental consistency, P i Indicating the degree to which the evaluator agrees with the ith task;
the Fleiss Kappa coefficient calculations were divided into 5 groups to represent different levels of consistency: 0 to 0.20 shows extremely low uniformity, 0.21 to 0.40 shows general uniformity, 0.41 to 0.60 shows moderate uniformity, 0.61 to 0.80 shows high uniformity, and 0.81 to 1 shows almost complete uniformity.
Further, the FastText is subword information using words, each word w i Represented by a string of characters n-gram; adding boundary symbols at the beginning and end of a word<And>allowing the differentiation of prefixes and suffixes from other characters; second, word w i Itself contained in its n-gram set to learn the representation of each word;
assuming that the dictionary is an n-gram of size G, given a word w, usingRepresenting a set of n-grams appearing in w; each n-gram is represented as a vector z g The sum of the vector representations of the n-grams is used to represent a word, and thus the calculation of the scoring function is shown in equation 4.
Wherein v is c Representing a context vector;
finally, fastText can generate character n-gram word embeddings from vectors of the character n-gram
Further, the feature encoding layer is composed of BiLSTM and Attention, biLSTM is composed of forward LSTM and backward LSTM, and then the two hidden vectors of LSTM are connected into context vectorThe vector values embedded by the words of different levels are transferred to BiLSTM, the LSTM function is represented by equation (5), the cell state update equation is represented by equation (6), the output X of the BiLSTM layer t Calculated by the formula (7);
X t =o t *tanh(c t-1 ) (7)
wherein,representing forward LSTM hidden vector,>representing backward LSTM hidden vector,>indicating the degree to which the information is retained, f t 、o t 、i t Respectively representing a forgetting gate, an output gate and an input gate, sigma represents a logic function sigmoid, W and b are parameters of affine transformation, E t Input word representing time t, X t-1 Output of BiLSTM layer at time t-1, c t 、c t-1 Cell states at time t and time t-1 are shown, respectively; obtaining context semantic dependent features through BiLSTM, and respectively calculating the attention distribution among words in each feature space; the self-attention mechanism comprises two parts of point attention and multi-head attention, the point attention comprises a query matrix (Q, query), a Key matrix (K, key) and a Value matrix (V, value), the weight of the matrix is automatically updated in the network training process, and a calculation formula is shown as a formula (8):
wherein d k Representing the dimensions of the word vector/hidden layer.
Further, the input of the CRF layer is the output of the self-care layer, expressed by formula (9):
wherein h is attn Representing the output of the self-attention layer,representing a self-care layer output for each query vector;
given tag sequence y= { Y 1 、y 2 、…、y N Obtaining tag sequence Y and input h attn The calculation formula of the joint conditional probability of (2) is shown as formula (10):
where Y(s) is the set of all possible tag sequences for sentence s,for the score function, W and T represent parameters in the CRF layer, θ represents the maximization parameter of the negative log likelihood function; in training the model, use the negative log likelihood function +.>As an objective function.
The beneficial effects of the invention are as follows:
the method for identifying the software defect named entity based on multi-level feature fusion can improve the accuracy of identifying the software defect named entity, so that the software defect entity in the defect report is identified, and the classification standard of the software defect named entity is provided. Compared with the BiLSTM-CRF model for identifying the software defect named entity, the evaluation index F1-Score for predicting the effectiveness of the software defect named entity based on the multi-level feature fusion method improves the effectiveness by 5.02%.
Drawings
FIG. 1 is a diagram of a method of the present invention.
FIG. 2 is a schematic diagram of a summary of the number of various types of entities in a data set according to an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
As shown in fig. 1, a method for identifying a software defect named entity based on multi-level feature fusion includes the following steps:
step 1: and selecting a data source. The defect report chosen in this example was from 4 open source projects in Bugzilla and Jira, mozilla, spark, eclipse and Hadoop, respectively. As a first screening condition, the selected defect reports are all repaired defects, because the unrepaired defect reports do not guarantee the validity of the defects.
The total number of defect reports selected was 94000, and the extracted defect report text included header and description information. From these defect reports, a random sampling is made to form 4 sets of defect report samples, including 3000 defect reports from Mozilla, 2000 defect reports from Spark, 1500 defect reports from Eclipse and 1500 defect reports from Hadoop, which will be used to construct the corpus. Each word or symbol in the text content of the defect report is a token, and the entity is the smallest semantic unit of a single token or multiple related tokens.
Step 2: software defect named entity class definition. As shown in FIG. 2, the present invention classifies software defect naming entities into 7 classes: program language, application program interface, environment, user interface, platform, security, and software standard.
The program language category refers to the defect related to the development language to which the current defect belongs, and the different development languages process the defect or find the way to solve the defect. Entities of this class include the mainstream object oriented language (e.g., java), the procedural language (e.g., C), and the structured query language SQL, among others.
Application program interface class, which refers to API elements of libraries and frameworks that a developer can use to program, such as packages, classes, methods, functions, etc., by identifying API classes, code-level errors can be located.
The environment category contains 4 subcategories: software tools, such as development tools Eclipse; a software library, which refers to a program set integrating some general functions, such as NumPy; development framework, which refers to a collection of code that serves software development, such as asp.net; other software application types refer to generic software tools that may be used in the software development process, such as Markdown.
User interface category refers to defects associated with a graphical user interface. With the increasing acceptance and acceptance of smart devices by people, a large amount of software serving users has developed a simple and easy-to-operate user interface. Problems and defects of the interface in the process of user operation belong to the category of the user interface.
Platforms, primarily software or hardware platforms, such as the central processing unit instruction set (e.g., x 86), hardware architecture (e.g., mac), operating system, and system kernel (e.g., android, IOS).
Security refers to a flaw associated with code security or software security. Security-related flaws may cause significant loss to software during software development and maintenance, and therefore security-related flaws typically have a high priority and importance. Identifying a security class of named entity in the defect report facilitates preferential repair of defects, facilitating software development.
Software standards, including standard specifications in the field of software engineering, such as data formats, network protocols, software design patterns, etc., and acronyms for standard software techniques, such as AJAX.
TABLE 1 software Defect Domain Defect named entity Classification criteria
Step 3: and (5) data set processing. And (3) preprocessing the text information obtained in the step (1). The method comprises three parts of word segmentation, part-of-speech tagging and sequence tagging. First, for the work of the word segmentation and part-of-speech tagging stage, the natural language toolkit NLTK implementation of python is used. And (2) manually labeling the sequence of the words in the data source, finding out the words and phrases related to the software defect defined in the step (2) in the text information, and completing the labeling task of the named entity of the software defect.
First, for the work of the word segmentation and part-of-speech tagging stage, the natural language toolkit NLTK implementation of python is used. In order to better preserve entity characteristics, regular expressions in NLTK are modified to better match entities. Such as: ' ondesto () ' is a function that splits ' ondesto () ' entities into ' ondesto ', '3 words in NLTK. After regular matching rule modification on its NLTK, the 'ondestrus ()' entity is treated as a whole for word segmentation. By rewriting the regular matching rules, the entity representation can be made more complete.
In the sequence labeling stage, the labeling corpus works in 3 stages, and the first stage maintains a software defect entity dictionary. The dictionary is a corresponding table of the software defect entity and the entity category, which is beneficial to improving the labeling efficiency. The second stage is manual inspection through double verification, i.e. each defect report is inspected at least twice independently by two participants, each participant annotating the data of four different items. The software defect entity dictionary can be used for rapidly checking the respective labeling results. If the marked results are inconsistent, they discuss and agree on, and finally unify the software defect entity dictionary. And the third stage verifies the accuracy of the mark based on the card classification method.
Step 4: and verifying the marking accuracy based on a card classification method. The "card taxonomies" were used to label the entities in the samples and the difference in results between the different labeling personnel was measured by Fleiss Kappa. The card classification method is characterized in that two or more members extract the entity of the sample, the final entity type is judged according to the marked result of each member sample, and if the marked results of a plurality of members are consistent, the label is the final result of the sample; if the marked results are inconsistent among the members, the members discuss each other, and the final result is obtained. The Fleiss Kappa coefficient is an important index for verifying consistency of the labeling result data of the experiment.
The Fleiss Kappa coefficient is an important index for verifying consistency of the labeling result data of the experiment. The Fleiss Kappa is a great aid in assessing the degree of correlation between condition attributes and condition attributes. Let N be the total number of objects to be evaluated, N be the total number of objects to be evaluated, k be the number of grades to be evaluated, N ij The number of ranks for the j-th rating object to be rated for the i-th rating object. The coefficient is calculated as:
the marked data are according to the formula (1), the formula (2) and the formula (3). Calculated to obtain T=0.84. The Fleiss Kappa coefficient calculations are typically-1 to 1, but typically 0 to 1, typically divided into 5 groups to represent different levels of consistency: 0 to 0.20 (extremely low consistency), 0.21 to 0.40 (general consistency), 0.41 to 0.60 (medium consistency), 0.61 to 0.80 (high consistency), 0.81 to 1 (almost complete consistency).
Step 5: an input layer. Different levels of Word embedding are used as inputs to the method, including Word-level Word embedding by Word2Vec, character n-gram Word embedding by FastText, morphological-level Word embedding by Morph2Vec, character-level Word embedding and orthographic character-level Word embedding. By using these word embedding models, the goal is to capture the grapheme, morphology, and context information of the words in the defect report.
The word embedding at each level is described as follows:
word2Vec Word embedding uses the Skip-gram model to construct Word vectors with parameters set as follows: the word vector dimension size is 100; the context window size is 5; the total number of iterations is 40.
FastText is an extension of Word2Vec, which is relatively better at capturing professional and rare vocabulary representations in the field of software bugs. The purpose of FastText is to utilize word subword information for each word w i Represented by a string of characters n-gram. Special boundary symbols are added at the beginning and end of the word<And>allowing prefixes and suffixes to be distinguished from other characters. Second, word w i Itself contained in its n-grams set to learn the representation of each word (except the character n-grams). In w i For example, =where and n=3, w i Will be represented by the character n-grams:
< wh, whe, her, ere, re > also includes the sequence of the word itself: < where >.
Assuming that the dictionary is an n-gram of size G, given a word w, usingRepresents the set of n-grams that appear in w. Each n-gram is represented as a vector z g A word is represented by the sum of the vector representations of the n-grams. Thus, the calculation of the scoring function is obtained as shown in formula (4).
FastText is able to form a vector representation of a word from the vector of the character n-gram. Thus, even out-of-vocabulary words, n-grams can be used to generate word embeddings
Morph2Vec is another learning model that utilizes subword information to learn word embedding. The algorithm uses a list of candidate morphology segmentations for all words in the training data presented by the unsupervised morphology segmentation system. Assuming that each word has multiple candidate morphological segmentation sequences, the final word representationIs a weighted sum of all segmented morpheme level word embeddings of the word. An attention mechanism is used on top of the model to learn the weights, where the mechanism assigns more weight to the correct segmentation of the word. In the model proposed by the present invention, morpheme-level word embedding obtained from pre-trained Morph2Vec embedding is incorporated.
Orthographic character encoder. Alphabetic characters are encoded as "C" (if the character is uppercase, "C"), numeric characters are encoded as "n", punctuation is designated as "p", and other characters are "x". For example, the word "navigator. Cookie enabled" will be encoded as "cccccccccccccccccccccccccccccccccccccccccccccccccccc. Each orthographic code is also filled with 0 s with the longest word in the dataset so that all words have a fixed orthographic embedded length. This can reduce sparsity and capture shapes and orthographic patterns in words. A BiLSTM is used to train orthographic character level embedding that is simply a combination of two different LSTMs (i.e., forward and backward LSTMs), one of which is forward in order and the other is reverse in order. Connecting the outputs of the forward and backward LSTM to obtain the final orthographic character level word embedding
Similar to orthographic word embedding, another BiLSTM is used to learn word embedding at the character level. To this end, biLSTM is provided by character embedding of words. By concatenating the LSTM output vectors in both the front and rear directions, a character-level word embedding is ultimately obtained, which is expressed as
The final Word embedding of the model can be obtained from Word2Vec, fastText, morph2Vec, orthographic character-level Word embedding and character-level Word embedding as follows (4.1):
after connecting different levels of word embedding, E is embedded in the final word i The drop rate is applied. This prevents the model from relying on only one type of word embedding, and therefore, to ensure better generalization ability, the discard rate r=0.5 is set during training.
Step 6: and a feature coding layer. Based on the context information extracted by BiLSTM, the attention distribution among words is extracted by using a self-attention mechanism, and the attention distribution normalization is performed by using a SOFTMAX function.
The feature coding layer consists of BiLSTM and Attention, biLSTM consists of forward LSTM and backward LSTM, and then the two hidden vectors of LSTM are connected into context vectorThus, dual-context information can be efficiently obtained, and hidden features can be mined more. The vector values obtained by embedding words of different levels through the first layer are transferred to the BiLSTM unit, then the LSTM function can be represented by equation (5), the cell state update equation by equation (6), the output X of the BiLSTM layer t Can be calculated from equation (7).
X t =o t *tanh(c t ) (7)
In the formula (5), σ represents a logic function sigmoid, and W and b are parameters of affine transformation. F in the formulas (5), (6) and (7) t 、o t 、i t Respectively representing a forget gate, an output gate and an input gate. In the formula (7), tanh represents a hyperbolic tangent function. To this end, context semantic dependent features are obtained through BiLSTM. The attention distribution among words is calculated in each feature space, and the self-attention mechanism is better at capturing the internal correlation of the features and reducing the dependence of the model on external features. The self-attention mechanism includes two parts, point-by-point attention and multi-head attention. The dot-product attention comprises a query matrix (Q, query), a Key matrix (K, key) and a Value matrix (V, value), and the weight of the matrix is automatically updated in the network training process. The calculation formula is shown as formula (8).
Step 7: the prediction layer is marked. The present invention uses the CRF layer to sequentially label software defect naming entities in view of the dependencies between successive labels.
The CRF layer is used herein for sequential marking in view of the dependency between successive tags. The input to the CRF layer is the output of the self-care layer and can be expressed by equation (9).
Given tag sequence y= { Y 1 、y 2 、…、y N Obtaining tag sequence Y and input h attn The calculation formula of the joint conditional probability is shown as formula (10).
Y(s) is the set of all possible tag sequences for sentence s,w and T represent parameters in the CRF layer as fractional functions. In training the model, a negative log-likelihood function is usedAs an objective function.
The software defect named entity recognition algorithm based on multi-level feature fusion is summarized as follows.
Specific examples:
the effectiveness of the voting classification method is verified by comparing the independent use effects of the voting classification method designed by the invention and three classification methods (MNB, LR, MLP).
1. The model evaluates the dataset. The validity of the software defect domain named entity recognition method is verified on a data set of Mozilla, spark, eclipse and Hadoop four items. Firstly, randomly disordered data samples are selected, and then the data samples are divided into a training set, a verification set and a test set. As shown in table 2, the dataset with the entity tag is divided into a training set and a validation set, a test set. Wherein the Mozilla dataset contains 1494 defect reports for training, 427 defect reports for verification and 214 defect reports for testing; the Spark dataset contains 1288 defect reports for training, 368 defect reports for verification and 185 defect reports for testing; the Eclipse dataset contains 791 defect reports for training, 226 defect reports for verification and 113 defect reports for testing; the Hadoop dataset contained 737 defect reports for training, 210 defect reports for verification and 107 defect reports for testing.
Table 2 data set partitioning case
2. And evaluating the index. To effectively evaluate the performance of different classifiers, a multi-dimensional evaluation was performed using Recall, precision, F1-score, accuracy and statistical analysis methods. Specific descriptions of these indices are as follows.
For one sample data, there are four possible results for its classification:
-TP (True Positive): indicating that the positive sample prediction result is positive;
-FP (False Positive): indicating that the negative sample prediction result is positive;
-TN (True Negative): indicating that the negative sample prediction result is negative;
-FN (False Negative): indicating that the positive sample prediction is negative.
According to the classification result, each evaluation index and a specific calculation formula are as follows:
recall (Recall): the parameters represent the duty ratio of the positive samples in all positive samples, which are correctly predicted by the model, and the specific calculation formula is as follows:
accuracy (Precision): the parameters represent the duty ratio of the positive samples predicted by the model to be positive samples, and the specific calculation formula is as follows:
f1 value (F1-score): this parameter is a combination of the above-mentioned fair share of accuracy and recall. The higher F1 value means more accurate prediction of the model, and a specific calculation formula is:
experimental results:
RQ1: the influence of different levels of features on the performance of the MNER method for identifying the software defect naming entity;
the MNER method uses different word embedding methods to fuse the multi-level features, and in order to verify the effectiveness of the multi-level features in the MNER method, comparison experiments are carried out by taking the different-level features as experimental variables. Table 3 shows experimental results of MNER method at different levels of features on Mozilla, spark, eclipse, hadoop dataset, respectively, where w2v represents Word2Vec, ft represents Word n-gram level Word embedding method FastText, m2v represents Morph2Vec, char represents BiLSTM trained character embedding, and orthographic character level embedding.
TABLE 3 Performance index of different levels of features in MNER method
/>
In the comparison of the single Word embedding method, the Word2Vec of the Word-based Word embedding method performs best, the F1 value reaches 88.97% at the highest, and the average value of four data sets is 86.71%, which shows that the Word embedding effect is far better than that of other characters. Based on Word2Vec, fastText embedding, character embedding, morph2Vec embedding and orthographic character level embedding are respectively added, and the result shows that the performance improvement effect is optimal when Word2Vec is combined with orthographic character level embedding, the average precision rate on four data sets reaches 89.63%, the average recall rate reaches 88.43%, and the average F1 value reaches 89.02%; other embedments have certain promotion effect to the recognition effect. Experiments prove that the MNER method integrating all word embedding methods has the highest average precision of 91.19%, the average recall of 89.93% and the average F1 value of 90.55%.
RQ2: performance comparison of MNER method compared to baseline method CRF method, biLSTM-CRF method;
the results of the MNER method and baseline method CRF method, and the F1 values on the four datasets of the BiLSTM-CRF method are shown in Table 4. As can be seen from table 4, the F1 value of the MNER method was highest, the average F1 value was 90.55%, and the F1 value results were stable in the four data sets among the results of all the methods.
Table 4 comparison of different method performance indicators
/>

Claims (8)

1. A method for identifying a software defect named entity based on multi-level feature fusion is characterized by comprising the following steps:
step 1: selecting a data source: selecting a defect report from the open source project, wherein the selected defect report is a repaired defect;
step 2: software defect named entity class definition: the software defect named entities are divided into 7 classes: program language, application program interface, environment, user interface, platform, security, and software standard;
step 3: processing a data set;
preprocessing the text information of the data source obtained in the step 1, wherein the preprocessing comprises three parts of word segmentation, part-of-speech tagging and sequence tagging; for word segmentation and part-of-speech tagging, using the natural language toolkit NLTK of python for implementation; for the sequence labeling label, manually labeling the sequence of words in the data source, finding out the vocabulary and the phrase related to the software defect defined in the step 2 in the text information, and completing the task of labeling the software defect naming entity;
step 4: verifying the marking accuracy based on a card classification method;
marking entities in the sample by adopting a card classification method, and measuring the difference of results among different marking personnel by using a Fleiss Kappa coefficient;
the card classification method is characterized in that two or more members extract the entity of the sample, the final entity type is judged according to the marked result of each member sample, and if the marked results of a plurality of members are consistent, the label is the final result of the sample; if the marked results among the members are inconsistent, the members discuss each other to obtain a final result;
step 5: an input layer;
using different levels of word embedding as input: word-level Word embedding obtained by Word2Vec, character n-gram Word embedding obtained by Fasttext, morphological-level Word embedding obtained by Morph2Vec, character-level Word embedding and orthography character-level Word embedding; using the five word embedding models, inputting the five word embedding models into a BiLSTM model, and capturing the font, the morphology and the context information of the words in the defect report;
step 6: a feature encoding layer;
based on the context information extracted by BiLSTM, extracting the attention distribution among words by using a self-attention mechanism, and normalizing the attention distribution by using a SOFTMAX function;
step 7: marking a prediction layer;
the CRF layer is used to sequentially label software defect named entities.
2. The method for identifying a software defect named entity based on multi-level feature fusion according to claim 1, wherein the defect report is from 4 open source items in Bugzilla and Jira, mozilla, spark, eclipse and Hadoop respectively.
3. The method for identifying a software defect named entity based on multi-level feature fusion according to claim 1, wherein the 7 classes of the software defect named entity are respectively: program language, application program interface, environment, user interface, platform, security, and software standard;
(1) Programming language: defects related to which development language the current defect belongs include mainstream object-oriented language, procedural language and structured query language;
(2) Application program interface: API elements that refer to libraries and frameworks that developers use to program;
(3) Environment: comprising 4 subcategories: software tools, software libraries, development frameworks, and general-purpose software tools;
(4) User interface: refers to defects associated with the graphical user interface;
(5) And (3) a platform: refers to a software or hardware platform;
(6) Safety: refers to a flaw related to code security or software security;
(7) Software standard: including standard specifications in the field of software engineering.
4. The method for identifying a software defect named entity based on multi-level feature fusion according to claim 1, wherein the labeling of the sequence is divided into 3 stages, and the first stage maintains a dictionary of software defect entities, which is a table of correspondence between software defect entities and entity categories; the second stage carries out manual inspection through double verification, namely each defect report is independently inspected at least twice by two participants, each participant marks the data of four different projects, and the respective marking results can be rapidly inspected by utilizing a software defect entity dictionary; if the marked results are inconsistent, the participants discuss and agree, and finally unify the software defect entity dictionary; the third stage is to verify the accuracy of the mark based on card classification.
5. The method for identifying the software defect named entity based on multi-level feature fusion according to claim 1, wherein the Fleiss Kappa coefficient is an index for verifying consistency of experimental labeling result data, and is specifically as follows:
let N be the total number of objects to be evaluated, N be the total number of objects to be evaluated, T be the number of classes to be evaluated, N ij And (3) classifying the j-th evaluation object into the number of grades of the i-th evaluated object, wherein the calculation formula of the Fleiss Kappa coefficient is as follows:
wherein,indicating relative viewing consistency between evaluators, < >>Hypothesis probability representing accidental consistency, P i Indicating the degree to which the evaluator agrees with the ith task;
the Fleiss Kappa coefficient calculations were divided into 5 groups to represent different levels of consistency: 0 to 0.20 shows extremely low uniformity, 0.21 to 0.40 shows general uniformity, 0.41 to 0.60 shows moderate uniformity, 0.61 to 0.80 shows high uniformity, and 0.81 to 1 shows almost complete uniformity.
6. The method for identifying a software defect named entity based on multi-level feature fusion of claim 1, wherein FastText is sub-word information using words, each word w i Represented by a string of characters n-gram; adding boundary symbols at the beginning and end of a word<And>allowing the differentiation of prefixes and suffixes from other characters; second, word w i Itself contained in its n-gram set to learn the representation of each word;
assuming that the dictionary is an n-gram of size G, given a word w, usingRepresenting a set of n-grams appearing in w; each is put intoThe n-grams are represented by vectors as z g The sum of the vector representations of the n-grams is used to represent a word, so that the calculation of the scoring function is shown in formula 4;
wherein v is c Representing a context vector;
finally, fastText can generate character n-gram word embeddings from vectors of the character n-gram
7. The method for identifying a software defect named entity based on multi-level feature fusion of claim 1, wherein the feature encoding layer is composed of BiLSTM and Attention, biLSTM is composed of forward LSTM and backward LSTM, and then two hidden vectors of LSTM are connected into a context vectorThe vector values embedded by the words of different levels are transferred to BiLSTM, the LSTM function is represented by equation (5), the cell state update equation is represented by equation (6), the output X of the BiLSTM layer t Calculated by the formula (7);
X t =o t *tanh(c t-1 ) (7)
wherein,representing forward LSTM hidden vector,>representing backward LSTM hidden vector,>indicating the degree to which the information is retained, f t 、o t 、i t Respectively representing a forgetting gate, an output gate and an input gate, sigma represents a logic function sigmoid, W and b are parameters of affine transformation, E t Input word representing time t, X t-1 Output of BiLSTM layer at time t-1, c t 、c t-1 Cell states at time t and time t-1 are shown, respectively; obtaining context semantic dependent features through BiLSTM, and respectively calculating the attention distribution among words in each feature space; the self-attention mechanism comprises two parts of point attention and multi-head attention, the point attention comprises a query matrix (Q, query), a Key matrix (K, key) and a Value matrix (V, value), the weight of the matrix is automatically updated in the network training process, and a calculation formula is shown as a formula (8):
wherein d k Representing the dimensions of the word vector/hidden layer.
8. The method for identifying a software defect named entity based on multi-level feature fusion according to claim 1, wherein the input of the CRF layer is the output of the self-care layer, represented by formula (9):
wherein h is attn Representing the output of the self-attention layer,representing a self-care layer output for each query vector;
given tag sequence y= { Y 1 、y 2 、…、y N Obtaining tag sequence Y and input h attn The calculation formula of the joint conditional probability of (2) is shown as formula (10):
where Y(s) is the set of all possible tag sequences for sentence s,for the score function, W and T represent parameters in the CRF layer, θ represents the maximization parameter of the negative log likelihood function; in training the model, use the negative log likelihood function +.>As an objective function.
CN202210396841.3A 2022-04-15 2022-04-15 Multi-level feature fusion-based software defect named entity identification method Active CN114943229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210396841.3A CN114943229B (en) 2022-04-15 2022-04-15 Multi-level feature fusion-based software defect named entity identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210396841.3A CN114943229B (en) 2022-04-15 2022-04-15 Multi-level feature fusion-based software defect named entity identification method

Publications (2)

Publication Number Publication Date
CN114943229A CN114943229A (en) 2022-08-26
CN114943229B true CN114943229B (en) 2024-03-12

Family

ID=82907168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210396841.3A Active CN114943229B (en) 2022-04-15 2022-04-15 Multi-level feature fusion-based software defect named entity identification method

Country Status (1)

Country Link
CN (1) CN114943229B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832781A (en) * 2017-10-18 2018-03-23 扬州大学 A kind of software defect towards multi-source data represents learning method
CN110442860A (en) * 2019-07-05 2019-11-12 大连大学 Name entity recognition method based on time convolutional network
CN111783462A (en) * 2020-06-30 2020-10-16 大连民族大学 Chinese named entity recognition model and method based on dual neural network fusion
WO2020215456A1 (en) * 2019-04-26 2020-10-29 网宿科技股份有限公司 Text labeling method and device based on teacher forcing
WO2020232882A1 (en) * 2019-05-20 2020-11-26 平安科技(深圳)有限公司 Named entity recognition method and apparatus, device, and computer readable storage medium
CN114169330A (en) * 2021-11-24 2022-03-11 匀熵教育科技(无锡)有限公司 Chinese named entity identification method fusing time sequence convolution and Transformer encoder

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832781A (en) * 2017-10-18 2018-03-23 扬州大学 A kind of software defect towards multi-source data represents learning method
WO2020215456A1 (en) * 2019-04-26 2020-10-29 网宿科技股份有限公司 Text labeling method and device based on teacher forcing
WO2020232882A1 (en) * 2019-05-20 2020-11-26 平安科技(深圳)有限公司 Named entity recognition method and apparatus, device, and computer readable storage medium
CN110442860A (en) * 2019-07-05 2019-11-12 大连大学 Name entity recognition method based on time convolutional network
CN111783462A (en) * 2020-06-30 2020-10-16 大连民族大学 Chinese named entity recognition model and method based on dual neural network fusion
CN114169330A (en) * 2021-11-24 2022-03-11 匀熵教育科技(无锡)有限公司 Chinese named entity identification method fusing time sequence convolution and Transformer encoder

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张华丽 ; 康晓东 ; 李博 ; 王亚鸽 ; 刘汉卿 ; 白放 ; .结合注意力机制的Bi-LSTM-CRF中文电子病历命名实体识别.计算机应用.2020,(S1),全文. *
郑炜 ; 陈军正 ; 吴潇雪 ; 陈翔 ; 夏鑫 ; .基于深度学习的安全缺陷报告预测方法实证研究.软件学报.2020,(05),全文. *

Also Published As

Publication number Publication date
CN114943229A (en) 2022-08-26

Similar Documents

Publication Publication Date Title
CN110196906B (en) Deep learning text similarity detection method oriented to financial industry
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN113221567A (en) Judicial domain named entity and relationship combined extraction method
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
Jiang et al. Learning numeral embedding
CN113383316B (en) Method and apparatus for learning program semantics
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN117076653A (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
Logeswaran et al. Sentence ordering using recurrent neural networks
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN115481635A (en) Address element analysis method and system
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
Alshahrani et al. Hunter Prey Optimization with Hybrid Deep Learning for Fake News Detection on Arabic Corpus.
CN114943229B (en) Multi-level feature fusion-based software defect named entity identification method
CN110807096A (en) Information pair matching method and system on small sample set
Liu et al. Learning conditional random fields with latent sparse features for acronym expansion finding
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
Zhou et al. A hybrid approach to Chinese word segmentation around CRFs
Tang et al. Software Knowledge Entity Relation Extraction with Entity‐Aware and Syntactic Dependency Structure Information
Misal et al. Transfer Learning for Marathi Named Entity Recognition
Xu et al. Incorporating forward and backward instances in a bi-lstm-cnn model for relation classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant