CN111782807A - Self-acceptance technology debt detection and classification method based on multi-method ensemble learning - Google Patents

Self-acceptance technology debt detection and classification method based on multi-method ensemble learning Download PDF

Info

Publication number
CN111782807A
CN111782807A CN202010568813.6A CN202010568813A CN111782807A CN 111782807 A CN111782807 A CN 111782807A CN 202010568813 A CN202010568813 A CN 202010568813A CN 111782807 A CN111782807 A CN 111782807A
Authority
CN
China
Prior art keywords
annotation
self
feature
classifier
acceptance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010568813.6A
Other languages
Chinese (zh)
Other versions
CN111782807B (en
Inventor
殷茗
徐悦然
田嘉毅
朱奎宇
马怀宇
张小港
薛禹坤
吴瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010568813.6A priority Critical patent/CN111782807B/en
Publication of CN111782807A publication Critical patent/CN111782807A/en
Application granted granted Critical
Publication of CN111782807B publication Critical patent/CN111782807B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a self-supporting recognition technology detection classification method based on multi-method ensemble learning, which comprises five steps: preprocessing the characteristic words; selecting the first k most useful features to train a classifier; training corresponding sub-classifiers by using a naive Bayes polynomial and a linear Logistic regression method; integrating and predicting the prediction result through a sub-classifier voting rule to obtain accuracy and recall rate, and finally calculating an F1 value as a subsequent evaluation standard by integrating the accuracy and the recall rate; finally, clustering the characteristics which frequently appear in the experimental process and have high information gain values by a clustering method, and further classifying the detected technical debts.

Description

Self-acceptance technology debt detection and classification method based on multi-method ensemble learning
Technical Field
The invention belongs to the technical field of software development, and particularly relates to a self-acceptance debt classification detection method based on multi-method ensemble learning.
Background
The document "Huang Q, Shihab E, Xia X, et al.identifying self-adapting technical ebt in open source project using minimum [ J]Empirical software engineering,2017 discloses a method for automatically detecting self-acceptance technical debts using an integrated classifier. The method uses source code annotations from different software items to analyze the annotations to be detected. The method comprises preprocessing a source file, selecting and screening features by using the features, and using a naive Bayes polynomial (A)
Figure BDA0002548744720000011
Bayesian multinomial) trains each classifier, and finally an integrated classifier composed of a plurality of classifiers predicts according to voting rules to determine whether the statement has self-acceptance technical debt. The method is verified to have a result which is greatly improved compared with a text mode-based method and an NLP classifier method, and the method has excellent operation performance. However, the training method of the classifier is single, and the accuracy of the individual classifier is low, so that the final result is not accurate.
Self-acceptance technical debt (SATD) is a term that has been proposed to express debt intentionally introduced during software development, usually to enable a project to be developed more quickly in the short term, at the expense of maintenance needed in the future, and the focus of the present invention is to accurately detect SATD in order to help mitigate the cost of maintaining SATD. However, some methods strongly depend on manual detection so far, and many high-level methods adopt a single natural language detection mode to automatically identify the SATD, however, the disadvantages of low efficiency of the manual detection method are obvious, the mode classifier of the single natural language detection has low performance and poor flexibility, and although the previous detection also has good results, the SATD has the characteristics of diversity, semantic change and the like in the projectDetecting SATD presents significant challenges. Therefore, in order to improve the accuracy and flexibility of SATD detection, the invention provides a multi-method ensemble learning SATD detection method, which takes 8 open-source items as a data set, firstly preprocesses an annotation text, extracts features by using a feature selection method, and then uses the feature selection method
Figure BDA0002548744720000022
A Bayes Multinomial and Simple Logistic method is used for integrally training a classifier, finally, sub-classifiers are integrated to form an integrated classifier, annotated classification labels are obtained according to voting rules, and technical debts are accurately identified and self-accepted. Finally, the experimental result is compared with three experimental baselines (based on a mode, single method ensemble learning and NLP), and the result shows that the SATD detection method of the method is high in accuracy, remarkably improves recall rate, can achieve better detection effect and is obviously superior to the previous detection method.
Disclosure of Invention
Technical problem to be solved
The key technical difficulty of the self-acceptance technology debt detection problem in the software development process is as follows: the self-acceptance technical debt is source code level in nature, but generally needs to be analyzed by detecting source code annotations and decomposing sentences into feature words, and when a classifier is trained, the training model is single, and the result error is large. The invention focuses on the integrated learning of multiple methods to detect the debt of self-acceptance technology, and firstly preprocesses the characteristic words in the invention. Stop words and punctuation marks are removed, only effective words are taken for feature selection, and invalid features are filtered to reduce noise. Also consider word proximity, such as: words with the same word stem, such as happy, and happy, are unified to the word stem by a port; the first k most useful features are then selected to train the classifier. The invention selects the first k most useful characteristics for training; secondly using a naive Bayesian polynomial (
Figure BDA0002548744720000021
Bayes Multinomial) and linear Logistic regression (Simple Logistii)c) The two methods train corresponding sub-classifiers to enable the F1 value predicted each time to be the best, and improve the accuracy of prediction as much as possible while predicting as many self-acceptance technical debts as possible. And finally, performing integrated prediction on the prediction result trained by the sub-classifiers in the invention by using the voting rule of the sub-classifiers, and determining the prediction result of the final integrated classifier in each process.
Technical scheme
A self-acceptance technology debt detection and classification method based on multi-method ensemble learning is characterized by comprising the following steps:
step 1: preprocessing the characteristic words
Raw annotation data was processed using heuristic rules:
(1) deleting the license description annotation with the fixed format automatically generated by the compiler;
(2) merging the multiple lines of annotations into a sentence;
(3) deleting code present in the annotation statement;
(4) deleting the Javadoc which does not contain the reserved words, and reserving the comment sentences containing the reserved words;
step 2: the first k most useful features are selected to train the classifier
After text pre-processing of source item annotations, the present invention uses a vector space model, VSM, to process words that have been divided into features; in the model, each sentence annotation is represented by a word vector, the divided word characteristics can be regarded as dimensions, and each sentence annotation can be regarded as a data point in a high-dimensional space; the invention uses HashMap as the mapping of VSM model, wherein character type identification is divided characteristics, double-precision numerical value is word frequency, namely the frequency of the characteristics appearing in the current annotation, and the frequency is standardized;
information gain, a widely used feature selection method, is employed to select useful features: let the annotation data set be denoted as C { (C)1,L1),(C2,L2),...,(CN,LN)},CiRepresents the ith comment, LiRepresenting the annotationClass labels, i.e. yes (t) no
Figure BDA0002548744720000032
Presence of self-acceptance technical debt; yet to order Ci={w1,w2,…,wnWhere n represents the notation CiNumber of features in, wiRepresents the ith feature in the sentence annotation; for a feature w and an annotation CiThere are 4 possible relationships between them:
● (w, t): note CiContains the feature w and there is a self-acceptance technical debt (i.e., t) in the sentence annotation
Figure BDA0002548744720000035
Note CiThe feature w is included, but no self-acceptance technical debt is present in the sentence annotation (i.e.,
Figure BDA0002548744720000036
)
Figure BDA0002548744720000034
note CiNot including the feature w, but there is a self-acceptance technical debt (i.e., t) in the sentence annotation
Figure BDA0002548744720000033
Note CiNo feature w is included and there is no self-acceptance technical liability in the sentence annotation (i.e.,
Figure BDA0002548744720000037
)
based on the above 4 possible relationships, the information gain of the feature w and the tag t is calculated as follows:
Figure BDA0002548744720000031
wherein p (w ', t') represents the probability of the feature w 'appearing in the annotation with the label t', p (w ') represents the probability of the feature w' appearing in the annotation, and p (t ') represents the probability of the annotation with the label t';
after an information gain value corresponding to each feature is calculated by using an information gain method, sorting the features from large to small according to the size of the information gain value; the higher the score, the more important the feature is in predicting the classification label; the method selects the characteristic of which the information gain value is k% at the top, and discards other characteristics;
and step 3: training sub-classifiers using naive Bayes polynomial and linear Logistic regression
(1):Native Bayes Multinomial
The invention sets six classifiers, namely No. 2, No. 3, No. 4, No. 5, No. 6 and No. 8 classifiers as a polynomial naive Bayes classifier NBM, and trains by using an NBM method; let the annotations collection be Ci={w1,w2,...,wnH, class label LiThe following can be obtained:
Figure BDA0002548744720000041
applying bayes' theorem on equation (3) yields:
Figure BDA0002548744720000042
Figure BDA0002548744720000043
identifying a classification tag for the annotation by equation (4);
(2):Simple Logistic
in the experiment, two classifiers, namely the classifier No. 1 and the classifier No. 7 are set as a linear logistic regression classifier SimpleLogistic; let the annotation data set be denoted as C { (C)1,L1),(C2,L2),...,(CN,LN) In which C isiRepresents the ith comment, LiClass labels representing the note, i.e. whether or not there is self-supportRecognizing technical debts; in addition, it also needs to order CiIs represented as Ci={w1,w2,...,wnWhere n represents the notation CiNumber of features in, wiRepresents the ith feature in the sentence annotation; according to the linear logistic regression theorem, the following can be obtained:
z=θ1w12w2+…+θnwn0=θTCi(5)
it is substituted into a sigmoid function, which is expressed as follows:
Figure BDA0002548744720000044
according to the final result of the sigmoid function, dividing the annotations to be detected into two categories, wherein the category label value of 1 is the annotation statement with the self-acceptance technical debt;
and 4, step 4: sub-classifier voting rules
Adopting a voting rule to take the classification label result predicted by the majority of sub-classifiers as the final prediction result of the integrated classifier;
and 5: clustering for self-acceptance technical debt classification
The invention re-screens and deletes the original data according to the characteristics selected by the information gain value in the steps, the frequency of the appearance of the characteristics, the position of the appearance of the characteristics and the characteristics of developers, and finally classifies the characteristic words by using a clustering method.
The k% in step 2 is 10%.
Advantageous effects
The self-acceptance technology debt detection and classification method based on multi-method ensemble learning provided by the invention solves the optimization problem of self-acceptance technology debt detection in the software development process. The invention fully considers the detection characteristics of the debt of the self-acceptance technology, and trains classifiers of different data sets by using corresponding methods through text preprocessing, creatively provides a method for selecting characteristics by using an evaluation information gain value, then improving the performance of the training classifier by using different classifier training methods for different classifiers, and integrating the sub-classifiers to form an integrated classifier, so that the final detection prediction result is optimized, the detection precision and the detection range are improved, the classification indexes tend to be balanced, and the detection indexes are obviously improved. Finally, the clustering method is used for classifying the features, and the clustering result is used for analyzing the type of the self-acceptance technical debt to which the features belong, so that the detection and classification effects on the self-acceptance technical debt are achieved.
The invention is used as a self-acceptance technology debt detection and classification technology based on multi-method integrated learning, fully considers the attribute characteristics of self-acceptance technology debt in the software development process, aims to improve the software quality as much as possible and reduce the hidden risk criterion in the software development process, carefully analyzes the characteristics of the self-acceptance technology debt in the software development process, adopts information gain to quantify characteristic influence factors, and refines the classification detection process. Meanwhile, in the feature training process, a self-acceptance technology debt classification detection technology using a naive Bayesian polynomial and linear Logitics regression is innovatively provided, and finally a clustering method is used for classifying the features so as to obtain the detection and classification effects of the self-acceptance technology debt. According to the method, the experimental result is compared with results of four different self-acceptance technology debt detection methods such as a Potdar and Shihab mode-based method, an integrated learning evaluation index using a single method, an optimal sub-classifier, a Natural Language (NLP) based maximum entropy classifier method and the like, the accuracy, recall rate and F1 values of the results obtained by the four different technical methods are improved to different degrees, and particularly, the comprehensive F1 value is respectively improved by 51.87%, 16.22%, 28.76% and 32.12% compared with other methods. And finally classifying the detected characteristics with the self-acceptance technical debts to obtain a final result.
Drawings
FIG. 1 is a flow chart of the method of the present invention
FIG. 2 graph of entropy versus probability
FIG. 3sigmoid function curve
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention discloses a self-supporting technology detection classification method based on multi-method ensemble learning. The method mainly comprises five core steps: preprocessing the characteristic words; selecting the first k most useful features to train a classifier; using a naive Bayesian polynomial (
Figure BDA0002548744720000061
Bayes Multinomial) and linear Logistic regression (Simple Logistic) to train corresponding sub-classifiers; and integrating and predicting the prediction result through the sub-classifier voting rules to obtain precision (precision) and recall rate (call), and finally calculating an F1 value (F1-score) as a subsequent evaluation standard by integrating the precision and the recall rate. Finally, clustering the characteristics which frequently appear in the experimental process and have high information gain values by a clustering method, and further classifying the detected technical debts.
Step 1: preprocessing the characteristic words
The invention considers that some characteristic words are invalid, such as stop words, punctuation marks and the like, and also considers the similarity of words, for example: happy, happenses, happier, etc. have similar contents with the same stem, so the words are unified into the stem using the Porter stem algorithm.
The invention processes raw annotation data using heuristic rules:
(1) and deleting license description class comments with a fixed format automatically generated by a compiler, such as comments of functions such as automatically generated constructors and the like and catch code comment blocks automatically generated. The comments before the class declaration also do not usually contain self-acceptance technical debt, so the comments before the class declaration are also deleted.
(2) Developers sometimes write a long annotation in multiple lines rather than directly in the form of an annotation block. This annotation writing can cause a one-sentence-length annotation to be mistaken for a multi-sentence annotation. So the multiple line annotations in this case are merged into one sentence.
(3) In a software project, there is a lot of source code in the form of annotations. These codes are annotated out may be due to the code not being used on the one hand and to the code being used only for debug on the other hand. These codes present in the annotation statement typically do not contain self-acceptance technical liability and can therefore be deleted.
(4) Javadoc annotations generally do not have a self-acceptance technical debt, while some Javadoc that contain a self-acceptance technical debt generally contain reserved words such as "todo", "fixme" or "XXX". Therefore, the invention deletes the Javadoc which does not contain the reserved words and reserves the comment sentences containing the reserved words.
Step 2: the first k most useful features are selected to train the classifier
After text pre-processing of the source item annotation, the present invention uses a Vector Space Model (VSM) to process words that have been classified as features. In this model, each sentence annotation is represented by a word vector, and the divided word features can be regarded as dimensions, and each sentence annotation can be regarded as a data point in a high-dimensional space. The present invention uses HashMap as the mapping for the VSM model, where the character type identification is the divided feature, the double-precision value is the word frequency, i.e. the number of times a feature appears in the current annotation, and the number of times is normalized.
It can be found by reading the source item annotation that each source item is marked off by text preprocessing with a large number of features, for example 3661 features in the argomaml project. When a space vector model is used, the dimensionality is too high, so that the experimental performance is reduced, and the final result is even influenced. And through manual simple analysis, the problem that the annotations with the self-acceptance technical debt are in a small number, namely the number is unbalanced can be solved, and the difficulty is increased for the detection of the self-acceptance technical debt.
To address the above issues, the present invention uses feature selection to extract a subset of features that are most useful for classification when performing self-acceptance technique liability detection. Feature selection can significantly improve classifier classification performance based on previous research and practice. The present invention employs information gain, a widely used feature selection method, to select useful features.
Let the annotation data set be denoted as C { (C)1,L1),(C2,L2),...,(CN,LN)},CiRepresents the ith comment, LiClass labels representing the piece of annotation, i.e. yes (t) no (t)
Figure BDA0002548744720000081
There is a self-acceptance technical debt. Yet to order Ci= {w1,w2,...,wnWhere n represents the notation CiNumber of features in, wiRepresenting the ith feature in the sentence annotation. For a feature w and an annotation CiThere are 4 possible relationships between them:
● (w, t): note CiContains the feature w and there is a self-acceptance technical debt (i.e., t) in the sentence annotation
Figure BDA0002548744720000082
Note CiThe feature w is included, but no self-acceptance technical debt is present in the sentence annotation (i.e.,
Figure BDA0002548744720000083
)
Figure BDA0002548744720000084
note CiNot including the feature w, but there is a self-acceptance technical debt (i.e., t) in the sentence annotation
Figure BDA0002548744720000085
Note CiNo feature w is included and there is no self-acceptance technical liability in the sentence annotation (i.e.,
Figure BDA0002548744720000086
)
based on the above 4 possible relationships, the information gain of the feature w and the tag t is calculated as follows:
Figure BDA0002548744720000087
where p (w ', t') represents the probability of the feature w 'appearing in the annotation with the label t', p (w ') represents the probability of the feature w' appearing in the annotation, and p (t ') represents the probability of the annotation with the label t'.
The information gain may measure the amount of information needed to predict the class label given whether a feature is in the current annotation to be tested. After the information gain value corresponding to each feature is calculated by using the information gain method, the features are sorted from large to small according to the size of the information gain value. The higher the score, the more important the feature is in predicting the classification label. The invention selects the feature with the information gain value at the top k%, and discards the other features. By the method, the number of the features in the model construction stage can be reduced, and the number of the features in the prediction stage is also reduced, so that the experimental efficiency of the method is greatly improved. By default, the present invention empirically selects the top 10% of the total number of features, which makes the experimental results nearly optimal.
And step 3: training sub-classifiers using naive Bayes polynomial and linear Logistic regression
(1):Native Bayes Multinomial
In the experiment, the classifiers No. 2, No. 3, No. 4, No. 5, No. 6 and No. 8 are set as polynomial naive Bayes classifiers (A)
Figure BDA0002548744720000094
Bayes Multinomial), trained using the NBM method. NBM emphasizes the concept of polynomial distribution on the basis of Naive Bayes (NB), and the principle of NBM is similar to NB and belongs to Bayesian methods. The main advantages of training a classifier using a bayesian-like method are its short computation time and high training performance, since it assumes that the given labels and features are conditionally independent. Thus, let the annotations set be Ci={w1,w2,...,wnH, class label LiThe following can be obtained:
Figure BDA0002548744720000091
applying bayes' theorem on equation (3) yields:
Figure BDA0002548744720000092
Figure BDA0002548744720000093
the class label of the annotation can be identified by equation (4). Note that in NB, only whether a feature is present in the current annotation to be tested is considered, and NBM is similar to NB, but the class label is determined by the number of times each feature appears in the annotation. From summary and experimentation, it can be concluded that the use of NBM performs better than the NB method when certain specific features occur a large number of times in the annotation set.
(2):Simple Logistic
In this experiment, two classifiers, i.e., classifier No. 1 and classifier No. 7, were set as linear logistic regression classifiers (simplelogicc). The Simple Logistic method is based on Simple Logistic regression, multiple iterations are carried out by using a LogitBoost algorithm, parameters of a basic weak classifier are optimized through each iteration, and a high-precision model is finally formed. The number of iterations of the LogitBoost is 10 in the default situation, and if the iteration effect is not good, the optimal number of iterations can be obtained by using K-fold cross Validation (K-folderCross Validation).
Let the annotation data set be denoted as C { (C)1,L1),(C2,L2),...,(CN,LN) In which C isiRepresents the ith comment, LiThe classification label representing the piece of annotation, i.e. whether there is a self-acceptance technical liability. In addition, it also needs to order CiIs represented as Ci={w1,w2,...,wn},Wherein n represents Note CiNumber of features in, wiRepresenting the ith feature in the sentence annotation. According to the linear logistic regression theorem, the following can be obtained:
z=θ1w12w2+…+θnwn0=θTCi(5)
it is substituted into a sigmoid function, which is expressed as follows:
Figure BDA0002548744720000101
and according to the final result of the sigmoid function, dividing the annotations to be detected into two categories, wherein the category label value of 1 is the annotation statement with the self-acceptance technical debt.
And 4, step 4: sub-classifier voting rules
Because the data training process is divided into a plurality of sub-classifiers for predicting the classification label result, the final accuracy of the integrated classifier is obviously improved under the condition of ensuring the accuracy of each sub-classifier. In the invention, the classification label result predicted by most sub-classifiers is taken as the final prediction result of the integrated classifier.
And 5: clustering for self-acceptance technical debt classification
According to the invention, the characteristics selected through the information gain value in the steps are re-counted and processed according to the frequency of the appearance of the characteristics, the position of the appearance of the characteristics and the characteristics of developers, and finally the characteristic words are classified by using a clustering method.
Firstly, counting the characteristic frequency in a source code, and selecting a characteristic word with high frequency. Then, considering the reasons for the human preference factor, some features that indicate that there is no technical debt but are referenced are deleted. There are also words of perceptual color that rarely appear in annotations where there is no self-acceptance technical liability, and thus are reserved for these needs. Finally, some emotional verbs are also deleted. The selection standard is to reserve the characteristics with larger influence weight for the final classification detection and to screen and delete the characteristics with small influence factors obtained by analysis.
The research method framework of the invention is divided into two stages: a model building phase and a prediction phase. In the model building phase, source items are first entered as a training data set, annotations in these source items having known classification labels, and then sub-classifiers are built for each individual source item. In the prediction phase, all the sub-classifiers are integrated to jointly predict whether a self-acknowledged technical liability exists for an annotation in the target. In order to make the result as accurate as possible, only one item is selected as a target item at a time for prediction, and other n-1 items are input into a model as source items, so that the other n-1 items are used for training a sub-classifier.
The method comprises the following steps: text pre-processing
Before selecting the feature value, the original item annotation is firstly subjected to text preprocessing. This is done because the desired features are core words and there are a large number of punctuation, stop words, etc. in the source item annotation. And many words all have the same word stem, these words can simplify to unify and use the word to express in the classification problem, so will not influence the classification effect, can raise the classification efficiency again, so the text processing can be divided into 3 steps:
(1) characterization: the source item annotation text is divided into words, phrases, symbols, or other meaningful elements. Experiments only retained the features containing the english letters, i.e. all punctuation marks were deleted first, and in addition, some word features were accompanied by punctuation marks or numbers, which also required the deletion of these characters and only the words, for example: "TODO:" to be characterized as "TODO". Finally, all the word features are converted into lower case.
(2) Deleting stop words: stop words are a class of words that are often used when writing annotations, and have little meaning to the problem of detecting self-acceptance debts that the present invention addresses, because stop words do not have a sense of practical significance to identify self-acceptance debts. Commonly used stop words include "I", "should", "to", "the", and the like. While there are many text mining efforts that provide a standard stop word list, some stop words are actually useful for classification against the self-acceptance technical liability detection problem. For example, a certain comment with self-acceptance technical debt is "TODO", wherein the phrase "short wave" carries information useful for classification, and the two words "short" and "wave" are usually defaulted as stop words. Therefore, the invention establishes a stop word list aiming at the problem of detecting the self-acceptance technical debt, wherein only a few prepositions which are useless for classification (such as the, the and the is) are contained. Words not longer than 2 or longer than 20 are also considered stop words by the present invention.
(3) Drying words: word desiccation is a process that unifies words (sometimes derivatives) into stems, roots, or shapes. For example, the words "stems", "stemmed" will both be reduced to "stem". The present invention uses the well-known "Porter stem parser" to implement word drying, thereby reducing redundant synonyms.
Step two: feature value selection
The features after the preprocessing are not all used for training the classifier, the classification efficiency of the classifier is too low, and the noise is increased due to too much feature quantity, and the first k most useful features are selected by using a certain feature selection method;
shannon first demonstrated the amount of information to select useful features by using a widely used feature method of information gain.
The information quantity is defined by using a logarithmic function, namely for an event x, the probability of occurrence is p (x), and the information quantity corresponding to the event x is defined as follows:
I(x)=-log p(x) (7)
as can be seen from equation (7), the size of the information amount represents the size of the uncertainty of the event occurrence, and the smaller the uncertainty of the event occurrence, the smaller the information amount is; conversely, the greater the uncertainty of the occurrence of an event, the greater the amount of information that is present.
Shannon later proposed the definition of information entropy. Information entropy is the information needed to remove uncertaintyAnd (4) measurement of information quantity. It represents the expectation of the amount of information that an event may produce, considering all possible cases of the event. In general, information entropy is a measure of the amount of information. I.e. for variable X ═ { X-1,x2,...,xi,...,xnIts information entropy is defined as:
H(x)=-i=1np(xi)logp(xi) (8)
wherein, p (x)i) Represents that the random variable X is XiThe probability of (c). The information entropy only depends on the distribution of the random variable X and is irrelevant to the value of X. The magnitude of the entropy may indicate the magnitude of uncertainty in the probability that the random variable is at a certain value. The variation curve of the information entropy with the probability is shown in fig. 1.
As shown in FIG. 1, when p (x)i) When 0 or 1, H (x)i) 0, i.e. uncertainty 0; when p (x)i) When not greater than 0.5, H (x)i) The entropy reaches a maximum, i.e. when the random variable X takes XiThe uncertainty is greatest.
If a certain precondition is added to the occurrence of an event, a conditional entropy H (X | Y) can be obtained, which represents the result of averaging the random variable X after the random variable Y has been taken over all possible cases, i.e.:
Figure BDA0002548744720000131
Figure BDA0002548744720000132
the final result of the conditional entropy can be obtained by bringing formula (10) into formula (9):
Figure BDA0002548744720000133
after the information entropy and the conditional entropy are known, the information gain can be defined in both of them. The information gain is defined as: the difference between the entropy of the information set to be classified and the conditional entropy of the information after a certain characteristic is selected is as follows:
IG(X|Y)=H(X)-H(X|Y) (12)
the information gain can be easily understood by the above formula and related concepts. For a certain feature set X ═ X1,x2,...,xi,...,xnH (X | Y) is definite, and in the case that the condition is Y, the smaller the uncertainty of the feature X, the larger the information gain, which indicates that the feature performs better. Feature selection is performed by calculating information gain of features, and usually, the top k features with high IG values are selected or a certain threshold value is set for screening.
Data information in a data set is represented as features and annotations, and analyzed to find that four relationships may exist: containing features and having self-acceptance technical debts; contains features and is free of self-acceptance technical debt; no features are included but self-acceptance technical debt exists; no features are included and no self-acceptance technical debt exists.
Based on the above four possible relationships, the information gain of the feature w and the tag t is calculated as follows:
Figure BDA0002548744720000134
where p (w ', t') represents the probability of the feature w 'appearing in the annotation with the label t', p (w ') represents the probability of the feature w' appearing in the annotation, and p (t ') represents the probability of the annotation with the label t'.
After the information gain corresponding to each feature is calculated by using the information gain method, the first k% of features are selected from big to small according to the size of the information gain, and other features are abandoned. In the experiments of the present invention, 10% of the total number of features was empirically selected.
Step three: training sub-classifiers
Using a naive Bayes polynomial on the basis of the selected feature values (a)
Figure BDA0002548744720000144
Bayes Multinomial) and linear Logistic regression (Simple Logistic) to train the corresponding sub-classifiers. The final sub-classifiers are combined togetherAnd the integrated classifier integrated by the multiple classifiers predicts the data to be detected.
Training on some sub-classifiers by using a naive Bayes method and on an annotation set Ci={w1,w2,...,wnOn, the class label is LiLabels and features are independent of each other, which can be conditional, and then a classification and label representing the annotation is obtained using bayesian theorem on a conditional basis.
Polynomial naive Bayes classifier (
Figure BDA0002548744720000145
Bayes Multinomial) is a specific example of a naive Bayes classifier. The naive bayes classifier emphasizes the independence of events under certain conditions, while the polynomial naive bayes classifier emphasizes that events obey a polynomial distribution, both of which are similar in principle.
The naive bayes classifier algorithm is a classification algorithm based on bayes rules. It has the following preconditions: suppose that given Y, event X ═ X1,x2,...,xi,...,xnMutually independent of each other. This assumption greatly simplifies the representation of P (X | Y) and simplifies the problems encountered in evaluating the data set. For event X ═ { X1,x2,...,xi,...,xnX given condition YiIndependently of one another, from which follows:
Figure BDA0002548744720000141
it is generally assumed that Y is an arbitrary discrete variable and that event X ═ X1,x2,...,xi,...,xnIs any discrete or real-valued variable. In training the classifier, the goal is for each instance x that needs to be classifiediThe probability distribution of possible values of Y is output. According to bayes rule, the probability expression that Y will take its kth possible value is:
Figure BDA0002548744720000142
now suppose xiIndependently of condition Y, equation (2-8) can be rewritten as:
Figure BDA0002548744720000143
equation (16) is the basic equation of a polynomial naive bayes classifier. P (Y ═ Y)k|x1,x2,...,xn) Referred to as a posterior probability.
Now a new instance X is given from the data setnew={x1,x2,...,xnAnd given a prior probability P (Y) and a conditional probability P (x)iY), if the most likely Y value (i.e., classification label) is to be found, then the naive bayes classifier rule gives the following rule:
Figure BDA0002548744720000151
equation (17) is generally simplified as follows:
Figure BDA0002548744720000152
that is, y having the maximum value of the formula (18)kI.e. the final classification result.
Other sub-classifiers are trained by linear logistic linear regression, the LogitBoost algorithm is added to the method for iteration on the basis of simple logic, and the optimal iteration times are obtained by using K-fold Cross Validation (K-Folder Cross Validation). And the prediction process is that the annotated data set is constructed into a linear logistic regression form and is brought into the sigmoid function, and finally label classification of 0 and 1 is carried out on the final result of the sigmoid function to predict whether the annotated data set is the annotated sentence with the self-acceptance technical debt.
First for a given dataset D { (x)1,y1),(x2,y2)...(xm,ym) In which (x)i,yi) Denotes the ith sample, where xi={xi1,xi2,...,xinN features per datum; classification label yi∈ {0, 1 }. assume xiThe n characteristics of (a) are linear relationships, i.e.:
z=θxi+b=θ1xi12xi2+…+θnxn1+b (19)
for the sake of simplicity, b in the formula (19) is written as θ0The following can be obtained:
z=θ1xi12xi2+…+θnxn10=θTX (20)
the invention aims to realize the problem of two-classification, and hopefully, the final function can intuitively display the classification result, so the sigmoid function is adopted, and the function is expressed as follows:
Figure BDA0002548744720000153
the formula (20) is introduced into formula (21) to obtain:
Figure BDA0002548744720000154
the function curve is shown in fig. 3:
as can be seen from the figure, the sigmoid function value field is [0, 1], it can be assumed that the final result is judged to be 1 when y is greater than 0.5, and the final result is judged to be 0 when y is less than 0.5, so that the second classification can be clearly realized. If further order:
P(y=1|x;θ)=hθ(x) (23)
P(y=0|x;θ)=1-hθ(x) (24)
a loss function of logistic regression can be obtained:
Figure BDA0002548744720000161
the result of equation (25) is minimized by θ, which is the parameter in equations (2-15), to obtain the functional representation of the final logistic regression classifier.
The Simple Logistic classifier adopted by the invention uses a LogitBoost algorithm on the basis of training a weak classifier by using Logistic regression. LogitBoost, first proposed by Schapire and Singer, is one of the enhanced algorithms developed in recent years. The Boosting algorithm was originally designed to combine several weak classifiers to improve classification performance, and later Freund and Schapire proposed a more practical enhancement algorithm AdaBoost, but this algorithm has the problem of overfitting when processing noisy data. For this case, Friedman et al propose the LogitBoost algorithm to linearly reduce the training error.
The concept of the LogitBoost algorithm is as follows:
◆ input dataset D { (x)1,y1),(x2,y2)...(xn,yn) In which xi∈X,yi∈ Y { -1, 1}, and inputting the iteration number T.
◆ initialization weight
Figure BDA0002548744720000162
(i ═ 1.., N), f (x) ═ 0, probability
Figure BDA0002548744720000163
T repeats for the iteration number T-1.
a. Calculating weights and variables:
wi=p(xi)*[1-p(xi)](26)
Figure BDA0002548744720000164
b. with wiFor weighting, a weighted least squares method is adopted to fit the weak classifier function ft(x),f(xi) Representing a functional form of the weak classifier. The present invention trains weak classifiers using logistic regression functions.
Figure BDA0002548744720000171
c. Update F (x) and p (x) of the round of iterations:
Figure BDA0002548744720000172
Figure BDA0002548744720000173
output the final classification lf (x) ═ sign [ f (x) ].
Step four: sub-classifier voting rules
In the prediction phase, a classifier trained on the source item is required to predict the classification label of the annotation to be predicted in the target item. Since each project has annotations with different styles, and the feature distribution of the annotations is different, the experiment uses each sub-classifier to construct an integrated classifier. Each sub-classifier is trained by a method adaptive to own data according to the characteristics of different source items, and each sub-classifier has independence and cannot interfere with the prediction process of each other. Therefore, under the condition of ensuring the accuracy of each sub-classifier, the final accuracy of the integrated classifier is also obviously improved. The invention takes the classification label result predicted by most sub-classifiers as the prediction result of the final integrated classifier. Thus, the prediction process is like a voting election, with each sub-classifier "voting" to determine the final "winner" (i.e., the annotated classification label).
Table 1 gives the voting process for the sub-classifiers used to predict the classification label. The columns correspond to the set of sub-classifiers and the predicted outcome of each sub-classifier. The last action is gathered into the final output of the classifier. In the qualified example, there are a total of 7 sub-classifiers, assuming that the data used to train these 7 sub-classifiers is from 7 different source items. The prediction result of 3 sub-classifiers is "no self-acceptance liability" (Negative), and the prediction results of the other four sub-classifiers are "self-acceptance liability" (Positive), so the final output result of the integrated classifier is "self-acceptance liability for the annotation".
TABLE 1 example of sub-classifier voting
Figure BDA0002548744720000174
Figure BDA0002548744720000181
Step five: clustering for self-acceptance technical debt classification
And analyzing the characteristics obtained by the detection according to the information gain value in the detection process. The different projects available from the analysis may share some common features representing self-acceptance technical debts, such as "todo", "fixme", "workaround", "implementation", "hack", etc. But the frequency with which these features appear may vary from item to item. For example, some developers prefer to use the word "hack" for temporary fix problems, while others prefer to use the word "workkarround". Further, these words may also appear in notes where no self-acceptance technical debt exists. For example, when the word "instance" appears in a comment where there is no self-acceptance technical debt, it means that the developer has written code to implement some functionality (e.g., "instances backspace function"), for indicative purposes. However, in annotations where there is a self-supporting technical debt, the term "implementation" usually means that the developer needs to implement some functionality but has not yet completed (e.g., "Bunch of methods still not implemented").
Some developers prefer to use words with perceptual color (e.g., yuck, ugly, stupid, ill, etc.) when writing self-acceptance technical debt notes. These words with perceptual color rarely appear in annotations that are not in the self-acceptance of technical debts, but sometimes developers want to remind themselves to avoid writing low quality code (e.g., "guard against business reporting stuck"), in which case these words sometimes appear. In addition, developers prefer to use some emotional verbs, interrogative words, and comparison levels to address self-acceptance technical debt. Wherein, the emotional dynamic words comprise "should", "need", "can", "would", etc., the query words comprise "what", "how", "where", etc., and the comparative words comprise "better", "most", "fast", etc.
By reading comments containing self-acceptance technical debts, it can be concluded that in many cases developers have to repair a program quickly or implement a function quickly in a short time. That is, many self-acceptance technical debts are made under a state in which the developer is under time stress or emotional tension.
The invention processes and counts the original data set again, combines 5 self-acceptance technical debts proposed by Maldonado and Shihab in the research process, and finally classifies the counted characteristic values by using a clustering method, and the obtained results are shown in the table 2:
TABLE 2 typical feature Classification of self-acceptance technical debts
Figure BDA0002548744720000191

Claims (2)

1. A self-acceptance technology debt detection and classification method based on multi-method ensemble learning is characterized by comprising the following steps:
step 1: preprocessing the characteristic words
Raw annotation data was processed using heuristic rules:
(1) deleting the license description annotation with the fixed format automatically generated by the compiler;
(2) merging the multiple lines of annotations into a sentence;
(3) deleting code present in the annotation statement;
(4) deleting the Javadoc which does not contain the reserved words, and reserving the comment sentences containing the reserved words;
step 2: the first k most useful features are selected to train the classifier
After text pre-processing of source item annotations, the present invention uses a vector space model, VSM, to process words that have been divided into features; in the model, each sentence annotation is represented by a word vector, the divided word characteristics can be regarded as dimensions, and each sentence annotation can be regarded as a data point in a high-dimensional space; the invention uses HashMap as the mapping of VSM model, wherein character type identification is divided characteristics, double-precision numerical value is word frequency, namely the frequency of the characteristics appearing in the current annotation, and the frequency is standardized;
information gain, a widely used feature selection method, is employed to select useful features: let the annotation data set be denoted as C { (C)1,L1),(C2,L2),...,(CN,LN)},CiRepresents the ith comment, LiClass labels representing the piece of annotation, i.e. yes (t) no (t)
Figure FDA0002548744710000011
Presence of self-acceptance technical debt; yet to order Ci={w1,w2,…,wnWhere n represents the notation CiNumber of features in, wiRepresents the ith feature in the sentence annotation; for a feature w and an annotation CiThere are 4 possible relationships between them:
- (w, t): note CiContains the feature w and there is a self-acceptance technical debt (i.e., t) in the sentence annotation
·
Figure FDA0002548744710000012
Note CiContains the feature w, but no self-acceptance technical debt exists in the sentence annotation
Figure FDA0002548744710000013
·
Figure FDA0002548744710000014
Note CiNot including the feature w, but there is a self-acceptance technical debt (i.e., t) in the sentence annotation
·
Figure FDA0002548744710000015
Note CiDoes not contain the feature w and does not have self-acceptance technical debt in the sentence annotation
Figure FDA0002548744710000016
Based on the above 4 possible relationships, the information gain of the feature w and the tag t is calculated as follows:
Figure FDA0002548744710000021
wherein p (w ', t') represents the probability of the feature w 'appearing in the annotation with the label t', p (w ') represents the probability of the feature w' appearing in the annotation, and p (t ') represents the probability of the annotation with the label t';
after an information gain value corresponding to each feature is calculated by using an information gain method, sorting the features from large to small according to the size of the information gain value; the higher the score, the more important the feature is in predicting the classification label; the method selects the characteristic of which the information gain value is k% at the top, and discards other characteristics;
and step 3: training sub-classifiers using naive Bayes polynomial and linear Logistic regression
(1):Native Bayes Multinomial
The invention sets six classifiers, namely No. 2, No. 3, No. 4, No. 5, No. 6 and No. 8 classifiers as a polynomial naive Bayes classifier NBM, and trains by using an NBM method; let the annotations collection be Ci={w1,w2,…,wnH, class label LiThe following can be obtained:
Figure FDA0002548744710000022
applying bayes' theorem on equation (3) yields:
Figure FDA0002548744710000023
Figure FDA0002548744710000024
identifying a classification tag for the annotation by equation (4);
(2):Simple Logistic
in the experiment, two classifiers, namely the classifier No. 1 and the classifier No. 7 are set as a linear logistic regression classifier SimpleLogistic; let the annotation data set be denoted as C { (C)1,L1),(C2,L2),...,(CN,LN) In which C isiRepresents the ith comment, LiA classification label representing the piece of annotation, i.e. whether a self-acceptance technical debt exists; in addition, it also needs to order CiIs represented as Ci={w1,w2,…,wnWhere n represents the notation CiNumber of features in, wiRepresents the ith feature in the sentence annotation; according to the linear logistic regression theorem, the following can be obtained:
z=θ1w12w2+…+θnwn0=θTCi(5)
it is substituted into a sigmoid function, which is expressed as follows:
Figure FDA0002548744710000031
according to the final result of the sigmoid function, dividing the annotations to be detected into two categories, wherein the category label value of 1 is the annotation statement with the self-acceptance technical debt;
and 4, step 4: sub-classifier voting rules
Adopting a voting rule to take the classification label result predicted by the majority of sub-classifiers as the final prediction result of the integrated classifier;
and 5: clustering for self-acceptance technical debt classification
The invention re-screens and deletes the original data according to the characteristics selected by the information gain value in the steps, the frequency of the appearance of the characteristics, the position of the appearance of the characteristics and the characteristics of developers, and finally classifies the characteristic words by using a clustering method.
2. The method for detecting and classifying debt of self-acceptance technology based on multi-method ensemble learning as claimed in claim 1, wherein k% in step 2 is 10%.
CN202010568813.6A 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning Active CN111782807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010568813.6A CN111782807B (en) 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010568813.6A CN111782807B (en) 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning

Publications (2)

Publication Number Publication Date
CN111782807A true CN111782807A (en) 2020-10-16
CN111782807B CN111782807B (en) 2024-05-24

Family

ID=72756715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010568813.6A Active CN111782807B (en) 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning

Country Status (1)

Country Link
CN (1) CN111782807B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112748951A (en) * 2021-01-21 2021-05-04 杭州电子科技大学 XGboost-based self-acceptance technology debt multi-classification method
CN112800232A (en) * 2021-04-01 2021-05-14 南京视察者智能科技有限公司 Big data based case automatic classification and optimization method and training set correction method
CN113313184A (en) * 2021-06-07 2021-08-27 西北工业大学 Heterogeneous integrated self-acceptance technology debt automatic detection method
CN113377422A (en) * 2021-06-09 2021-09-10 大连海事大学 Method for identifying self-recognition technology debt based on deep learning
CN113407439A (en) * 2021-05-24 2021-09-17 西北工业大学 Detection method for software self-recognition type technical debt
US11971804B1 (en) 2021-06-15 2024-04-30 Allstate Insurance Company Methods and systems for an intelligent technical debt helper bot

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171871A1 (en) * 2007-03-23 2009-07-02 Three Palm Software Combination machine learning algorithms for computer-aided detection, review and diagnosis
US7814040B1 (en) * 2006-01-31 2010-10-12 The Research Foundation Of State University Of New York System and method for image annotation and multi-modal image retrieval using probabilistic semantic models
CN107111842A (en) * 2014-12-16 2017-08-29 具珉秀 Asset management device and its operating method
CN110069252A (en) * 2019-04-11 2019-07-30 浙江网新恒天软件有限公司 A kind of source code file multi-service label mechanized classification method
WO2019217323A1 (en) * 2018-05-06 2019-11-14 Strong Force TX Portfolio 2018, LLC Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources
CN111000553A (en) * 2019-12-30 2020-04-14 山东省计算中心(国家超级计算济南中心) Intelligent classification method for electrocardiogram data based on voting ensemble learning
CN111242191A (en) * 2020-01-06 2020-06-05 中国建设银行股份有限公司 Credit rating method and device based on multi-classifier integration
CN111273911A (en) * 2020-01-14 2020-06-12 杭州电子科技大学 Software technology debt identification method based on bidirectional LSTM and attention mechanism

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7814040B1 (en) * 2006-01-31 2010-10-12 The Research Foundation Of State University Of New York System and method for image annotation and multi-modal image retrieval using probabilistic semantic models
US20090171871A1 (en) * 2007-03-23 2009-07-02 Three Palm Software Combination machine learning algorithms for computer-aided detection, review and diagnosis
CN107111842A (en) * 2014-12-16 2017-08-29 具珉秀 Asset management device and its operating method
WO2019217323A1 (en) * 2018-05-06 2019-11-14 Strong Force TX Portfolio 2018, LLC Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources
CN110069252A (en) * 2019-04-11 2019-07-30 浙江网新恒天软件有限公司 A kind of source code file multi-service label mechanized classification method
CN111000553A (en) * 2019-12-30 2020-04-14 山东省计算中心(国家超级计算济南中心) Intelligent classification method for electrocardiogram data based on voting ensemble learning
CN111242191A (en) * 2020-01-06 2020-06-05 中国建设银行股份有限公司 Credit rating method and device based on multi-classifier integration
CN111273911A (en) * 2020-01-14 2020-06-12 杭州电子科技大学 Software technology debt identification method based on bidirectional LSTM and attention mechanism

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
GABRIELE BAVOTA ETAL.: "A large scale emprirical study on self-admitted technical debt", 2016 IEEE ACM13TH WORKINGCONFERENCE, 30 November 2016 (2016-11-30) *
POTDAR A ETAL.: "An exploratory study on self-admitted technical debt", IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION *
刘亚珺等: "软件集成开发环境的技术债务管理研究", 计算机科学, vol. 44, no. 11 *
陈松峰;范明;: "利用PCA和AdaBoost建立基于贝叶斯的组合分类器", 计算机科学, no. 08, 15 August 2010 (2010-08-15) *
韩素青;成慧雯;王宝丽;: "三支决策朴素贝叶斯增量学习算法研究", 计算机工程与应用, no. 18 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112748951A (en) * 2021-01-21 2021-05-04 杭州电子科技大学 XGboost-based self-acceptance technology debt multi-classification method
CN112748951B (en) * 2021-01-21 2022-04-22 杭州电子科技大学 XGboost-based self-acceptance technology debt multi-classification method
CN112800232A (en) * 2021-04-01 2021-05-14 南京视察者智能科技有限公司 Big data based case automatic classification and optimization method and training set correction method
CN113407439A (en) * 2021-05-24 2021-09-17 西北工业大学 Detection method for software self-recognition type technical debt
CN113407439B (en) * 2021-05-24 2024-02-27 西北工业大学 Detection method for software self-recognition type technical liabilities
CN113313184A (en) * 2021-06-07 2021-08-27 西北工业大学 Heterogeneous integrated self-acceptance technology debt automatic detection method
CN113313184B (en) * 2021-06-07 2024-05-24 西北工业大学 Heterogeneous integrated self-bearing technology liability automatic detection method
CN113377422A (en) * 2021-06-09 2021-09-10 大连海事大学 Method for identifying self-recognition technology debt based on deep learning
CN113377422B (en) * 2021-06-09 2024-04-05 大连海事大学 Self-recognition technical liability method based on deep learning identification
US11971804B1 (en) 2021-06-15 2024-04-30 Allstate Insurance Company Methods and systems for an intelligent technical debt helper bot

Also Published As

Publication number Publication date
CN111782807B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
CN111782807A (en) Self-acceptance technology debt detection and classification method based on multi-method ensemble learning
CN110209806B (en) Text classification method, text classification device and computer readable storage medium
CN107992597B (en) Text structuring method for power grid fault case
Weiss et al. Structured prediction cascades
US7606784B2 (en) Uncertainty management in a decision-making system
CN112364638B (en) Personality identification method based on social text
CN107193804A (en) A kind of refuse messages text feature selection method towards word and portmanteau word
CN112966068A (en) Resume identification method and device based on webpage information
CN111522953B (en) Marginal attack method and device for naive Bayes classifier and storage medium
Ababneh Investigating the relevance of Arabic text classification datasets based on supervised learning
dos Reis et al. One-class quantification
Gunaseelan et al. Automatic extraction of segments from resumes using machine learning
Jeyakarthic et al. Optimal bidirectional long short term memory based sentiment analysis with sarcasm detection and classification on twitter data
Heid et al. Reliable part-of-speech tagging of historical corpora through set-valued prediction
CN115841105B (en) Event extraction method, system and medium based on event type hierarchical relationship
Sheng et al. A paper quality and comment consistency detection model based on feature dimensionality reduction
CN115496630A (en) Patent writing quality checking method and system based on natural language algorithm
Sherrod Predictive modelling software
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
Klinger et al. Feature subset selection in conditional random fields for named entity recognition
CN114443840A (en) Text classification method, device and equipment
Do Van et al. Classification and variable selection using the mining of positive and negative association rules
Hamdy et al. Deep embedding of open source software bug repositories for severity prediction
MacNamara et al. Neural networks for language identification: a comparative study
Dang et al. Unsupervised threshold autoencoder to analyze and understand sentence elements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant