CN112748951B - XGboost-based self-acceptance technology debt multi-classification method - Google Patents

XGboost-based self-acceptance technology debt multi-classification method Download PDF

Info

Publication number
CN112748951B
CN112748951B CN202110081268.2A CN202110081268A CN112748951B CN 112748951 B CN112748951 B CN 112748951B CN 202110081268 A CN202110081268 A CN 202110081268A CN 112748951 B CN112748951 B CN 112748951B
Authority
CN
China
Prior art keywords
code
code annotation
annotation
text information
technical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110081268.2A
Other languages
Chinese (zh)
Other versions
CN112748951A (en
Inventor
陈信
俞东进
范旭麟
王琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110081268.2A priority Critical patent/CN112748951B/en
Publication of CN112748951A publication Critical patent/CN112748951A/en
Application granted granted Critical
Publication of CN112748951B publication Critical patent/CN112748951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/73Program documentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a self-acceptance technology debt classification method based on XGboost. According to the method, the self-acceptance technical debt classifier based on the XGboost is constructed, so that the self-acceptance technical debt can be effectively classified. Meanwhile, the method adopts random exchange and random scrambling strategies in the EDA method to enhance data, and uses the class spacing measurement to generate the quality of the data, thereby effectively overcoming the problem of sample imbalance. In addition, the method uses CHI to extract features, selects the first s words with the highest scores (the value of s is 10% of the total number of different features), accelerates model training and improves the performance of the model. The method can effectively classify the technical debts of the software, reduce the cost of software maintenance and has very important significance on the software maintenance.

Description

XGboost-based self-acceptance technology debt multi-classification method
Technical Field
The invention relates to the field of software maintenance, in particular to a self-acceptance technology debt multi-classification method based on XGboost.
Background
Technical Debt (TD) is a metaphor, which refers to the irregular code generated after a software developer has adopted some compromise software development scheme to meet the real business requirements or budget, time constraints. Research has shown that technical debts can significantly degrade software quality, posing a number of challenges to software maintenance. The influence of technical debt on software mainly comprises three aspects, namely maintainability, developability and visibility. Firstly, the readability of the code of the technical debt is poor, the code is difficult to be understood by others, and the code odor may exist in the code, so that the expansion and the enhancement of the software are influenced, and the maintenance of the software is difficult. Secondly, the technical debt reduces the capability of the software system to adapt to changes, so that the system is difficult to support rapid iteration and evolution of functions, and the usability, expandability and flexibility of the system are difficult to meet the actual requirements. Thirdly, for the final user, the technical debt causes defects in the aspects of functions, design, user experience and the like, so that the user cannot smoothly complete the established business process, and the invisible code problem of the user is upgraded into a visible quality problem; for developers, the bulky technical architecture and scattered business logic cause that products cannot respond to demand changes quickly, delivery is delayed, and finally, the difficult-to-understand invisible architecture problem is upgraded to be a visible software delivery risk.
Since technical debt is invisible and exists in projects for a long time, detection and elimination of technical debt becomes an urgent problem to be solved, and has attracted high attention of researchers. As research progresses, researchers have found that sometimes developers intentionally introduce incomplete, defective code or solutions into a project due to factors such as development time tension, budget limitation or commercial interest, etc., and record in code annotations, a type of technical debt known as self-acceptance technical debt (SATD). There have been researches to classify self-acceptance technical debts into 5 categories, namely design technical debt (design TD), demand technical debt (implementation TD), defect technical debt (defect TD), test technical debt (test TD) and documentation debt (documentation TD), wherein different categories of technical debts may need different personnel to process (for example, the test technical debt needs to be solved by a tester, and the defect technical debt needs to be solved by a developer), so as to correctly classify the technical debts, and help development teams to improve work efficiency.
Currently, researchers are mainly focused on the detection of technical liability of software, i.e., by analyzing source code or code annotations and devising automated or semi-automated methods to identify whether technical liability exists in software. However, very little research has focused on the problem of multiple classification of debts from acceptance techniques. Experience has shown that different types of technical debts have different effects on software development, for example design debts often mean that there are large problems in the code and therefore require a high cost for maintenance. A defect liability indicates that there may be a defect or crash in the software, which liability needs to be removed in a timely manner. Therefore, the technical debts of different categories can be identified to help developers to better understand the technical debts in the software, and the repair efficiency of the developers on the technical debts is improved.
Disclosure of Invention
In order to effectively identify different types of technical debts, the invention provides a self-acceptance technical debt multi-classification method based on XGboost, which can effectively classify the technical debts.
The technical scheme adopted by the invention is as follows:
step (1) acquires a code annotation set S ═ S from a dataset1,S2,…,Sn) N is the number of code annotations for all classes, where each sample is denoted as Si=<id,comment,LB>And i is 1,2, …, n, wherein id represents the number of the code annotation, comment represents the text information of the code annotation, and LB represents the label of the code annotation, namely the type of the technical debt.
Step (2) of sampling S for each sampleiThe comment in (1) is pre-processed.
Firstly, completely same samples in an original data set are filtered by utilizing a character string full matching and cosine similarity calculation method;
then, deleting the historical version record contained in the code annotation;
finally, deleting noise information in the code annotation, wherein the noise information comprises numbers, punctuations, URLs, source codes and stop words; all words are converted to lower case letter form.
After treatment, each sample is Si=<id,preComment,LB>Wherein the precompment represents the text information of the pre-processed code annotation.
And (3) performing data enhancement on the text information of the preprocessed code annotation. Because the number of annotations of the technical debt designed in the data set is the largest, the required technical debt and the defective technical debt are relatively less, and the effect of the classifier model is influenced. Therefore, the text information of the code annotation of the two types of samples can be enhanced by adopting random exchange and random scrambling strategies in an EDA (easy Data augmentation) method.
And (4) calculating the weight of each feature in the sample by using a chi-square statistical method, sequencing the features from large to small according to the weight values, and selecting s features with the largest weight.
Step (5) using countvectorer method to express all code annotation texts as word frequency matrix FMn×sWherein the element FM [ i][j]Indicates the number of occurrences of the jth word in the ith code annotation, where i is 1,2, …, n, j is 1,2, …, s.
And (6) constructing a classifier model based on XGboost.
Firstly, annotating a sample S in a set of codes according to a word frequency matrix FMiIs shown as Si=(xi,yi),xi={FM[i][1],FM[i][2],…,FM[i][s]},yiIs the corresponding class label.
Then, the predicted values of all code annotations are calculated.
And finally, training the classifier model by adopting an addition mode, and adding the best tree model into the classifier model at present each time.
And (7) training a classifier model by adopting a leave-one-out cross validation method.
Assuming that p items are contained in the data set, selecting the code annotations of p-1 items as a training set, using the code annotations of the remaining 1 items as a test set, and using the label of the code annotation of the design technical liability in the training set as 0, the label of the code annotation of the demand technical liability as 1, and the label of the code annotation of the defect technical liability as 2. And finally obtaining a trained classifier model through continuous iteration and optimization of the model.
Step (8) classified prediction
For new code annotation, firstly preprocessing the text information of the code annotation, then selecting the characteristics of the text information of the preprocessed code annotation, finally representing the text information of each code annotation into vectors according to the selected characteristics, inputting each vector into a classifier model for prediction, and outputting the predicted value of each code annotation to each class by the classifier model through calculation, wherein the class label with the largest predicted value is the predicted label of the code annotation.
Compared with the traditional classification method, the method has the beneficial effects that:
1. the self-acceptance technology debt classifier based on the XGboost is constructed, the self-acceptance technology debt can be effectively classified, and the classification accuracy is improved.
2. The invention adopts the random exchange and random scrambling strategies to enhance the data, and uses the class interval measurement to generate the quality of the data, thereby effectively overcoming the problem of unbalanced samples.
3. And (3) feature extraction is carried out by using the CHI, and s words before scoring are selected, so that the model training is accelerated, and the performance of the model is improved.
Drawings
Fig. 1 is a flowchart of a self-acceptance technology debt multi-classification method based on XGBoost according to the present invention.
Detailed Description
Data source acquisition: the raw data used in this example is from the public data set compiled by Maldoado and Shihab. This data contains 10 open source items including Ant, ArgoUML, Columba, EMF, Hibernate, JEdit, JFreeChart, JMeter, JRuby, and Squirrel. When constructing a data set, Maldoado and Shihab use JDeodorant to extract code annotations for these ten items and apply existing heuristics to remove some extraneous annotations (e.g., annotations automatically generated by tools, partial code segments, etc.). The data set labels the code annotation as a distinct tag. Since the present invention primarily identifies the design technical debt, the demand technical debt and the defect technical debt in the self-acceptance technical debt annotation, only data related to the design technical debt, the demand technical debt and the defect technical debt is used.
In order to make the purpose, technical scheme and advantages of the present invention more clearly understood, the following detailed description of the XGBoost-based self-acceptance technique debt multi-classification method provided by the present invention with reference to fig. 1 includes the following steps:
step (1) acquires a code annotation set S ═ S from a dataset1,S2,…,Sn) N is the number of code annotations for all classes, where each sample is denoted as Si=<id,comment,LB>And i is 1,2, …, n, wherein id represents a code annotation number, comment represents text information of the code annotation, and LB represents a label of the code annotation, namely the type of the technical debt.
Step (2) of sampling S for each sampleiThe comment in (1) is preprocessed:
2-1, filtering completely same samples in the original data set by a character string complete matching method and a cosine similarity calculation method (similarity is 1);
2-2, delete historical version record contained in code annotation (this historical version record is usually denoted as "xx-xx-xx: text", where "xx-xx-xx" denotes date and "text" denotes historical record);
2-3, deleting noise information such as numbers, punctuations, URLs (uniform resource locators), source codes and the like contained in the code annotation, and converting all words into a lower case letter form;
2-4, building a stop word list which not only contains words such as "the", "an", "for", "a", etc., but also considers words with a length less than 3 or greater than 20 as stop words, mainly because single english words are usually less than 20 characters, and words with less than 3 characters are generally articles or subjects, which do not provide much useful information for classification;
and 2-5, deleting stop words contained in the text information of the code annotation according to the stop word list.
Each sample after processing is denoted as Si=<id,precomment,LB>Wherein the precompment represents the text information of the pre-processed code annotation.
And (3) performing data enhancement on the text information of the preprocessed code annotation. Considering that the code amount of the technical debt designed in the data set is far more than the code amount of the technical debt in demand and the technical debt in defect, data enhancement needs to be carried out on the two types of technical debt:
3-1, generating new required technical debt and defective technical debt sample data by adopting random exchange and random disorder strategies in an EDA method:
random exchange: random selection of the same class of LB from the datasetrFor each sample, a random position is generated, and each sample is divided into two segments according to the generated random position. The code annotated pieces of text information of the two samples are then exchanged to form two new samples.
Random scrambling: a sample is randomly selected and then the word order of the text information of the code annotation of the sample is randomly shuffled to form a new sample.
3-2, by executing the random exchange strategy 25 times and the random shuffle strategy 50 times, 100 new samples are generated. Given that the generated samples may have a negative impact on the classifier, the class separation is used to evaluate the generated samples and select the sample that is the greatest distance from all samples in all classes:
Figure BDA0002909193980000051
wherein c is the number of classes, niIs the number of samples in the ith class, y represents a generated sample, xijRepresents the jth sample in the ith class, and d represents the average distance of the generated sample from all samples in all classes.
This process is performed 1000 times for each small category in the test set, resulting in 1000 new samples for each small category.
Step (4) uses CHI-square statistics (CHI) method to select the most representative feature in the annotation text. CHI Primary measurement class LBrAnd a feature word wjThe dependency of (c) is calculated as follows:
Figure BDA0002909193980000061
wherein A represents belonging to the class LBrAnd contains the word wjB denotes that the code does not belong to the class LBrAnd contains the word wjC denotes that the code belongs to the class LBrAnd does not contain the word wjD denotes that the code does not belong to the class LBrAnd does not contain the word wjNumber of code annotations.
In this way, the CHI scores corresponding to each word are obtained and then ranked from high to low. Finally, the s words with the highest CHI scores (s is 10% of the total number of different features) are selected, and then the unselected feature words in all the annotations are deleted in turn.
Step (5) using a countvectorer method to represent the text information of all code annotations into a word frequency matrix FMn×sWherein the element FM [ i][j]And (3) indicating the occurrence number of the jth word in the text information of the ith code annotation, wherein i is 1,2, …, and n, j is 1,2, …, s.
And (6) constructing a classifier model based on XGboost, wherein the main idea of the XGboost is to combine a plurality of decision tree models with low classification accuracy into a classifier model with high accuracy, establish the classifier model in a distributed manner, and continuously optimize the classifier model along the gradient descending direction in the iteration process so as to ensure that the final prediction result is optimal, and the XGboost has the characteristics of high speed, good robustness and the like. The method comprises the following specific steps:
6-1, annotating the sample S in the set with the code according to the word frequency matrix FMiIs shown as Si=(xi,yi),xi={FM[i][1],FM[i][2],…,FM[i][s]},yiIs the corresponding class label.
6-2, calculating the predicted value of all code annotations, wherein the predicted value of the ith annotation can be calculated according to the following formula:
Figure BDA0002909193980000062
whereinF={f(x)=ωq(x)}(q:Rs→T,ω∈RT) The function space representing one regression tree, i.e. all possible regression trees, T represents the total number of leaf nodes of one regression tree, ω represents the weight value of each leaf, q represents the structure of each tree, which will annotate each xiMapping to the corresponding leaf node, and K represents the number of regression trees.
6-3, adopting an additive training mode when training the XGboost classifier model, namely adding the best tree model into the classifier model at each time. If the predicted value of the ith sample in the t iteration is
Figure BDA0002909193980000071
The objective function is calculated as follows:
Figure BDA0002909193980000072
where Ω represents a normalized function, and:
Figure BDA0002909193980000073
Figure BDA0002909193980000074
where l denotes the loss function, the squared loss function is used in the present invention:
Figure BDA0002909193980000075
and (7) training a classifier model, selecting the code annotations of 9 items as a training set, using the annotations of the remaining 1 item as a test set, and representing the label of the code annotation indicating the design technical liability in the training set by 0, the label of the code annotation indicating the demand technical liability by 1 and the label of the code annotation indicating the defect technical liability by 2. And finally obtaining a trained classifier model through continuous iteration and optimization of the model.
And (8) classifying and predicting, namely for the new code annotation, preprocessing the text information of the code annotation, then selecting the characteristics of the text information of the preprocessed code annotation, finally representing the text information of each code annotation into vectors according to the selected characteristics, inputting each vector into a classifier model for prediction, and outputting the predicted value of each code annotation to each class by the classifier model through calculation, wherein the class label with the largest predicted value is the predicted label of the code annotation.

Claims (3)

1. The self-acceptance technology debt multi-classification method based on XGboost is characterized by comprising the following steps:
step (1) acquires a code annotation set S ═ S from a dataset1,S2,...,Sn) N is the number of code annotations for all classes, where each sample is denoted as Si=<id,comment,LB>1, 2., n, wherein id represents the number of the code annotation, comment represents the text information of the code annotation, and LB represents the label of the code annotation, i.e. the type of technical liability;
step (2) of sampling S for each sampleiPreprocessing the comment in (1);
firstly, completely same samples in an original data set are filtered by utilizing a character string full matching and cosine similarity calculation method;
then, deleting the historical version record contained in the code annotation;
finally, deleting noise information in the code annotation, wherein the noise information comprises numbers, punctuations, URLs, source codes and stop words; converting all words into a lower case letter form;
after treatment, each sample is Si=<id,preComment,LB>Wherein the precompment represents the text information of the pre-processed code annotation;
step (3) data enhancement is carried out on the text information of the preprocessed code annotation;
enhancing text information of code annotations of required technical debts and defective technical debts by adopting random exchange and random disorder strategies in an EDA method;
step (4) calculating the weight of each feature in the sample by using a chi-square statistical method, sequencing the features from large to small according to the weight values, and selecting s features with the largest weight;
step (5) using a countvectorer method to represent the text information of all code annotations into a word frequency matrix FMn×sWherein the element FM [ i][j]Indicating the occurrence number of the jth word in the text information of the ith code annotation, wherein i is 1, 2.
Step (7), constructing a classifier model based on XGboost;
firstly, annotating a sample S in a set of codes according to a word frequency matrix FMiIs shown as Si=(xi,yi),xi={FM[i][1],FM[i][2],...,FM[i][s]},yiIs a corresponding class label;
then, calculating the predicted values of all code annotations;
finally, training a classifier model in an addition mode, and adding the best tree model into the classifier model each time;
step (7) training a classifier model by adopting a leave-one cross validation method;
assuming that p items are contained in the data set, selecting the code annotations of p-1 items as a training set, using the code annotations of the remaining 1 items as a test set, and representing the label of the code annotation of the design technical liability in the training set by 0, the label of the code annotation of the demand technical liability by 1, and the label of the code annotation of the defect technical liability by 2; continuously iterating and optimizing the classifier model to finally obtain a trained classifier model;
step (8) classified prediction
For new code annotation, firstly preprocessing the text information of the code annotation, then selecting the characteristics of the text information of the preprocessed code annotation, finally representing the text information of each code annotation into vectors according to the selected characteristics, inputting each vector into a classifier model for prediction, and outputting the predicted value of each code annotation to each class by the classifier model through calculation, wherein the class label with the largest predicted value is the predicted label of the code annotation.
2. The XGboost-based self-acceptance technology debt multi-classification method according to claim 1, wherein: step (2) also comprises establishing a stop word list, wherein the list not only comprises words of 'the', 'an', 'for' and 'a', but also takes words with the length less than 3 or more than 20 as stop words, and the stop words contained in the text information of the code annotation are deleted according to the stop word list.
3. The XGboost-based self-acceptance technology debt multi-classification method according to claim 1, wherein: in step (3), considering that the generated samples may have negative influence on the classifier, the class interval is used to evaluate the generated samples, and the sample with the largest distance to all the samples in all the classes is selected.
CN202110081268.2A 2021-01-21 2021-01-21 XGboost-based self-acceptance technology debt multi-classification method Active CN112748951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110081268.2A CN112748951B (en) 2021-01-21 2021-01-21 XGboost-based self-acceptance technology debt multi-classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110081268.2A CN112748951B (en) 2021-01-21 2021-01-21 XGboost-based self-acceptance technology debt multi-classification method

Publications (2)

Publication Number Publication Date
CN112748951A CN112748951A (en) 2021-05-04
CN112748951B true CN112748951B (en) 2022-04-22

Family

ID=75652763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110081268.2A Active CN112748951B (en) 2021-01-21 2021-01-21 XGboost-based self-acceptance technology debt multi-classification method

Country Status (1)

Country Link
CN (1) CN112748951B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069252A (en) * 2019-04-11 2019-07-30 浙江网新恒天软件有限公司 A kind of source code file multi-service label mechanized classification method
CN111273911A (en) * 2020-01-14 2020-06-12 杭州电子科技大学 Software technology debt identification method based on bidirectional LSTM and attention mechanism
CN111782807A (en) * 2020-06-19 2020-10-16 西北工业大学 Self-acceptance technology debt detection and classification method based on multi-method ensemble learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069252A (en) * 2019-04-11 2019-07-30 浙江网新恒天软件有限公司 A kind of source code file multi-service label mechanized classification method
CN111273911A (en) * 2020-01-14 2020-06-12 杭州电子科技大学 Software technology debt identification method based on bidirectional LSTM and attention mechanism
CN111782807A (en) * 2020-06-19 2020-10-16 西北工业大学 Self-acceptance technology debt detection and classification method based on multi-method ensemble learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Research on Classification Model of Equipment Support Personnel Based on Collaborative Filtering and Xgboost Algorithm;Jianqiao Sun 等;《2017 International Conference on Computer Systems, Electronics and Control (ICCSEC)》;20171227;全文 *
一种基于XGBoost的恶意HTTP请求识别方法;徐迪;《电信工程技术与标准化》;20181215(第12期);全文 *
软件集成开发环境的技术债务管理研究;刘亚等;《计算机科学》;20171115(第11期);全文 *

Also Published As

Publication number Publication date
CN112748951A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN107729468B (en) answer extraction method and system based on deep learning
CN106776538A (en) The information extracting method of enterprise&#39;s noncanonical format document
CN110188197B (en) Active learning method and device for labeling platform
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN111104526A (en) Financial label extraction method and system based on keyword semantics
JP2005222532A5 (en)
CN108550054B (en) Content quality evaluation method, device, equipment and medium
CN115062148B (en) Risk control method based on database
CN112070138A (en) Multi-label mixed classification model construction method, news classification method and system
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN110110087A (en) A kind of Feature Engineering method for Law Text classification based on two classifiers
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN109710725A (en) A kind of Chinese table column label restoration methods and system based on text classification
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN110245234A (en) A kind of multi-source data sample correlating method based on ontology and semantic similarity
CN111666748B (en) Construction method of automatic classifier and decision recognition method
CN112286799A (en) Software defect positioning method combining sentence embedding and particle swarm optimization algorithm
CN112748951B (en) XGboost-based self-acceptance technology debt multi-classification method
Tang et al. Enriching feature engineering for short text samples by language time series analysis
CN117235253A (en) Truck user implicit demand mining method based on natural language processing technology
WO2018220688A1 (en) Dictionary generator, dictionary generation method, and program
CN112069322B (en) Text multi-label analysis method and device, electronic equipment and storage medium
CN114239576A (en) Issue label classification method based on topic model and convolutional neural network
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN115481240A (en) Data asset quality detection method and detection device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant