CN112560410A - User demand labeling process management method based on active learning - Google Patents

User demand labeling process management method based on active learning Download PDF

Info

Publication number
CN112560410A
CN112560410A CN202110045602.9A CN202110045602A CN112560410A CN 112560410 A CN112560410 A CN 112560410A CN 202110045602 A CN202110045602 A CN 202110045602A CN 112560410 A CN112560410 A CN 112560410A
Authority
CN
China
Prior art keywords
user
requirement
vector
text
requirements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110045602.9A
Other languages
Chinese (zh)
Inventor
李传艺
张晟宇
骆斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110045602.9A priority Critical patent/CN112560410A/en
Publication of CN112560410A publication Critical patent/CN112560410A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a user demand labeling process management method based on active learning, which comprises the following steps: extracting user demand characteristics, including extracting keyword frequency vectors, extracting heuristic attribute vectors and calculating TF-IDF vectors; ordering user requirements, including ordering user requirements, preprocessing texts, adjusting according to text similarity and ordering according to clustering results; training a classification model, including embedding vectors, processing a user demand text matrix by using a convolutional neural network, integrating characteristics of user demands and predicting and classifying the neural network; user demand reordering, including calculating variance, user demand preprocessing, calculating uncertainty, and reordering user demand. The case category information is used for assisting the law recommendation, and the attention mechanism is used, so that more attention can be paid to key words, and the accuracy of the law recommendation is improved.

Description

User demand labeling process management method based on active learning
Technical Field
The invention relates to a user requirement labeling method, in particular to a user requirement labeling process management method based on active learning, and belongs to the technical field of requirement engineering and natural language processing.
Background
In recent years, with the continuous increase of software scale and the influence of factors such as diversity of software requirements, complexity of digital environment, consistency, changeability and invisibility, the development efficiency and quality of software cannot meet the objective requirements of software industry development. Therefore, software technology is constantly being updated, where the reuse of user requirements is of great concern. The requirement engineering is the earliest stage in the software development process, the reuse of user requirements can save the time of requirement analysis and can also help to quickly locate other reusable software assets.
In order to better multiplex the user requirements, informal, unstructured user requirements need to be transcribed into a well-structured specification of the requirements, which need to be divided into different categories. The number of user requirements is numerous and it is impractical to classify them all manually. There has been a great deal of research on automated demand classification, particularly methods using natural language processing techniques. With the rapid development of machine learning technology, supervised learning algorithms are widely applied and have excellent performance on text classification problems. On the basis, the classification calculation based on machine learning and deep learning is widely applied to the user demand classification problem, and good classification performance is obtained. In order to achieve good performance of these classification systems based on supervised learning algorithms, a large number of user demand instances must be artificially labeled and the model trained accordingly. However, manually noting the type of user demand can be labor and time intensive, and even error prone for large software projects. Therefore, in order to reduce the workload of manual labeling and reduce the cost of software development, it is necessary to effectively select a training set and avoid wasting the labeling cost.
Active learning is a well-known machine learning algorithm, and selects from unlabeled data instances by using a query strategy, so that the labeling cost is reduced, and the labeling bottleneck can be overcome. In this way, a supervised learning algorithm can achieve a model with higher accuracy with as few annotated data instances as possible.
Among the different modes of active learning, pool-based active learning is widely applied in the text classification problem, where the query strategy is mainly based on uncertainty sampling. This active learning mode is also widely used in the demand classification problem. The algorithm is mainly used for predicting collected and unmarked user requirements on the basis of the existing classification model, and selecting a plurality of models with the lowest grasp on the types of the models from all the alternative user requirements according to different strategies, such as a minimum confidence strategy, and marking the models, so that the accuracy of the models is effectively improved, and the marking cost required by the models based on the supervised learning algorithm is saved. However, for the automatic classification of user needs, the currently existing active learning algorithm has limitations in two aspects: a seed selection policy and a query policy. For seed selection strategies, most existing active learning algorithms adopt a method of randomly selecting a plurality of unlabeled data instances to form an initial seed data set. For the problem of user demand classification, due to the numerous types and unbalanced type distribution, a phenomenon may possibly occur in the seed set selected by the random selection method: some less-prevalent types of data instances will not appear in the seed dataset. Such seed data sets may cause subsequent query strategies to completely ignore data instances of missing types, thus greatly reducing the performance of the classifier on these missing types, thereby reducing the active learning speed, i.e., the missing class effect. For query strategies, uncertainty-based sampling is an exploratory strategy that chooses based on the model's uncertainty for unlabeled data, thus causing the active learner to erroneously over-confident, i.e., sample bias, data instances belonging to the missing type. After each iteration of the software update, a large number of user requirements belonging to a new class are proposed, and uncertainty-based sampling has difficulty in being able to select data instances belonging to the "new class", so the classifier will be poor in effect. Even though a well-designed seed selection strategy may constitute a seed dataset containing all existing types of instances, there is still a need for a query strategy to select new user requirements after each iterative update, especially to find data instances belonging to a "new class". These two problems will be more serious in active learning algorithms applied to user demand classification, because the type distribution of user demand is very unbalanced. Therefore, in the invention, a seed selection strategy based on the knowledge in the user requirement field and a query strategy based on the difference and uncertainty are provided to optimize the process management of the user requirement type marking.
Disclosure of Invention
The invention relates to a user requirement labeling process management method based on active learning, which comprises the steps of extracting characteristics of user requirements, sequencing the user requirements according to keyword frequency characteristics and text similarity of the user requirements, and providing the user requirements in the front of the sequence for labeling personnel. Secondly, training a classification model based on a convolutional neural network by using the labeled user requirements; and then, according to the uncertainty of the model and the difference between the unlabeled data and the labeled data, the unlabeled user requirements are reordered. The method can effectively select the training set of the model from knowledge in the field of user requirements, and reduces the artificial labeling cost.
1. A user demand marking management method based on active learning is characterized by comprising the following steps:
acquiring user requirement description, and extracting a feature vector of a word level for each requirement description;
based on the demand characteristic vector and the text content, applying a clustering algorithm to sequence the user demands according to types;
screening out a part of user requirements, carrying out manual marking on the user requirements, and training a classification model according to the feature vectors;
and performing type prediction on all the remaining unmarked data by using the classification model to obtain type probability distribution, sequencing the unmarked requirements again, and circularly executing the previous step until the performance of the model reaches the expectation.
2. The method of claim 1, wherein obtaining user requirement descriptions, extracting a word-level feature vector for each requirement description comprises:
giving out seven categories of non-project-specific keywords according to the classification standard of the existing user requirements, selecting project-specific keywords from the user requirement data set of each software project by utilizing word similarity, summarizing the two keywords into the keywords of each category, counting the number of the keywords of each category in each user requirement, and splicing the seven numerical values into a vector of 1 multiplied by 7 to serve as the keyword frequency vector of the requirement text;
defining 18 problems which are helpful for classification aiming at user requirements, providing a general expression of each problem, and corresponding each requirement to a 1 multiplied by 18 heuristic attribute vector according to the matching condition of a user requirement text and the general expression;
and calculating a TF-IDF (Term Frequency-Inverse text Frequency index) feature vector for the user requirement.
3. The method of claim 1, wherein applying a clustering algorithm to rank user requirements by type based on requirement eigenvectors and textual content comprises:
clustering unlabeled user requirements by using a Gaussian mixture model, and recording the probability result of clustering, wherein the input of the clustering model is a keyword frequency vector required by the user;
performing text preprocessing on all the unmarked user requirements, including removing misspelled words and stop words, and counting the frequency of all the appearing words in each cluster in the cluster;
calculating the text similarity between each user requirement and each cluster by using the word frequency counted in the last step, adjusting the clustering result according to the text similarity, and repeating the process for a plurality of times until the clustering result is not changed;
according to the final clustering result, sequencing all the unlabelled user requirements according to rules, wherein the rules are as follows: and sequencing according to the user requirements contained in each cluster from small to large, and sequencing according to the clustering posterior probability and the text similarity of the user requirements in each cluster.
4. The method of claim 1, wherein a portion of the user requirements are filtered and manually labeled, and training the classification model based on the feature vectors comprises:
screening out user requirements with a certain data volume according to the sorting result, and manually marking the types of the user requirements;
using a Word2Vec method to represent all words appearing in the requirement text as vectors, namely, each requirement text is represented by a len × emb matrix, en is the number of words in the requirement text, emb is the embedding dimension of the Word2Vec, and finally each requirement is represented as a matrix;
training a classification model by using a convolutional neural network, wherein the training comprises the steps of performing convolutional operation on a matrix by using a plurality of convolutional cores to obtain a smaller matrix, performing pooling operation on the matrix to obtain a smaller matrix, and finally performing folding operation to reduce the dimension of the matrix into a one-dimensional vector;
splicing a Word2Vec matrix processed by using a convolutional neural network with the keyword vector, the heuristic attribute vector and the TF-IDF vector obtained in the claim 1 or the claim 2 to obtain a complete feature vector of the user required text;
and inputting the obtained feature vector into a neural network, and outputting the probability vector of the class to which the requirement belongs by using a Softmax function after the processing of a hidden layer so as to train a classification model.
5. The method of claim 1, wherein performing type prediction on all remaining unlabeled data using the classification model, obtaining a type probability distribution, re-ordering unlabeled demand, and looping through the previous steps until model performance is expected comprises:
calculating a keyword vector, a heuristic attribute vector and a TF-IDF vector of the requirements of the unclassified users, and calculating the difference between each unlabeled user requirement and the labeled user requirement according to the former two characteristics;
preprocessing the unmarked user requirements to enable the unmarked user requirements to meet the input of a classification model;
predicting the user requirements which are not marked according to the existing classification model, and calculating the uncertainty of the model for each user requirement, namely the uncertainty is represented by the difference between the maximum probability and the second-order probability of the prediction type;
ordering all unlabelled user demands according to the differences and uncertainties, and repeating the steps defined in claim 4 until the performance of the classification model reaches a preset condition.
Drawings
FIG. 1 is a flowchart of a user requirement annotation process management method based on active learning
FIG. 2 non-item specific keyword List
FIG. 3 exemplary diagram of project specific keywords
FIG. 4 is a list of 11 questions for which heuristic attributes correspond
FIG. 5 a user demand classification model based on convolutional neural network
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
The invention relates to a user requirement labeling process management method based on active learning, which comprises the steps of extracting characteristics of user requirements, sequencing the user requirements according to keyword frequency characteristics and text similarity of the user requirements, and providing the user requirements in the front of the sequence for labeling personnel. Secondly, training a classification model based on a convolutional neural network by using the labeled user requirements; and then, according to the uncertainty of the model and the difference between the unlabeled data and the labeled data, the unlabeled user requirements are reordered. The method can effectively select the training set of the model from knowledge in the field of user requirements, and reduces the artificial labeling cost. The invention mainly comprises the following steps:
step (1) extracting user requirement characteristics;
step (2) ordering user requirements;
step (3) training a classification model;
step (4) re-ordering the user requirements;
a detailed work flow of the user requirement labeling process management method based on active learning is shown in fig. 1. The above steps will be described in detail herein.
1. In order to convert the user requirements into a form which can be understood by a computer, the requirement text needs to be converted into a vector form, namely, features which are helpful for classification are extracted from the requirement text, and the features are vectorized. The invention mainly utilizes two types of feature vectors: vectorization of each word in the demand text and the feature vectors closely related to the category. The first class of feature vectors are divided into keyword frequency vectors and heuristic attribute vectors, and the second class of feature vectors are TF-IDF vectors, and the specific steps comprise:
and (1.1) extracting a keyword frequency vector required by a user. The keywords of the user's needs consist of two parts: non-item specific keywords and item specific keywords. By analyzing the definitions of various types of user requirements, the present invention defines a list of non-item specific keywords for each type as shown in FIG. 2. Based on this, the present invention utilizes word similarity to select project specific keywords as shown in FIG. 3 from the user requirements data set for each software project. The keyword frequency vector for each user request is a zero initialized 1 x 7 vector, where each entry represents the number of times that the type of keyword occurs in the user's request. Traversing all words in the user requirement, if the word belongs to a certain type of keyword, adding 1 to the value corresponding to the type.
And (1.2) extracting heuristic attribute vectors required by the user. Each requirement text may be divided into different sections, and some sections strongly suggest categories to which requirements are potentially attributed. The present invention names different portions of the user's requirements as heuristic attributes (HP). Since a user demand may have multiple HPs of different request types, and a single HP may mean more than one request type, the relationship between HP and demand category is used as a feature of the demand text. First, the definition of some concepts related to heuristic properties is given:
system Capability (System Capability): the core functions of the system that the demand provider wishes to implement.
Reason (ratio): the benefits of this need are realized, or if this need is not implemented, an unexpected or negative event will occur.
Relevant Existing Capabilities (Related Existing Capabilities): existing functions or components associated with this need.
Expected or Unexpected Behavior (Expected or Unexpected Behavior): the behavior that the system should or should not behave according to this demand.
Background (Context): conditions where expected or unexpected behavior should or should not occur.
Description of Implementation (Implementation Instructions): it relates to how the requested functionality is implemented, e.g., the steps that a developer should follow to develop the requested functionality.
Example (c): to illustrate an example of anything that may be set forth in the requirements.
Seven heuristic attributes are artificially defined according to the concept, and the problems corresponding to the seven heuristic attributes (shown as problems 1-7 in fig. 4) are defined. The answers to these seven questions are all "yes" or "no". And (3) answering the seven questions according to the required text, wherein if the answer is 'yes', the heuristic attribute vector is set to be '1', and otherwise, the heuristic attribute vector is set to be '0'. In addition to the seven questions above, we have supplemented four additional questions (such as questions 8-11 in FIG. 4). Problem 8 is a problem with describing system performance in the requirements text: if no similar sentence exists in the requirement, setting the value of the heuristic attribute as '0'; if the first sentence in the requirement text is a sentence describing the system performance, setting the value of the heuristic attribute as '1'; otherwise, the value is set to '2'; question 9 is a question about the person's term: the value of this heuristic attribute is equal to the number of human pronouns (e.g., I, you, we, user) in the requirements text, i.e., equal to the answer to question 9. Problem 10 is a problem with the system and its components: the value of this heuristic attribute is equal to the number of occurrences of words in the requirements text about the system or system component (e.g., system name, UI component, etc.), i.e., equal to the answer to question 10; problem 11 is a summarized problem: the value of the heuristic attribute is equal to the number of sentences in the requirement text which do not correspond to any previous heuristic attribute, namely, the answer of the question 11. The 11 values of the heuristic attributes are spliced together to form a 1 × 11 vector, that is, each requirement corresponds to a heuristic attribute vector with a dimension of 1 × 11. In addition to the values of HP, we also calculate four other values as characteristics of the requirements: an index of words or sentences in the requirement text and the number of sentences are used as features in the requirement text. The four values are supplementary to the heuristic attribute, and form the heuristic attribute of the complete demand text together with the seven heuristic attribute values.
And (1.3) calculating the TF-IDF vector. Typically, the TF-IDF value for each word is obtained by the product of two terms: the first calculates the normalized word frequency (TF), i.e., the number of times a word appears in the desired text, by dividing the number of times the desired word appears by the total number of words in the desired text; the second term is the Inverse Document Frequency (IDF), which measures the importance of a term and can be obtained by dividing the total number of requirements by the number of requirements containing the word (usually the denominator is increased by 1 to avoid the case that the denominator is 0), and taking the resulting quotient to be a base-10 logarithm. Namely, the TFIDF value of a word w in the requirement r is calculated by the formula:
TF-IDF=TF(w,r)×IDF(w)
wherein the content of the first and second substances,
TF (w, r) ═ w number of occurrences in r/total number of words in r
IDF(w)=loge(Total number of demands/number of demands including w)
Thus, all words appearing in the document have a TF-IDF value in each requirement text, and the TF-IDF vector of each requirement text is a vector formed by splicing the TF-IDF values of each word in the requirement.
2. In the design stage of the software, a large number of collected user requirements do not contain type labels, so that the labels need to be ranked in priority. In order to construct a seed data set that can contain all types of data instances, it is necessary to cluster them using unsupervised clustering algorithms. The invention uses a Gaussian mixture model to cluster the unlabeled user requirements. In order to obtain a better clustering effect, the method uses the keyword vector to characterize the user requirements. And inputting the keyword frequency vectors into a Gaussian mixture model to obtain clustering results, and sequencing according to the number of data instances of each cluster in the clustering results and the posterior probability of the data instances in each cluster. The method comprises the following specific steps:
and (2.1) clustering user requirements. The keyword frequency vector of the user requirement has 7 items, wherein each item can be approximately regarded as a Gaussian distribution. Therefore, the gaussian mixture model should have seven gaussian single model components, corresponding to seven clusters in the clustering result. According to the clustering result of the Gaussian mixture model, each user requirement UR can be obtainediProbability vectors belonging to seven clusters, i.e. p (UR)i)=(pi1,pi2,...,pi7). Finally, URiThe clustering result is as follows:
Figure BSA0000230353960000061
and (2.2) adjusting the text similarity. The goal of the adjustment is to have the user needs clustered into clusters that contain more user needs of the same type as the user needs. Since the word frequency of each word in the user requirement in each cluster needs to be calculated, the user requirement text needs to be preprocessed. The pretreatment step comprises: cleaning (punctuation removal, stop word removal) and normalization (stem extraction and morphological restoration). The original user requirement will have punctuation marks, which are meaningless in the process of calculating text similarity. Stop words are words that are unnecessary in sentences, and articles like "a", "the" and the like are meaningless and unnecessarily increase the amount of computation in the process of adjusting the similarity of texts. Stemming and morphological reduction (stemming) aims to find the original form of a word.
In order to globally adjust the distribution of all user requirements in each cluster according to text similarity, the adjustment is performed according to the following steps.
(1) A word frequency table is constructed for each cluster, and a word table is constructed for each user requirement, namely the user requirement
Figure BSA0000230353960000062
And (4) showing. And calculating the text similarity between the text similarity and each cluster by the following formula:
Figure BSA0000230353960000071
TS (UR) for text similarity between user demand and clustersi,clusterj) Indicating that each word in the user's request occurs in clusters with a frequency of f (w)ik,clusterj) And (4) showing. To avoid underflow of the word frequency multiplication, the logarithm of the word frequency is taken. Because the lengths of the user requirements are different, namely the number of words contained in the user requirements is different, the logarithms of the word frequencies are normalized, and the sum of the logarithms of the word frequencies is divided by the length of the user requirements. When calculating the text similarity between the user requirement and each cluster, it is necessary to delete the text similarity from the cluster to which the user requirement belongs, and update the word frequency table of the cluster.
(2) And adjusting the clustering result according to the text similarity, namely adjusting URi to the cluster with the highest text similarity:
Figure BSA0000230353960000072
(3) step (2) is performed for all user requirements in the user requirement data set and the adjustment is performed k times for all user requirements repeatedly. In the present invention, when k is 5, a good clustering effect can be obtained.
And (2.3) sorting according to the clustering result. Comprehensively considering the posterior probability and the text similarity of the Gaussian mixture model for sequencing, the specific steps are as follows:
(1) and sequencing the seven clusters according to the clustering result adjusted by the text similarity, and sequencing from small to large according to the user requirements contained in each cluster.
(2) And (4) traversing each cluster according to the sequence of the step (1), and selecting the most representative user requirement from the cluster. The representativeness of the user requirement is obtained by multiplying the posterior probability of the cluster and the text similarity. Before the user requirement is selected from each cluster, in order to eliminate the dimensional influence between the posterior probability and the text similarity, the two data need to be normalized, so that the two data have comparability, namely, the two data are unified into the same value space. Firstly, the clustering posterior probability and the text similarity between all the unlabeled user requirements and the cluster to which the unlabeled user requirements belong are counted, and the two characteristics are normalized in all the user requirements. The invention adopts linear function normalization (Min-Max Scaling), and the result is mapped into the range of [0, 1] by carrying out linear transformation on the original data, thereby realizing equal ratio Scaling. And finally deleting the selected user requirements from the cluster.
(3) And (3) repeating the step (2), and skipping if a certain cluster is empty in the process until all clusters become empty sets.
3. After the user requirements are selected and the manual labeling is completed, the classifier can be trained by utilizing the user requirements with the type labels. The invention uses a user demand classification model based on a convolutional neural network, and a model framework is shown in FIG. 5, and the specific sub-steps comprise:
and (3.1) embedding the vector. All the required texts are subjected to Word segmentation (because the required texts are English texts, Word segmentation is directly performed by using a blank space), each Word is represented by a one-dimensional vector (the length is an embedding dimension) by using a Word2Vec function of a genesis library in python, in order to facilitate the input of a model and ensure the accuracy, the mode of the required texts containing the number of the words is calculated, the mode is used as the length of each required text, each required text can be represented by a matrix of len × emb, len is the number of the words in the required text, and emb is the embedding dimension of Word2 Vec.
And (3.2) processing a Word2Vec matrix of the required text by using a convolutional neural network. Performing convolution operation on a Word2Vec matrix by using 128 convolution kernels with the sizes of 2, 3 and 4 respectively to obtain a three-dimensional vector; and performing pooling operation with the window size of 3 and the step length of 3 to obtain a matrix with a smaller size, and finally performing a flatten operation to obtain a one-dimensional vector. The size of the convolution kernel in the convolutional layer can be regarded as a mode similar to an N-gram, the characteristics of word sequences are better utilized, parameters in the model are reduced through subsequent pooling and flatten operations, and the efficiency of the model is greatly improved under the condition that the classification accuracy is not influenced;
and (3.3) integrating the characteristics of the user requirements. Firstly, splicing a Word2Vec matrix processed by using a convolutional neural network with the keyword vector, the heuristic attribute vector and the TF-IDF vector obtained in the steps (1.1), (1.2) and (1.3) to obtain a complete feature vector of a user demand text;
and (3.4) predicting and classifying the neural network. The input of the neural network comprises the feature vectors extracted by the convolutional neural network and the feature vectors based on the user requirement field knowledge, namely the input vectors of the classification model of the support vector machine: keyword frequency, heuristic attributes, and TF-IDF. The four vectors are spliced and input into the fully-connected layer. The fully-connected layer consists of two layers, the first layer fully connects the feature vectors with 128 neurons, the activation function of each neuron is the ReLU function, and the Dropout rate is 0.5. The output of this layer is passed to the output layer, which is called the softmax layer. Since the user needs are of 7 types in total, there are 7 neurons in the softmax layer, and the activation function of each neuron is the softmax function. Outputting the category to which the demand belongs by using a 'softmax' function, wherein the 'softmax' function is a normalization function, and assuming that we have an array V and Vi represents the ith element in V, the softmax value of the element is:
Figure BSA0000230353960000081
in the present invention, the requirement will be divided into 7 categories, and 7 values are calculated by using the "softmax" function, and the index corresponding to the largest value of the 7 values is the category to which the requirement belongs.
4. After the user performs case description, the text input by the user is also preprocessed.
And (4.1) calculating the difference. And extracting all keywords from any unmarked user requirement. A window of length 3, defined as a "keyword block", is built around the keywords in the user's requirements. The keyword block will select one word from each of the left and right sides of the keyword's neighborhood. If the keyword appears at the beginning or end of the user's request, only one word is added next to it, i.e. the length of the window becomes 2. The query policy defines the differences using the keyword blocks. For any keyword in user requirements, the difference of the corresponding keyword blocks is as follows:
Figure BSA0000230353960000082
wherein
Figure BSA0000230353960000083
Due to the different lengths of KCs, we need keyword blocks for length normalization. And in order to avoid overflow after multiplication of the word frequency, logarithm is taken on the word frequency. Wherein, f (w)ij) Represents the word wijFrequency of occurrence in the labeled user demand data set. And the frequency of keywords is defined as follows:
Figure BSA0000230353960000091
the difference of each keyword block in the user requirement can be solved by combining the three formulas. The keyword variance of the user requirement is the sum of the variances of all the keyword blocks:
Figure BSA0000230353960000092
similar to the definition of the keyword vector of the user requirement, the difference of the unlabeled user requirement with respect to the heuristic attribute is defined as:
Figure BSA0000230353960000093
the query strategy based on keywords and heuristic attributes will use two features together to measure the difference:
div(URi)=div(kwi)·div(hpi)
similarly, in order to eliminate the dimensional influence between the two, a linear function normalization process is required before the calculation.
And (4.2) preprocessing the user requirement. The unlabeled user requirements are preprocessed to meet the input of the classification model.
And (4.3) calculating uncertainty. And predicting the un-labeled user requirements according to the existing classification model, and calculating the uncertainty of the model for each user requirement. Based on the output of the softmax function of the classification model of the user demand, the probability classifier classifies the user demand r into the type corresponding to the highest posterior probability, namely:
Figure BSA0000230353960000094
according to the uncertainty sampling strategy, the uncertainty of the classification model to the user requirement r is defined as follows:
Figure BSA0000230353960000095
step (4.4) re-orders the user requirements. The query strategy based on the difference can ensure the diversity of the user requirements in the training set, so the query strategy is a utilization strategy, namely, the sample space is searched as much as possible. A suitable query strategy would be to balance exploration against utilization. Uncertainty sampling works well for exploration problems, so it is desirable to combine a diversity-based query strategy with uncertainty sampling to solve both exploration and exploitation problems. The combined query strategy gives a score for each unlabeled user demand according to the differences and uncertainties:
score(URi)=γ·div(URi)+(1-γ)·pUS
in the initial stage of the query strategy, the classification model should focus more on the search of the whole sample space so that the classifier can get a rough decision boundary. With the increase of data in the training set, the query strategy should focus on making the classifier get more accurate decision boundaries. To achieve this, let γ be a parameter that decreases linearly from 1 to 0 with the process of data selection. In the initial phase of selection, γ is 1. And subtracting 1/k from gamma when selecting one user requirement by using the query strategy. When the annotation cost is exhausted, γ is reduced to 0.
A user requirement tagging process management method based on active learning according to the present invention has been described in detail above with reference to the accompanying drawings. The invention has the following advantages: a seed selection strategy is provided in the design stage of software, so that all types of user requirements can be selected, and the class missing effect is avoided; the uncertainty sampling is expanded by using the domain knowledge required by the user, and a classification model with higher accuracy can be obtained with lower labeling cost.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. Also, a detailed description of known process techniques is omitted herein for the sake of brevity. The present embodiments are to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (5)

1. A user demand marking management method based on active learning is characterized by comprising the following steps:
acquiring user requirement description, and extracting a feature vector of a word level for each requirement description;
based on the demand characteristic vector and the text content, applying a clustering algorithm to sequence the user demands according to types;
screening out a part of user requirements, carrying out manual marking on the user requirements, and training a classification model according to the feature vectors;
and performing type prediction on all the remaining unmarked data by using the classification model to obtain type probability distribution, sequencing the unmarked requirements again, and circularly executing the previous step until the performance of the model reaches the expectation.
2. The method of claim 1, wherein obtaining user requirement descriptions, extracting a word-level feature vector for each requirement description comprises:
giving out seven categories of non-item-day specific keywords according to the classification standard of the existing user requirements, selecting item-specific keywords from the user requirement data set of each software item by utilizing word similarity, summarizing the two keywords into the keywords of each category, counting the number of the keywords of each category in each user requirement, and splicing the seven numerical values into a vector of 1 multiplied by 7 to serve as the keyword frequency vector of the requirement text;
defining 18 problems which are helpful for classification aiming at user requirements, providing a general expression of each problem, and corresponding each requirement to a 1 multiplied by 18 heuristic attribute vector according to the matching condition of a user requirement text and the general expression;
and calculating a TF-IDF (Term Frequency-Inverse text Frequency index) feature vector for the user requirement.
3. The method of claim 1, wherein applying a clustering algorithm to rank user requirements by type based on requirement eigenvectors and textual content comprises:
clustering unlabeled user requirements by using a Gaussian mixture model, and recording the probability result of clustering, wherein the input of the clustering model is a keyword frequency vector required by the user;
performing text preprocessing on all the unmarked user requirements, including removing misspelled words and stop words, and counting the frequency of all the appearing words in each cluster in the cluster;
calculating the text similarity between each user requirement and each cluster by using the word frequency counted in the last step, adjusting the clustering result according to the text similarity, and repeating the process for a plurality of times until the clustering result is not changed;
according to the final clustering result, sequencing all the unlabelled user requirements according to rules, wherein the rules are as follows: and sequencing according to the user requirements contained in each cluster from small to large, and sequencing according to the clustering posterior probability and the text similarity of the user requirements in each cluster.
4. The method of claim 1, wherein a portion of the user requirements are filtered and manually labeled, and training the classification model based on the feature vectors comprises:
screening out user requirements with a certain data volume according to the sorting result, and manually marking the types of the user requirements;
using a Word2Vec method to represent all words appearing in the requirement text as vectors, namely, each requirement text is represented by a len × emb matrix, len is the number of words in the requirement text, emb is the embedding dimension of Word2Vec, and finally each requirement is represented as a matrix;
training a classification model by using a convolutional neural network, wherein the training comprises the steps of performing convolutional operation on a matrix by using a plurality of convolutional cores to obtain a smaller matrix, performing pooling operation on the matrix to obtain a smaller matrix, and finally performing folding operation to reduce the dimension of the matrix into a one-dimensional vector;
splicing a Word2Vec matrix processed by using a convolutional neural network with the keyword vector, the heuristic attribute vector and the TF-IDF vector obtained in the claim 1 or the claim 2 to obtain a complete feature vector of the user required text;
and inputting the obtained feature vector into a neural network, and outputting the probability vector of the class to which the requirement belongs by using a Softmax function after the processing of a hidden layer so as to train a classification model.
5. The method of claim 1, wherein performing type prediction on all remaining unlabeled data using the classification model, obtaining a type probability distribution, re-ordering unlabeled demand, and looping through the previous steps until model performance is expected comprises:
calculating a keyword vector, a heuristic attribute vector and a TF-IDF vector of the requirements of the unclassified users, and calculating the difference between each unlabeled user requirement and the labeled user requirement according to the former two characteristics;
preprocessing the unmarked user requirements to enable the unmarked user requirements to meet the input of a classification model;
predicting the user requirements which are not marked according to the existing classification model, and calculating the uncertainty of the model for each user requirement, namely the uncertainty is represented by the difference between the maximum probability and the second-order probability of the prediction type;
ordering all unlabelled user demands according to the differences and uncertainties, and repeating the steps defined in claim 4 until the performance of the classification model reaches a preset condition.
CN202110045602.9A 2021-01-14 2021-01-14 User demand labeling process management method based on active learning Pending CN112560410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110045602.9A CN112560410A (en) 2021-01-14 2021-01-14 User demand labeling process management method based on active learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110045602.9A CN112560410A (en) 2021-01-14 2021-01-14 User demand labeling process management method based on active learning

Publications (1)

Publication Number Publication Date
CN112560410A true CN112560410A (en) 2021-03-26

Family

ID=75035573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110045602.9A Pending CN112560410A (en) 2021-01-14 2021-01-14 User demand labeling process management method based on active learning

Country Status (1)

Country Link
CN (1) CN112560410A (en)

Similar Documents

Publication Publication Date Title
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
US8027977B2 (en) Recommending content using discriminatively trained document similarity
EP1573660B1 (en) Identifying critical features in ordered scale space
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN111291188B (en) Intelligent information extraction method and system
CN107688870B (en) Text stream input-based hierarchical factor visualization analysis method and device for deep neural network
CN108509521B (en) Image retrieval method for automatically generating text index
CN108228541A (en) The method and apparatus for generating documentation summary
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN111008530A (en) Complex semantic recognition method based on document word segmentation
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
CN114077661A (en) Information processing apparatus, information processing method, and computer readable medium
CN115248839A (en) Knowledge system-based long text retrieval method and device
CN114265936A (en) Method for realizing text mining of science and technology project
CN113032573A (en) Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
Bollegala et al. Minimally supervised novel relation extraction using a latent relational mapping
CN114298020B (en) Keyword vectorization method based on topic semantic information and application thereof
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
CN115292515A (en) Knowledge graph construction method in sewing equipment modular design field
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
CN111339303B (en) Text intention induction method and device based on clustering and automatic abstracting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination