WO2022095573A1 - 一种结合主动学习的社区问答网站答案排序方法及系统 - Google Patents
一种结合主动学习的社区问答网站答案排序方法及系统 Download PDFInfo
- Publication number
- WO2022095573A1 WO2022095573A1 PCT/CN2021/116051 CN2021116051W WO2022095573A1 WO 2022095573 A1 WO2022095573 A1 WO 2022095573A1 CN 2021116051 W CN2021116051 W CN 2021116051W WO 2022095573 A1 WO2022095573 A1 WO 2022095573A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- question
- answer
- answers
- candidate
- community
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 50
- 238000002372 labelling Methods 0.000 claims abstract description 38
- 230000007246 mechanism Effects 0.000 claims abstract description 11
- 238000011176 pooling Methods 0.000 claims description 37
- 238000004364 calculation method Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 22
- 239000011159 matrix material Substances 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000007619 statistical method Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000011160 research Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 230000006872 improvement Effects 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 abstract description 36
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000001149 cognitive effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/091—Active learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the invention relates to Internet technology, in particular to a method and system for sorting answers on a community question-and-answer website combined with active learning.
- CQA Community Question Answering
- the retrieval function of mainstream CQA websites generally returns a list of similar questions to users, and sorts the answers to each similar question according to data such as likes and comments.
- This method can help users to select answers to a certain extent, but there are still problems. For example, browsing a large amount of question-and-answer data and judging the pros and cons of answers from different similar questions cause users' cognitive overload and reduce users' cognitive overload. user experience. Therefore, it is necessary to uniformly sort the answers of all similar questions, and directly return the sorted answer list according to the user's retrieval target.
- CQA website answer sorting task which can also be called community Q&A. Task.
- CQA website question and answer data make it difficult to study the answer ranking method of CQA website.
- the length of the question and answer texts on the CQA website is quite different, with few co-occurring words and sparse distribution.
- the answer text contains a lot of redundancy, noise and even wrong information, which exacerbates the semantic gap between the question and answer texts.
- the related work of CQA website generally introduces community features calculated based on community data, such as calculating the share of approvals for an answer based on the total approvals of all answers under the question, and calculating users based on the number of user answers.
- community features calculated based on community data such as calculating the share of approvals for an answer based on the total approvals of all answers under the question, and calculating users based on the number of user answers.
- the average number of likes obtained for each answer this calculation can only guarantee the accuracy of community characteristics when the community data is large enough.
- the community data on the CQA website has a long-tailed distribution, and the community characteristic data of a large number of question and answer data is very small, which causes the answer ranking model to bias the question and answer data that is difficult to accurately reflect the real level of the data with a large number of community characteristics.
- the correct answer is not unique.
- the user's evaluation of an answer is based on the comparison with other candidate answers, so it is more suitable to use the sorting method based on answer pairs, that is, convert the answer sorting problem into a series of two-class problems, and predict any two under the target problem.
- the ordering relationship of the candidate answers Compared with the single-answer-based ranking method that predicts the correlation between questions and answers, the answer-pair-based ranking method needs to label the ranking relationship between any two candidate answers when labeling the training set, which increases the size of the training set and makes labeling difficult. Increase.
- the object of the present invention is to provide a method and system for sorting answers on a community question-and-answer website combined with active learning, aiming at the problems caused by the semantic gap between the question-and-answer text data and the long-tail distribution of the question-and-answer community data when the answers of the CQA website are sorted in the above-mentioned prior art, Reduce the interference in the answer sorting process, reduce the difficulty of text modeling and the cost of sample labeling.
- the present invention has the following technical solutions:
- a community Q&A website answer ranking method combined with active learning including the following steps:
- the title of the target question, the content of the candidate answer and the title of the original question corresponding to the candidate answer in the question-and-answer data are firstly segmented and stop words are removed, and then the text is represented as a word vector matrix by word2vec.
- the community data related to the question and answer in the step S1 includes the number of question answers, the share of answer approvals, the number of user answers, the average number of user approvals, the average number of user likes, the average number of user answers saved, and the number of user followers;
- the number of answers to the question refers to the total number of answers under the question
- the number of user answers refers to the total number of answers provided by the user on the website
- the number of user followers refers to the total number of people followed by the user
- the proportion of the likes of the answer is calculated as follows:
- the average number of likes by users, the average number of likes by users, and the average number of favorites of user answers refer to the number of likes, likes, and favorites obtained by users on average for each answer, and the calculation methods are as follows:
- uac i represents the number of answers of user ui
- uvc i represents the total number of approvals obtained by all the answers of user u i , that is, the number of user approvals
- ula i represents the average number of likes of user ui
- ulc i represents all the answers of user ui i
- the calculation method of the problem long-tail factor and the user long-tail factor in the step S1 is as follows:
- qv i represents the sum of the agreeing numbers of all the answers under question q i , namely
- m i is the total number of answers under the question
- uac i represents the number of answers from user ui ;
- the structure of the QQA-CNN model in step S1 includes a deep network for the target question, a deep network for candidate answers, an attention mechanism module and a feature connection layer between the two deep networks; a deep network for the target question It consists of two convolutional layers and two pooling layers.
- the deep network for candidate answers contains three convolutional layers and three pooling layers.
- the QQA-CNN model is in front of the two deep networks and between the two pooling layers, respectively.
- Two attention mechanism modules are introduced, and finally, in the feature connection layer, the learned high-level semantic features of the target question and the candidate answer, the community feature, and the similarity feature of the target question and the candidate answer corresponding to the original question are connected to obtain CQA. Representation of website question answering data.
- the QQA-CNN model uses wide convolution to extract the semantic features of several consecutive words; in the pooling layer, QQA-CNN adopts two pooling strategies.
- QQA-The CNN model adopts partial pooling, that is, the average pooling is performed on the features within a certain length window; for the last pooling layer in the network, the QQA-CNN model adopts full pooling, that is, the convolution results are in the sentence length dimension.
- the attention mechanism module calculates the attention weight based on the feature maps output by the convolutional layers of the two deep models, and applies the results to the pooling layer for weighted pooling.
- the text features of the target question and candidate answer are convolved Feature map obtained by layer and
- the calculation expression of the attention matrix A is as follows:
- the sum of the elements in each row and column is the weight of the word.
- the feature connection layer combines features, including the high-level semantic features of the target question text, the high-level semantic features of the candidate answer text, the community features related to the question and answer data, and the cosine similarity between the target question and the original question text feature matrix of the candidate answer, and finally
- the question answering data is represented as distributed vectors by the QQA-CNN model.
- three rules are obtained and formalized; first, in the CQA website, under the same question, the ranking of the best answer will be higher than that of the non-best answer; Secondly, on the CQA website, under the same question, there is no difference in the ranking order of the non-best answers; finally, on the CQA website, the ranking ratio of the answers to the same questions as the target question domain under the target question is compared with the answers of different questions from the target question domain. high ranking;
- the symbol > represents that for the target question qi, the candidate question-answer pair than candidate question-and-answer pairs
- the rank is high, that is, the sorting label is 1, the symbol represents the candidate question-answer pair for the target question qi and candidate question-and-answer pairs
- the sorting label is 0;
- the design program automatically constructs the labeled training set L.
- the step S2 answer ranking model is constructed based on two QQA-CNN models with shared parameters and a fully connected layer, and the input includes text features and community features related to the target question and the two candidate question-and-answer pairs;
- the model combines the input target question and two candidate question-answer pairs into two question-answer triples, respectively, and inputs the related text features and community features of the triples into the QQA-CNN model with two shared parameters to obtain two triples.
- Feature representation of question answering data
- the feature representation of the question and answer data triplet learned by the QQA-CNN model is input into the fully connected layer, and the correlation score between the target question and the candidate question and answer pair is obtained through nonlinear mapping.
- the magnitude of the relevance score outputs the final ranking label; when the output is 1, it means that the first candidate question-answer pair ranks higher than the second candidate question-answer pair in the final ranking; when the output is -1, the result is the opposite ;
- the loss function of the answer ranking model consists of the hinge loss function, the parameter regularization term and the penalty term as follows:
- t i and t' i represent the set of related features of question and answer triples with sorting labels 1 and -1; u j and u' j represent the set of related features of question and answer triples with sorting labels of 0;
- F(t i ) represents the correlation score obtained by inputting the fully connected layer after t i is characterized by QQA-CNN;
- y i represents the expected sequence label of the candidate question and answer pair;
- the unlabeled training set is constructed according to the actual research target, and for the target problem, the k-NN algorithm is implemented in the data set based on the open source graph computing framework GraphLab to retrieve several similar problems; All the answers under the question construct the candidate question-answer pair set of the target question; finally, select two candidate question-answer pairs each time from the target question candidate question-answer pair set without repetition, and form the target question and the two candidate question-answer pairs into two triples respectively , then the triplet pair composed of two triples is a sample in the unlabeled training set; in addition to the automatic construction of the labeled training set, active learning is applied to the answer sorting algorithm, and the unlabeled training set is targeted according to the query function.
- the query function first measures the difference between the correlation scores of the two candidate question-answer pairs based on the information entropy.
- the query function combines the similarity between candidate answers when selecting samples.
- the final query function is as follows:
- a i and a' i represent the text feature matrix of the two candidate answers;
- sim represents the cosine similarity;
- m i represents the number of question-and-answer triplet pairs under the target question query i .
- the present invention also provides a community question-and-answer website answer sorting system combined with active learning, including:
- the question and answer data representation module is used to extract the text features of the question and answer data, and express the question title and answer content after word segmentation and stopword removal as a word vector matrix;
- the user long-tail factor maps the total number of approvals under the question and the number of user answers to (0,1), and multiplies the obtained question-and-answer data community features by the question long-tail factor and user long-tail factor to replace the original question-and-answer data community features , and by inputting the community features of the question and answer data into the QQA-CNN model, the question and answer data is represented as a distributed vector;
- the training set construction and answer sorting module is used to formalize the statistical results into rules by performing statistical analysis on the question and answer data set, automatically construct a preliminary labeling training set based on the rules, build an answer sorting model based on the QQA-CNN model, and predict any two The ordering relationship between the candidate answers, construct an unlabeled training set, select additional samples from it for manual labeling by combining active learning, merge the labeling results into the preliminary labeling training set, train the answer ranking model again, and use the retrained answer ranking
- the model performs community Q&A site answer ranking.
- the present invention has the following beneficial effects: firstly characterize and model the question and answer data of the CQA website, solve the interference brought by the long-tail distribution of the community data to the answer sorting through the long-tail factor, and use the convolutional neural network.
- the attention mechanism is introduced to alleviate the semantic gap between question and answer texts.
- the active learning and answer sorting are combined.
- the unlabeled training set is also constructed, and additional samples are selected in the unlabeled training set for labeling, and the labeling results are merged. Achieve the highest possible model performance with the lowest possible annotation cost.
- the invention can uniformly sort the candidate answers under the target question in the CQA website.
- Fig. 1 is the structural representation of the QQA-CNN model of the present invention
- FIG. 2 is a schematic structural diagram of an answer ranking model of the present invention.
- the present invention can be divided into the following two processes in combination with the community question-and-answer website answer sorting method of active learning:
- Step 1 First, the title of the target question, the content of the candidate answer and the title of the candidate answer corresponding to the original question in the question-and-answer data are divided into words and stop words are removed, and then the text is represented as a word vector matrix by word2vec.
- Step 2 Extract the number of question answers, the share of answer approvals, the number of user answers, the average number of user approvals, the average number of user likes, the average number of user answers favorites, and the number of user followers as the community characteristics of the Q&A data.
- the number of answers to the question refers to the total number of answers under the question; the number of user answers refers to the total number of answers provided by the user on the website; the number of user followers refers to the total number of people followed by the user.
- the share of approval for an answer refers to the proportion of the number of likes obtained by an answer to the total number of likes obtained by all the answers to the question.
- the calculation formula is as follows:
- the average number of likes of users, the number of likes of users, and the number of favorites of user answers refer to the number of likes, likes and favorites obtained by users on average for each answer.
- the calculation formulas are as follows:
- uac i represents the number of answers of user ui
- uvc i represents the total number of approvals obtained by all the answers of user u i , that is, the number of user approvals
- ula i represents the average number of likes of user ui
- ulc i represents all the answers of user ui i
- the question long-tail factor and the user long-tail factor map the total number of approvals under the question and the number of user answers to (0,1), and replace the community feature with the result of multiplying the community feature by the long-tail factor to balance the length of the data.
- Step 3 Input the text features of the target question, the candidate answers and the candidate answers corresponding to the original question and the community features related to the question and answer data into the QQA-CNN model to obtain the distributed representation of the question and answer data.
- the structure of the QQA-CNN model includes a deep network for the target question, a deep network for the candidate answer, an attention mechanism module and a feature connection layer between the two deep networks.
- the deep network for the target question consists of two convolutional layers and two pooling layers
- the deep network for the candidate answer consists of three convolutional layers and three pooling layers
- QQA-CNN pools the first two of the two deep networks
- Two attention mechanism modules are introduced between the layers, and finally, in the connection layer, the learned high-level semantic features of the target question and the candidate answer, the community feature, and the similarity feature of the target question and the candidate answer corresponding to the original question are connected. , and finally get the representation of the question and answer data of the CQA website.
- the QQA-CNN model uses wide convolution to extract the semantic features of several consecutive words.
- the QQA-CNN model adopts two pooling strategies.
- the QQA-CNN model adopts partial pooling, that is, average pooling of the features within a certain length window; for the last pool in the network
- the QQA-CNN model adopts full pooling, that is, the average pooling of the convolution results in the sentence length dimension.
- the attention mechanism module calculates the attention weight based on the feature maps output by the convolutional layers of the two deep models, and applies the results to the pooling layer for weighted pooling.
- the feature maps obtained by the convolutional layer and The calculation formula of the attention matrix A is as follows:
- QQA-CNN adds a connection layer after the two deep neural networks to combine features, including the high-level semantic features of the target question text, the high-level semantic features of the candidate answer text, the community features related to the question and answer data, and the original target question and candidate answer. Cosine similarity of the feature matrix of the question text. Finally, the question answering data is represented as a distributed vector by the QQA-CNN model.
- Step 1 Statistically analyze the question and answer data set of the community question and answer website, and formalize the obtained results into rules to automatically construct an annotation training set.
- the symbol > represents that for the target question qi, the candidate question-answer pair than candidate question-and-answer pairs
- the rank is high, that is, the sorting label is 1
- the symbol represents the candidate question-answer pair for the target question qi and candidate question-and-answer pairs
- the ordering label is 0.
- the design program automatically constructs the labeled training set L.
- Step 2 Build an answer ranking model based on the QQA-CNN model and train to predict the ranking relationship between any two candidate answers.
- the answer ranking model is built based on two QQA-CNN models with shared parameters and fully connected layers, and the input includes the target question and the text features and community features related to the two candidate question-answer pairs.
- the model combines the input target question and two candidate question-answer pairs into two question-answer triples, respectively, and inputs the related text features and community features of the triples into the QQA-CNN model with two shared parameters to obtain two triples.
- the feature representation of the question and answer data then, the feature representation of the question and answer data triplet learned by the QQA-CNN model is input into the fully connected layer, and the correlation score between the target question and the candidate question and answer pair is obtained through nonlinear mapping.
- the magnitude of the correlation scores between the two candidate question-answer pairs outputs the final ranking labels.
- the output is 1, it means that the first candidate question-answer pair should be ranked higher than the second candidate question-answer pair in the final ranking; when the output is -1, the result is the opposite.
- the loss function of the answer ranking model consists of the hinge loss function, the parameter regularization term and the penalty term as follows:
- t i and t' i represent the set of related features of question and answer triples with sorting labels 1 and -1; u j and u' j represent the set of related features of question and answer triples with sorting labels of 0;
- F(t i ) represents the correlation score obtained by inputting the fully connected layer after t i is characterized by QQA-CNN;
- y i represents the expected sequence label of the candidate question and answer pair;
- Step 3 Construct an unlabeled sample set, select additional samples from it for manual labeling combined with active learning, and merge them into the labeled training set to further train the answer ranking model.
- the unlabeled training set U is constructed according to the actual research goal.
- the k-NN algorithm is implemented in the dataset based on the open source graph computing framework GraphLab to retrieve several similar questions; then, the target question is constructed with the similar questions and all answers under the similar questions.
- select two candidate question-answer pairs each time from the target question candidate question-answer pair set without repetition, and combine the target question and the two candidate question-answer pairs into two triples respectively, then the two triples
- the formed triplet pair is a sample in the unlabeled training set.
- the unlabeled training set is targeted to select the unlabeled samples that are most conducive to the performance improvement of the answer sorting model. Annotated and used to train the model.
- the query function first measures the difference between the correlation scores of the two candidate question-answer pairs based on the information entropy. The smaller the gap, the greater the information entropy, and the greater the inaccuracy of the model prediction results.
- the specific calculation formula is as follows:
- the query function also considers the similarity between the candidate answers when selecting samples.
- the final query function is as follows:
- m i represents the number of question-and-answer triplet pairs under the target question query i .
- the present invention also provides a community question-and-answer website answer sorting system combined with active learning, including:
- the question and answer data representation module is used to extract the text features of the question and answer data, and express the question title and answer content after word segmentation and stopword removal as a word vector matrix;
- the user long-tail factor maps the total number of approvals under the question and the number of user answers to (0,1), and multiplies the obtained question-and-answer data community features by the question long-tail factor and user long-tail factor to replace the original question-and-answer data community features , and by inputting the community features of the question and answer data into the QQA-CNN model, the question and answer data is represented as a distributed vector;
- the training set construction and answer sorting module is used to formalize the statistical results into rules by performing statistical analysis on the question and answer data set, automatically construct a preliminary labeling training set based on the rules, build an answer sorting model based on the QQA-CNN model, and predict any two The ordering relationship between the candidate answers, construct an unlabeled training set, select additional samples from it for manual labeling by combining active learning, merge the labeling results into the preliminary labeling training set, train the answer ranking model again, and use the retrained answer ranking
- the model performs community Q&A site answer ranking.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Human Computer Interaction (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种结合主动学习的社区问答网站答案排序方法及系统,排序方法包括步骤S1进行问答数据表征和建模,步骤S2结合主动学习构建训练集以及候选问答对排序关系预测。同时提供了一种结合主动学习的社区问答网站答案排序系统。首先对CQA网站问答数据进行表征和建模,通过长尾因子解决社区数据长尾分布给答案排序带来的干扰,在卷积神经网络中引入注意力机制缓解问答文本间的语义鸿沟问题。然后将主动学习和答案排序相结合,在基于规则自动构建标注训练集外,还构建未标注训练集,在未标注训练集中额外选择样本进行标注,将标注结果合并之后再次训练答案排序模型,从而实现以尽可能低的标注代价换取到尽可能高的模型性能。
Description
本发明涉及互联网技术,具体为一种结合主动学习的社区问答网站答案排序方法及系统。
自21世纪以来,以用户为中心的Web2.0技术飞速发展,互联网用户成为拥有网络内容消费者和网络内容生成者双重身份的新时代用户。互联网对用户生成内容(User Generate Content,UGC)的支持,使得用户间通过网络可以分享更为复杂、多样的信息,基于此,社区问答(Community Question Answering,CQA)网站应运而生。CQA网站是一类开放的知识信息交流平台,通过自然语言问答形式将有信息需求的用户和乐于分享个人经验知识的用户关联到一起,实现知识信息准确、直接的传递,并支持用户通过点赞、评论等操作表达对问答数据的态度。从2005年第一个CQA网站“Yahoo!Answers”的出现,到至今“StackExchange”、“Quora”、“知乎”、“百度知道”等各类中英文CQA网站的不断出现,吸引了大量用户,成为用户获取信息和分享经验知识的重要渠道。
CQA网站发展至今,用户在系统中获取信息的方式经历了从直接提问到优先搜索的变迁。CQA网站发展初期,没有问答数据积累,有信息需求的用户通常会选择直接提问并等待其他的用户回答,这种方式可以直接获得用户所需信息,但通常等待时间很长,甚至等待很长时间后也没有答案。近年来,CQA网站快速发展,积累了大量问答数据,其中包含着很多相似提问甚至是相同提问,所以大部分用户在提问前会优先基于自己的问题检索CQA网站的历史问答数据,当检索到的历史问答数据无法满足需求时再去提问,从而减少等待时间,提升使用体验。目前主流CQA网站的检索功能一般给用户返回相似问题列表,并分别对每个相似问题的答案根据点赞、评论等数据进行排序。这种方式在一定程度上能够帮助用户进行答案 选择,但依然存在问题,例如大量问答数据的浏览和对来自不同相似问题的答案之间的优劣判断造成了用户的认知过载,降低了用户的使用体验。所以,需要对所有相似问题的答案进行统一排序,针对用户的检索目标直接返回已排序的答案列表,帮助用户进行答案选择已成为了研究热点,即CQA网站答案排序任务,也可以称为社区问答任务。然而,CQA网站问答数据的特点给CQA网站答案排序方法的研究造成了困难。首先,CQA网站问答文本长度相差较大,共现词少且分布稀疏,另外,作为用户生成文本,答案文本中包含了大量冗余、噪声甚至是错误信息,加剧了问答文本间的语义鸿沟问题,对文本建模造成了困难;其次,CQA网站相关工作一般会引入基于社区数据计算的社区特征,如基于问题下所有答案总赞同数计算某个答案的赞同数份额,基于用户回答数计算用户平均每个回答所获得的赞同数,这种计算只有在社区数据足够大时才能保证社区特征的准确性。而现实中,CQA网站社区数据呈长尾分布,大量问答数据的社区特征数据很小,导致答案排序模型偏向大量社区特征难以准确反映数据真实水平的问答数据;最后,考虑到CQA网站中问题下正确答案并不唯一,用户对某个答案的评价基于与其他候选答案的比较,所以更适合采用基于答案对的排序方法,即将答案排序问题转换为一系列二分类问题,预测目标问题下任意两个候选答案的排序关系。相较于预测问题和答案间相关性的基于单答案的排序方法,基于答案对的排序方法在训练集标注时需要对任意两个候选答案间的排序关系进行标注,训练集规模增长且标注难度增加。
目前国内外不少CQA网站答案排序相关工作为了降低训练集标注代价采用基于单答案的排序方法,直接对问题和每一个候选答案进行建模,预测问答间相关性,忽略了CQA网站中答案间的排序关系;并且在对CQA网站问答数据进行表征时,没有考虑到问答文本数据间明显的语义鸿沟,也没有考虑到问答社区数据长尾分布给研究带来的干扰。
发明内容
本发明的目的在于针对上述现有技术中CQA网站答案排序时问答文本数据间语义鸿沟 以及问答社区数据长尾分布带来的问题,提供一种结合主动学习的社区问答网站答案排序方法及系统,减少答案排序过程中的干扰,降低文本建模的难度以及样本标注代价。
为了实现上述目的,本发明有如下的技术方案:
一种结合主动学习的社区问答网站答案排序方法,包括以下步骤:
S1、问答数据表征和建模:首先抽取问答数据文本特征,将分词以及去停用词后的问题标题和答案内容表示为词向量矩阵;然后基于问答相关社区数据计算问答数据社区特征,通过问题长尾因子和用户长尾因子将问题下答案总赞同数和用户回答数映射到(0,1)之间,将所述的问答数据社区特征乘以问题长尾因子和用户长尾因子代替原先的问答数据社区特征;最后将问答数据社区特征输入QQA-CNN模型将问答数据表征为分布式向量;
S2、结合主动学习构建训练集以及候选问答对排序关系预测:首先对问答数据集进行统计分析,将统计结果形式化为规则,基于规则自动构建初步的标注训练集;然后基于QQA-CNN模型构建答案排序模型并预测任意两个候选答案间的排序关系;最后构建未标注训练集,结合主动学习从中选择额外样本进行人工标注,将标注结果合并进初步的标注训练集再次训练答案排序模型,利用再次训练后的答案排序模型进行社区问答网站答案排序。
优选的,所述的步骤S1首先对问答数据中目标问题的标题、候选答案的内容和候选答案对应原问题的标题进行分词和去停用词,然后利用word2vec分别将文本表示为词向量矩阵。
优选的,所述的步骤S1中问答相关社区数据包括问题答案数、答案赞同份额、用户回答数、用户平均赞同数、用户平均喜欢数、用户答案平均被收藏数以及用户关注者数;
所述的问题答案数指问题下的答案总数,所述的用户回答数指用户在网站中提供的答案总数,所述的用户关注者数指用户被关注的总人次;所述的答案赞同份额指答案获得的赞同数在问题所有答案获得的总赞同数中的比例,答案赞同份额的计算方式如下:
所述的用户平均赞同数、用户平均喜欢数以及用户答案平均被收藏数分别指的是用户平均每个回答获得的赞同数、喜欢数和被收藏数,其计算方式如下:
式中:uac
i表示用户u
i的回答数;uvc
i表示用户u
i所有回答获得的总赞同数,即用户赞同数;ula
i表示用户u
i的平均喜欢数;ulc
i表示用户u
i所有回答获得的喜欢数总和,即用户喜欢数。
优选的,所述的步骤S1的问题长尾因子和用户长尾因子计算方式如下:
m
i为问题下的答案总数;
ω
q=0.1,φ
q=0.6均表示问题长尾因子计算参数;
uac
i表示用户u
i的回答数;
ω
u=0.1,φ
u=1均表示用户长尾因子计算参数。
优选的,步骤S1中的QQA-CNN模型的结构包括针对目标问题的深度网络、针对候选答案的深度网络,以及两个深度网络间的注意力机制模块和特征连接层;针对目标问题的深度网络包括两个卷积层和两个池化层,针对候选答案的深度网络包含三个卷积层和三个池化层,QQA-CNN模型在两个深度网络前、两个池化层间分别引入了两个注意力机制模块,最终在特征连接层中将学习得到的目标问题和候选答案的高层语义特征、社区特征以及目标问题和候选答案对应原问题相似度特征四部分进行连接,得到CQA网站问答数据的表征。
优选的,所述的卷积层中,QQA-CNN模型采用宽卷积提取连续的若干个词语的语义特征;池化层中,QQA-CNN采用两种池化策略,对于中间池化层QQA-CNN模型采取部分池化,即对一定长度窗口内的特征进行平均池化;对于网络中的最后一个池化层,QQA-CNN模型采用全部池化,即对卷积结果在句长维度上进行平均池化;注意力机制模块基于两个深度模型卷积层输出的特征图计算注意力权重,将结果应用于池化层中进行加权池化,对于目标问题和候选答案文本特征经过卷积层得到的特征图
和
注意力矩阵A计算表达式如下:
式中:|·|表示欧几里得距离;
注意力矩阵A中,在每行和每列上对元素进行求和即为单词的权重。
所述的特征连接层进行特征的合并,包括目标问题文本的高层语义特征、候选答案文本的高层语义特征、问答数据相关社区特征以及目标问题和候选答案原问题文本特征矩阵的余弦相似度,最终通过QQA-CNN模型将问答数据表征为分布式向量。
优选的,所述的步骤S2对问答数据集进行统计分析后,得到三条规则并对其进行形式化;首先,CQA网站中,同一问题下,最佳答案的排名会高于非最佳答案;其次,CQA网站中,同一问题下,非最佳答案间的排名先后顺序没有区别;最后,CQA网站中,目标问题下 与目标问题领域相同问题的答案的排名比与目标问题领域不同问题的答案的排名高;
对以上三条规则形式化如下:
基于形式化得到的三条规则,设计程序自动构建标注训练集L。
优选的,所述的步骤S2答案排序模型基于两个共享参数的QQA-CNN模型和全连接层构建,输入包括目标问题和两个候选问答对相关的文本特征和社区特征;
首先,模型将输入的目标问题和两个候选问答对分别组成两个问答三元组,将三元组相关文本特征和社区特征分别输入两个共享参数的QQA-CNN模型得到两个三元组问答数据的特征表示;
然后,将由QQA-CNN模型学习到的问答数据三元组的特征表示输入全连接层,通过非线性映射得到目标问题和候选问答对间的相关性分数,根据目标问题和两个候选问答对间相关性分数的大小输出最终的排序标签;当输出为1时,意味着第一个候选问答对在最终排序中比第二个候选问答对排名高;而当输出为-1时,则结果相反;
答案排序模型的损失函数由铰链损失函数、参数正则项和惩罚项构成如下:
式中:t
i和t'
i表示排序标签为1和-1的问答三元组相关特征集合;u
j和u'
j表示排序标签为0的问答三元组相关特征集合;F(t
i)表示t
i通过QQA-CNN进行表征后输入全连接层得到 的相关性分数;y
i表示候选问答对期望的先后排序标签;Φ表示答案排序模型中的所有参数,包括QQA-CNN模型和全连接层中的参数;λ和μ表示答案排序算法超参数,λ=0.05,μ=0.01。
优选的,所述的步骤S2未标注训练集根据实际研究目标进行构建,对目标问题,在数据集中基于开源图计算框架GraphLab实现k-NN算法检索若干个相似问题;然后,用相似问题及相似问题下所有答案构建目标问题的候选问答对集;最后,不重复的从目标问题候选问答对集中每次选择两个候选问答对,将目标问题和两个候选问答对分别组成两个三元组,则两个三元组构成的三元组对就是未标注训练集中的一个样本;在自动构建标注训练集外,将主动学习应用于答案排序算法,根据查询函数针对性的在未标注训练集中选择最有助于答案排序模型性能提升的未标注样本进行标注并用于训练模型;查询函数首先基于信息熵衡量两个候选问答对相关性分数间的差距,差距越小,信息熵越大,模型预测结果的不准确性越大,具体计算公式如下:
查询函数在选择样本时结合候选答案间的相似度,最终查询函数如下:
q(TT
i')=e(TT
i')+β·sim(a
i,a'
i)
式中:a
i和a'
i表示两个候选答案的文本特征矩阵;sim表示余弦相似度;β参数决定协调候选答案相似度对最终查询分数的影响,β=0.1;
将所有目标问题相同的样本的标注分数之和作为目标问题的标注分数,计算表达式如下:
式中:m
i表示目标问题query
i下问答三元组对的数量。
本发明同时提供一种结合主动学习的社区问答网站答案排序系统,包括:
问答数据表征模块,用于抽取问答数据文本特征,将分词以及去停用词后的问题标题和答案内容表示为词向量矩阵;基于问答相关社区数据计算问答数据社区特征,通过问题长尾因子和用户长尾因子将问题下答案总赞同数和用户回答数映射到(0,1)之间,将得到的问答数据社区特征乘以问题长尾因子和用户长尾因子代替原先的问答数据社区特征,并通过将问答数据社区特征输入QQA-CNN模型,使得问答数据表征为分布式向量;
训练集构建与答案排序模块,用于通过对问答数据集进行统计分析,将统计结果形式化为规则,基于规则自动构建初步的标注训练集,基于QQA-CNN模型构建答案排序模型并预测任意两个候选答案间的排序关系,构建未标注训练集,通过结合主动学习从中选择额外的样本进行人工标注,将标注结果合并进初步的标注训练集再次训练答案排序模型,利用再次训练后的答案排序模型进行社区问答网站答案排序。
相较于现有技术,本发明有如下的有益效果:首先对CQA网站问答数据进行表征和建模,通过长尾因子解决社区数据长尾分布给答案排序带来的干扰,在卷积神经网络中引入注意力机制缓解问答文本间的语义鸿沟问题。然后将主动学习和答案排序相结合,在基于规则自动构建标注训练集外,还构建未标注训练集,在未标注训练集中额外选择样本进行标注,将标注结果合并之后再次训练答案排序模型,从而实现以尽可能低的标注代价换取到尽可能高的模型性能。本发明能够对CQA网站中目标问题下的候选答案进行统一排序。
图1为本发明QQA-CNN模型的结构示意图;
图2为本发明答案排序模型的结构示意图。
下面结合附图对本发明做进一步的详细说明。
本发明结合主动学习的社区问答网站答案排序方法可以分为如下2个过程:
(1)问答数据表征和建模,包括3个步骤;
第1步:首先对问答数据中目标问题的标题、候选答案的内容和候选答案对应原问题的标题进行分词和去停用词,然后利用word2vec分别将文本表示为词向量矩阵。
第2步:抽取问题答案数、答案赞同份额、用户回答数、用户平均赞同数、用户平均喜欢数、用户答案平均被收藏数、用户关注者数作为问答数据社区特征。
问题答案数指问题下的答案总数;用户回答数指用户在网站中提供的答案总数;用户关注者数指用户被关注的总人次。
答案赞同份额指答案获得的赞同数在问题所有答案获得的总赞同数中的比例,计算公式如下:
用户平均赞同数、用户平均喜欢数和用户答案平均被收藏数指用户平均每个回答获得的赞同数、喜欢数和被收藏数,计算公式分别如下:
式中:uac
i表示用户u
i的回答数;uvc
i表示用户u
i所有回答获得的总赞同数,即用户赞同数;ula
i表示用户u
i的平均喜欢数;ulc
i表示用户u
i所有回答获得的喜欢数总和,即用户喜欢数。
考虑到社区问答网站中用户回答数和问题下答案总赞同数呈长尾部分,大部分问题下答案总赞同数很少,大部分用户回答数很少,为反映不同问题下答案的答案赞同份额和不用用户的用户平均赞同数等社区特征的计算基数问题下答案总赞同数和用户回答数的差异,提出问题长尾因子和用户长尾因子,计算公式分别如下:
式中:qv
i表示问题q
i下所有答案的赞同数之和,即
m
i为问题下的答案总数;ω
q=0.1,φ
q=0.6表示问题长尾因子计算参数;uac
i表示用户u
i的回答数;ω
u=0.1,φ
u=1表示用户长尾因子计算参数。
问题长尾因子和用户长尾因子将问题下答案总赞同数和用户回答数映射到(0,1)之间,通过用社区特征乘以长尾因子的结果代替社区特征,以平衡因数据长尾分布给研究带来的影响。
第3步:将目标问题、候选答案和候选答案对应原问题的文本特征和问答数据相关社区特征输入QQA-CNN模型获取问答数据的分布式表示。
QQA-CNN模型的结构包括针对目标问题的深度网络、针对候选答案的深度网络,两个深度网络间的注意力机制模块和特征连接层。针对目标问题的深度网络包括两个卷积层和两个池化层,针对候选答案的深度网络包含三个卷积层和三个池化层,QQA-CNN在两个深度网络前两个池化层间分别引入了两个注意力机制模块,最终在连接层中将学习得到的目标问题和候选答案的高层语义特征、社区特征以及目标问题和候选答案对应原问题相似度特征四 部分进行连接,最终得到CQA网站问答数据的表征。
卷积层中,QQA-CNN模型采用宽卷积提取连续的若干个词语的语义特征。池化层中,QQA-CNN模型采用两种池化策略,对于中间池化层QQA-CNN模型采取部分池化,即对一定长度窗口内的特征进行平均池化;对于网络中的最后一个池化层,QQA-CNN模型采用全部池化,即对卷积结果在句长维度上进行平均池化。注意力机制模块基于两个深度模型卷积层输出的特征图计算注意力权重,将结果应用于池化层中进行加权池化,对于目标问题和候选答案文本特征经过卷积层得到的特征图
和
注意力矩阵A计算公式如下:
式中:|·|表示欧几里得距离,注意力矩阵A中,在每行和每列上对元素进行求和即为单词的权重。
QQA-CNN在两个深度神经网络后增加了一个连接层,进行特征的合并,包括目标问题文本的高层语义特征、候选答案文本的高层语义特征、问答数据相关社区特征以及目标问题和候选答案原问题文本特征矩阵的余弦相似度。最终,通过QQA-CNN模型将问答数据表征为分布式向量。
(2)结合主动学习的训练集构建及候选问答对间排序关系预测,包括3个步骤。
第1步:对社区问答网站问答数据集进行统计分析,将得到的结果形式化为规则从而自动构建标注训练集。
对问答数据集进行统计分析后,得到三条规则并对其进行形式化。首先,CQA网站中,同一问题下,最佳答案的排名通常会高于非最佳答案。其次,CQA网站中,同一问题下,非最佳答案间的排名先后顺序没有明显的区别。最后,CQA网站中,目标问题下与目标问题领域相同的问题下的答案的排名比与目标问题领域不同的问题下的答案的排名高。
对以上三条规则形式化如下:
基于形式化得到的三条规则,设计程序自动构建标注训练集L。
第2步:基于QQA-CNN模型构建答案排序模型并训练预测任意两个候选答案间的排序关系。
答案排序模型基于两个共享参数的QQA-CNN模型和全连接层构建,输入包括目标问题和两个候选问答对相关的文本特征和社区特征。首先,模型将输入的目标问题和两个候选问答对分别组成两个问答三元组,将三元组相关文本特征和社区特征分别输入两个共享参数的QQA-CNN模型得到两个三元组问答数据的特征表示;然后,将由QQA-CNN模型学习到的问答数据三元组的特征表示输入全连接层,通过非线性映射得到目标问题和候选问答对间的相关性分数,根据目标问题和两个候选问答对间相关性分数的大小输出最终的排序标签。当输出为1时,意味着第一个候选问答对在最终排序中应比第二个候选问答对排名高;而当输出为-1时,则结果相反。
答案排序模型的损失函数由铰链损失函数、参数正则项和惩罚项构成如下:
式中:t
i和t'
i表示排序标签为1和-1的问答三元组相关特征集合;u
j和u'
j表示排序标签为0的问答三元组相关特征集合;F(t
i)表示t
i通过QQA-CNN进行表征后输入全连接层得到的相关性分数;y
i表示候选问答对期望的先后排序标签;Φ表示答案排序模型中的所有参数, 包括了QQA-CNN网络和全连接层中的参数;λ和μ表示答案排序算法超参数,λ=0.05,μ=0.01。
第3步:构建未标注样本集,结合主动学习从中选择额外样本进行人工标注,合并进标注训练集进一步训练答案排序模型。
未标注训练集U根据实际研究目标进行构建,对目标问题,在数据集中基于开源图计算框架GraphLab实现k-NN算法检索若干个相似问题;然后,用相似问题及相似问题下所有答案构建目标问题的候选问答对集;最后,不重复的从目标问题候选问答对集中每次选择两个候选问答对,将目标问题和两个候选问答对分别组成两个三元组,则两个三元组构成的三元组对就是未标注训练集中的一个样本。
为降低训练集标注代价,在自动构建标注训练集外,将主动学习应用于答案排序算法,根据查询函数针对性的在未标注训练集中选择最有助于答案排序模型性能提升的未标注样本进行标注并用于训练模型。
查询函数首先基于信息熵衡量两个候选问答对相关性分数间的差距,差距越小,信息熵越大,模型预测结果的不准确性越大,具体计算公式如下:
另外,考虑到社区问答网站中相似问题的优质答案具有一定的相似性,所以查询函数在选择样本时也考虑候选答案间的相似度,最终查询函数如下:
q(TT
i')=e(TT
i')+β·sim(a
i,a'
i) (14)
式中:a
i和a'
i表示两个候选答案的文本特征矩阵;sim表示余弦相似度;β参数表示协调 候选答案相似度对最终查询分数的影响,β=0.1。
将所有目标问题相同的样本的标注分数之和作为目标问题的标注分数,计算公式如下:
式中:m
i表示目标问题query
i下问答三元组对的数量。
本发明同时提供一种结合主动学习的社区问答网站答案排序系统,包括:
问答数据表征模块,用于抽取问答数据文本特征,将分词以及去停用词后的问题标题和答案内容表示为词向量矩阵;基于问答相关社区数据计算问答数据社区特征,通过问题长尾因子和用户长尾因子将问题下答案总赞同数和用户回答数映射到(0,1)之间,将得到的问答数据社区特征乘以问题长尾因子和用户长尾因子代替原先的问答数据社区特征,并通过将问答数据社区特征输入QQA-CNN模型,使得问答数据表征为分布式向量;
训练集构建与答案排序模块,用于通过对问答数据集进行统计分析,将统计结果形式化为规则,基于规则自动构建初步的标注训练集,基于QQA-CNN模型构建答案排序模型并预测任意两个候选答案间的排序关系,构建未标注训练集,通过结合主动学习从中选择额外的样本进行人工标注,将标注结果合并进初步的标注训练集再次训练答案排序模型,利用再次训练后的答案排序模型进行社区问答网站答案排序。
以上所述的仅仅是本发明的较佳实施例,并不用以对本发明的技术方案进行任何限制,本领域技术人员应当理解的是,在不脱离本发明精神和原则的前提下,该技术方案还可以进行若干简单的修改和替换,这些修改和替换也均属于权利要求书所涵盖的保护范围之内。
Claims (10)
- 一种结合主动学习的社区问答网站答案排序方法,其特征在于,包括以下步骤:S1、问答数据表征和建模:首先抽取问答数据文本特征,将分词以及去停用词后的问题标题和答案内容表示为词向量矩阵;然后基于问答相关社区数据计算问答数据社区特征,通过问题长尾因子和用户长尾因子将问题下答案总赞同数和用户回答数映射到(0,1)之间,将所述的问答数据社区特征乘以问题长尾因子和用户长尾因子代替原先的问答数据社区特征;最后将问答数据社区特征输入QQA-CNN模型将问答数据表征为分布式向量;S2、结合主动学习构建训练集以及候选问答对排序关系预测:首先对问答数据集进行统计分析,将统计结果形式化为规则,基于规则自动构建初步的标注训练集;然后基于QQA-CNN模型构建答案排序模型并预测任意两个候选答案间的排序关系;最后构建未标注训练集,结合主动学习从中选择额外样本进行人工标注,将标注结果合并进初步的标注训练集再次训练答案排序模型,利用再次训练后的答案排序模型进行社区问答网站答案排序。
- 根据权利要求1所述结合主动学习的社区问答网站答案排序方法,其特征在于:所述的步骤S1首先对问答数据中目标问题的标题、候选答案的内容和候选答案对应原问题的标题进行分词和去停用词,然后利用word2vec分别将文本表示为词向量矩阵。
- 根据权利要求1所述结合主动学习的社区问答网站答案排序方法,其特征在于:所述的步骤S1中问答相关社区数据包括问题答案数、答案赞同份额、用户回答数、用户平均赞同数、用户平均喜欢数、用户答案平均被收藏数以及用户关注者数;所述的问题答案数指问题下的答案总数,所述的用户回答数指用户在网站中提供的答案总数,所述的用户关注者数指用户被关注的总人次;所述的答案赞同份额指答案获得的赞同数在问题所有答案获得的总赞同数中的比例,答案赞同份额的计算方式如下:所述的用户平均赞同数、用户平均喜欢数以及用户答案平均被收藏数分别指的是用户平均每个回答获得的赞同数、喜欢数和被收藏数,其计算方式如下:式中:uac i表示用户u i的回答数;uvc i表示用户u i所有回答获得的总赞同数,即用户赞同数;ula i表示用户u i的平均喜欢数;ulc i表示用户u i所有回答获得的喜欢数总和,即用户喜欢数。
- 根据权利要求1所述结合主动学习的社区问答网站答案排序方法,其特征在于:所述的步骤S1中的QQA-CNN模型的结构包括针对目标问题的深度网络、针对候选答案的深度网络,以及两个深度网络间的注意力机制模块和特征连接层;针对目标问题的深度网络包括两个卷积层和两个池化层,针对候选答案的深度网络包含三个卷积层和三个池化层,QQA-CNN模型在两个深度网络前、两个池化层间分别引入了两个注意力机制模块,最终在特征连接层中将学习得到的目标问题和候选答案的高层语义特征、社区特征以及目标问题和候选答案对应原问题相似度特征四部分进行连接,得到CQA网站问答数据的表征。
- 根据权利要求5所述结合主动学习的社区问答网站答案排序方法,其特征在于:所述的卷积层中,QQA-CNN模型采用宽卷积提取连续的若干个词语的语义特征;池化层中,QQA-CNN模型采用两种池化策略,对于中间池化层QQA-CNN模型采取部分池化,即对一定长度窗口内的特征进行平均池化;对于网络中的最后一个池化层,QQA-CNN模型采用全部池化,即对卷积结果在句长维度上进行平均池化;注意力机制模块基于两个深度模型卷积层输出的特征图计算注意力权重,将结果应用于池化层中进行加权池化,对于目标问题和候选答案文本特征经过卷积层得到的特征图 和 注意力矩阵A计算表达式如下:式中:|·|表示欧几里得距离;注意力矩阵A中,在每行和每列上对元素进行求和即为单词的权重;所述的特征连接层进行特征的合并,包括目标问题文本的高层语义特征、候选答案文本的高层语义特征、问答数据相关社区特征以及目标问题和候选答案原问题文本特征矩阵的余弦相似度,最终通过QQA-CNN模型将问答数据表征为分布式向量。
- 根据权利要求1所述结合主动学习的社区问答网站答案排序方法,其特征在于:所述的步骤S2对问答数据集进行统计分析后,得到三条规则并对其进行形式化;首先,CQA网 站中,同一问题下,最佳答案的排名会高于非最佳答案;其次,CQA网站中,同一问题下,非最佳答案间的排名先后顺序没有区别;最后,CQA网站中,目标问题下与目标问题领域相同问题的答案的排名比与目标问题领域不同问题的答案的排名高;对以上三条规则形式化如下:基于形式化得到的三条规则,设计程序自动构建标注训练集L。
- 根据权利要求1所述结合主动学习的社区问答网站答案排序方法,其特征在于:所述的步骤S2答案排序模型基于两个共享参数的QQA-CNN模型和全连接层构建,输入包括目标问题和两个候选问答对相关的文本特征和社区特征;首先,模型将输入的目标问题和两个候选问答对分别组成两个问答三元组,将三元组相关文本特征和社区特征分别输入两个共享参数的QQA-CNN模型得到两个三元组问答数据的特征表示;然后,将由QQA-CNN模型学习到的问答数据三元组的特征表示输入全连接层,通过非线性映射得到目标问题和候选问答对间的相关性分数,根据目标问题和两个候选问答对间相关性分数的大小输出最终的排序标签;当输出为1时,意味着第一个候选问答对在最终排序中比第二个候选问答对排名高;而当输出为-1时,则结果相反;答案排序模型的损失函数由铰链损失函数、参数正则项和惩罚项构成如下:式中:t i和t′ i表示排序标签为1和-1的问答三元组相关特征集合;u j和u' j表示排序标签为0的问答三元组相关特征集合;F(t i)表示t i通过QQA-CNN进行表征后输入全连接层得到的相关性分数;y i表示候选问答对期望的先后排序标签;Φ表示答案排序模型中的所有参数,包括QQA-CNN模型和全连接层中的参数;λ和μ表示答案排序算法超参数,λ=0.05,μ=0.01。
- 根据权利要求1所述结合主动学习的社区问答网站答案排序方法,其特征在于:所述的步骤S2未标注训练集根据实际研究目标进行构建,对目标问题,在数据集中基于开源图计算框架GraphLab实现k-NN算法检索若干个相似问题;然后,用相似问题及相似问题下所有答案构建目标问题的候选问答对集;最后,不重复的从目标问题候选问答对集中每次选择两个候选问答对,将目标问题和两个候选问答对分别组成两个三元组,则两个三元组构成的三元组对就是未标注训练集中的一个样本;在自动构建标注训练集外,将主动学习应用于答案排序算法,根据查询函数针对性的在未标注训练集中选择最有助于答案排序模型性能提升的未标注样本进行标注并用于训练模型;查询函数首先基于信息熵衡量两个候选问答对相关性分数间的差距,差距越小,信息熵越大,模型预测结果的不准确性越大,具体计算公式如下:查询函数在选择样本时结合候选答案间的相似度,最终查询函数如下:q(TT i′)=e(TT i')+β·sim(a i,a′ i)式中:a i和a′ i表示两个候选答案的文本特征矩阵;sim表示余弦相似度;β参数决定协调 候选答案相似度对最终查询分数的影响,β=0.1;将所有目标问题相同的样本的标注分数之和作为目标问题的标注分数,计算表达式如下:式中:m i表示目标问题query i下问答三元组对的数量。
- 一种结合主动学习的社区问答网站答案排序系统,其特征在于,包括:问答数据表征模块,用于抽取问答数据文本特征,将分词以及去停用词后的问题标题和答案内容表示为词向量矩阵;基于问答相关社区数据计算问答数据社区特征,通过问题长尾因子和用户长尾因子将问题下答案总赞同数和用户回答数映射到(0,1)之间,将得到的问答数据社区特征乘以问题长尾因子和用户长尾因子代替原先的问答数据社区特征,并通过将问答数据社区特征输入QQA-CNN模型,使得问答数据表征为分布式向量;训练集构建与答案排序模块,用于通过对问答数据集进行统计分析,将统计结果形式化为规则,基于规则自动构建初步的标注训练集,基于QQA-CNN模型构建答案排序模型并预测任意两个候选答案间的排序关系,构建未标注训练集,通过结合主动学习从中选择额外的样本进行人工标注,将标注结果合并进初步的标注训练集再次训练答案排序模型,利用再次训练后的答案排序模型进行社区问答网站答案排序。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/955,584 US11874862B2 (en) | 2020-11-09 | 2022-09-29 | Community question-answer website answer sorting method and system combined with active learning |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011240697.1 | 2020-11-09 | ||
CN202011240697.1A CN112434517B (zh) | 2020-11-09 | 2020-11-09 | 一种结合主动学习的社区问答网站答案排序方法及系统 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/955,584 Continuation US11874862B2 (en) | 2020-11-09 | 2022-09-29 | Community question-answer website answer sorting method and system combined with active learning |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022095573A1 true WO2022095573A1 (zh) | 2022-05-12 |
Family
ID=74700021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/116051 WO2022095573A1 (zh) | 2020-11-09 | 2021-09-01 | 一种结合主动学习的社区问答网站答案排序方法及系统 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11874862B2 (zh) |
CN (1) | CN112434517B (zh) |
WO (1) | WO2022095573A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110377713A (zh) * | 2019-07-16 | 2019-10-25 | 杭州微洱网络科技有限公司 | 一种基于概率转移改善问答系统上下文的方法 |
CN115098664A (zh) * | 2022-08-24 | 2022-09-23 | 中关村科学城城市大脑股份有限公司 | 智能问答方法、装置、电子设备和计算机可读介质 |
CN116070884A (zh) * | 2023-03-30 | 2023-05-05 | 深圳奥雅设计股份有限公司 | 高密度城市社区和微气候监控与管理系统 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112434517B (zh) | 2020-11-09 | 2023-08-04 | 西安交通大学 | 一种结合主动学习的社区问答网站答案排序方法及系统 |
US12111826B1 (en) * | 2023-03-31 | 2024-10-08 | Amazon Technologies, Inc. | Neural search for programming-related query answering |
CN116450796B (zh) * | 2023-05-17 | 2023-10-17 | 中国兵器工业计算机应用技术研究所 | 一种智能问答模型构建方法及设备 |
CN116701609B (zh) * | 2023-07-27 | 2023-09-29 | 四川邕合科技有限公司 | 基于深度学习的智能客服问答方法、系统、终端及介质 |
CN116953653B (zh) * | 2023-09-19 | 2023-12-26 | 成都远望科技有限责任公司 | 一种基于多波段天气雷达组网回波外推方法 |
CN118016314B (zh) * | 2024-04-08 | 2024-06-18 | 北京大学第三医院(北京大学第三临床医学院) | 一种医疗数据输入的优化方法、装置及电子设备 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180067922A1 (en) * | 2015-03-06 | 2018-03-08 | National Institute Of Information And Communications Technology | Entailment pair extension apparatus, computer program therefor and question-answering system |
CN110321421A (zh) * | 2019-07-04 | 2019-10-11 | 南京邮电大学 | 用于网站知识社区系统的专家推荐方法及计算机存储介质 |
CN112434517A (zh) * | 2020-11-09 | 2021-03-02 | 西安交通大学 | 一种结合主动学习的社区问答网站答案排序方法及系统 |
Family Cites Families (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9817897B1 (en) * | 2010-11-17 | 2017-11-14 | Intuit Inc. | Content-dependent processing of questions and answers |
US11914674B2 (en) * | 2011-09-24 | 2024-02-27 | Z Advanced Computing, Inc. | System and method for extremely efficient image and pattern recognition and artificial intelligence platform |
US9378647B2 (en) * | 2013-08-20 | 2016-06-28 | Chegg, Inc. | Automated course deconstruction into learning units in digital education platforms |
US11204929B2 (en) * | 2014-11-18 | 2021-12-21 | International Business Machines Corporation | Evidence aggregation across heterogeneous links for intelligence gathering using a question answering system |
US20170161364A1 (en) * | 2015-12-07 | 2017-06-08 | International Business Machines Corporation | Generating messages using keywords |
CN107992554A (zh) * | 2017-11-28 | 2018-05-04 | 北京百度网讯科技有限公司 | 提供问答信息的聚合结果的搜索方法和装置 |
US11055355B1 (en) * | 2018-06-25 | 2021-07-06 | Amazon Technologies, Inc. | Query paraphrasing |
CN109710741A (zh) * | 2018-12-27 | 2019-05-03 | 中山大学 | 一种面向在线问答平台的基于深度强化学习的问题标注方法 |
US11380305B2 (en) * | 2019-01-14 | 2022-07-05 | Accenture Global Solutions Limited | System and method for using a question and answer engine |
US20230036072A1 (en) * | 2019-06-24 | 2023-02-02 | Zeyu GAO | AI-Based Method and System for Testing Chatbots |
US11366855B2 (en) * | 2019-11-27 | 2022-06-21 | Amazon Technologies, Inc. | Systems, apparatuses, and methods for document querying |
US11210341B1 (en) * | 2019-12-09 | 2021-12-28 | A9.Com, Inc. | Weighted behavioral signal association graphing for search engines |
US12014284B2 (en) * | 2019-12-27 | 2024-06-18 | Industrial Technology Research Institute | Question-answering learning method and question-answering learning system using the same and computer program product thereof |
US11709873B2 (en) * | 2020-01-13 | 2023-07-25 | Adobe Inc. | Reader-retriever approach for question answering |
US20210240775A1 (en) * | 2020-02-03 | 2021-08-05 | Intuit Inc. | System and method for providing automated and unsupervised inline question answering |
US20210365837A1 (en) * | 2020-05-19 | 2021-11-25 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems and methods for social structure construction of forums using interaction coherence |
US20210365500A1 (en) * | 2020-05-19 | 2021-11-25 | Miso Technologies Inc. | System and method for question-based content answering |
CN111738340B (zh) * | 2020-06-24 | 2022-05-20 | 西安交通大学 | 一种分布式K-means电力用户分类方法、存储介质及分类设备 |
US11321329B1 (en) * | 2020-06-24 | 2022-05-03 | Amazon Technologies, Inc. | Systems, apparatuses, and methods for document querying |
US20220391595A1 (en) * | 2021-06-02 | 2022-12-08 | Oracle International Corporation | User discussion environment interaction and curation via system-generated responses |
US20230023958A1 (en) * | 2021-07-23 | 2023-01-26 | International Business Machines Corporation | Online question answering, using reading comprehension with an ensemble of models |
US11654371B2 (en) * | 2021-07-30 | 2023-05-23 | Sony Interactive Entertainment LLC | Classification of gaming styles |
US20230186161A1 (en) * | 2021-12-14 | 2023-06-15 | Oracle International Corporation | Data manufacturing frameworks for synthesizing synthetic training data to facilitate training a natural language to logical form model |
US20230205824A1 (en) * | 2021-12-23 | 2023-06-29 | Pryon Incorporated | Contextual Clarification and Disambiguation for Question Answering Processes |
US11893070B2 (en) * | 2022-02-08 | 2024-02-06 | My Job Matcher, Inc. | Apparatus and methods for expanding contacts for a social networking platform |
-
2020
- 2020-11-09 CN CN202011240697.1A patent/CN112434517B/zh active Active
-
2021
- 2021-09-01 WO PCT/CN2021/116051 patent/WO2022095573A1/zh active Application Filing
-
2022
- 2022-09-29 US US17/955,584 patent/US11874862B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180067922A1 (en) * | 2015-03-06 | 2018-03-08 | National Institute Of Information And Communications Technology | Entailment pair extension apparatus, computer program therefor and question-answering system |
CN110321421A (zh) * | 2019-07-04 | 2019-10-11 | 南京邮电大学 | 用于网站知识社区系统的专家推荐方法及计算机存储介质 |
CN112434517A (zh) * | 2020-11-09 | 2021-03-02 | 西安交通大学 | 一种结合主动学习的社区问答网站答案排序方法及系统 |
Non-Patent Citations (1)
Title |
---|
YA TIAN, MINGCHUN ZHENG, HONG QIAO: "Re-ranking model for implicit spam answers in CQA", APPLICATION RESEARCH OF COMPUTERS, CHENGDU, CN, vol. 34, no. 8, 31 August 2017 (2017-08-31), CN , pages 2315 - 2371, XP055928420, ISSN: 1001-3695, DOI: 10. 3969 /j. issn. 1001-3695. 2017. 08. 018 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110377713A (zh) * | 2019-07-16 | 2019-10-25 | 杭州微洱网络科技有限公司 | 一种基于概率转移改善问答系统上下文的方法 |
CN110377713B (zh) * | 2019-07-16 | 2023-09-15 | 广州探域科技有限公司 | 一种基于概率转移改善问答系统上下文的方法 |
CN115098664A (zh) * | 2022-08-24 | 2022-09-23 | 中关村科学城城市大脑股份有限公司 | 智能问答方法、装置、电子设备和计算机可读介质 |
CN115098664B (zh) * | 2022-08-24 | 2022-11-29 | 中关村科学城城市大脑股份有限公司 | 智能问答方法、装置、电子设备和计算机可读介质 |
CN116070884A (zh) * | 2023-03-30 | 2023-05-05 | 深圳奥雅设计股份有限公司 | 高密度城市社区和微气候监控与管理系统 |
CN116070884B (zh) * | 2023-03-30 | 2023-06-30 | 深圳奥雅设计股份有限公司 | 高密度城市社区和微气候监控与管理系统 |
Also Published As
Publication number | Publication date |
---|---|
US20230035338A1 (en) | 2023-02-02 |
CN112434517B (zh) | 2023-08-04 |
US11874862B2 (en) | 2024-01-16 |
CN112434517A (zh) | 2021-03-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022095573A1 (zh) | 一种结合主动学习的社区问答网站答案排序方法及系统 | |
CN105893523A (zh) | 利用答案相关性排序的评估度量来计算问题相似度的方法 | |
WO2023225858A1 (zh) | 一种基于常识推理的阅读型考题生成系统及方法 | |
Cai et al. | Intelligent question answering in restricted domains using deep learning and question pair matching | |
CN112307182B (zh) | 一种基于问答系统的伪相关反馈的扩展查询方法 | |
CN115795018B (zh) | 一种面向电网领域的多策略智能搜索问答方法及系统 | |
CN114429143A (zh) | 一种基于强化蒸馏的跨语言属性级情感分类方法 | |
CN113220864A (zh) | 智能问答数据处理系统 | |
Shanshan et al. | An improved hybrid ontology-based approach for online learning resource recommendations | |
CN111666374A (zh) | 一种在深度语言模型中融入额外知识信息的方法 | |
CN111581326B (zh) | 一种基于异构外部知识源图结构抽取答案信息的方法 | |
Avogadro et al. | Estimating Link Confidence for Human-in-the-loop Table Annotation | |
Li et al. | Approach of intelligence question-answering system based on physical fitness knowledge graph | |
Alwaneen et al. | Stacked dynamic memory-coattention network for answering why-questions in Arabic | |
CN116628146A (zh) | 一种金融领域的faq智能问答方法及系统 | |
Liang et al. | Knowledge extraction experiment based on tourism knowledge graph Q & A data set | |
Hu | Application of top-n rule-based optimal recommendation system for language education content based on parallel computing | |
Pang et al. | Query expansion and query fuzzy with large-scale click-through data for microblog retrieval | |
Zhang et al. | Research on answer selection based on LSTM | |
Wang et al. | Sentiment classification based on weak tagging information and imbalanced data | |
Liu et al. | Overview of Knowledge Reasoning for Knowledge Graph | |
Wang et al. | [Retracted] Construction of a Knowledge Map Based on Text CNN Algorithm for Maritime English Subjects | |
Sun | Design of intelligent question answering system for hospital online triage based on knowledge graph | |
Kumar et al. | Building conversational Question Answer Machine and comparison of BERT and its different variants | |
Zhang | Accuracy Recommendation Algorithm of Preschool Education Distance Teaching Course Based on Improved K-Means |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21888272 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21888272 Country of ref document: EP Kind code of ref document: A1 |