CN111581382A - Method and system for predicting hot questions in question-and-answer community - Google Patents

Method and system for predicting hot questions in question-and-answer community Download PDF

Info

Publication number
CN111581382A
CN111581382A CN202010357802.3A CN202010357802A CN111581382A CN 111581382 A CN111581382 A CN 111581382A CN 202010357802 A CN202010357802 A CN 202010357802A CN 111581382 A CN111581382 A CN 111581382A
Authority
CN
China
Prior art keywords
question
questions
attribute data
sample
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010357802.3A
Other languages
Chinese (zh)
Other versions
CN111581382B (en
Inventor
张莉
赵丽娴
蒋竞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010357802.3A priority Critical patent/CN111581382B/en
Publication of CN111581382A publication Critical patent/CN111581382A/en
Application granted granted Critical
Publication of CN111581382B publication Critical patent/CN111581382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a prediction method of a hot problem in a question-answering community, belongs to the field of hot content prediction, and solves the problems that in the prior art, the adopted related information is less, and the recognition rate and the recognition accuracy of the hot problem are low. The prediction method comprises the following steps: obtaining a plurality of sample question data from a question-and-answer community, wherein the plurality of sample question data comprises: user attribute data of the question, text attribute data of the question, metadata attribute data of the question, and time attribute data of the question; converting the plurality of sample problem data into an input matrix; constructing a full convolution neural network according to the input matrix, and training the full convolution neural network to obtain a full convolution neural network prediction model; and predicting the problem to be identified by using a full convolution neural network prediction model so as to determine the category of the problem to be identified, wherein the category of the problem to be identified comprises a hot problem and a cold problem. The recognition rate and the recognition precision of the hot problem can be improved by adopting comprehensive information.

Description

Method and system for predicting hot questions in question-and-answer community
Technical Field
The invention relates to the field of hot content prediction, in particular to a method and a system for predicting hot problems in a software question-and-answer community.
Background
Software engineers often utilize online tools (called software information sites) to solve problems they encounter during software development and maintenance. In these online tools, the software question-and-answer community is very popular, where users can ask and answer programming-related questions. Stack overflow is a representative software question and answer community. The community is put into use from 2008, and at present, more than 1000 ten thousand users exist. These software question-and-answer communities have a large number of questions each day, and a new question is issued and then quickly masked by the rest of the newly issued questions. This increases the difficulty with which questions are answered and makes it difficult for users to find appropriate questions to track and contribute. Over time, some problems began to prevail and received a high degree of evaluation. Posts that attract more user attention will have relatively higher user activity, such as browsing volume, number of answers, etc. Finding these posts enables us to learn about problems that are of wide interest to the user. Finding a problem that is going to become hot early in the issue may reduce the time it takes to obtain an answer and increase the likelihood of obtaining an answer, thereby solving a problem that many users may be facing. Therefore, it is necessary to study the hot problem prediction method in the software question-and-answer community.
Existing methods for trending content prediction are generally based on social platforms (e.g., microblog, twitter, facebook, etc.) to accomplish the prediction task by three methods: modeling based on temporal processes, modeling based on textual features, and modeling according to social network structures. According to the time-based modeling method, a prediction model is constructed through information such as browsing amount, forwarding amount and appraisal amount, and hot content and cold content are distinguished. The text feature modeling method constructs a prediction model through information such as titles, texts and images of contents and identifies popular contents. The social network modeling based method predicts the popular content through methods such as diffusion models and propagation simulation.
The existing prediction method has the following defects:
1. in the prior art, the selected related information for identifying the hot problems in the software question-and-answer community is less, so that the identification rate of the hot problems is low; and
2. social network models in the software question-and-answer community are fuzzy, and the hot problems cannot be accurately identified by using the social network models to build prediction models.
Disclosure of Invention
In view of the foregoing analysis, embodiments of the present invention provide a method and a system for predicting a hot problem in a software question-and-answer community, so as to solve the problem that in the existing method, mainly for a social network, the adopted related information is less, which results in a low recognition rate and recognition accuracy for the hot problem.
On one hand, the embodiment of the invention provides a method for predicting hot questions in a question-and-answer community, which comprises the following steps: obtaining a plurality of sample question data from the question-answering community, wherein the plurality of sample question data comprises: user attribute data of the question, text attribute data of the question, metadata attribute data of the question, and time attribute data of the question; converting the plurality of sample problem data into an input matrix; constructing a full convolution neural network according to the input matrix and the identification of the category of the sample problem, and training the full convolution neural network to obtain a full convolution neural network prediction model; and predicting the problem to be identified by using the full convolution neural network prediction model so as to determine the category of the problem to be identified, wherein the category of the problem to be identified comprises a hot problem and a cold problem.
The beneficial effects of the above technical scheme are as follows: the identification rate of the hot problem is improved by the complete and comprehensive sample problem data through the plurality of sample problem data comprising the user attribute data of the problem, the text attribute data of the problem, the metadata attribute data of the problem and the time attribute data of the problem. The full convolution neural network prediction model constructed based on the complete and comprehensive sample problem data can improve the accuracy of identifying the hot problem.
Based on the further improvement of the above method, the user attribute data of the question includes: a user reputation value of the user who issued the question, an upward vote count of the user who issued the question, and a downward vote count of the user who issued the question, wherein the user reputation value, the upward vote count, and the downward vote count are related to a quality of the question issued by the user and a quality of the answer.
Based on a further improvement of the above method, the text attribute data of the question includes: a title of the question, a body of the question, and a label for the question, wherein the text attribute data for the question is text attribute data in the form of a word vector.
Based on a further improvement of the above method, the metadata attribute data of the problem includes: a label popularity rating of the question and a ratio of code segments to non-code segment lengths in a question body, wherein the label popularity rating of the question is an assessment of how often all labels contained by the question appear in all recently issued questions.
Based on a further improvement of the above method, the time attribute of the question includes: the amount of change in the browsing volume of questions per hour, the amount of change in the score of questions per hour, and the amount of change in the number of answers to questions per hour.
Based on a further improvement of the above method, converting the plurality of sample questions into an input matrix further comprises: respectively taking the user attribute data, the text attribute data in the form of the word vector, the metadata attribute data and the time attribute data of each sample question as elements in a feature vector to generate a list of feature vectors; and constructing the input matrix by using multi-column eigenvectors of a plurality of sample questions, wherein the input matrix comprises m rows and n columns, m corresponds to the number of sample questions, and n corresponds to the number of elements in the column eigenvector.
Based on the further improvement of the method, constructing a fully convolutional neural network according to the input matrix, and training the fully convolutional neural network to obtain a fully convolutional neural network prediction model further comprises: determining the category of each sample question and marking an identifier according to the final browsing amount of the sample questions, wherein the sample questions with higher browsing amount are determined as hot sample questions and marked as 1, and the sample questions with lower browsing amount are determined as cold sample questions and marked as 0; and importing the input matrix and the identification into the full convolution neural network and training the full convolution neural network to obtain a full convolution neural network prediction model.
Based on a further improvement of the above method, training the fully convolutional neural network to obtain a fully convolutional neural network prediction model further comprises: inputting the input matrix into a convolutional layer, and performing convolution for three times, wherein the convolutional layer adopts a modified linear unit activation function, and batch standardization processing is performed after each convolution; after the third convolution, entering a pooling layer and carrying out global average pooling treatment; and establishing a prediction model by adopting a multinomial logistic regression Softmax classifier according to vectors and identifications thereof obtained by convolving and pooling sample problems.
On the other hand, the embodiment of the invention provides a system for predicting hot questions in a question-and-answer community, which comprises the following steps: a sample question data obtaining module, configured to obtain a plurality of sample question data from the question-answering community, where the plurality of sample question data includes: user attribute data of the question, text attribute data of the question, metadata attribute data of the question, and time attribute data of the question; a conversion module for converting the plurality of sample questions into an input matrix; the prediction model acquisition module is used for constructing a full convolution neural network according to the input matrix and the identification of the category of the sample problem, and training the full convolution neural network to obtain a full convolution neural network prediction model; and the problem to be identified predicting module is used for predicting the problem to be identified by using the full convolution neural network prediction model so as to determine the category of the problem to be identified, wherein the category of the problem to be identified comprises a hot problem and a cold problem.
Based on a further improvement of the above system, the user attribute data of the question includes: a user reputation value of the issued question, an upward vote count of the issued question user, and a downward vote count of the issued question user, wherein the user reputation value, the upward vote count, and the downward vote count are related to a question quality and an answer quality issued by the user; the text attribute data of the question includes: the method comprises the steps of obtaining a question title, a question body and a question label, wherein text attribute data of the question are text attribute data converted into a word vector form; metadata attribute data for the problem includes: the label popularity level of the question and the proportion of code segments to non-code segment lengths in the question body, wherein the label popularity level of the question is an evaluation of the frequency of all labels contained in the question in all recently issued questions; and the temporal attributes of the question include: the amount of change in the browsing volume of questions per hour, the amount of change in the score of questions per hour, and the amount of change in the number of answers to questions per hour.
Compared with the prior art, the invention can realize at least one of the following beneficial effects:
1. the attribute data which is acquired based on the software question-answering community and used for judging whether the question is popular or not is comprehensive, and the attribute data comprises user attribute data, text attribute data, metadata attribute data and time attribute data, so that the recognition rate of the popular question is improved by the complete and comprehensive sample question data;
2. the difference between the hot problem and the cold problem on the time attribute data is obvious, and when the problem is identified to be the hot problem, the problem can be more quickly determined to be the hot problem through the time attribute data of the problem to be identified; and
3. the full convolution neural network prediction model constructed according to the attribute data can improve the accuracy of identifying the hot problem.
In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a flow diagram of a method for predicting a trending question in a community of questions and answers according to an embodiment of the present invention;
FIG. 2 is a flow diagram of training a full convolutional neural network to obtain a full convolutional neural network prediction model, according to an embodiment of the present invention; and
FIG. 3 is a block diagram of a prediction system for topical questions in the question-and-answer community, according to an embodiment of the present invention.
Reference numerals:
302-sample problem data acquisition module; 304-a conversion module; 306-a prediction model acquisition module; and 308-to-be-identified problem prediction module
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
One embodiment of the present invention discloses a method for predicting hot questions in a question-and-answer community, as shown in fig. 1.
Referring to fig. 1, the method for predicting the hot questions in the question-and-answer community includes: step S102, obtaining a plurality of sample question data from a question-answering community, wherein the plurality of sample question data comprise: user attribute data of the question, text attribute data of the question, metadata attribute data of the question, and time attribute data of the question; step S104, converting the plurality of sample problem data into an input matrix; step S106, constructing a full convolution neural network according to the input matrix and the identification of the category of the sample problem, and training the full convolution neural network to obtain a full convolution neural network prediction model; and step S108, predicting the problem to be identified by using the full convolution neural network prediction model so as to determine the category of the problem to be identified, wherein the category of the problem to be identified comprises a hot problem and a cold problem.
Compared with the prior art, the prediction method provided by the embodiment has the advantages that the identification rate of the hot problem is improved by the aid of the complete and comprehensive sample problem data through the plurality of sample problem data including the user attribute data of the problem, the text attribute data of the problem, the metadata attribute data of the problem and the time attribute data of the problem. In addition, the full convolution neural network prediction model constructed based on the complete and comprehensive sample problem data can improve the accuracy of identifying the hot problem.
The method for predicting the hot questions in the question-and-answer community comprises the step S102 of obtaining a plurality of sample question data from the question-and-answer community, wherein the plurality of sample question data comprise: user attribute data for the question, text attribute data for the question, metadata attribute data for the question, and time attribute data for the question. Specifically, the user attribute data of the question includes: the method includes the steps of issuing a question by a user, and selecting a user reputation value for the question, an upward vote count for the user issuing the question, and a downward vote count for the user issuing the question, wherein the user reputation value, the upward vote count, and the downward vote count are related to the quality of the question issued by the user and the quality of the answer. For example, when the quality of the questions issued by the user is high and the answer quality is good, a higher user reputation value, a greater number of upward votes, and a fewer number of downward votes are obtained, and conversely, when the quality of the questions issued by the user is low and the answer quality is not good, a lower user reputation value, a fewer number of upward votes, and a greater number of downward votes are obtained. The text attribute data of the question includes: a title of the question, a body of the question, and a label for the question, wherein the text attribute data for the question is text attribute data in the form of a word vector. Metadata attribute data for a problem includes: the tag popularity rating of the question, which is an assessment of how often all tags contained in the question appear in all recently issued questions, and the proportion of code segments to non-code segment lengths in the question body. The time attributes of the questions include: the amount of change in the browsing volume of questions per hour, the amount of change in the score of questions per hour, and the amount of change in the number of answers to questions per hour.
After step S102, the process proceeds to step S104, where the plurality of sample problem data is converted into an input matrix. Specifically, converting the multiple sample problem into the input matrix further comprises: respectively taking the user attribute data, the text attribute data in the form of word vectors, the metadata attribute data and the time attribute data of each sample question as elements in the feature vectors to generate a list of feature vectors; and constructing an input matrix by using the multi-column eigenvectors of the plurality of sample questions, wherein the input matrix comprises m rows and n columns, wherein m corresponds to the number of sample questions and n corresponds to the number of elements in one column of eigenvectors.
After step S104, proceeding to step S106, constructing a fully convolutional neural network according to the input matrix, and training the fully convolutional neural network to obtain a fully convolutional neural network prediction model further includes: determining the category of each sample question and marking an identifier according to the final browsing amount of the sample questions, wherein the sample questions with higher browsing amount are determined as hot sample questions and marked as 1, and the sample questions with lower browsing amount are determined as cold sample questions and marked as 0; and importing the input matrix and the identification into a full convolution neural network and training the full convolution neural network to obtain a full convolution neural network prediction model.
Specifically, training the fully convolutional neural network to obtain the fully convolutional neural network prediction model further comprises: step S202, inputting an input matrix into a convolutional layer, and performing convolution for three times, wherein the convolutional layer adopts a modified linear unit activation function, and batch standardization processing is performed after each convolution; step S204, after the convolution for three times, entering a pooling layer and carrying out global average pooling treatment; and step S206, establishing a prediction model by adopting a multinomial logistic regression Softmax classifier according to vectors and identifications thereof obtained by convolving and pooling sample problems.
Hereinafter, a prediction method of a hot problem in the software question-and-answer community is described in detail by way of specific examples.
The prediction method comprises the following steps: crawling the sample problem data using a script crawler framework; according to the sample problem data, acquiring user attribute data of the problem, text attribute data of the problem, metadata attribute data of the problem and time attribute data of the problem; constructing a full convolution neural network model according to the sample problem data; and predicting the problem to be identified by using a full convolution neural network model, and determining the problem category, wherein the problem category comprises a hot problem and a cold problem.
In a certain range, more sample problem data are obtained, the characteristics of the hot problem and the cold problem can be summarized more comprehensively, and the accuracy of the full convolution neural network prediction model obtained through training is higher. For example, 3200 groups of sample problem data are obtained, and after the problem is released for one hour, the prediction accuracy of the trained full convolution neural network prediction model on the problem to be recognized reaches over 80 percent.
Specifically, the user attribute data of the question includes: a reputation value of the user who issued the question, an upward vote count (upsvotes) of the user who issued the question, and a downward vote count (downtvotes) of the user who issued the question. The user who issues the hot problem and the user who issues the cold problem have difference on the user attribute data; the user reputation value is related to the time length of the user registering the website and the user experience; in addition, the user reputation value, the number of upward votes, and the number of downward votes are related to the quality of questions and the quality of answers issued by the user. Specifically, when the quality of the questions issued by the user is high and the quality of the answers is good, a higher user reputation value, a larger number of upward votes and a smaller number of downward votes are obtained, and conversely, when the quality of the questions issued by the user is low and the quality of the answers is not good, a lower user reputation value, a smaller number of upward votes and a larger number of downward votes are obtained.
Specifically, the text attribute data of the question includes: the title of the question, the body of the question, the label of the question. The title and text content of the question determine the novelty, readability, reducibility and the like of the question; the label of the question represents the knowledge field related to the question, and the concerned users in different knowledge fields have different amounts; thus, the title, body, and label of the problem can affect the hot status of the problem.
Specifically, metadata attribute data for a problem includes: the label popularity rating of the question, the ratio of code segment length to non-code segment length in the question body.
The label popularity level of the problem comprehensively considers all labels of the problem, and is an evaluation of the frequency of the labels contained in the problem in all recently issued problems; the high occurrence frequency of the tags indicates that more users in the software question-answering community relate to the knowledge field; in general, the hot problem has a high label popularity rating.
The ratio of the code segment length to the non-code length in the question body characterizes the interpretation degree of the code by the user who issues the question; the non-code segment in the problem has a large length ratio, the readability of the problem is high, and the user browsing amount is generally higher.
To obtain the tag popularity rating for the problem, it is first necessary to obtain the popularity of each tag, which is the frequency that the tag appears in the problem sample.
Taking the label of java as an example, the size of the question sample is n, the number of questions containing the label of java is njava, and the popularity level of java is FjavaThe calculation method is shown in the following formula (1):
Figure BDA0002474073560000091
a problem usually contains multiple tags, and if a problem contains m tags, the tags are tags1,tag2,…,tagmThen, the label popularity level FL calculation method of the problem is as shown in the following equation (2):
Figure BDA0002474073560000101
assume that the length of a code segment contained in the question body is lengthcodeContaining a length of non-code segmenttextThen the ratio of the code segment length to the non-code segment length, ratio, in the problem bodyThe calculation method is shown in the following formula (3):
Figure BDA0002474073560000102
according to the formulas (1) and (2), the label popularity levels of all the problems can be obtained, and according to the formula (3), the proportion of the code segment length to the non-code segment length in the problem text can be obtained.
Specifically, the time attributes of the question include: the variation of the browsing amount of the questions per hour, the variation of the scores of the questions per hour and the variation of the number of answers to the questions per hour.
Taking the variable quantity of the browsing amount of the problem in each hour as an example, the browsing amount data in the sample problem data is the browsing amount data v after the problem is issued for i hoursiVariation of problem browsing volume within i hours vciFor the difference between its current browsing volume and the previous hour browsing volume, vciThe calculation method of (2) is shown in the following formula (4):
Figure BDA0002474073560000103
according to the formula (4), the variation of the browsing amount of the question per hour can be obtained.
By using the same calculation method, the variation of the question score per hour and the variation of the number of answer questions per hour can be obtained.
The hot problem has a higher browsing volume than the cold problem, and the browsing volume of the hot problem may be higher than that of the cold problem at the initial stage of issue of the problem, so the present invention uses the variation of the browsing volume of the problem per hour after issue of the problem.
After the problem is published, the user can vote upwards and vote downwards aiming at the problem, the score of the problem is the difference value of the quantity of the votes upwards and the quantity of the votes downwards, the high-score problem is often clearer, the problem has better quality, and therefore more attention can be paid, and the problem score variation quantity in each hour after the problem is published is used.
The present invention uses the amount of variation in the number of question answers per hour after issue of a question because many users enter the community to find the answer to the question they are facing, and the question with the answer is more helpful to those users and therefore gets more browsing.
In a specific implementation process, the full convolution neural network prediction model is determined through the following process:
user attribute data for the problem can be directly obtained;
performing punctuation removal and stop word removal processing on the text attribute data of the problem;
converting the processed problem text attribute data into a word vector form by using a TF-IDF (term frequency-inverse document frequency) method;
the tag popularity rating in metadata attribute data for a problem can be obtained from equations (1) (2);
the ratio of the code segment length to the non-code segment length in the problem body in the metadata attribute data for the problem can be obtained by equation (3);
the amount of change in the browsing amount of the question per hour in the time attribute data for the question can be obtained by formula (4), and the amount of change in the score of the question per hour and the amount of change in the number of answers to the question per hour can be obtained using the same method;
respectively taking the user attribute data of the problem, the text attribute data converted into the word vector, the metadata attribute data and the time attribute data as elements in the feature vector to generate a feature vector;
and importing the feature vectors into a full convolution neural network algorithm model, and training to obtain a full convolution neural network prediction model.
Specifically, the full convolution neural network prediction model is obtained by training through the following method:
forming an input matrix by using the characteristic vectors of a plurality of problems, inputting a convolution layer, and performing convolution for three times, wherein the convolution layer adopts a modified linear unit (Relu) activation function, batch standardization processing is performed after each convolution, and the batch standardization processing method is shown in the following formulas (7), (8) and (9);
forming an input matrix by the feature vectors of a plurality of problems, inputting a convolution layer, and performing convolution for three times, wherein the convolution process is shown as the following formula (5), wherein Q represents the input feature matrix, W represents a convolution kernel, i and j represent convolution operation, i and j respectively represent rows and columns of the matrix, and output represents output:
output(i,j)=(Q*W)(i,j) (5)
(Q*W)(i,j)=∑m∑xQ(i-m,j-n)w(m,n) (6)
the convolution layer adopts a modified linear unit (Relu) activation function, batch standardization processing is carried out after each convolution, the batch standardization processing method is shown in formulas (7), (8) and (9), B represents batch processing data, s represents data quantity in the batch processing data, and inputiRepresents one sample, μ, after convolution processingBRepresents the mean of the data of this batch,
Figure BDA0002474073560000121
the variance of the processed data is expressed, and the convolution processing is performed on the processed data by the formula (9) to obtain a sample inputiIs normalized to input'iAnd is used for subsequent treatment:
Figure BDA0002474073560000122
Figure BDA0002474073560000123
Figure BDA0002474073560000124
after three times of convolution, the data enters a pooling layer, and global average pooling treatment is carried out on the data;
and then, establishing a prediction model by adopting a multinomial logistic regression Softmax classifier according to vectors and identifications thereof obtained after the problems are subjected to convolution and pooling, and obtaining the probability that the problems respectively belong to the hot problem and the cold problem.
Furthermore, predicting the problem to be identified, vectorizing text attribute data of the problem to be identified, and respectively taking user attribute data, text attribute data converted into word vectors, metadata attribute data and time attribute data of the problem to be identified as elements in the feature vectors to generate feature vectors; and inputting the feature vector of the problem to be identified into a full convolution neural network prediction model, and predicting whether the problem to be identified is a hot problem or a cold problem.
Hereinafter, a prediction system of a topical problem in the question-and-answer community will be described in detail with reference to fig. 3.
In another embodiment of the present invention, a system for predicting a hot question in a question-and-answer community is disclosed, comprising: a sample question data obtaining module 302, configured to obtain a plurality of sample question data from a question-and-answer community, where the plurality of sample question data includes: user attribute data of the question, text attribute data of the question, metadata attribute data of the question, and time attribute data of the question; a conversion module 304 for converting the plurality of sample questions into an input matrix; the prediction model obtaining module 306 is configured to construct a full convolution neural network according to the input matrix and the identifier of the category of the sample problem, and train the full convolution neural network to obtain a full convolution neural network prediction model; and a to-be-identified problem prediction module 308, configured to predict the to-be-identified problem by using a full-convolution neural network prediction model, so as to determine a category of the to-be-identified problem, where the category of the to-be-identified problem includes a hot problem and a cold problem.
Specifically, the user attribute data of the question includes: the method includes the steps of issuing a question by a user, and selecting a user reputation value for the question, an upward vote count for the user issuing the question, and a downward vote count for the user issuing the question, wherein the user reputation value, the upward vote count, and the downward vote count are related to the quality of the question issued by the user and the quality of the answer. Specifically, when the quality of the questions issued by the user is high and the answer quality is good, a higher user reputation value, a greater number of upward votes and a fewer number of downward votes are obtained, and conversely, when the quality of the questions issued by the user is low and the answer quality is not good, a lower user reputation value, a fewer number of upward votes and a greater number of downward votes are obtained.
Specifically, the text attribute data of the question includes: the method comprises the steps of providing a question, a question title, a question body and a question label, wherein text attribute data of the question are converted into text attribute data in a word vector form.
Specifically, metadata attribute data for a problem includes: the tag popularity rating of the problem, which is an assessment of the frequency with which all tags contained in the problem appear in all recently issued problems, and the proportion of code segments to non-code segment lengths in the body of the problem.
Specifically, the time attributes of the question include: the amount of change in the browsing volume of questions per hour, the amount of change in the score of questions per hour, and the amount of change in the number of answers to questions per hour.
The prediction system of the hot questions in the question and answer community corresponds to the prediction method described above, so the prediction system of the hot questions in the question and answer community also includes a plurality of other modules, but for avoiding repeated description, detailed description of the plurality of other modules is omitted.
The hot problem prediction system in the software question-and-answer community has the same principle as the hot problem prediction method in the software question-and-answer community, so the system also has the technical effect corresponding to the prediction method.
Compared with the prior art, the prediction method and the prediction system can realize at least one of the following beneficial effects:
1. the attribute data which is acquired based on the software question-answering community and used for judging whether the question is popular or not is comprehensive, and the attribute data comprises user attribute data, text attribute data, metadata attribute data and time attribute data, so that the recognition rate of the popular question can be improved by the complete and comprehensive sample question data;
2. the difference between the hot problem and the cold problem on the time attribute data is obvious, and when the problem is identified to be the hot problem, the problem can be more quickly determined to be the hot problem through the time attribute data of the problem to be identified; and
3. the full convolution neural network prediction model constructed according to the attribute data can improve the accuracy of identifying the hot problem.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A method for predicting hot questions in a question-and-answer community is characterized by comprising the following steps:
obtaining a plurality of sample question data from the question-answering community, wherein the plurality of sample question data comprises: user attribute data of the question, text attribute data of the question, metadata attribute data of the question, and time attribute data of the question;
converting the plurality of sample problem data into an input matrix;
constructing a full convolution neural network according to the input matrix and the identification of the category of the sample problem, and training the full convolution neural network to obtain a full convolution neural network prediction model; and
predicting the problem to be identified by using the full convolution neural network prediction model so as to determine the category of the problem to be identified, wherein the category of the problem to be identified comprises a hot problem and a cold problem.
2. The method for predicting the topical questions in the question-and-answer community according to claim 1, wherein the user attribute data of the questions comprises: a user reputation value of the user who issued the question, an upward vote count of the user who issued the question, and a downward vote count of the user who issued the question, wherein the user reputation value, the upward vote count, and the downward vote count are related to a quality of the question issued by the user and a quality of the answer.
3. The method for predicting the topical questions in the question-and-answer community according to claim 1, wherein the text attribute data of the questions comprises: a title of the question, a body of the question, and a label for the question, wherein the text attribute data for the question is text attribute data in the form of a word vector.
4. The method for predicting a trending question in a question-and-answer community according to claim 1, wherein the metadata attribute data of the question includes: the tag popularity rating of the question and the ratio of code segments to non-code segment lengths in the question body, where,
the tag popularity rating of the question is an assessment of the frequency with which all tags contained in the question appear in all recently issued questions.
5. The method for predicting the topical questions in the question-and-answer community according to claim 1, wherein the time attributes of the questions comprise: the amount of change in the browsing volume of questions per hour, the amount of change in the score of questions per hour, and the amount of change in the number of answers to questions per hour.
6. The method for predicting the topical questions in the question-and-answer community according to any one of claims 1 to 5, wherein converting the plurality of sample questions into the input matrix further comprises:
respectively taking the user attribute data, the text attribute data in the form of the word vector, the metadata attribute data and the time attribute data of each sample question as elements in a feature vector to generate a list of feature vectors; and
constructing the input matrix using columns of feature vectors of a plurality of sample questions, the input matrix comprising m rows and n columns, wherein m corresponds to the number of sample questions and n corresponds to the number of elements in the column of feature vectors.
7. The method for predicting the hot questions in the question-answering community according to any one of claims 6, wherein constructing a full convolution neural network according to the input matrix, and training the full convolution neural network to obtain a full convolution neural network prediction model further comprises:
determining the category of each sample question and marking an identifier according to the final browsing amount of the sample questions, wherein the sample questions with higher browsing amount are determined as hot sample questions and marked as 1, and the sample questions with lower browsing amount are determined as cold sample questions and marked as 0; and
and importing the input matrix and the identification into the full convolution neural network and training the full convolution neural network to obtain a full convolution neural network prediction model.
8. The method of predicting the hot questions in the question-answering community according to claim 1, wherein training the full convolution neural network to obtain a full convolution neural network prediction model further comprises:
inputting the input matrix into a convolutional layer, and performing convolution for three times, wherein the convolutional layer adopts a modified linear unit activation function, and batch standardization processing is performed after each convolution;
after the third convolution, entering a pooling layer and carrying out global average pooling treatment; and
and establishing a prediction model by adopting a multinomial logistic regression Softmax classifier according to vectors and identifications thereof obtained by convolving and pooling sample problems.
9. A system for predicting a topical question in a question-and-answer community, comprising:
a sample question data obtaining module, configured to obtain a plurality of sample question data from the question-answering community, where the plurality of sample question data includes: user attribute data of the question, text attribute data of the question, metadata attribute data of the question, and time attribute data of the question;
a conversion module for converting the plurality of sample questions into an input matrix;
the prediction model acquisition module is used for constructing a full convolution neural network according to the input matrix and the identification of the category of the sample problem, and training the full convolution neural network to obtain a full convolution neural network prediction model; and
and the problem to be identified predicting module is used for predicting the problem to be identified by using the full convolution neural network prediction model so as to determine the category of the problem to be identified, wherein the category of the problem to be identified comprises a hot problem and a cold problem.
10. The system for predicting hot questions in a community of questions and answers as set forth in claim 9,
the user attribute data of the question includes: a user reputation value of the issued question, an upward vote count of the issued question user, and a downward vote count of the issued question user, wherein the user reputation value, the upward vote count, and the downward vote count are related to a question quality and an answer quality issued by the user;
the text attribute data of the question includes: the method comprises the steps of obtaining a question title, a question body and a question label, wherein text attribute data of the question are text attribute data converted into a word vector form;
metadata attribute data for the problem includes: the label popularity level of the question and the proportion of code segments to non-code segment lengths in the question body, wherein the label popularity level of the question is an evaluation of the frequency of all labels contained in the question in all recently issued questions; and
the time attributes of the question include: the amount of change in the browsing volume of questions per hour, the amount of change in the score of questions per hour, and the amount of change in the number of answers to questions per hour.
CN202010357802.3A 2020-04-29 2020-04-29 Method and system for predicting hot questions in question-answering community Active CN111581382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010357802.3A CN111581382B (en) 2020-04-29 2020-04-29 Method and system for predicting hot questions in question-answering community

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010357802.3A CN111581382B (en) 2020-04-29 2020-04-29 Method and system for predicting hot questions in question-answering community

Publications (2)

Publication Number Publication Date
CN111581382A true CN111581382A (en) 2020-08-25
CN111581382B CN111581382B (en) 2023-06-30

Family

ID=72113239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010357802.3A Active CN111581382B (en) 2020-04-29 2020-04-29 Method and system for predicting hot questions in question-answering community

Country Status (1)

Country Link
CN (1) CN111581382B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235343A1 (en) * 2009-03-13 2010-09-16 Microsoft Corporation Predicting Interestingness of Questions in Community Question Answering
CN107909016A (en) * 2017-11-03 2018-04-13 车智互联(北京)科技有限公司 A kind of convolutional neural networks generation method and the recognition methods of car system
CN109165289A (en) * 2018-08-31 2019-01-08 西安交通大学 A method of the prediction of community's question and answer website problem quality is carried out by depth convolutional neural networks
CN109871439A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Ask-Answer Community problem method for routing based on deep learning
CN110909254A (en) * 2019-10-31 2020-03-24 中山大学 Method and system for predicting question popularity of question-answering community based on deep learning model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235343A1 (en) * 2009-03-13 2010-09-16 Microsoft Corporation Predicting Interestingness of Questions in Community Question Answering
CN107909016A (en) * 2017-11-03 2018-04-13 车智互联(北京)科技有限公司 A kind of convolutional neural networks generation method and the recognition methods of car system
CN109165289A (en) * 2018-08-31 2019-01-08 西安交通大学 A method of the prediction of community's question and answer website problem quality is carried out by depth convolutional neural networks
CN109871439A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Ask-Answer Community problem method for routing based on deep learning
CN110909254A (en) * 2019-10-31 2020-03-24 中山大学 Method and system for predicting question popularity of question-answering community based on deep learning model

Also Published As

Publication number Publication date
CN111581382B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN108021616B (en) Community question-answer expert recommendation method based on recurrent neural network
CN111538912B (en) Content recommendation method, device, equipment and readable storage medium
US11893071B2 (en) Content recommendation method and apparatus, electronic device, and storage medium
US20210271975A1 (en) User tag generation method and apparatus, storage medium, and computer device
CN110309427B (en) Object recommendation method and device and storage medium
CN111898031B (en) Method and device for obtaining user portrait
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN112434151A (en) Patent recommendation method and device, computer equipment and storage medium
US11874862B2 (en) Community question-answer website answer sorting method and system combined with active learning
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN110674312B (en) Method, device and medium for constructing knowledge graph and electronic equipment
CN111177559B (en) Text travel service recommendation method and device, electronic equipment and storage medium
CN112464100B (en) Information recommendation model training method, information recommendation method, device and equipment
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
CN105574213A (en) Microblog recommendation method and device based on data mining technology
CN109062958B (en) Primary school composition automatic classification method based on TextRank and convolutional neural network
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
Najar et al. Exact fisher information of generalized dirichlet multinomial distribution for count data modeling
CN114358014A (en) Work order intelligent diagnosis method, device, equipment and medium based on natural language
Hain et al. The promises of Machine Learning and Big Data in entrepreneurship research
CN111581382B (en) Method and system for predicting hot questions in question-answering community
CN116257618A (en) Multi-source intelligent travel recommendation method based on fine granularity emotion analysis
CN109902231A (en) Education resource recommended method based on CBOW model
CN111984872B (en) Multi-modal information social media popularity prediction method based on iterative optimization strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant