CN103425635B - Method and apparatus are recommended in a kind of answer - Google Patents

Method and apparatus are recommended in a kind of answer Download PDF

Info

Publication number
CN103425635B
CN103425635B CN201210151044.5A CN201210151044A CN103425635B CN 103425635 B CN103425635 B CN 103425635B CN 201210151044 A CN201210151044 A CN 201210151044A CN 103425635 B CN103425635 B CN 103425635B
Authority
CN
China
Prior art keywords
weight
answer
semantic
question
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210151044.5A
Other languages
Chinese (zh)
Other versions
CN103425635A (en
Inventor
陈庆轩
梁丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210151044.5A priority Critical patent/CN103425635B/en
Publication of CN103425635A publication Critical patent/CN103425635A/en
Application granted granted Critical
Publication of CN103425635B publication Critical patent/CN103425635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of answer to recommend method and apparatus, wherein, this method includes:Acquisition problem corresponds to the content of text of answer with the problem, and participle obtains the semantic primitive of described problem and the semantic primitive of the answer;Using domain lexicon the problem of pre-establishing, weight of the semantic primitive of described problem in each classification is found out, calculates topic weights of the described problem in each classification;And using the answer domain lexicon pre-established, weight of the semantic primitive of each answer in each classification is found out, calculates topic weights of each answer in each classification respectively;Using the topic weights of obtained described problem and the topic weights of each answer, the Topic Similarity of each answer and described problem is calculated respectively, answer is recommended according to the result of calculation of the Topic Similarity.Compared with prior art, the present invention generates problem domain dictionary and answer domain lexicon respectively, effectively improves accuracy rate of the question and answer to semantic similarity, improves recall rate.

Description

Answer recommendation method and device
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of internet information processing, in particular to an answer recommendation method and device.
[ background of the invention ]
With the continuous development of information and network technologies, network interactive question-answer communities such as Baidu know, xinlangai question, google question-answer, search question, yahoo knowledge hall and the like are increasingly concerned by people. The network interactive question-answering communities provide a platform for internet citizens to carry out interactive communication, and users can freely put forward questions, browse the questions, answer the questions, carry out mutual-help communication and share knowledge. As the number of candidate answers increases with the increase of users participating in the questioning and answering community, the questioning and answering community generally automatically sorts the answers so as to recommend preferred answers to the users.
In the automatic ranking of answers, currently, a text topic analysis technology is mostly adopted to analyze the semantic relevance of question-answer pairs and the like to determine the satisfaction of the question-answer pairs, and then, the answers are automatically ranked. The text topic analysis technology is mainly based on a topic model, namely, texts are mapped into topic vectors, and the topic vectors are represented by word distribution, so that topic similarity calculation between the texts can be converted into similarity calculation between the topic vectors, and the similarity can be measured by cosine similarity.
Most of the existing text topic analysis methods are based on an assumption: that is, the texts all belong to the same topic space, and each topic belongs to the same word distribution. However, the questions and answers in the question-answer pairs may be described differently, i.e., inconsistent words may occur, for example, in the computer field, the distribution of the words in the question field is mainly based on commonly used or spoken computer words, such as computers, operating systems, etc.; the distribution of the answered domain words is mainly based on professional computer vocabularies, such as PC, win7 and the like; for another example, asking a user to ask a question about the skill of a game, but the answer to the user is a description of the specific skill and does not include the word in the question. At this time, the semantic relevance between the answer and the question calculated according to the existing method is low, so that the answer actually matched with the question cannot be recalled or the ranking of the answer is backward, the accuracy of quality judgment by question and answer is reduced, and the user cannot find the preferred answer.
[ summary of the invention ]
In view of the above, the present invention provides an answer recommendation method and apparatus, which respectively generate a question domain dictionary and an answer domain dictionary to expand domain mapping expressions of questions and answers in a question-answer pair, thereby effectively improving the accuracy of semantic similarity determination between the questions and the answers and increasing the recall rate.
The specific technical scheme is as follows:
an answer recommendation method, comprising the steps of:
s1, obtaining a question and text content of an answer corresponding to the question, and segmenting words to obtain a semantic unit of the question and a semantic unit of the answer;
s2, searching the weight of the semantic unit of the problem in each category by using a pre-established problem field dictionary, and calculating the theme weight of the problem in each category;
and
searching the weight of the semantic unit of each answer in each category by using a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category;
and S3, respectively calculating the topic similarity of each answer and the question by using the obtained topic weight of the question and the topic weight of each answer, and recommending the answers according to the calculation result of the topic similarity.
According to a preferred embodiment of the present invention, the method for establishing a dictionary of problem domains specifically includes:
acquiring the content of a question in a question-answer corpus, and segmenting words to obtain a semantic unit of the question;
respectively calculating the weight of each semantic unit of the problem in each category;
and forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.
According to a preferred embodiment of the present invention, the method for establishing a dictionary in an answer field specifically includes:
acquiring the content of an answer in a question and answer pair corpus, and performing word segmentation to obtain a semantic unit of the answer;
respectively calculating the weight of each semantic unit of the answer in each category;
and forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.
According to a preferred embodiment of the present invention, after obtaining the semantic unit of the question or the semantic unit of the answer, the method further comprises:
filtering semantic units with word frequency lower than a preset word frequency threshold;
and respectively calculating the weight in each category only for the residual semantic units after filtering.
According to a preferred embodiment of the present invention, the weight of the semantic unit in each category is calculated according to one or any combination of the following:
the difference of the word frequency of the semantic unit among all the categories, the word frequency of the semantic unit appearing in all the categories or the inverse word frequency of the semantic unit.
According to a preferred embodiment of the present invention, the method for calculating the weight of the semantic unit in each category is:
wherein, w (token) i ,C j ) Representing semantic units token i In class C j The weight in (1);
p ij =T ij /L j ,L j represents class C j The sum of the times of all semantic units contained therein, T ij Representing semantic units token i In class C j The number of occurrences in (1);
wherein m is the number of categories;
representing in semantic Unit token i In class C j The word frequency appears in the Chinese character, and n is a word frequency influence factor;
n represents the sum of the number of occurrences of all semantic units in the corpus, N (token) i ) Representing semantic units token i The number of occurrences.
According to a preferred embodiment of the present invention, before forming the question domain dictionary or the answer domain dictionary from the semantic units and their weights in the categories, the method further comprises:
carrying out similar weight filtering on the weight of each semantic unit among each category, and filtering the weight of which the occurrence frequency in the same weight interval is greater than a preset threshold value aiming at the same semantic unit;
only the weight of the semantic unit in the remaining categories is used to form a question domain dictionary or an answer domain dictionary.
According to a preferred embodiment of the present invention, the weight interval is set according to the weight of the semantic unit in each category.
According to a preferred embodiment of the present invention, before forming the problem domain dictionary by the semantic units and their weights in the categories, the method further comprises:
filtering out semantic units with the length of single characters, repeated numeric strings or numeric strings exceeding a preset length threshold;
and only the semantic units remaining after filtering are used for forming a question domain dictionary or an answer domain dictionary.
According to a preferred embodiment of the present invention, the method for calculating the topic similarity between the answer and the question includes:
respectively calculating the subject similarity of the answers and the questions under each category;
and selecting the maximum value of the calculated theme similarity as the theme similarity of the answer and the question.
According to a preferred embodiment of the present invention, the method for calculating the topic similarity between the answer and the question comprises:
sim(query,ans)=Max j {weight(query,C j )×weight(ans,C j )}
where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) j ) Indicating a problem in category C j Weight of subject in (ans, C) j ) Indicates that the answer is in category C j The theme weight in (1).
An answer recommendation apparatus, the apparatus comprising:
the text acquisition module is used for acquiring a question and text content of an answer corresponding to the question, and performing word segmentation to obtain a semantic unit of the question and a semantic unit of the answer;
the theme weight calculation module is used for searching the weight of the semantic unit of the problem in each category by utilizing a pre-established problem field dictionary and calculating the theme weight of the problem in each category;
and
the system comprises a database, a semantic unit, a topic weight calculation module and a topic weight calculation module, wherein the database is used for storing a plurality of classes of the answers;
and the similarity calculation module is used for calculating the similarity of the topics of the questions and the answers respectively by using the topic weight of the questions and the topic weight of the answers obtained by the topic weight calculation module, and recommending the answers according to the calculation result of the topic similarity.
According to a preferred embodiment of the present invention, the problem domain dictionary is created in advance by a problem dictionary creating module, and the problem dictionary creating module specifically includes:
the question acquisition submodule is used for acquiring the content of a question in the question-answer corpus and segmenting words to obtain a semantic unit of the question;
the first weight calculation submodule is used for calculating the weight of each semantic unit of the problem in each category respectively;
and the first integration submodule is used for forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.
According to a preferred embodiment of the present invention, the answer domain dictionary is established in advance by an answer dictionary establishing module, and the answer dictionary establishing module specifically includes:
the answer obtaining submodule is used for obtaining the content of the answers in the question and answer pair corpus and obtaining the semantic units of the answers by word segmentation;
the second weight calculation submodule is used for calculating the weight of each semantic unit of the answer in each category respectively;
and the second integration submodule is used for forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.
According to a preferred embodiment of the present invention, the question dictionary creating module or the answer dictionary creating module further includes:
the word frequency filtering submodule is used for filtering the semantic units with the word frequency lower than a preset word frequency threshold;
and providing the filtered residual semantic units to the first weight calculation submodule or the second weight calculation submodule.
According to a preferred embodiment of the present invention, the first weight calculating submodule or the second weight calculating submodule calculates the weight of the semantic unit in each category according to one or any combination of the following list:
the difference of the word frequency of the semantic unit among all categories, the word frequency of the semantic unit appearing in all categories or the inverse word frequency of the semantic unit.
According to a preferred embodiment of the present invention, the method for calculating the weight of the semantic unit in each category by the first weight calculation submodule or the second weight calculation submodule is as follows:
wherein, w (token) i ,C j ) Representing semantic Unit tokens i In class C j The weight of (1);
p ij =T ij /L j ,L j represents class C j The sum of the times of all semantic units contained therein, T ij Representing semantic units token i In class C j The number of occurrences in (a);
wherein m is the number of categories;
representing in semantic Unit token i In class C j The word frequency appears in the Chinese character, and n is a word frequency influence factor;
n represents the sum of the number of occurrences of all semantic units in the corpus, N (token) i ) Representing semantic units token i The number of occurrences.
According to a preferred embodiment of the present invention, the question dictionary creating module or the answer dictionary creating module further includes:
the weight filtering submodule is used for carrying out similar weight filtering on the weight of each semantic unit among all categories, and filtering the weight of which the occurrence frequency is greater than a preset threshold value in the same weight interval aiming at the same semantic unit;
providing only the weight of the semantic unit in the remaining categories to the first integration submodule or the second integration submodule for forming a question domain dictionary or an answer domain dictionary.
According to a preferred embodiment of the present invention, the weight interval is set according to the weight of the semantic unit in each category.
According to a preferred embodiment of the present invention, the question dictionary creating module or the answer dictionary creating module further includes:
the semantic unit filtering submodule is used for filtering out the semantic units of single characters, repeated numeric strings or numeric strings with the length exceeding a preset length threshold;
and providing the residual semantic units after filtering to the first integration submodule or the second integration submodule to form a question domain dictionary or an answer domain dictionary.
According to a preferred embodiment of the present invention, the similarity calculation module calculates the topic similarity of the answer and the question in each category, and selects the maximum value of the calculated topic similarity as the topic similarity of the answer and the question.
According to a preferred embodiment of the present invention, the method for calculating the topic similarity between the answer and the question by the similarity calculation module comprises:
sim(query,ans)=Max j {weight(query,C j )×weight(ans,C j )}
where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) j ) Indicating that the question is in category C j Weight (a) of the subject in (1)ns,C j ) Indicates that the answer is in category C j The theme weight in (1).
According to the technical scheme, the question field dictionary and the answer field dictionary are respectively generated by using the question-answer material, so that the field mapping expression of the question-answer pair is expanded, the accuracy of the question-answer pair semantic similarity is effectively improved, the problem of inaccurate matching under the condition that the word pairs describing the same theme are inconsistent is solved, and the recall rate is improved.
[ description of the drawings ]
FIG. 1 is a flowchart of an answer recommendation method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for creating a problem domain dictionary according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for creating a dictionary of answer fields according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an answer recommending apparatus according to a second embodiment of the present invention;
FIG. 5 is a diagram of a problem dictionary creating module according to a second embodiment of the present invention;
fig. 6 is a schematic diagram of an answer dictionary establishing module according to a second embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
During the question answering process of the network interactive question answering community, expressions of the same subject in questions and answers are different according to different knowledge backgrounds of the questioners and the answerers, such as < compressed software, winrar >, < slide, PPT >, < system software, win7> and the like, and the expressions have high semantic similarity under the specific domain background although the words are different.
The invention utilizes the characteristic to respectively establish a dictionary in the question field and a dictionary in the answer field aiming at the words of the question and the answer in different categories, and carries out answer recommendation according to the calculation result of the similarity by a calculation method for calculating the semantic similarity between the question and the answer in different fields.
The first embodiment,
Fig. 1 is a flowchart of an answer recommendation method provided in this embodiment, and as shown in fig. 1, the method includes:
and S10, obtaining a question and text content of an answer corresponding to the question, and performing word segmentation to obtain a semantic unit of the question and a semantic unit of the answer.
One question may include a plurality of corresponding answers, and the text content of the question and each answer is subjected to word segmentation filtering and the like to obtain semantic units contained in the obtained question and each answer.
The invention can perform word segmentation on the text content of the question or the answer by the existing word segmentation method, such as an N-gram word segmentation method, a forward maximum matching method, a reverse maximum matching method and the like. Taking an N-gram word segmentation method as an example, carrying out unary division to obtain each unary semantic unit, such as text, data, table and the like; binary division is carried out to obtain each binary semantic unit, such as a text box, a data packet, a new table and the like; carrying out ternary division to obtain each ternary semantic unit, namely a multi-line text box, a data packet interception, a new table downloading and the like; and repeating the operation of dividing the words of the N-element semantic units. The N-element semantic unit is N terms adjacent to the upper and lower parts of the question or answer, namely N terms which appear continuously, and no separators such as characters, punctuations or spaces are arranged in the middle.
The question or answer may include content for multiple domains. For example, one problem includes three fields of a title, a body and a supplementary description, and text contents of the three fields are respectively extracted and segmented to obtain corresponding semantic units. And respectively acquiring the question or the answer according to the title, the text and the supplementary content to obtain the corresponding N-element semantic unit.
For example, the user may ask the following questions:
' teaching computer high hand
Did my computer restart before downloading nothing but did i not log off-work? "
Included in this question are the title "teach high hands computer" and the text "do my download nothing but i did not log off-do after reboot? ". Taking this heading as an example, the word segmentation result includes: the univariate semantic unit "teaching", "computer", "high hand", the binary semantic unit "teaching computer", "computer high hand" and the ternary semantic unit "teaching computer high hand".
Step S20, searching the weight of the semantic unit of the question in each category by using a pre-established question domain dictionary, and calculating the theme weight of the question in each category; and searching the weight of the semantic unit of each answer in each category by using a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category.
The question domain dictionary or the answer domain dictionary comprises semantic units and weights of the semantic units in all categories. The categories are a plurality of preset domain categories, and encyclopedic categories can be adopted, such as computer, medicine, education, maps, songs, movies and the like.
The specific process of establishing the question domain dictionary and the answer domain dictionary in advance by using the existing question-answer pair corpus will be described in detail in the following paragraphs.
And searching the weight of each semantic unit of the problem in each category by using the problem field dictionary, and summing the weights of all semantic units contained in the problem according to each category to obtain the theme weight of the problem in each category. For example, a semantic element "computer" is searched for in a problem area dictionary, and the semantic element "computer" is found to have a weight of 15 in the computer category, a weight of 30 in the education category, and a weight of 10 in the medicine category. And sequentially finding the weight of each semantic unit of the problem obtained in the step S10 in each category.
And according to different categories, carrying out weighted summation on the weights of the semantic units under the corresponding categories to obtain the theme weight of the problem under each category. If the weight of the semantic unit under a certain category cannot be found, the weight of the semantic unit under the category is zero. For example, if the semantic units obtained by word segmentation of the question only have weights of "computer" and "high-hand" in the medicine classification, the weights of the semantic units "computer" and "high-hand" are added together to be the theme weight of the question in the medicine classification.
And similarly, searching the weight of the semantic unit of each answer in each category by using the dictionary in the answer field, and summing the weights of all the semantic units contained in the answer according to each category to obtain the theme weight of the answer in each category.
And S30, respectively calculating the topic similarity of each answer and the question by using the obtained topic weight of the question and the topic weight of each answer, and recommending the answer according to the calculation result of the topic similarity.
Using the topic weights of the questions in the respective categories and the topic weights of the answers in the respective categories calculated in step S20, the topic similarity of the answers to the questions is calculated.
The method for calculating the similarity between the answers and the topics of the questions may be, but not limited to, calculating by taking the product of the topic weight of the questions and the topic weight of the answers. Specifically, the topic similarity of the answer and the question under each category is calculated respectively, and then the maximum value of the topic similarity obtained by calculation is selected as the topic similarity of the answer and the question, that is:
sim(query,ans)=weight(query,C j )×weight(ans,C j )
where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) j ) Indicating a problem in category C j Weight (ans, C) of subject in (1) j ) Indicating that the answer is in the classPin C j Subject weight in (1).
After calculating the topic weights of the questions or answers in each category, only the topic weights in the first 5 categories of the questions and answers may be selected for similarity calculation.
If the highest theme weight of the question is 0, it indicates that the question cannot be clearly judged, and the theme similarity between the question and the answer in the question-answer pair cannot be calculated, and at this time, the existing semantic relevance is adopted to measure the relevance of the question-answer pair.
If the topic weight of the answer with the highest degree is 0, the answer cannot be clearly judged, the topic similarity of the answer and the question cannot be calculated, and the relevance of the question-answer pair is measured by adopting the existing semantic relevance at the same time.
Multiplying the weights of the questions and the answers in the corresponding categories to be used as the subject relevance of the categories, and selecting the maximum value of the product to be used as the subject relevance of the answers and the questions.
By the calculating method, the topic relevance of the question-answer pair can be calculated. As shown in table 1 below:
TABLE 1
According to the topic relevance of the question and answer pairs and the answers, the question and answer pairs with the same topic can be well identified, and the judgment of topic similarity with higher weight can be generated, so that an effective means is provided for judging the question and answer quality from the content relevance aspect of the text, and more accurate answers can be recommended.
A method for creating a dictionary for question fields and a dictionary for answer fields that are created in advance will be described with reference to fig. 2 and 3.
Fig. 2 is a flowchart of a method for establishing a problem domain dictionary according to this embodiment, and as shown in fig. 2, the method specifically includes:
and S401, obtaining the contents of the questions in the question and answer corpus, and performing word segmentation to obtain semantic units of the questions.
Acquiring the text content of the questions in the whole question-answer pair pre-material library, performing word segmentation, and performing filtering processing such as removing stop words and punctuation on terms obtained by word segmentation to obtain semantic units of the questions. The specific processing procedure is similar to step S10, and is not described herein again.
And S402, filtering out semantic units with word frequency lower than a preset word frequency threshold value.
In order to improve the efficiency, the semantic units are filtered based on the word frequency, and the semantic units with the word frequency lower than a preset word frequency threshold are filtered. For example, semantic units with a word frequency lower than 5 are removed.
Of course, this step is not essential, and may not be executed when the requirement on the processing efficiency is not high.
And step S403, respectively calculating the weight of each semantic unit of the problem in each category.
The weight of the semantic unit in each category is calculated according to one or any combination of the following:
the difference of the word frequency of the semantic unit among all the categories, the word frequency of the semantic unit appearing in all the categories or the inverse word frequency of the semantic unit.
Taking the difference of the word frequency of the semantic unit among each category, the word frequency of the semantic unit appearing in each category and the inverse word frequency of the semantic unit as an example, the method for calculating the weight of the semantic unit in each category can be, but not limited to, the following methods: the product of the difference of the word frequency of the semantic unit among all the categories and the product of the word frequency of the semantic unit appearing in all the categories and the inverse word frequency of the semantic unit is calculated, namely:
wherein, w (token) i ,C j ) Representing semantic Unit tokens i In class C j The weight in (1).
p ij =T ij /L j ,L j Represents a class C j The sum of the times of all semantic units contained therein, T ij Representing semantic units token i In class C j The number of occurrences in (c).
Wherein m is the number of classes.
Representing token i The word frequency of (c) is different between classes.
Representing in semantic Unit token i In class C j The word frequency in (1), n is the influence factor of the word frequency. The word frequency influence factor n can be set according to actual conditions, and the influence degree of the word frequency is adjusted, for example, n =5 is selected.
N represents the sum of the number of occurrences of all semantic units in the corpus, N (token) i ) Representing semantic units token i Number of occurrences, log (N/N (token) i ) Represent semantic Unit tokens i The inverse word frequency of (c). The inverse word frequency may also be directly used to process the inverse document rate in the corpus.
And S404, carrying out similar weight filtering on the weight of each semantic unit among each category.
In order to distinguish the importance degree of the semantic unit among each category, after the weights of the semantic unit in each category are calculated, the weights which appear for many times in the same weight interval need to be filtered out. That is, for the same semantic unit, the weight with the occurrence frequency greater than the preset threshold in the same weight interval is filtered.
The weight interval (e.g., [0, 10 ] interval) is set according to the weight of the semantic unit in each category. Specifically, the following methods may be employed, but are not limited to:
and determining each weight interval of the semantic units to be calculated by dividing the difference between the maximum value and the minimum value of the weight of the semantic units to be calculated in all the categories by the number of the weight intervals.
For example, a heuristic rule may be used to determine the weight interval if the highest weight Score of a semantic element in each class is Score max The lowest weight Score is Score min Then the interval length can be defined as (Score) max -Score min ) L, where L is the number of preset weight intervals, and in this embodiment L =6. The similarity weight number threshold is set to M/2, where M represents how many classes the semantic unit has a weight score in.
For example, as in the case of the weight distribution of the semantic unit "stock" in each category: 1: 1.65, 2: 2.32, 3: 58.62, 4: 3.12, 5: 3.62, 7: 14.82, 8: 24.31, 11: 14.85. First, it is determined that the interval length is (58.62-0)/6 =10, that is, the weight interval can be divided into [0, 10 ], [10, 20., "stocks" have weight scores in 8 categories, the threshold value of the number of similar weights is 4, and the weights of the word "stocks" in categories 1, 2,4 and 5 all belong to the weight interval [0, 10 ], so that the weights of the four categories are filtered, and finally the weights of the four categories are left as 3: 58.62, 7: 14.82, 8: 24.31 and 11: 14.85.
It should be noted that this step may not be executed when the requirements on processing efficiency and accuracy are not high.
Step S405, filtering out the single character, the repeated number string or the semantic unit with the length of the number string exceeding a preset length threshold value.
After calculating the weight of the semantic unit in each category, filtering the semantic unit, including:
the semantic unit of the single character, namely the Chinese character or the word with the length of 1, is filtered.
Semantic units with the numeric character string length exceeding a preset length threshold are filtered, for example, numeric character strings with the length larger than 10 are meaningless and are filtered.
Semantic units of the repeated numeric strings are filtered out. For example, a numeric character string with a large repetition degree (e.g., a numeric string with a repetition length of more than 4, such as 00001) is meaningless and is filtered.
It should be noted that the filtering process in this step may also be performed before calculating the weight of the semantic unit in each category, and specifically may be performed before or after step S402.
And step S406, forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.
That is, the problem area dictionary includes at least semantic units and weights of the semantic units in the categories.
Similarly, fig. 3 is a flowchart of a method for establishing a dictionary in an answer field according to this embodiment, and as shown in fig. 3, the method specifically includes:
step S501, obtaining the content of the answer in the question-answer corpus, and obtaining the semantic unit of the answer by word segmentation.
And step S502, filtering out semantic units with word frequency lower than a preset word frequency threshold value.
And step S503, respectively calculating the weight of each semantic unit of the answer in each category.
Step S504, similar weight filtering is carried out on the weight of each semantic unit among all categories, and the weight with the occurrence frequency larger than a preset threshold value in the same weight interval is filtered aiming at the same semantic unit.
And step S505, filtering out the single character, the repeated number string or the semantic unit with the length of the number string exceeding a preset length threshold value.
And S506, forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.
The processing method from step S501 to step S506 is similar to that from step S401 to step S406, and is not repeated herein.
Through the establishing method, the question domain dictionary and the answer domain dictionary of each category are formed. As shown in tables 2 and 3 below.
TABLE 2
Problem domain binary semantic unit Weight of Answer field binary semantic Unit Weight of
Text box 45.226 Control terminal 51.5122
Shared internet access 45.2149 Mitnick 51.3074
Default gateway 45.1803 Stop message 50.968
Data packet 45.1551 Click cancellation 50.8755
In java 45.1044 Partition table 50.8634
Excel table 45.0597 Machine dog 50.7862
Entering DOS 45.004 Gray pigeon 50.533
Table 2 shows the distribution of the binary semantic units in the computer category in the question domain and the answer domain. As can be seen from table 2, the question field is mainly a binary semantic unit for realizing functions or achieving effects, and the answer field is mainly a binary semantic unit for executing actions or applying technologies.
TABLE 3
Problem domain binary semantic unit Weight of Answer field binary semantic unit Weight of
Normal value 45.4417 Hepatitis B antibody 46.8926
Each menstruation 45.4238 Coarse and shallow suggestions 46.6657
Ovarian cyst 45.4168 Liver function test 46.468
Pleurisy 45.3994 Vaccine boosting 46.3076
Core of hepatitis B 45.3889 The fish contain 46.2249
Table 3 shows the distribution of the binary semantic units in the medicine category in the question domain and the answer domain. As can be seen from table 3, the question domain is mainly a binary semantic unit for some queries of disorders, and in the answer domain, is mainly a binary semantic unit for some treatment methods and advices.
The invention can capture the common semantic units of the question and the answer aiming at the field better by utilizing a mode of respectively calculating the question and the answer. Meanwhile, the condition that the distribution conditions of the N-element semantic units in all the categories are unbalanced can be fully considered, and the expected target is well achieved.
The above is a detailed description of the method provided by the present invention, and the answer recommending apparatus provided by the present invention is described in detail below.
Example two
Fig. 4 is a schematic diagram of an answer recommending apparatus provided in this embodiment. As shown in fig. 4, the apparatus includes:
the text obtaining module 10 is configured to obtain a question and text content of an answer corresponding to the question, and perform word segmentation to obtain a semantic unit of the question and a semantic unit of the answer.
One question may include a plurality of corresponding answers, and the text content of the question and each answer is subjected to word segmentation filtering and the like to obtain semantic units contained in the obtained question and each answer.
The text obtaining module 10 may use an existing word segmentation method to segment the text content of the question or the answer, such as an N-gram word segmentation method, a forward maximum matching method, a reverse maximum matching method, and the like. Taking an N-gram word segmentation method as an example, carrying out unary division to obtain each unary semantic unit, such as text, data, table and the like; binary division is carried out to obtain each binary semantic unit, such as a text box, a data packet, a new table and the like; carrying out ternary division to obtain each ternary semantic unit, a plurality of lines of text boxes, data packet interception, new table downloading and the like; and repeating the steps to divide the words of the N-element semantic units. The N-element semantic unit is N terms which are adjacent in the context of the question or the answer, namely N terms which continuously appear, and no separators such as characters, punctuations or spaces are arranged in the middle.
The question or answer may include content for multiple domains. For example, one problem includes three fields of a title, a body and a supplementary description, and text contents of the three fields are respectively extracted and segmented to obtain corresponding semantic units. And respectively acquiring the question or the answer according to the title, the text and the supplementary content to obtain the corresponding N-element semantic unit.
The topic weight calculation module 20 is configured to find the weight of the semantic unit of the question obtained by the text acquisition module 10 in each category by using a pre-established question domain dictionary, and calculate the topic weight of the question in each category.
And the system is used for searching the weight of the semantic unit of each answer in each category obtained by the text acquisition module 10 by using a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category.
The question domain dictionary or the answer domain dictionary comprises semantic units and weights of the semantic units in all categories. The categories are a plurality of preset domain categories, and encyclopedic categories can be adopted, such as computer, medicine, education, maps, songs, movies and the like.
The following paragraphs will describe the device for establishing the question domain dictionary and the answer domain dictionary in advance by using the existing question and answer pair corpus.
And searching the weight of each semantic unit of the problem in each category by using the problem domain dictionary, and summing the weights of all the semantic units contained in the problem according to each category to obtain the theme weight of the problem in each category. For example, a semantic element "computer" is searched for in a problem area dictionary, and the semantic element "computer" is found to have a weight of 15 in the computer category, a weight of 30 in the education category, and a weight of 10 in the medicine category. The weights of the semantic units of the question obtained in the text acquisition module 10 in the categories are found in turn.
And according to different categories, carrying out weighted summation on the weights of the semantic units under the corresponding categories to obtain the theme weight of the problem under each category. If the weight of the semantic unit under a certain category cannot be found, the weight of the semantic unit under the category is zero. For example, the semantic units obtained by word segmentation of the question only have weights of "computer" and "high hand" in the medicine classification, and the weights of the semantic units "computer" and "high hand" are added to serve as the theme weight of the question in the medicine classification.
Similarly, the weights of the semantic units of the answers in all the categories are found out by utilizing the answer field dictionary, and the weights of all the semantic units contained in the answers are summed according to all the categories to obtain the theme weights of the answers in all the categories.
The similarity calculation module 30 is configured to calculate topic similarity between each answer and the question by using the topic weight of the question and the topic weight of each answer obtained by the topic weight calculation module 20, and recommend an answer according to a calculation result of the topic similarity.
The topic weight of the question in each category and the topic weight of the answer in each category calculated in the topic weight calculation module 20 are used to calculate the topic similarity of the answer and the question.
The method for calculating the similarity between the answers and the topics of the questions may be, but not limited to, calculating by taking the product of the topic weight of the questions and the topic weight of the answers. Specifically, topic similarity of the answer and the question under each category is calculated respectively, and then the maximum value of the topic similarity obtained through calculation is selected as the topic similarity of the answer and the question, that is:
sim(query,ans)=weight(query,C j )×weight(ans,C j )
where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) j ) Indicating a problem in category C j Weight (ans, C) of subject in (1) j ) Indicates that the answer is in category C j The theme weight in (1).
The similarity calculation module 30 may select only the topic weights in the question and the first 5 categories of the answer calculated by the topic weight calculation module 20 for similarity calculation.
If the highest theme weight of the question is 0, it indicates that the question cannot be clearly judged, and the theme similarity between the question and the answer in the question-answer pair cannot be calculated, and at this time, the existing semantic relevance is adopted to measure the relevance of the question-answer pair.
If the topic weight of the answer with the highest degree is 0, the answer cannot be clearly judged, the topic similarity of the answer and the question cannot be calculated, and the relevance of the question-answer pair is measured by adopting the existing semantic relevance at the same time.
Multiplying the weights of the questions and the answers in the corresponding categories to be used as the subject relevance of the categories, and selecting the maximum value of the product to be used as the subject relevance of the answers and the questions.
According to the topic relevance of the question and answer pairs and the answers, the question and answer pairs with the same topic can be well identified, and the judgment of topic similarity with higher weight can be generated, so that an effective means is provided for judging the question and answer quality from the content relevance aspect of the text, and more accurate answers can be recommended.
Next, a device for creating a dictionary for question fields and a dictionary for answer fields created in advance will be described with reference to fig. 5 and 6.
Fig. 5 is a schematic diagram of an apparatus for creating a problem domain dictionary provided in this embodiment, and as shown in fig. 5, the apparatus specifically includes:
the question acquisition submodule 401 is configured to acquire content of a question in a question-answer corpus, and perform word segmentation to obtain a semantic unit of the question.
And acquiring the text content of the questions in the whole question-answer pair pre-material library, segmenting words, and filtering terms obtained by segmenting the words to remove stop words, punctuation points and the like to obtain semantic units of the questions. The specific processing procedure is similar to that of the text acquiring module 10, and is not described herein again.
And the word frequency filtering submodule 402 is configured to filter out semantic units with word frequencies lower than a preset word frequency threshold.
In order to improve efficiency, semantic units are filtered based on word frequency, and semantic units with word frequency lower than a preset word frequency threshold are filtered. For example, semantic units with a word frequency lower than 5 are removed.
Certainly, the sub-module is not a necessary sub-module, and may not be included when the requirement on the processing efficiency is not high.
A first weight calculating submodule 403, configured to calculate weights of semantic units of the problem in the categories, respectively.
The weight of the semantic unit in each category is calculated according to one or any combination of the following:
the difference of the word frequency of the semantic unit among all categories, the word frequency of the semantic unit appearing in all categories or the inverse word frequency of the semantic unit.
By combining the difference of the word frequency of the semantic unit among each category, the word frequency of the semantic unit appearing in each category and the inverse word frequency of the semantic unit, the weight calculation method of the semantic unit in each category can be but is not limited to adopt: the product of the difference of the word frequency of the semantic unit among all the categories and the product of the word frequency of the semantic unit appearing in all the categories and the inverse word frequency of the semantic unit is calculated, namely:
wherein, w (token) i ,C j ) Representing semantic units token i In class C j The weight of (1).
p ij =T ij /L j ,L j Represents class C j The sum of the times of all semantic units contained therein, T ij Representing semantic units token i In class C j Number of occurrences in (c).
Wherein m is the number of classes.
Representing token i Word frequency in classInter alia, variability.
Representing in semantic Unit token i In class C j The word frequency in (1) and n is the influence factor of the word frequency. The word frequency influence factor n can be set according to actual conditions, and influence strength of the word frequency is adjusted, for example, n =5 is selected.
N represents the sum of the number of occurrences of all semantic units in the corpus, N (token) i ) Representing semantic units token i Number of occurrences, log (N/N (token) i ) Represent semantic Unit tokens i The inverse word frequency of (c). The inverse word frequency may also be directly used to process the inverse document rate in the corpus.
The weight filtering sub-module 404 is configured to perform similar weight filtering on the weights of the semantic units between the categories.
In order to distinguish the importance degree of the semantic unit among the categories, after the weights of the semantic unit in the categories are calculated, the weights which appear for many times in the same weight interval need to be filtered. That is, for the same semantic unit, the weight with the occurrence frequency greater than the preset threshold in the same weight interval is filtered out.
The weight interval (e.g., [0, 10 ] interval) is set according to the weight of the semantic unit in each category. Specifically, the following methods may be employed, but are not limited to:
and determining each weight interval of the semantic units to be calculated by dividing the difference between the maximum value and the minimum value of the weight of the semantic units to be calculated in all the categories by the number of the weight intervals.
For example, a heuristic rule may be used to determine the weight interval if the highest weight Score of a semantic element in each class is Score max The lowest weight Score is Score min Then the interval length can be defined as (Score) max -Score min ) L, where L is the number of preset weight intervals, and in this embodiment, L =6. The similarity weight number threshold is set to M/2,where M represents how many categories the semantic unit has a weight score in.
For example, as in the case of the weight distribution of the semantic unit "stock" in various categories, the following are: 1: 1.65, 2: 2.32, 3: 58.62, 4: 3.12, 5: 3.62, 7: 14.82, 8: 24.31, 11: 14.85. First, it is determined that the interval length is (58.62-0)/6 =10, that is, the weight interval can be divided into [0, 10), [10, 20., "stocks" have weight scores in 8 categories, the threshold value of the number of similar weights is 4, and the weights of the word "stocks" in categories 1, 2,4, and 5 all belong to the weight interval [0, 10), so the weights of the four categories are filtered, and finally the weights of the four categories are left as 3: 58.62, 7: 14.82, 8: 24.31, and 11: 14.85.
It should be noted that the sub-module may not be included when the requirements on the processing efficiency and the precision are not high.
The semantic unit filtering submodule 405 is configured to filter out a single word, a repeated number string, or a semantic unit in which the length of the number string exceeds a preset length threshold.
The semantic unit filtering submodule 405 performs filtering processing on semantic units, including:
the semantic unit of the single character, namely the Chinese character or the word with the length of 1, is filtered.
Semantic units with the numeric character string length exceeding a preset length threshold value are filtered, for example, numeric character strings with the length larger than 10 are meaningless and are filtered.
Semantic units of the repeated numeric strings are filtered out. For example, a numeric character string with a large repetition degree (e.g., a numeric string with a repetition length of more than 4, such as 00001) is meaningless and filtered.
It should be noted that the sub-module may also be disposed before the first weight calculating sub-module 403, specifically before or after the word frequency filtering sub-module 402.
And a first integration submodule 406, configured to form a problem domain dictionary from the semantic units and their weights in the categories. That is, the problem area dictionary includes at least semantic units and weights of the semantic units in the categories.
Similarly, fig. 6 is a schematic diagram of an apparatus for creating a dictionary of answer fields according to this embodiment, and as shown in fig. 6, the apparatus specifically includes:
the answer obtaining sub-module 501 is configured to obtain content of an answer in a query-answer corpus, and perform word segmentation to obtain a semantic unit of the answer.
The word frequency filtering submodule 502 is configured to filter out semantic units whose word frequency is lower than a preset word frequency threshold.
The second weight calculating submodule 503 is configured to calculate the weight of each semantic unit of the answer in each category.
The weight filtering sub-module 504 is configured to perform similar weight filtering on the weights of the semantic units in the categories, and filter, for the same semantic unit, a weight whose occurrence frequency in the same weight interval is greater than a preset threshold.
And the semantic unit filtering submodule 505 is used for filtering out the semantic units of the single characters, the repeated numeric strings or the numeric strings with the length exceeding a preset length threshold value.
And a second integration submodule 506, configured to form an answer field dictionary from the semantic units and their weights in the categories.
The arrangement of the sub-modules 501 to 506 is similar to that of the sub-modules 401 to 406, and thus is not described herein again.
By the above-described creation means, a question region dictionary and an answer region dictionary for each category are formed. As shown in tables 1 and 2 below.
According to the answer recommendation method and device provided by the invention, the question field dictionary and the answer field dictionary comprising each classification are respectively established by using the question-answer pair corpus, so that the field mapping expression of the question-answer pair is expanded, the accuracy of the question-answer pair semantic similarity is effectively improved, the problem of inaccurate matching under the condition that the word pairs describing the same theme are inconsistent is solved, and the recall rate is improved. The method can be used for the aspects of answer recommendation, domain-based relevance content recommendation, search result recommendation and the like of various network interactive question-answering communities.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (28)

1. An answer recommendation method, comprising:
s1, obtaining a question and text content of an answer corresponding to the question, and segmenting words to obtain a semantic unit of the question and a semantic unit of the answer;
s2, searching the weight of the semantic unit of the problem in each category by using a pre-established problem domain dictionary, and calculating the theme weight of the problem in each category;
and
searching the weight of the semantic unit of each answer in each category by utilizing a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category;
and S3, respectively calculating the topic similarity of each answer and the question by using the obtained topic weight of the question and the topic weight of each answer, and recommending the answers according to the calculation result of the topic similarity.
2. The method according to claim 1, wherein the method for establishing the problem domain dictionary specifically comprises:
acquiring the content of a question in a question-answer corpus, and segmenting words to obtain a semantic unit of the question;
respectively calculating the weight of each semantic unit of the problem in each category;
and forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.
3. The method according to claim 1, wherein the method for establishing the dictionary of the answer field specifically comprises:
acquiring the content of an answer in a question and answer pair corpus, and performing word segmentation to obtain a semantic unit of the answer;
respectively calculating the weight of each semantic unit of the answer in each category;
and forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.
4. The method according to claim 2 or 3, wherein after obtaining the semantic unit of the question or the semantic unit of the answer, further comprising:
filtering semantic units with word frequency lower than a preset word frequency threshold;
and respectively calculating the weight in each category only for the residual semantic units after filtering.
5. The method according to claim 2 or 3, wherein the weight of the semantic unit in each category is calculated according to one or any combination of the following:
the difference of the word frequency of the semantic unit among all the categories, the word frequency of the semantic unit appearing in all the categories or the inverse word frequency of the semantic unit.
6. The method of claim 5, wherein the weight of the semantic unit in each category is calculated by:
wherein, w (token) i ,C j ) Representing semantic Unit tokens i In class C j The weight of (1);
p ij =T ij /L j ,L j represents class C j The sum of the times of all semantic units contained therein, T ij Representing semantic units token i In class C j The number of occurrences in (a);
wherein m is the number of categories;
representing in semantic Unit token i In class C j The word frequency appears in the Chinese character, and n is a word frequency influence factor;
n represents the sum of the number of occurrences of all semantic units in the corpus, N (token) i ) Representing semantic Unit tokens i The number of occurrences.
7. The method of claim 2, further comprising, prior to said forming each semantic unit and its weight in each category into a problem domain dictionary:
carrying out similar weight filtering on the weight of each semantic unit among each category, and filtering the weight of which the occurrence frequency in the same weight interval is greater than a preset threshold value aiming at the same semantic unit;
only the weight of the semantic units in the remaining categories is used to form the problem domain dictionary.
8. The method according to claim 3, wherein before forming the semantic units and their weights in the categories into an answer domain dictionary, further comprising:
carrying out similar weight filtering on the weight of each semantic unit among each category, and filtering the weight of which the occurrence frequency in the same weight interval is greater than a preset threshold value aiming at the same semantic unit;
only the weight of the semantic units in the remaining categories are used to form the answer field dictionary.
9. The method according to claim 7 or 8, wherein the weight interval is set according to the weight of the semantic unit in each category.
10. The method of claim 2, further comprising, prior to forming the problem domain dictionary from semantic units and their weights in categories,:
filtering out semantic units with the length of single characters, repeated numeric strings or numeric strings exceeding a preset length threshold;
and only the semantic units remaining after filtering are used for forming the problem domain dictionary.
11. The method of claim 3, further comprising, before forming the answer domain dictionary from the semantic units and their weights in the categories:
filtering out the single character, the repeated number string or the semantic unit with the length of the number string exceeding a preset length threshold; and only the semantic units remaining after filtering are used for forming an answer field dictionary.
12. The method of claim 1, wherein the calculating of the topic similarity of the answer to the question comprises:
respectively calculating the subject similarity of the answers and the questions under each category;
and selecting the maximum value of the topic similarity obtained by calculation as the topic similarity of the answer and the question.
13. The method of claim 12, wherein the similarity of the answers to the topics of the questions is calculated by:
sim(query,ans)=Max j {weight(query,C j )×weight(ans,C j )}
where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) j ) Indicating a problem in category C j Weight (ans, C) of subject in (1) j ) Indicates that the answer is in category C j The theme weight in (1).
14. An answer recommending apparatus, comprising:
the text acquisition module is used for acquiring a question and text content of an answer corresponding to the question, and performing word segmentation to obtain a semantic unit of the question and a semantic unit of the answer;
the topic weight calculation module is used for finding out the weight of the semantic unit of the question in each category by utilizing a pre-established question field dictionary and calculating the topic weight of the question in each category;
and
the system comprises a database, a semantic unit, a topic weight calculation module and a topic weight calculation module, wherein the database is used for storing a plurality of classes of the answers;
and the similarity calculation module is used for calculating the topic similarity of each answer and the question respectively by using the topic weight of the question and the topic weight of each answer obtained by the topic weight calculation module, and recommending the answer according to the calculation result of the topic similarity.
15. The apparatus according to claim 14, wherein the problem domain dictionary is created in advance by a problem dictionary creating module, and the problem dictionary creating module specifically includes:
the question acquisition submodule is used for acquiring the content of a question in the question and answer corpus and segmenting words to obtain a semantic unit of the question;
the first weight calculation submodule is used for calculating the weight of each semantic unit of the problem in each category respectively;
and the first integration submodule is used for forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.
16. The apparatus according to claim 14, wherein the answer field dictionary is previously established by an answer dictionary establishing module, the answer dictionary establishing module specifically includes:
the answer obtaining submodule is used for obtaining the content of the answers in the question and answer pair corpus and obtaining the semantic units of the answers by word segmentation;
the second weight calculation submodule is used for calculating the weight of each semantic unit of the answer in each category respectively;
and the second integration submodule is used for forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.
17. The apparatus of claim 15, wherein the problem dictionary establishing module further comprises:
the word frequency filtering submodule is used for filtering the semantic units with the word frequency lower than a preset word frequency threshold;
and providing the semantic units left after filtering to the first weight calculation submodule.
18. The apparatus of claim 16, wherein the answer dictionary creation module further comprises:
the word frequency filtering submodule is used for filtering the semantic units with the word frequency lower than a preset word frequency threshold;
and providing the semantic units left after filtering to the second weight calculation submodule.
19. The apparatus of claim 15, wherein the first weight calculating sub-module calculates the weight of the semantic unit in each category according to one or any combination of the following:
the difference of the word frequency of the semantic unit among all the categories, the word frequency of the semantic unit appearing in all the categories or the inverse word frequency of the semantic unit.
20. The apparatus of claim 16, wherein the second weight calculating sub-module calculates the weight of the semantic unit in each category according to one or any combination of the following:
the difference of the word frequency of the semantic unit among all the categories, the word frequency of the semantic unit appearing in all the categories or the inverse word frequency of the semantic unit.
21. The apparatus according to claim 19 or 20, wherein the method for calculating the weight of the semantic units in each category is:
wherein, w (token) i ,C j ) Representing semantic units token i In class C j The weight in (1);
p ij =T ij /L j ,L j represents a class C j The sum of the times of all semantic units contained therein, T ij Representing semantic units token i In class C j The number of occurrences in (a);
wherein m is the number of categories;
representing in semantic Unit token i In class C j The word frequency in the sequence is shown, and n is a word frequency influence factor;
n represents the sum of the number of occurrences of all semantic units in the corpus, N (token) i ) Representing semantic units token i The number of occurrences.
22. The apparatus of claim 15, wherein the problem dictionary creation module further comprises:
the weight filtering submodule is used for carrying out similar weight filtering on the weight of each semantic unit among each category and filtering the weight of the same semantic unit, wherein the occurrence frequency of the same semantic unit in the same weight interval is greater than a preset threshold value;
only the weights of the semantic units in the remaining categories are provided to the first integrating submodule for forming a problem domain dictionary.
23. The apparatus of claim 16, wherein the answer dictionary establishing module further comprises:
the weight filtering submodule is used for carrying out similar weight filtering on the weight of each semantic unit among all categories, and filtering the weight of which the occurrence frequency is greater than a preset threshold value in the same weight interval aiming at the same semantic unit;
only the weight of the semantic units in the remaining categories is provided to the second integration submodule for forming an answer field dictionary.
24. The apparatus according to claim 22 or 23, wherein the weight interval is set according to the weight of the semantic units in each category.
25. The apparatus of claim 15, wherein the problem dictionary establishing module further comprises:
the semantic unit filtering submodule is used for filtering the semantic units of single characters, repeated numeric strings or numeric strings with the length exceeding a preset length threshold;
and providing the semantic units remaining after filtering to the first integration submodule to form a problem field dictionary.
26. The apparatus of claim 16, wherein the answer dictionary establishing module further comprises:
the semantic unit filtering submodule is used for filtering out the semantic units of single characters, repeated numeric strings or numeric strings with the length exceeding a preset length threshold;
and providing the filtered residual semantic units to the second integration submodule to form an answer field dictionary.
27. The apparatus according to claim 14, wherein the similarity calculation module calculates topic similarity of the answer and the question in each category, respectively, and selects a maximum value of the calculated topic similarity as the topic similarity of the answer and the question.
28. The apparatus according to claim 27, wherein the similarity calculation module calculates the subject similarity between the answer and the question by:
sim(query,ans)=Max j {weight(query,C j )×weight(ans,C j )}
where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) j ) Indicating a problem in category C j Weight (ans, C) of subject in (1) j ) Indicates that the answer is in category C j The theme weight in (1).
CN201210151044.5A 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer Active CN103425635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210151044.5A CN103425635B (en) 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210151044.5A CN103425635B (en) 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer

Publications (2)

Publication Number Publication Date
CN103425635A CN103425635A (en) 2013-12-04
CN103425635B true CN103425635B (en) 2018-02-02

Family

ID=49650400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210151044.5A Active CN103425635B (en) 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer

Country Status (1)

Country Link
CN (1) CN103425635B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714488A (en) * 2014-01-03 2014-04-09 无锡清华信息科学与技术国家实验室物联网技术中心 Method for optimizing question answering platform in social network
CN105005564B (en) * 2014-04-17 2019-09-03 北京搜狗科技发展有限公司 A kind of data processing method and device based on answer platform
CN104298735B (en) * 2014-09-30 2018-06-05 北京金山安全软件有限公司 Method and device for identifying application program type
CN105786874A (en) * 2014-12-23 2016-07-20 北京奇虎科技有限公司 Method and device for constructing question-answer knowledge base data items based on encyclopedic entries
CN106294505B (en) * 2015-06-10 2020-07-07 华中师范大学 Answer feedback method and device
CN106610932A (en) * 2015-10-27 2017-05-03 中兴通讯股份有限公司 Corpus processing method and device and corpus analyzing method and device
CN105653840B (en) * 2015-12-21 2019-01-04 青岛中科慧康科技有限公司 The similar case recommender system and corresponding method shown based on words and phrases distribution table
CN105740310B (en) * 2015-12-21 2019-08-02 哈尔滨工业大学 A kind of automatic answer method of abstracting and system in question answering system
CN105786793B (en) * 2015-12-23 2019-05-28 百度在线网络技术(北京)有限公司 Parse the semantic method and apparatus of spoken language text information
CN107168967B (en) * 2016-03-07 2020-12-04 创新先进技术有限公司 Target knowledge point acquisition method and device
CN106844686A (en) * 2017-01-26 2017-06-13 武汉奇米网络科技有限公司 Intelligent customer service question and answer robot and its implementation based on SOLR
CN106997375B (en) * 2017-02-28 2020-08-18 浙江大学 Customer service reply recommendation method based on deep learning
CN106997342B (en) * 2017-03-27 2020-08-18 上海奔影网络科技有限公司 Intention identification method and device based on multi-round interaction
CN107145573A (en) * 2017-05-05 2017-09-08 上海携程国际旅行社有限公司 The problem of artificial intelligence customer service robot, answers method and system
CN107329995B (en) * 2017-06-08 2018-03-23 北京神州泰岳软件股份有限公司 A kind of controlled answer generation method of semanteme, apparatus and system
CN107844531B (en) * 2017-10-17 2020-05-22 东软集团股份有限公司 Answer output method and device and computer equipment
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN108446320A (en) * 2018-02-09 2018-08-24 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN109033318B (en) * 2018-07-18 2020-11-27 北京市农林科学院 Intelligent question and answer method and device
CN110852094B (en) * 2018-08-01 2023-11-03 北京京东尚科信息技术有限公司 Method, apparatus and computer readable storage medium for searching target
CN109299478A (en) * 2018-12-05 2019-02-01 长春理工大学 Intelligent automatic question-answering method and system based on two-way shot and long term Memory Neural Networks
CN113342950B (en) * 2021-06-04 2023-04-21 北京信息科技大学 Answer selection method and system based on semantic association

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1489089A (en) * 2002-08-19 2004-04-14 松下电器产业株式会社 Document search system and question answer system
CN1790332A (en) * 2005-12-28 2006-06-21 刘文印 Display method and system for reading and browsing problem answers
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101174259A (en) * 2007-09-17 2008-05-07 张琰亮 Intelligent interactive request-answering system
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN101520802A (en) * 2009-04-13 2009-09-02 腾讯科技(深圳)有限公司 Question-answer pair quality evaluation method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080126319A1 (en) * 2006-08-25 2008-05-29 Ohad Lisral Bukai Automated short free-text scoring method and system
US20090089876A1 (en) * 2007-09-28 2009-04-02 Jamie Lynn Finamore Apparatus system and method for validating users based on fuzzy logic

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1489089A (en) * 2002-08-19 2004-04-14 松下电器产业株式会社 Document search system and question answer system
CN1790332A (en) * 2005-12-28 2006-06-21 刘文印 Display method and system for reading and browsing problem answers
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101174259A (en) * 2007-09-17 2008-05-07 张琰亮 Intelligent interactive request-answering system
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN101520802A (en) * 2009-04-13 2009-09-02 腾讯科技(深圳)有限公司 Question-answer pair quality evaluation method and system

Also Published As

Publication number Publication date
CN103425635A (en) 2013-12-04

Similar Documents

Publication Publication Date Title
CN103425635B (en) Method and apparatus are recommended in a kind of answer
US10831769B2 (en) Search method and device for asking type query based on deep question and answer
CN106709040B (en) Application search method and server
CN110427463B (en) Search statement response method and device, server and storage medium
CN108280155B (en) Short video-based problem retrieval feedback method, device and equipment
CN105989040B (en) Intelligent question and answer method, device and system
CN108628833B (en) Method and device for determining summary of original content and method and device for recommending original content
CN107885745B (en) Song recommendation method and device
US9020805B2 (en) Context-based disambiguation of acronyms and abbreviations
CN108073568A (en) keyword extracting method and device
CN109783631B (en) Community question-answer data verification method and device, computer equipment and storage medium
CN105808590B (en) Search engine implementation method, searching method and device
CN109947952B (en) Retrieval method, device, equipment and storage medium based on English knowledge graph
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN110297893B (en) Natural language question-answering method, device, computer device and storage medium
JP2005122533A (en) Question-answering system and question-answering processing method
CN102332025A (en) Intelligent vertical search method and system
CN103886034A (en) Method and equipment for building indexes and matching inquiry input information of user
CN109255012B (en) Method and device for machine reading understanding and candidate data set size reduction
CN108241613A (en) A kind of method and apparatus for extracting keyword
CN107688616A (en) Show unique fact of entity
CN109766547B (en) Sentence similarity calculation method
CN113342958B (en) Question-answer matching method, text matching model training method and related equipment
CN114416929A (en) Sample generation method, device, equipment and storage medium of entity recall model
CN110633410A (en) Information processing method and device, storage medium, and electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant