CN103425635B

CN103425635B - Method and apparatus are recommended in a kind of answer

Info

Publication number: CN103425635B
Application number: CN201210151044.5A
Authority: CN
Inventors: 陈庆轩; 梁丰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-05-15
Filing date: 2012-05-15
Publication date: 2018-02-02
Anticipated expiration: 2032-05-15
Also published as: CN103425635A

Abstract

The invention provides a kind of answer to recommend method and apparatus, wherein, this method includes：Acquisition problem corresponds to the content of text of answer with the problem, and participle obtains the semantic primitive of described problem and the semantic primitive of the answer；Using domain lexicon the problem of pre-establishing, weight of the semantic primitive of described problem in each classification is found out, calculates topic weights of the described problem in each classification；And using the answer domain lexicon pre-established, weight of the semantic primitive of each answer in each classification is found out, calculates topic weights of each answer in each classification respectively；Using the topic weights of obtained described problem and the topic weights of each answer, the Topic Similarity of each answer and described problem is calculated respectively, answer is recommended according to the result of calculation of the Topic Similarity.Compared with prior art, the present invention generates problem domain dictionary and answer domain lexicon respectively, effectively improves accuracy rate of the question and answer to semantic similarity, improves recall rate.

Description

Answer recommendation method and device

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of internet information processing, in particular to an answer recommendation method and device.

[ background of the invention ]

With the continuous development of information and network technologies, network interactive question-answer communities such as Baidu know, xinlangai question, google question-answer, search question, yahoo knowledge hall and the like are increasingly concerned by people. The network interactive question-answering communities provide a platform for internet citizens to carry out interactive communication, and users can freely put forward questions, browse the questions, answer the questions, carry out mutual-help communication and share knowledge. As the number of candidate answers increases with the increase of users participating in the questioning and answering community, the questioning and answering community generally automatically sorts the answers so as to recommend preferred answers to the users.

In the automatic ranking of answers, currently, a text topic analysis technology is mostly adopted to analyze the semantic relevance of question-answer pairs and the like to determine the satisfaction of the question-answer pairs, and then, the answers are automatically ranked. The text topic analysis technology is mainly based on a topic model, namely, texts are mapped into topic vectors, and the topic vectors are represented by word distribution, so that topic similarity calculation between the texts can be converted into similarity calculation between the topic vectors, and the similarity can be measured by cosine similarity.

Most of the existing text topic analysis methods are based on an assumption: that is, the texts all belong to the same topic space, and each topic belongs to the same word distribution. However, the questions and answers in the question-answer pairs may be described differently, i.e., inconsistent words may occur, for example, in the computer field, the distribution of the words in the question field is mainly based on commonly used or spoken computer words, such as computers, operating systems, etc.; the distribution of the answered domain words is mainly based on professional computer vocabularies, such as PC, win7 and the like; for another example, asking a user to ask a question about the skill of a game, but the answer to the user is a description of the specific skill and does not include the word in the question. At this time, the semantic relevance between the answer and the question calculated according to the existing method is low, so that the answer actually matched with the question cannot be recalled or the ranking of the answer is backward, the accuracy of quality judgment by question and answer is reduced, and the user cannot find the preferred answer.

[ summary of the invention ]

In view of the above, the present invention provides an answer recommendation method and apparatus, which respectively generate a question domain dictionary and an answer domain dictionary to expand domain mapping expressions of questions and answers in a question-answer pair, thereby effectively improving the accuracy of semantic similarity determination between the questions and the answers and increasing the recall rate.

The specific technical scheme is as follows:

an answer recommendation method, comprising the steps of:

s1, obtaining a question and text content of an answer corresponding to the question, and segmenting words to obtain a semantic unit of the question and a semantic unit of the answer;

s2, searching the weight of the semantic unit of the problem in each category by using a pre-established problem field dictionary, and calculating the theme weight of the problem in each category;

and

searching the weight of the semantic unit of each answer in each category by using a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category;

and S3, respectively calculating the topic similarity of each answer and the question by using the obtained topic weight of the question and the topic weight of each answer, and recommending the answers according to the calculation result of the topic similarity.

According to a preferred embodiment of the present invention, the method for establishing a dictionary of problem domains specifically includes:

acquiring the content of a question in a question-answer corpus, and segmenting words to obtain a semantic unit of the question;

respectively calculating the weight of each semantic unit of the problem in each category;

and forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.

According to a preferred embodiment of the present invention, the method for establishing a dictionary in an answer field specifically includes:

acquiring the content of an answer in a question and answer pair corpus, and performing word segmentation to obtain a semantic unit of the answer;

respectively calculating the weight of each semantic unit of the answer in each category;

and forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.

According to a preferred embodiment of the present invention, after obtaining the semantic unit of the question or the semantic unit of the answer, the method further comprises:

filtering semantic units with word frequency lower than a preset word frequency threshold;

and respectively calculating the weight in each category only for the residual semantic units after filtering.

According to a preferred embodiment of the present invention, the weight of the semantic unit in each category is calculated according to one or any combination of the following:

the difference of the word frequency of the semantic unit among all the categories, the word frequency of the semantic unit appearing in all the categories or the inverse word frequency of the semantic unit.

According to a preferred embodiment of the present invention, the method for calculating the weight of the semantic unit in each category is:

wherein, w (token) _i ，C _j ) Representing semantic units token _i In class C _j The weight in (1);

p _ij ＝T _ij /L _j ，L _j represents class C _j The sum of the times of all semantic units contained therein, T _ij Representing semantic units token _i In class C _j The number of occurrences in (1);

wherein m is the number of categories;

representing in semantic Unit token _i In class C _j The word frequency appears in the Chinese character, and n is a word frequency influence factor;

n represents the sum of the number of occurrences of all semantic units in the corpus, N (token) _i ) Representing semantic units token _i The number of occurrences.

According to a preferred embodiment of the present invention, before forming the question domain dictionary or the answer domain dictionary from the semantic units and their weights in the categories, the method further comprises:

carrying out similar weight filtering on the weight of each semantic unit among each category, and filtering the weight of which the occurrence frequency in the same weight interval is greater than a preset threshold value aiming at the same semantic unit;

only the weight of the semantic unit in the remaining categories is used to form a question domain dictionary or an answer domain dictionary.

According to a preferred embodiment of the present invention, the weight interval is set according to the weight of the semantic unit in each category.

According to a preferred embodiment of the present invention, before forming the problem domain dictionary by the semantic units and their weights in the categories, the method further comprises:

filtering out semantic units with the length of single characters, repeated numeric strings or numeric strings exceeding a preset length threshold;

and only the semantic units remaining after filtering are used for forming a question domain dictionary or an answer domain dictionary.

According to a preferred embodiment of the present invention, the method for calculating the topic similarity between the answer and the question includes:

respectively calculating the subject similarity of the answers and the questions under each category;

and selecting the maximum value of the calculated theme similarity as the theme similarity of the answer and the question.

According to a preferred embodiment of the present invention, the method for calculating the topic similarity between the answer and the question comprises:

sim(query，ans)＝Max _j {weight(query，C _j )×weight(ans，C _j )}

where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) _j ) Indicating a problem in category C _j Weight of subject in (ans, C) _j ) Indicates that the answer is in category C _j The theme weight in (1).

An answer recommendation apparatus, the apparatus comprising:

the text acquisition module is used for acquiring a question and text content of an answer corresponding to the question, and performing word segmentation to obtain a semantic unit of the question and a semantic unit of the answer;

the theme weight calculation module is used for searching the weight of the semantic unit of the problem in each category by utilizing a pre-established problem field dictionary and calculating the theme weight of the problem in each category;

and

the system comprises a database, a semantic unit, a topic weight calculation module and a topic weight calculation module, wherein the database is used for storing a plurality of classes of the answers;

and the similarity calculation module is used for calculating the similarity of the topics of the questions and the answers respectively by using the topic weight of the questions and the topic weight of the answers obtained by the topic weight calculation module, and recommending the answers according to the calculation result of the topic similarity.

According to a preferred embodiment of the present invention, the problem domain dictionary is created in advance by a problem dictionary creating module, and the problem dictionary creating module specifically includes:

the question acquisition submodule is used for acquiring the content of a question in the question-answer corpus and segmenting words to obtain a semantic unit of the question;

the first weight calculation submodule is used for calculating the weight of each semantic unit of the problem in each category respectively;

and the first integration submodule is used for forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.

According to a preferred embodiment of the present invention, the answer domain dictionary is established in advance by an answer dictionary establishing module, and the answer dictionary establishing module specifically includes:

the answer obtaining submodule is used for obtaining the content of the answers in the question and answer pair corpus and obtaining the semantic units of the answers by word segmentation;

the second weight calculation submodule is used for calculating the weight of each semantic unit of the answer in each category respectively;

and the second integration submodule is used for forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.

According to a preferred embodiment of the present invention, the question dictionary creating module or the answer dictionary creating module further includes:

the word frequency filtering submodule is used for filtering the semantic units with the word frequency lower than a preset word frequency threshold;

and providing the filtered residual semantic units to the first weight calculation submodule or the second weight calculation submodule.

According to a preferred embodiment of the present invention, the first weight calculating submodule or the second weight calculating submodule calculates the weight of the semantic unit in each category according to one or any combination of the following list:

the difference of the word frequency of the semantic unit among all categories, the word frequency of the semantic unit appearing in all categories or the inverse word frequency of the semantic unit.

According to a preferred embodiment of the present invention, the method for calculating the weight of the semantic unit in each category by the first weight calculation submodule or the second weight calculation submodule is as follows:

wherein, w (token) _i ，C _j ) Representing semantic Unit tokens _i In class C _j The weight of (1);

p _ij ＝T _ij /L _j ，L _j represents class C _j The sum of the times of all semantic units contained therein, T _ij Representing semantic units token _i In class C _j The number of occurrences in (a);

wherein m is the number of categories;

the weight filtering submodule is used for carrying out similar weight filtering on the weight of each semantic unit among all categories, and filtering the weight of which the occurrence frequency is greater than a preset threshold value in the same weight interval aiming at the same semantic unit;

providing only the weight of the semantic unit in the remaining categories to the first integration submodule or the second integration submodule for forming a question domain dictionary or an answer domain dictionary.

the semantic unit filtering submodule is used for filtering out the semantic units of single characters, repeated numeric strings or numeric strings with the length exceeding a preset length threshold;

and providing the residual semantic units after filtering to the first integration submodule or the second integration submodule to form a question domain dictionary or an answer domain dictionary.

According to a preferred embodiment of the present invention, the similarity calculation module calculates the topic similarity of the answer and the question in each category, and selects the maximum value of the calculated topic similarity as the topic similarity of the answer and the question.

According to a preferred embodiment of the present invention, the method for calculating the topic similarity between the answer and the question by the similarity calculation module comprises:

sim(query，ans)＝Max _j {weight(query，C _j )×weight(ans，C _j )}

where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) _j ) Indicating that the question is in category C _j Weight (a) of the subject in (1)ns，C _j ) Indicates that the answer is in category C _j The theme weight in (1).

According to the technical scheme, the question field dictionary and the answer field dictionary are respectively generated by using the question-answer material, so that the field mapping expression of the question-answer pair is expanded, the accuracy of the question-answer pair semantic similarity is effectively improved, the problem of inaccurate matching under the condition that the word pairs describing the same theme are inconsistent is solved, and the recall rate is improved.

[ description of the drawings ]

FIG. 1 is a flowchart of an answer recommendation method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for creating a problem domain dictionary according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for creating a dictionary of answer fields according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an answer recommending apparatus according to a second embodiment of the present invention;

FIG. 5 is a diagram of a problem dictionary creating module according to a second embodiment of the present invention;

fig. 6 is a schematic diagram of an answer dictionary establishing module according to a second embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

During the question answering process of the network interactive question answering community, expressions of the same subject in questions and answers are different according to different knowledge backgrounds of the questioners and the answerers, such as < compressed software, winrar >, < slide, PPT >, < system software, win7> and the like, and the expressions have high semantic similarity under the specific domain background although the words are different.

The invention utilizes the characteristic to respectively establish a dictionary in the question field and a dictionary in the answer field aiming at the words of the question and the answer in different categories, and carries out answer recommendation according to the calculation result of the similarity by a calculation method for calculating the semantic similarity between the question and the answer in different fields.

The first embodiment,

Fig. 1 is a flowchart of an answer recommendation method provided in this embodiment, and as shown in fig. 1, the method includes:

and S10, obtaining a question and text content of an answer corresponding to the question, and performing word segmentation to obtain a semantic unit of the question and a semantic unit of the answer.

One question may include a plurality of corresponding answers, and the text content of the question and each answer is subjected to word segmentation filtering and the like to obtain semantic units contained in the obtained question and each answer.

The invention can perform word segmentation on the text content of the question or the answer by the existing word segmentation method, such as an N-gram word segmentation method, a forward maximum matching method, a reverse maximum matching method and the like. Taking an N-gram word segmentation method as an example, carrying out unary division to obtain each unary semantic unit, such as text, data, table and the like; binary division is carried out to obtain each binary semantic unit, such as a text box, a data packet, a new table and the like; carrying out ternary division to obtain each ternary semantic unit, namely a multi-line text box, a data packet interception, a new table downloading and the like; and repeating the operation of dividing the words of the N-element semantic units. The N-element semantic unit is N terms adjacent to the upper and lower parts of the question or answer, namely N terms which appear continuously, and no separators such as characters, punctuations or spaces are arranged in the middle.

The question or answer may include content for multiple domains. For example, one problem includes three fields of a title, a body and a supplementary description, and text contents of the three fields are respectively extracted and segmented to obtain corresponding semantic units. And respectively acquiring the question or the answer according to the title, the text and the supplementary content to obtain the corresponding N-element semantic unit.

For example, the user may ask the following questions:

' teaching computer high hand

Did my computer restart before downloading nothing but did i not log off-work? "

Included in this question are the title "teach high hands computer" and the text "do my download nothing but i did not log off-do after reboot? ". Taking this heading as an example, the word segmentation result includes: the univariate semantic unit "teaching", "computer", "high hand", the binary semantic unit "teaching computer", "computer high hand" and the ternary semantic unit "teaching computer high hand".

Step S20, searching the weight of the semantic unit of the question in each category by using a pre-established question domain dictionary, and calculating the theme weight of the question in each category; and searching the weight of the semantic unit of each answer in each category by using a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category.

The question domain dictionary or the answer domain dictionary comprises semantic units and weights of the semantic units in all categories. The categories are a plurality of preset domain categories, and encyclopedic categories can be adopted, such as computer, medicine, education, maps, songs, movies and the like.

The specific process of establishing the question domain dictionary and the answer domain dictionary in advance by using the existing question-answer pair corpus will be described in detail in the following paragraphs.

And searching the weight of each semantic unit of the problem in each category by using the problem field dictionary, and summing the weights of all semantic units contained in the problem according to each category to obtain the theme weight of the problem in each category. For example, a semantic element "computer" is searched for in a problem area dictionary, and the semantic element "computer" is found to have a weight of 15 in the computer category, a weight of 30 in the education category, and a weight of 10 in the medicine category. And sequentially finding the weight of each semantic unit of the problem obtained in the step S10 in each category.

And according to different categories, carrying out weighted summation on the weights of the semantic units under the corresponding categories to obtain the theme weight of the problem under each category. If the weight of the semantic unit under a certain category cannot be found, the weight of the semantic unit under the category is zero. For example, if the semantic units obtained by word segmentation of the question only have weights of "computer" and "high-hand" in the medicine classification, the weights of the semantic units "computer" and "high-hand" are added together to be the theme weight of the question in the medicine classification.

And similarly, searching the weight of the semantic unit of each answer in each category by using the dictionary in the answer field, and summing the weights of all the semantic units contained in the answer according to each category to obtain the theme weight of the answer in each category.

And S30, respectively calculating the topic similarity of each answer and the question by using the obtained topic weight of the question and the topic weight of each answer, and recommending the answer according to the calculation result of the topic similarity.

Using the topic weights of the questions in the respective categories and the topic weights of the answers in the respective categories calculated in step S20, the topic similarity of the answers to the questions is calculated.

The method for calculating the similarity between the answers and the topics of the questions may be, but not limited to, calculating by taking the product of the topic weight of the questions and the topic weight of the answers. Specifically, the topic similarity of the answer and the question under each category is calculated respectively, and then the maximum value of the topic similarity obtained by calculation is selected as the topic similarity of the answer and the question, that is:

sim(query，ans)＝weight(query，C _j )×weight(ans，C _j )

where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) _j ) Indicating a problem in category C _j Weight (ans, C) of subject in (1) _j ) Indicating that the answer is in the classPin C _j Subject weight in (1).

After calculating the topic weights of the questions or answers in each category, only the topic weights in the first 5 categories of the questions and answers may be selected for similarity calculation.

If the highest theme weight of the question is 0, it indicates that the question cannot be clearly judged, and the theme similarity between the question and the answer in the question-answer pair cannot be calculated, and at this time, the existing semantic relevance is adopted to measure the relevance of the question-answer pair.

If the topic weight of the answer with the highest degree is 0, the answer cannot be clearly judged, the topic similarity of the answer and the question cannot be calculated, and the relevance of the question-answer pair is measured by adopting the existing semantic relevance at the same time.

Multiplying the weights of the questions and the answers in the corresponding categories to be used as the subject relevance of the categories, and selecting the maximum value of the product to be used as the subject relevance of the answers and the questions.

By the calculating method, the topic relevance of the question-answer pair can be calculated. As shown in table 1 below:

TABLE 1

According to the topic relevance of the question and answer pairs and the answers, the question and answer pairs with the same topic can be well identified, and the judgment of topic similarity with higher weight can be generated, so that an effective means is provided for judging the question and answer quality from the content relevance aspect of the text, and more accurate answers can be recommended.

A method for creating a dictionary for question fields and a dictionary for answer fields that are created in advance will be described with reference to fig. 2 and 3.

Fig. 2 is a flowchart of a method for establishing a problem domain dictionary according to this embodiment, and as shown in fig. 2, the method specifically includes:

and S401, obtaining the contents of the questions in the question and answer corpus, and performing word segmentation to obtain semantic units of the questions.

Acquiring the text content of the questions in the whole question-answer pair pre-material library, performing word segmentation, and performing filtering processing such as removing stop words and punctuation on terms obtained by word segmentation to obtain semantic units of the questions. The specific processing procedure is similar to step S10, and is not described herein again.

And S402, filtering out semantic units with word frequency lower than a preset word frequency threshold value.

In order to improve the efficiency, the semantic units are filtered based on the word frequency, and the semantic units with the word frequency lower than a preset word frequency threshold are filtered. For example, semantic units with a word frequency lower than 5 are removed.

Of course, this step is not essential, and may not be executed when the requirement on the processing efficiency is not high.

And step S403, respectively calculating the weight of each semantic unit of the problem in each category.

The weight of the semantic unit in each category is calculated according to one or any combination of the following:

Taking the difference of the word frequency of the semantic unit among each category, the word frequency of the semantic unit appearing in each category and the inverse word frequency of the semantic unit as an example, the method for calculating the weight of the semantic unit in each category can be, but not limited to, the following methods: the product of the difference of the word frequency of the semantic unit among all the categories and the product of the word frequency of the semantic unit appearing in all the categories and the inverse word frequency of the semantic unit is calculated, namely:

wherein, w (token) _i ，C _j ) Representing semantic Unit tokens _i In class C _j The weight in (1).

p _ij ＝T _ij /L _j ，L _j Represents a class C _j The sum of the times of all semantic units contained therein, T _ij Representing semantic units token _i In class C _j The number of occurrences in (c).

Wherein m is the number of classes.

Representing token _i The word frequency of (c) is different between classes.

Representing in semantic Unit token _i In class C _j The word frequency in (1), n is the influence factor of the word frequency. The word frequency influence factor n can be set according to actual conditions, and the influence degree of the word frequency is adjusted, for example, n =5 is selected.

N represents the sum of the number of occurrences of all semantic units in the corpus, N (token) _i ) Representing semantic units token _i Number of occurrences, log (N/N (token) _i ) Represent semantic Unit tokens _i The inverse word frequency of (c). The inverse word frequency may also be directly used to process the inverse document rate in the corpus.

And S404, carrying out similar weight filtering on the weight of each semantic unit among each category.

In order to distinguish the importance degree of the semantic unit among each category, after the weights of the semantic unit in each category are calculated, the weights which appear for many times in the same weight interval need to be filtered out. That is, for the same semantic unit, the weight with the occurrence frequency greater than the preset threshold in the same weight interval is filtered.

The weight interval (e.g., [0, 10 ] interval) is set according to the weight of the semantic unit in each category. Specifically, the following methods may be employed, but are not limited to:

and determining each weight interval of the semantic units to be calculated by dividing the difference between the maximum value and the minimum value of the weight of the semantic units to be calculated in all the categories by the number of the weight intervals.

For example, a heuristic rule may be used to determine the weight interval if the highest weight Score of a semantic element in each class is Score _max The lowest weight Score is Score _min Then the interval length can be defined as (Score) _max -Score _min ) L, where L is the number of preset weight intervals, and in this embodiment L =6. The similarity weight number threshold is set to M/2, where M represents how many classes the semantic unit has a weight score in.

For example, as in the case of the weight distribution of the semantic unit "stock" in each category: 1: 1.65, 2: 2.32, 3: 58.62, 4: 3.12, 5: 3.62, 7: 14.82, 8: 24.31, 11: 14.85. First, it is determined that the interval length is (58.62-0)/6 =10, that is, the weight interval can be divided into [0, 10 ], [10, 20., "stocks" have weight scores in 8 categories, the threshold value of the number of similar weights is 4, and the weights of the word "stocks" in categories 1, 2,4 and 5 all belong to the weight interval [0, 10 ], so that the weights of the four categories are filtered, and finally the weights of the four categories are left as 3: 58.62, 7: 14.82, 8: 24.31 and 11: 14.85.

It should be noted that this step may not be executed when the requirements on processing efficiency and accuracy are not high.

Step S405, filtering out the single character, the repeated number string or the semantic unit with the length of the number string exceeding a preset length threshold value.

After calculating the weight of the semantic unit in each category, filtering the semantic unit, including:

the semantic unit of the single character, namely the Chinese character or the word with the length of 1, is filtered.

Semantic units with the numeric character string length exceeding a preset length threshold are filtered, for example, numeric character strings with the length larger than 10 are meaningless and are filtered.

Semantic units of the repeated numeric strings are filtered out. For example, a numeric character string with a large repetition degree (e.g., a numeric string with a repetition length of more than 4, such as 00001) is meaningless and is filtered.

It should be noted that the filtering process in this step may also be performed before calculating the weight of the semantic unit in each category, and specifically may be performed before or after step S402.

And step S406, forming a problem field dictionary by the semantic units and the weights of the semantic units in the categories.

That is, the problem area dictionary includes at least semantic units and weights of the semantic units in the categories.

Similarly, fig. 3 is a flowchart of a method for establishing a dictionary in an answer field according to this embodiment, and as shown in fig. 3, the method specifically includes:

step S501, obtaining the content of the answer in the question-answer corpus, and obtaining the semantic unit of the answer by word segmentation.

And step S502, filtering out semantic units with word frequency lower than a preset word frequency threshold value.

And step S503, respectively calculating the weight of each semantic unit of the answer in each category.

Step S504, similar weight filtering is carried out on the weight of each semantic unit among all categories, and the weight with the occurrence frequency larger than a preset threshold value in the same weight interval is filtered aiming at the same semantic unit.

And step S505, filtering out the single character, the repeated number string or the semantic unit with the length of the number string exceeding a preset length threshold value.

And S506, forming an answer field dictionary by the semantic units and the weights of the semantic units in the categories.

The processing method from step S501 to step S506 is similar to that from step S401 to step S406, and is not repeated herein.

Through the establishing method, the question domain dictionary and the answer domain dictionary of each category are formed. As shown in tables 2 and 3 below.

TABLE 2

Problem domain binary semantic unit	Weight of	Answer field binary semantic Unit	Weight of
				Text box	45.226	Control terminal	51.5122
Shared internet access	45.2149	Mitnick	51.3074
				Default gateway	45.1803	Stop message	50.968
Data packet	45.1551	Click cancellation	50.8755
				In java	45.1044	Partition table	50.8634
Excel table	45.0597	Machine dog	50.7862
				Entering DOS	45.004	Gray pigeon	50.533

Table 2 shows the distribution of the binary semantic units in the computer category in the question domain and the answer domain. As can be seen from table 2, the question field is mainly a binary semantic unit for realizing functions or achieving effects, and the answer field is mainly a binary semantic unit for executing actions or applying technologies.

TABLE 3

Problem domain binary semantic unit	Weight of	Answer field binary semantic unit	Weight of
				Normal value	45.4417	Hepatitis B antibody	46.8926
Each menstruation	45.4238	Coarse and shallow suggestions	46.6657
				Ovarian cyst	45.4168	Liver function test	46.468
Pleurisy	45.3994	Vaccine boosting	46.3076
				Core of hepatitis B	45.3889	The fish contain	46.2249

Table 3 shows the distribution of the binary semantic units in the medicine category in the question domain and the answer domain. As can be seen from table 3, the question domain is mainly a binary semantic unit for some queries of disorders, and in the answer domain, is mainly a binary semantic unit for some treatment methods and advices.

The invention can capture the common semantic units of the question and the answer aiming at the field better by utilizing a mode of respectively calculating the question and the answer. Meanwhile, the condition that the distribution conditions of the N-element semantic units in all the categories are unbalanced can be fully considered, and the expected target is well achieved.

The above is a detailed description of the method provided by the present invention, and the answer recommending apparatus provided by the present invention is described in detail below.

Example two

Fig. 4 is a schematic diagram of an answer recommending apparatus provided in this embodiment. As shown in fig. 4, the apparatus includes:

the text obtaining module 10 is configured to obtain a question and text content of an answer corresponding to the question, and perform word segmentation to obtain a semantic unit of the question and a semantic unit of the answer.

The text obtaining module 10 may use an existing word segmentation method to segment the text content of the question or the answer, such as an N-gram word segmentation method, a forward maximum matching method, a reverse maximum matching method, and the like. Taking an N-gram word segmentation method as an example, carrying out unary division to obtain each unary semantic unit, such as text, data, table and the like; binary division is carried out to obtain each binary semantic unit, such as a text box, a data packet, a new table and the like; carrying out ternary division to obtain each ternary semantic unit, a plurality of lines of text boxes, data packet interception, new table downloading and the like; and repeating the steps to divide the words of the N-element semantic units. The N-element semantic unit is N terms which are adjacent in the context of the question or the answer, namely N terms which continuously appear, and no separators such as characters, punctuations or spaces are arranged in the middle.

The topic weight calculation module 20 is configured to find the weight of the semantic unit of the question obtained by the text acquisition module 10 in each category by using a pre-established question domain dictionary, and calculate the topic weight of the question in each category.

And the system is used for searching the weight of the semantic unit of each answer in each category obtained by the text acquisition module 10 by using a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category.

The following paragraphs will describe the device for establishing the question domain dictionary and the answer domain dictionary in advance by using the existing question and answer pair corpus.

And searching the weight of each semantic unit of the problem in each category by using the problem domain dictionary, and summing the weights of all the semantic units contained in the problem according to each category to obtain the theme weight of the problem in each category. For example, a semantic element "computer" is searched for in a problem area dictionary, and the semantic element "computer" is found to have a weight of 15 in the computer category, a weight of 30 in the education category, and a weight of 10 in the medicine category. The weights of the semantic units of the question obtained in the text acquisition module 10 in the categories are found in turn.

And according to different categories, carrying out weighted summation on the weights of the semantic units under the corresponding categories to obtain the theme weight of the problem under each category. If the weight of the semantic unit under a certain category cannot be found, the weight of the semantic unit under the category is zero. For example, the semantic units obtained by word segmentation of the question only have weights of "computer" and "high hand" in the medicine classification, and the weights of the semantic units "computer" and "high hand" are added to serve as the theme weight of the question in the medicine classification.

Similarly, the weights of the semantic units of the answers in all the categories are found out by utilizing the answer field dictionary, and the weights of all the semantic units contained in the answers are summed according to all the categories to obtain the theme weights of the answers in all the categories.

The similarity calculation module 30 is configured to calculate topic similarity between each answer and the question by using the topic weight of the question and the topic weight of each answer obtained by the topic weight calculation module 20, and recommend an answer according to a calculation result of the topic similarity.

The topic weight of the question in each category and the topic weight of the answer in each category calculated in the topic weight calculation module 20 are used to calculate the topic similarity of the answer and the question.

The method for calculating the similarity between the answers and the topics of the questions may be, but not limited to, calculating by taking the product of the topic weight of the questions and the topic weight of the answers. Specifically, topic similarity of the answer and the question under each category is calculated respectively, and then the maximum value of the topic similarity obtained through calculation is selected as the topic similarity of the answer and the question, that is:

sim(query，ans)＝weight(query，C _j )×weight(ans，C _j )

where sim (query, ans) represents the subject similarity of the answer and the question, weight (query, C) _j ) Indicating a problem in category C _j Weight (ans, C) of subject in (1) _j ) Indicates that the answer is in category C _j The theme weight in (1).

The similarity calculation module 30 may select only the topic weights in the question and the first 5 categories of the answer calculated by the topic weight calculation module 20 for similarity calculation.

Next, a device for creating a dictionary for question fields and a dictionary for answer fields created in advance will be described with reference to fig. 5 and 6.

Fig. 5 is a schematic diagram of an apparatus for creating a problem domain dictionary provided in this embodiment, and as shown in fig. 5, the apparatus specifically includes:

the question acquisition submodule 401 is configured to acquire content of a question in a question-answer corpus, and perform word segmentation to obtain a semantic unit of the question.

And acquiring the text content of the questions in the whole question-answer pair pre-material library, segmenting words, and filtering terms obtained by segmenting the words to remove stop words, punctuation points and the like to obtain semantic units of the questions. The specific processing procedure is similar to that of the text acquiring module 10, and is not described herein again.

And the word frequency filtering submodule 402 is configured to filter out semantic units with word frequencies lower than a preset word frequency threshold.

In order to improve efficiency, semantic units are filtered based on word frequency, and semantic units with word frequency lower than a preset word frequency threshold are filtered. For example, semantic units with a word frequency lower than 5 are removed.

Certainly, the sub-module is not a necessary sub-module, and may not be included when the requirement on the processing efficiency is not high.

A first weight calculating submodule 403, configured to calculate weights of semantic units of the problem in the categories, respectively.

By combining the difference of the word frequency of the semantic unit among each category, the word frequency of the semantic unit appearing in each category and the inverse word frequency of the semantic unit, the weight calculation method of the semantic unit in each category can be but is not limited to adopt: the product of the difference of the word frequency of the semantic unit among all the categories and the product of the word frequency of the semantic unit appearing in all the categories and the inverse word frequency of the semantic unit is calculated, namely:

wherein, w (token) _i ，C _j ) Representing semantic units token _i In class C _j The weight of (1).

p _ij ＝T _ij /L _j ，L _j Represents class C _j The sum of the times of all semantic units contained therein, T _ij Representing semantic units token _i In class C _j Number of occurrences in (c).

Wherein m is the number of classes.

Representing token _i Word frequency in classInter alia, variability.

Representing in semantic Unit token _i In class C _j The word frequency in (1) and n is the influence factor of the word frequency. The word frequency influence factor n can be set according to actual conditions, and influence strength of the word frequency is adjusted, for example, n =5 is selected.

The weight filtering sub-module 404 is configured to perform similar weight filtering on the weights of the semantic units between the categories.

In order to distinguish the importance degree of the semantic unit among the categories, after the weights of the semantic unit in the categories are calculated, the weights which appear for many times in the same weight interval need to be filtered. That is, for the same semantic unit, the weight with the occurrence frequency greater than the preset threshold in the same weight interval is filtered out.

For example, a heuristic rule may be used to determine the weight interval if the highest weight Score of a semantic element in each class is Score _max The lowest weight Score is Score _min Then the interval length can be defined as (Score) _max -Score _min ) L, where L is the number of preset weight intervals, and in this embodiment, L =6. The similarity weight number threshold is set to M/2,where M represents how many categories the semantic unit has a weight score in.

For example, as in the case of the weight distribution of the semantic unit "stock" in various categories, the following are: 1: 1.65, 2: 2.32, 3: 58.62, 4: 3.12, 5: 3.62, 7: 14.82, 8: 24.31, 11: 14.85. First, it is determined that the interval length is (58.62-0)/6 =10, that is, the weight interval can be divided into [0, 10), [10, 20., "stocks" have weight scores in 8 categories, the threshold value of the number of similar weights is 4, and the weights of the word "stocks" in categories 1, 2,4, and 5 all belong to the weight interval [0, 10), so the weights of the four categories are filtered, and finally the weights of the four categories are left as 3: 58.62, 7: 14.82, 8: 24.31, and 11: 14.85.

It should be noted that the sub-module may not be included when the requirements on the processing efficiency and the precision are not high.

The semantic unit filtering submodule 405 is configured to filter out a single word, a repeated number string, or a semantic unit in which the length of the number string exceeds a preset length threshold.

The semantic unit filtering submodule 405 performs filtering processing on semantic units, including:

Semantic units with the numeric character string length exceeding a preset length threshold value are filtered, for example, numeric character strings with the length larger than 10 are meaningless and are filtered.

Semantic units of the repeated numeric strings are filtered out. For example, a numeric character string with a large repetition degree (e.g., a numeric string with a repetition length of more than 4, such as 00001) is meaningless and filtered.

It should be noted that the sub-module may also be disposed before the first weight calculating sub-module 403, specifically before or after the word frequency filtering sub-module 402.

And a first integration submodule 406, configured to form a problem domain dictionary from the semantic units and their weights in the categories. That is, the problem area dictionary includes at least semantic units and weights of the semantic units in the categories.

Similarly, fig. 6 is a schematic diagram of an apparatus for creating a dictionary of answer fields according to this embodiment, and as shown in fig. 6, the apparatus specifically includes:

the answer obtaining sub-module 501 is configured to obtain content of an answer in a query-answer corpus, and perform word segmentation to obtain a semantic unit of the answer.

The word frequency filtering submodule 502 is configured to filter out semantic units whose word frequency is lower than a preset word frequency threshold.

The second weight calculating submodule 503 is configured to calculate the weight of each semantic unit of the answer in each category.

The weight filtering sub-module 504 is configured to perform similar weight filtering on the weights of the semantic units in the categories, and filter, for the same semantic unit, a weight whose occurrence frequency in the same weight interval is greater than a preset threshold.

And the semantic unit filtering submodule 505 is used for filtering out the semantic units of the single characters, the repeated numeric strings or the numeric strings with the length exceeding a preset length threshold value.

And a second integration submodule 506, configured to form an answer field dictionary from the semantic units and their weights in the categories.

The arrangement of the sub-modules 501 to 506 is similar to that of the sub-modules 401 to 406, and thus is not described herein again.

By the above-described creation means, a question region dictionary and an answer region dictionary for each category are formed. As shown in tables 1 and 2 below.

According to the answer recommendation method and device provided by the invention, the question field dictionary and the answer field dictionary comprising each classification are respectively established by using the question-answer pair corpus, so that the field mapping expression of the question-answer pair is expanded, the accuracy of the question-answer pair semantic similarity is effectively improved, the problem of inaccurate matching under the condition that the word pairs describing the same theme are inconsistent is solved, and the recall rate is improved. The method can be used for the aspects of answer recommendation, domain-based relevance content recommendation, search result recommendation and the like of various network interactive question-answering communities.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An answer recommendation method, comprising:

s2, searching the weight of the semantic unit of the problem in each category by using a pre-established problem domain dictionary, and calculating the theme weight of the problem in each category;

and

searching the weight of the semantic unit of each answer in each category by utilizing a pre-established answer field dictionary, and respectively calculating the theme weight of each answer in each category;

2. The method according to claim 1, wherein the method for establishing the problem domain dictionary specifically comprises:

3. The method according to claim 1, wherein the method for establishing the dictionary of the answer field specifically comprises:

4. The method according to claim 2 or 3, wherein after obtaining the semantic unit of the question or the semantic unit of the answer, further comprising:

5. The method according to claim 2 or 3, wherein the weight of the semantic unit in each category is calculated according to one or any combination of the following:

6. The method of claim 5, wherein the weight of the semantic unit in each category is calculated by:

wherein, w (token) _i ,C _j ) Representing semantic Unit tokens _i In class C _j The weight of (1);

wherein m is the number of categories;

n represents the sum of the number of occurrences of all semantic units in the corpus, N (token) _i ) Representing semantic Unit tokens _i The number of occurrences.

7. The method of claim 2, further comprising, prior to said forming each semantic unit and its weight in each category into a problem domain dictionary:

only the weight of the semantic units in the remaining categories is used to form the problem domain dictionary.

8. The method according to claim 3, wherein before forming the semantic units and their weights in the categories into an answer domain dictionary, further comprising:

only the weight of the semantic units in the remaining categories are used to form the answer field dictionary.

9. The method according to claim 7 or 8, wherein the weight interval is set according to the weight of the semantic unit in each category.

10. The method of claim 2, further comprising, prior to forming the problem domain dictionary from semantic units and their weights in categories,:

and only the semantic units remaining after filtering are used for forming the problem domain dictionary.

11. The method of claim 3, further comprising, before forming the answer domain dictionary from the semantic units and their weights in the categories:

filtering out the single character, the repeated number string or the semantic unit with the length of the number string exceeding a preset length threshold; and only the semantic units remaining after filtering are used for forming an answer field dictionary.

12. The method of claim 1, wherein the calculating of the topic similarity of the answer to the question comprises:

and selecting the maximum value of the topic similarity obtained by calculation as the topic similarity of the answer and the question.

13. The method of claim 12, wherein the similarity of the answers to the topics of the questions is calculated by:

sim(query,ans)＝Max _j {weight(query,C _j )×weight(ans,C _j )}

14. An answer recommending apparatus, comprising:

the topic weight calculation module is used for finding out the weight of the semantic unit of the question in each category by utilizing a pre-established question field dictionary and calculating the topic weight of the question in each category;

and

and the similarity calculation module is used for calculating the topic similarity of each answer and the question respectively by using the topic weight of the question and the topic weight of each answer obtained by the topic weight calculation module, and recommending the answer according to the calculation result of the topic similarity.

15. The apparatus according to claim 14, wherein the problem domain dictionary is created in advance by a problem dictionary creating module, and the problem dictionary creating module specifically includes:

the question acquisition submodule is used for acquiring the content of a question in the question and answer corpus and segmenting words to obtain a semantic unit of the question;

16. The apparatus according to claim 14, wherein the answer field dictionary is previously established by an answer dictionary establishing module, the answer dictionary establishing module specifically includes:

17. The apparatus of claim 15, wherein the problem dictionary establishing module further comprises:

and providing the semantic units left after filtering to the first weight calculation submodule.

18. The apparatus of claim 16, wherein the answer dictionary creation module further comprises:

and providing the semantic units left after filtering to the second weight calculation submodule.

19. The apparatus of claim 15, wherein the first weight calculating sub-module calculates the weight of the semantic unit in each category according to one or any combination of the following:

20. The apparatus of claim 16, wherein the second weight calculating sub-module calculates the weight of the semantic unit in each category according to one or any combination of the following:

21. The apparatus according to claim 19 or 20, wherein the method for calculating the weight of the semantic units in each category is:

wherein, w (token) _i ,C _j ) Representing semantic units token _i In class C _j The weight in (1);

p _ij ＝T _ij /L _j ，L _j represents a class C _j The sum of the times of all semantic units contained therein, T _ij Representing semantic units token _i In class C _j The number of occurrences in (a);

wherein m is the number of categories;

representing in semantic Unit token _i In class C _j The word frequency in the sequence is shown, and n is a word frequency influence factor;

22. The apparatus of claim 15, wherein the problem dictionary creation module further comprises:

the weight filtering submodule is used for carrying out similar weight filtering on the weight of each semantic unit among each category and filtering the weight of the same semantic unit, wherein the occurrence frequency of the same semantic unit in the same weight interval is greater than a preset threshold value;

only the weights of the semantic units in the remaining categories are provided to the first integrating submodule for forming a problem domain dictionary.

23. The apparatus of claim 16, wherein the answer dictionary establishing module further comprises:

only the weight of the semantic units in the remaining categories is provided to the second integration submodule for forming an answer field dictionary.

24. The apparatus according to claim 22 or 23, wherein the weight interval is set according to the weight of the semantic units in each category.

25. The apparatus of claim 15, wherein the problem dictionary establishing module further comprises:

the semantic unit filtering submodule is used for filtering the semantic units of single characters, repeated numeric strings or numeric strings with the length exceeding a preset length threshold;

and providing the semantic units remaining after filtering to the first integration submodule to form a problem field dictionary.

26. The apparatus of claim 16, wherein the answer dictionary establishing module further comprises:

and providing the filtered residual semantic units to the second integration submodule to form an answer field dictionary.

27. The apparatus according to claim 14, wherein the similarity calculation module calculates topic similarity of the answer and the question in each category, respectively, and selects a maximum value of the calculated topic similarity as the topic similarity of the answer and the question.

28. The apparatus according to claim 27, wherein the similarity calculation module calculates the subject similarity between the answer and the question by:

sim(query,ans)＝Max _j {weight(query,C _j )×weight(ans,C _j )}