WO2015058604A1 - Apparatus and method for obtaining degree of association of question and answer pair and for search ranking optimization - Google Patents
Apparatus and method for obtaining degree of association of question and answer pair and for search ranking optimization Download PDFInfo
- Publication number
- WO2015058604A1 WO2015058604A1 PCT/CN2014/086838 CN2014086838W WO2015058604A1 WO 2015058604 A1 WO2015058604 A1 WO 2015058604A1 CN 2014086838 W CN2014086838 W CN 2014086838W WO 2015058604 A1 WO2015058604 A1 WO 2015058604A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- question
- answer
- word
- analyzed
- category
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Definitions
- the present invention relates to the field of network data communication technologies, and in particular, to an apparatus and method for obtaining a correlation degree of a question and answer pair, an apparatus and method for optimizing a search ranking of a question and answer pair, and a method for determining a frequency of capturing network resource points. Apparatus and method.
- the Q&A community is a web application that generates content for users.
- the basic form is that users ask questions according to their own needs, and other users give answers. This form provides a new channel for users to access information on the web.
- the quality of the information in the Q&A community is so different that there are a large number of low-quality Q&A pairs in the Q&A community. This not only brings a lot of inconvenience to users to find information, but also reduces the quality of the Q&A community.
- the prior art method of judging the quality of question and answer depends more on the non-text features of the question and answer pair to evaluate the quality of the question and answer, which will affect its versatility.
- the prior art sets the crawl frequency method for the network resource point, and relies more on Q&A analysis of links to websites, such methods are used for question-and-answer searches. They cannot be semantically analyzed. Q&A pairs cannot adjust the frequency of crawling (or crawling fineness, crawling frequency) according to the quality of network resource points. The accuracy and versatility of search results.
- the present invention has been made in order to provide an apparatus and method for obtaining the degree of association of a question and answer pair that overcomes the above problems or at least partially solves the above problems, and an apparatus and method for optimizing a search ranking of a question and answer pair, And an apparatus and method for determining a crawl frequency of a network resource point.
- an apparatus for obtaining a degree of association of a question and answer pair comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; a word extraction unit adapted to the question and answer pair to be analyzed The problem content and the answer content are subjected to a word extraction operation to obtain at least one question word to be analyzed and at least one answer word to be analyzed; the correlation degree calculating unit is adapted to select at least the question answer knowledge base according to the question word to be analyzed and the answer word to be analyzed.
- a question and answer knowledge record that calculates the degree of association of the question and answer pairs to be analyzed based on the selected question and answer knowledge record.
- an apparatus for optimizing a search ranking of a question and answer pair comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; and a search unit adapted to receive a user's search request, Obtaining, according to the user's search request, a plurality of pairs of questions and answers to be analyzed that are matched with the search request; and the calculating unit is configured to acquire, according to the question and answer knowledge base, the degree of association of each question and answer pair to be analyzed; the search ranking unit is adapted to be according to the The degree of association of the question and answer pairs to be analyzed optimizes the search ranking of the question and answer pairs to be analyzed.
- an apparatus for determining a crawling frequency of a network resource point comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; and a resource analysis unit adapted to be configured by a network resource point Grasping a plurality of pairs of questions to be analyzed; the calculating unit is adapted to obtain an association degree of each question and answer pair to be analyzed according to the question and answer knowledge base; the crawling frequency determining unit determines the association according to the degree of association of the question and answer pairs to be analyzed The frequency of crawling network resource points.
- a method for obtaining a degree of association of a question and answer pair comprising the steps of: performing a word extraction operation on a question content and an answer content of the question and answer pair to be analyzed, and obtaining at least one problem to be analyzed a word and at least one word to be analyzed; selecting at least one question and answer knowledge record from the question and answer knowledge base including the plurality of question and answer knowledge records according to the question word to be analyzed and the word to be analyzed, and calculating the question and answer to be analyzed according to the selected question and answer knowledge record The degree of association.
- a method for optimizing a search ranking of a question and answer pair comprising the steps of: receiving a search request of a user, and acquiring a plurality of to-be-matched matches with the search request according to the search request of the user
- the question and answer pair is analyzed; according to the question and answer knowledge base including the plurality of question and answer knowledge records, the degree of association of each question and answer pair to be analyzed is obtained; and the search ranking of the question and answer pair to be analyzed is optimized according to the degree of association of the question and answer pairs to be analyzed.
- a method for determining a crawling frequency of a network resource point comprising the steps of: capturing, by a network resource point, a plurality of question and answer pairs to be analyzed; according to the plurality of question and answer knowledge records
- the question and answer knowledge base obtains the degree of association of each question and answer pair to be analyzed; and determines the frequency of the crawling of the network resource points according to the degree of association of the question and answer pairs to be analyzed.
- multiple question and answer pairs are extracted from a webpage containing a question and answer pair, and multiple pieces are constructed according to the extracted question and answer pairs.
- the question and answer knowledge base of the question and answer knowledge record, the word extraction operation of the question and answer pair of the question and the answer, and at least one word to be analyzed and at least one word to be analyzed are obtained, and then according to the question word to be analyzed and the word to be analyzed
- Selecting at least one Q&A knowledge record from the Q&A knowledge base and calculating the correlation degree of the Q&A pairs to be analyzed according to the selected Q&A knowledge record can evaluate the quality of the Q&A pair from the semantic aspect and solve the prior art evaluation only on the lexical level.
- each question and question to be analyzed is obtained according to the question and answer knowledge base.
- the degree of association of the pair and the search ranking of the question and answer pair to be analyzed according to the degree of association of the question and answer pairs to be analyzed can evaluate the quality of the question and answer pair to be analyzed from the semantic aspect, and solve the problem that the prior art relies on the question and answer on the webpage and question and answer.
- the problem of poor sorting effect further, by grasping a plurality of question and answer pairs to be analyzed by the network resource point, obtaining the correlation degree of each question and answer pair to be analyzed according to the question and answer knowledge base and determining the correlation degree according to the question and answer pair to be analyzed
- the crawling frequency of the network resource point can determine the crawling frequency by evaluating the quality of the network resource point, and solves the problem that the prior art cannot select the crawling frequency according to the quality of the network resource point.
- the solution of the present application is easy to implement and has high versatility.
- FIG. 1 shows a flow chart of a method of obtaining a degree of association of a question and answer pair, in accordance with one embodiment of the present invention
- Figure 2 shows a detailed flow chart for building a Q&A knowledge base
- FIG. 3 is a schematic diagram showing an explanation model of the question and answer knowledge base obtained by using the steps shown in FIG. 2;
- FIG. 4 shows a detailed flow chart of step S200 of Figure 1;
- FIG. 5 illustrates a block diagram of an apparatus for obtaining a degree of association of a question and answer pair, in accordance with one embodiment of the present invention
- FIG. 6 shows a flow chart of a method for optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention
- FIG. 7 shows a block diagram of an apparatus for optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention
- FIG. 8 shows a flow chart of a method of determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention
- FIG. 9 shows a block diagram of an apparatus for determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention.
- Figure 10 shows a block diagram of an application server for performing the method according to the invention
- Figure 11 shows a storage unit for holding or carrying program code implementing the method according to the invention.
- the existing method of obtaining the degree of association of question and answer pairs is to use text features and non-text features to describe the questions and answers of the question and answer pairs.
- the existing method for obtaining a search ranking of a question and answer pair is to use a text feature and a non-text feature to describe the question and answer pair to rank the question and answer pair, or to answer questions based on the question and answer.
- Text features mainly include textual visual features (such as punctuation density, average word length, text entropy, etc.) and text content features (such as text content word scale, question word density, related word coverage, etc.), and extract Chinese automatic errors widely used.
- non-text features include user weightedness indicators, answer question status, answer answer time, user relationship interaction features, and so on.
- a problem quality prediction model and an answer quality prediction model are respectively learned on the training set, and the output of the two models is used to evaluate the quality of the question and answer.
- the relevant word coverage feature is used to describe the semantic matching of the question and answer questions, which is not only at the lexical level. And did not consider the semantic matching of questions and answers.
- the semantic matching of questions and answers is precisely the core of question and answer.
- the question is “Where is the capital of China?”, the answer 1 is “Beijing” and the answer 2 is “China's capital is Shanghai”. Then the question is “where is the capital of China” after the word segmentation and discarding the stop words, the answer 1 word segmentation result is “Beijing”, and the answer 2 word segmentation result is “China Capital Shanghai”.
- FIG. 1 shows a flow chart of a method of obtaining the degree of association of a question and answer pair, in accordance with one embodiment of the present invention.
- a method of obtaining a degree of association of a question and answer pair comprising the following steps S100 and S200:
- S100 performing a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtaining at least one question word to be analyzed and at least one answer word to be analyzed.
- the word extraction operation of the question content and the answer content of the question and answer pair to be analyzed specifically includes: segmenting the question content and the answer content of the question and answer pair to be analyzed, removing the stop word, and word merge (word Join), and the operation of extracting entity words (such as nouns, verbs, etc.). Then, at least one problem word to be analyzed is obtained from the question content of the question and answer pair to be analyzed, and at least one answer word to be analyzed is obtained from the answer content of the question and answer pair to be analyzed.
- S200 Select at least one question and answer knowledge record from the question and answer knowledge base including the plurality of question and answer knowledge records according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
- the problem content and the answer content of the analysis question and answer pair may be analyzed from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
- the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs.
- the category corresponding to the question and answer pair is captured.
- the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair.
- Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word. .
- FIG. 2 shows a detailed flow chart for building a Q&A knowledge base. Specifically, the following steps S310, S320, and S330 are included:
- data may be fetched from a webpage containing a high-quality question and answer pair on the Internet, and a question and answer pair may be extracted to ensure the quality of the extracted question and answer pair;
- the webpage including the high-quality question and answer pair includes cQA (Customer Quality Assurance) community, major professional forums, etc.
- cQA Customer Quality Assurance
- the webpage containing the high-quality question and answer pair includes the category information corresponding to each question and answer pair, the category corresponding to the question and answer pair can be grasped together while the question and answer pair is captured.
- the word extraction operation is performed on the question content and the answer content of each question and answer pair in the question and answer pairs extracted in step S310, specifically including the question content and the answer content of the question and answer pair.
- Word segmentation, removal of stop words, word merging, and operations for extracting entity words are examples of Word segmentation, removal of stop words, word merging, and operations for extracting entity words.
- At least one question word is obtained from the question content of each question and answer pair, and at least one answer word is obtained from the answer content of each question and answer pair, and the category set ⁇ C 1 ,..., C k ,... for the question and answer pair can be obtained.
- Step S330 in this embodiment may be performed based on the mass information record after the massive question and answer pair obtained from the web page is subjected to the word extraction operation as described in step S320 to obtain a massive information record.
- the semantic relevance obtained based on massive information records is more accurate.
- the calculating the probability that the answer word belongs to the category includes:
- the calculating the degree of specificity of each answer word on the question word in the category includes:
- the calculating the strength of the question word in the category to be explained by each answer word specifically comprising:
- C Ck)*interpret(QWi,AWj
- C Ck);
- P(C k ) represents the probability of occurrence of the category C k
- P(AW j ) represents the probability that the answer is AW j
- C k ) represents the probability that the C k category belongs to AW j ;
- #(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;
- #(AW j ) indicates the number of times the answer word is AW j .
- a question and answer knowledge record can be obtained to construct a question and answer knowledge base.
- Figure 3 shows a schematic diagram of an explanatory model of a question and answer knowledge base obtained using the steps shown in Figure 2. It can be seen that for each question word QW i , n question and answer knowledge records can be obtained for each of the category sets ⁇ C 1 , . . . , C k , . . . , C p >.
- the calculated semantic relevance is 0, the corresponding question and answer knowledge record can be deleted; further, if the number of question and answer knowledge records in the question and answer knowledge base is too large, the question and answer knowledge is stored.
- the overhead of recording and calculating the degree of association of the question and answer pairs to be analyzed is too large, and a threshold can be preset, and the question and answer knowledge record whose semantic relevance is less than the threshold is deleted to reduce the overhead.
- FIG. 4 shows a detailed flowchart of step S200 in FIG. 1.
- step S200 specifically includes the following steps S210, S220, and S230:
- step S210 Select a question and answer knowledge record that matches the problem words included in the problem word to be analyzed and the included answer words and the answer words to be analyzed.
- the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word.
- a field matching or field search method is used to select a part of the question and answer knowledge record related to the question and answer pair to be analyzed from the question and answer knowledge base. .
- the question and answer knowledge record corresponding to the same category in the selected question and answer knowledge record obtain the degree of association of the question and answer pairs to be analyzed for each category, and specifically include: the selected question and answer knowledge record corresponds to the same category
- the semantic relevance of the Q&A knowledge record is weighted and added, and the degree of association of the question and answer pairs to be analyzed for each category is obtained.
- the Q&A knowledge records selected by step S210 are grouped according to their corresponding categories, and the Q&A knowledge records corresponding to the same category are grouped; the semantic relevance of each group of Q&A knowledge records is weighted (for example, And adding a weight of 1 or 100), obtaining the degree of association of the question and answer pair to be analyzed for the category; thereby obtaining at least one (the number of degrees of association in the embodiment is the corresponding category of the question and answer pair to be analyzed The number) the degree of association.
- Figure 5 illustrates a block diagram of an apparatus for obtaining the degree of association of a question and answer pair, in accordance with one embodiment of the present invention.
- the apparatus includes a question and answer knowledge base 100, a word extraction unit 200, and an associated degree calculation unit 300.
- the question and answer knowledge base 100 is adapted to store a plurality of question and answer knowledge records; the question and answer knowledge base 100 of the present embodiment can be constructed by crawling a large number of question and answer pairs in the web page.
- the word extracting unit 200 is adapted to perform a word extracting operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
- the word extracting unit 200 is adapted to perform word segmentation, remove stop words, word join, and extract entity words (for example, nouns) for the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
- the association degree calculation unit 300 is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
- the correlation degree calculation unit 300 is adapted to select a question and answer knowledge record whose question words are matched with the question words to be analyzed and the included answer words match the answer words to be analyzed.
- the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category.
- the semantic relevance weights for example, the weight
- the number of degrees of association in the embodiment that is, the number of categories to be analyzed, the number of categories to be analyzed
- the above-mentioned question and answer pairs to be analyzed are selected for each category
- the maximum value of the degree of association with the maximum value as the degree of association of the question and answer pairs to be analyzed.
- the word extracting unit 200 Using the question and answer knowledge base 100, the word extracting unit 200, and the associated degree calculating unit 300, selecting at least one question and answer knowledge record from the question and answer knowledge base by using the question word to be analyzed and the answer word to be analyzed, and calculating according to the selected question and answer knowledge record
- the degree of correlation between the question and answer pairs to be analyzed can be analyzed from the semantic aspect of the analysis question and answer pair.
- the evaluation effect is better and easier to implement.
- the scope of application is wider and versatile. Stronger.
- the device further includes a question and answer knowledge base construction unit 400, and the question and answer knowledge base construction unit 400 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs.
- Recorded Q&A knowledge base the Q&A knowledge base.
- the Q&A knowledge base is existing. Since the amount of information of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base often needs to be updated, by adding a Q&A knowledge base building unit 400. Build (or update) the Q&A knowledge base to ensure the immediacy and reliability of the content of the Q&A knowledge base.
- the question and answer knowledge base construction unit 400 grabs the category corresponding to the question and answer pair.
- data may be fetched from a webpage containing a high-quality question and answer pair on the Internet, and a question and answer pair may be extracted to ensure the quality of the extracted question and answer pair; the webpage including the high-quality question and answer pair includes cQA community, major professional forums, etc. Since the webpage containing the high quality question and answer pair includes category information corresponding to each question and answer pair, the question and answer knowledge base construction unit 400 can grab the category corresponding to the question and answer pair while grabbing the question and answer pair.
- the question and answer knowledge base construction unit 400 is adapted to perform the following operations on each question and answer pair: performing a word extraction operation on the question content and the answer content of the question and answer pair to obtain a question word set and an answer word set, specifically
- the question and answer knowledge base construction unit 400 performs the word segmentation, the removal of the stop word, the word combination, and the operation of extracting the entity word for the problem content and the answer content of each of the question and answer pairs in the extracted question and answer pairs to obtain the question words and answers.
- a word each of the question words in the set of question words and each answer word in the set of answer words form an information record on each of the categories corresponding to the question and answer pair.
- the question and answer knowledge base construction unit 400 is adapted to record, for each piece of information, an operation of calculating a probability that the answer word belongs to the category, and calculating a degree of specificity of the answer word to the question word on the category, The strength of the question word in the category to be explained by the answer word; multiplying the above probability, the degree of specificity and the intensity, the product obtained is the semantic relevance of the answer word and the question word;
- the answer words and their semantic relevance form a question and answer knowledge record corresponding to the category.
- the question and answer knowledge base construction unit 400 is adapted to calculate the probability that the answer word belongs to the category according to the following method:
- the question and answer knowledge base construction unit 400 is adapted to calculate the degree of specificity of the interpretation of the question words by the respective answer words on the category according to the following method:
- the question and answer knowledge base construction unit 400 is adapted to calculate the strength of the problem words explained by the respective answer words on the category according to the following method:
- the question and answer knowledge base construction unit 400 is adapted to multiply the above probability, specific degree, and intensity according to the following method:
- C Ck)*interpret(QWi,AWj
- C Ck);
- P(C k ) represents the probability of occurrence of the category C k
- P(AW j ) represents the probability that the answer is AW j
- C k ) represents the probability that the C k category belongs to AW j ;
- #(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;
- #(AW j ) indicates the number of times the answer word is AW j .
- the words to be analyzed and the words to be analyzed are as follows:
- an existing Q&A knowledge base may be retrieved, or a Q&A knowledge base may be constructed by grasping the QQA community and the Q&A pairs of the major professional forums;
- the second step is to answer the question and answer pair to be analyzed.
- the word set to be analyzed is obtained.
- the answer word set to be analyzed ⁇ symptoms, drugs, treatment, anti-virus, pediatric cold particles, description , dosage, cough, Chinese medicine, granules, antibiotics, amoxicillin, amoxicillin granules, granules, oral, roxithromycin, efficacy>, and the type of question and answer pair to be analyzed is “medical health”;
- a plurality of question and answer knowledge records matching the problem words and the words to be analyzed are selected from the question and answer knowledge base, thereby obtaining the following answer words and semantic relevance (for convenience of reading,
- the values of the semantic relevance in the table are the values that have been properly normalized):
- the Q&A knowledge records including the answer words and the answers to be analyzed are selected, and further Get the semantic relevance of the selected question and answer knowledge records.
- the answers to the answers in this example that match the answer words in the Q&A knowledge record include: ⁇ Oral, Kechuan, Pediatric cold particles, examination, cough, treatment, flu symptoms, cold particles>.
- the degree of correlation of the question and answer pairs to be analyzed may be calculated, and the degree of correlation of the question and answer pairs to be analyzed reaches 0.9 (under the condition that the correlation degree ranges from 0 to 1).
- FIG. 6 shows a flow chart of a method of optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention.
- the method includes the following steps S610, S620, and S630:
- S610 Receive a search request of the user, and obtain a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user.
- the network search technology may be used, for example, using a question and answer pair search engine to obtain a question and answer pair to be analyzed according to the user's search request.
- S620 Obtain an association degree of each question and answer pair to be analyzed according to a Q&A knowledge base including a plurality of Q&A knowledge records.
- the question content and the answer content of the question and answer pair may be analyzed from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
- step S620 of the embodiment is substantially the same as the method of obtaining the degree of association of the question and answer pair as shown in FIG. repeat.
- the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs.
- the category corresponding to the question and answer pair is captured.
- the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair.
- Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word.
- QW question word
- AW answer word
- semantic relevance between the question word and the answer word.
- the semantics between problem words and answer words of multiple Q&A knowledge records can be obtained based on the learning of massive information. Correlation; by using the information extracted from the web page to build a question-and-answer knowledge base, the scope of application is broader, and the method is more versatile.
- the method of the embodiment further includes the step of constructing the question and answer knowledge base, and the process of constructing the question and answer knowledge base is substantially the same as the process shown in FIG. 2; the interpretation model of the question and answer knowledge base of the present embodiment is as shown in FIG. The interpretation model is roughly the same. It will not be repeated here.
- the search ranking of the question and answer pair to be analyzed can be optimized by using the degree of association, and the ranking effect is better.
- the specific method may be the search ranking of the question-and-answer pair to be analyzed in the order of the degree of association of the question-and-answer pairs to be analyzed, that is, the search ranking of the question-and-answer pair with a high degree of relevance is ranked first; or may be based on the search first
- the ranking technique initially arranges the website to which the question and answer pair to be analyzed belongs, and calculates a search ranking of the pair of questions to be analyzed according to the degree of association between the sequence number of the preliminary arrangement and the question and answer pair to be analyzed, for example, the waiting
- the analysis question and answer is multiplied by the degree of association of the preliminary arrangement of the website to which it belongs, and the order of the result of the multiplication operation is used as the search ranking of the question and answer pair to be analyzed;
- the quality of the pair and the row of the website to which it belongs The combination of names, sorting pairs of questions and answers to be analyzed, users can get better results sorting quality when using Q&A.
- the device includes a question and answer knowledge base 710, a search unit 720, a calculation unit 730, and a search ranking unit 740.
- the question and answer knowledge base 710 is adapted to store a plurality of question and answer knowledge records.
- the question and answer knowledge base 710 of the present embodiment can be constructed by crawling a massive question and answer pair in a web page.
- the searching unit 720 is adapted to receive a search request of the user, and obtain a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user.
- the search unit 720 may be a question and answer pair search engine, and obtain a question and answer pair to be analyzed according to the user's search request; for example, the search unit 720 is a web search engine for question and answer search, and the receiving user passes The search request entered by the browser and the question and answer pair to be analyzed.
- the calculating unit 730 is adapted to obtain the degree of association of each question and answer pair to be analyzed according to the question and answer knowledge base 710.
- the calculation unit 730 of the present invention can analyze the problem content and the answer content of the analysis question and answer pair from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
- the question and answer knowledge base 710 constructs and includes a plurality of question and answer knowledge records using a large number of high quality question and answer pairs extracted from web pages, and can acquire semantics between problem words and answer words of multiple question and answer knowledge records based on learning of massive information. relativity.
- the search ranking unit 740 is adapted to optimize the search ranking of the question and answer pair to be analyzed according to the degree of association of the question and answer pairs to be analyzed.
- the specific method may be the search ranking of the question-and-answer pair to be analyzed in the order of the degree of association of the question-and-answer pairs to be analyzed, that is, the search ranking of the question-and-answer pair with a high degree of relevance is ranked first; or may be based on the search first
- the ranking technique initially arranges the website to which the question and answer pair to be analyzed belongs, and calculates a search ranking of the pair of questions to be analyzed according to the degree of association between the sequence number of the preliminary arrangement and the question and answer pair to be analyzed, for example, the waiting
- the analysis question and answer is multiplied by the degree of association of the preliminary arrangement of the website to which it belongs, and the order of the result of the multiplication operation is used as the search ranking of the question and answer pair to be analyzed.
- the apparatus further includes a question and answer knowledge base construction unit 750, wherein the question and answer knowledge base construction unit 750 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs.
- Recorded Q&A knowledge base In the device shown in FIG. 7, the Q&A knowledge base 710 is already existing. Since the information volume of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base 710 often needs to be updated.
- the knowledge base building unit 750 constructs (or updates) the question and answer knowledge base 710, which can ensure the immediacy and reliability of the content of the question and answer knowledge base 710.
- the question and answer knowledge base construction unit 750 of the present embodiment is the same as the question and answer knowledge base construction unit 400 shown in FIG. 5, and the description thereof will not be repeated here.
- the calculation unit 630 in FIG. 7 specifically includes a word extraction subunit and an associated degree calculation subunit (not shown).
- the word extraction subunit is adapted to perform the word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
- the word extraction subunit is adapted to perform word segmentation, remove stop words, word join, and extract entity words (eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
- entity words eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.
- the correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
- the correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words match the answer words to be analyzed.
- the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category.
- the semantic relevance weights for example, the weight
- Degree thereby obtaining at least one (the number of degrees of association in the embodiment, that is, the number of categories to be analyzed, the number of categories to be analyzed) is associated; selecting the above-mentioned question and answer pairs to be analyzed is the largest degree of association for each category The value, with the maximum value as the degree of association of the question and answer pairs to be analyzed.
- FIG. 8 illustrates a flow chart of a method of determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention.
- the method includes the following steps S810, S820, and S830:
- the plurality of to-be-analyzed question and answer pairs are captured by the network resource point.
- it may be a network resource point for determining a specific fetching frequency, for example, a Q&A community that needs to determine a fetching frequency, using a floor identification technology, according to the landlord (ie, the first post for a question)
- the user asks questions, and the content of the reply on the 2nd floor of the 1st floor (that is, the user who replies to the post in order) is the answer, to extract the question and answer pair to be analyzed.
- the question content and the answer content of the question and answer pair may be analyzed semantically by using the question and answer knowledge base.
- the analysis is performed to obtain the degree of correlation of the question and answer pairs to be analyzed, and the evaluation effect is better and easier to implement.
- step S820 of the embodiment is substantially the same as the method of obtaining the degree of association of the question and answer pair as shown in FIG. repeat.
- the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs.
- the category corresponding to the question and answer pair is captured.
- the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair.
- Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word.
- QW question word
- AW answer word
- semantic relevance between the question word and the answer word.
- the semantics between problem words and answer words of multiple Q&A knowledge records can be obtained based on the learning of massive information. Correlation; by using the information extracted from the web page to build a question-and-answer knowledge base, the scope of application is broader, and the method is more versatile.
- the method of the embodiment further includes the step of constructing a question and answer knowledge base, wherein the process of constructing the question and answer knowledge base is substantially the same as the process shown in FIG. 2; the interpretation model of the question and answer knowledge base of the present embodiment is as shown in FIG. 3
- the explanatory models shown are roughly the same. It will not be repeated here.
- the quality of the network resource points can be determined by using the correlation degree of the plurality of question and answer pairs to be analyzed, thereby determining the frequency of the network resource points.
- the specific method may be that the average value of the correlation degree of the pair of questions to be analyzed is used as the crawling frequency of the network resource point, that is, the network resource point with a large average value (ie, good quality) of the associated degree The higher the frequency (for example, the frequency at which the spider crawler crawls the network resource point is high); or the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed.
- the frequency for example, the frequency at which the spider crawler crawls the network resource point is high
- the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed.
- An average value, using the average value to adjust the initial crawl frequency to determine a crawl frequency of the network resource point for example, an spider crawler may be used to obtain an initial crawl frequency of the network resource point, using the correlation degree
- the average value of the initial capture frequency is weighted (including multiplication, normalization, etc.) to determine the capture frequency of the network resource point, so that the capture frequency of the high-quality network resource point is improved, thereby optimizing Search quality.
- the correlation degree of the question and answer pair to be analyzed is analyzed by the network resource point, and the crawling frequency of the network resource point is determined according to the degree of association, so that the accuracy of the crawling result can be improved.
- the apparatus includes a question and answer knowledge base 91, a resource analysis unit 920, a calculation unit 930, and a capture frequency acquisition unit 940.
- the Q&A knowledge base 910 is adapted to store a plurality of Q&A knowledge records.
- the question and answer knowledge base 910 of the present embodiment can be constructed by crawling a large number of question and answer pairs in a web page.
- the resource analysis unit 920 is adapted to capture a plurality of question and answer pairs to be analyzed by the network resource point.
- the resource analysis unit 920 may determine a network resource point of a capture frequency for a specific need, for example, a question and answer community that needs to determine a crawl frequency, and use a floor identification technology according to the landlord (ie, for a problem first)
- the user who posts the question) asks questions, and the content of the reply on the 1st floor and the 2nd floor (that is, the user who replies to the post in order) is the answer, to extract the question and answer pair to be analyzed.
- the calculating unit 930 is adapted to obtain the degree of association of each question and answer pair to be analyzed according to the question and answer knowledge base.
- the calculation unit 930 of the present invention can analyze the problem content and the answer content of the analysis question and answer pair from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
- the Q&A knowledge base 910 is constructed using a large number of high-quality Q&A pairs extracted from web pages and includes a plurality of Q&A knowledge records, which can acquire semantics between problem words and answer words of multiple Q&A knowledge records based on learning of massive information. relativity.
- the capture frequency determining unit 940 is adapted to determine a crawling frequency of the network resource point according to the correlation degree of the question and answer pair to be analyzed.
- the quality of the network resource points can be determined by using the correlation degree of the plurality of question and answer pairs to be analyzed, thereby determining the frequency of the network resource points.
- the specific method may be that the average value of the correlation degree of the pair of questions to be analyzed is used as the crawling frequency of the network resource point, that is, the network resource point with a large average value (ie, good quality) of the associated degree The higher the frequency (for example, the frequency at which the spider crawler crawls the network resource point is high); or the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed.
- An average value, using the average value to adjust the initial crawl frequency to determine a crawl frequency of the network resource point for example, an spider crawler may be used to obtain an initial crawl frequency of the network resource point, using the correlation degree
- the average value of the initial capture frequency is weighted (including multiplication, normalization, etc.) to determine the capture frequency of the network resource point, so that the capture frequency of the high-quality network resource point is improved, thereby optimizing Search quality.
- the apparatus further includes a question and answer knowledge base construction unit 950, and the question and answer knowledge base construction unit 950 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs.
- Recorded Q&A knowledge base the Q&A knowledge base 910 is existing. Since the amount of information of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base 910 often needs to be updated.
- the knowledge base building unit 950 builds (or updates) the Q&A knowledge base to ensure the immediacy and reliability of the content of the Q&A knowledge base.
- the question and answer knowledge base construction unit 950 of the present embodiment is the same as the question and answer knowledge base construction unit 400 shown in FIG. 5, and the description thereof will not be repeated here.
- the calculation unit 930 in FIG. 9 specifically includes a word extraction subunit and an associated degree calculation subunit (not shown).
- the word extraction subunit is adapted to perform the word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
- the word extraction subunit is adapted to perform word segmentation, remove stop words, word join, and extract entity words (eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
- entity words eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.
- the correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
- the correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words match the answer words to be analyzed.
- the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category.
- the semantic relevance weights for example, the weight
- Degree thereby obtaining at least one (the number of degrees of association in the embodiment, that is, the number of categories to be analyzed, the number of categories to be analyzed) is associated; selecting the above-mentioned question and answer pairs to be analyzed is the largest degree of association for each category The value, with the maximum value as the degree of association of the question and answer pairs to be analyzed.
- the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. It should be understood by those skilled in the art that a microprocessor or digital signal processor (DSP) can be used in practice to implement a device for obtaining the degree of association of a question and answer pair according to an embodiment of the present invention, and a device for optimizing search ranking of a question and answer pair. And some or all of the functions of some or all of the means for determining the frequency of crawling of network resource points.
- the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
- FIG. 10 illustrates a method for performing an association degree of obtaining a question and answer pair according to the present invention, a method of optimizing a search ranking of a question and answer pair, and a server for determining a frequency of crawling a network resource point, such as an application server.
- the application server traditionally includes a processor 1010 and a computer program product or computer readable medium in the form of a memory 1020.
- the memory 1020 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
- the memory 1020 has a memory space 1030 for executing program code 1031 of any of the above method steps.
- storage space 1030 for program code may include various program code 1031 for implementing various steps in the above methods, respectively.
- the program code can be read from or written to one or more computer program products.
- These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
- Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
- the storage unit may have a storage section, a storage space, and the like arranged similarly to the storage 1020 in the application server of FIG.
- the program code can be compressed, for example, in an appropriate form.
- the storage unit includes computer readable code 1131 ', ie, code that can be read by, for example, a processor, such as processor 1010, which, when executed by a server, causes the server to perform each of the methods described above. step.
Abstract
An apparatus and method for obtaining the degree of association of a question and answer pair, a method for the search ranking optimization of the question and answer pair, and an apparatus and method for determining the crawling frequency of a network resource point. The method for obtaining the degree of association of the question and answer pair comprises the following steps: performing a word extraction operation on the question content and answer content of a question and answer pair to be analyzed, to obtain at least one question word to be analyzed and at least one answer word to be analyzed; selecting at least one question and answer knowledge record from a question and answer knowledge library including a plurality of question and answer knowledge records according to the question word to be analyzed and the answer word to be analyzed, and calculating the degree of association of the question and answer pair to be analyzed according to the selected question and answer knowledge record. With the apparatus and method for obtaining the degree of association of the question and answer pair, the quality of the question and answer pair can be evaluated semantically, and the evaluation effect is better; in addition, the apparatus and method are easy to implement and excellent in universality.
Description
本发明涉及网络数据通信技术领域,具体涉及一种获取问答对的相关联程度的装置和方法,一种优化问答对的搜索排名的装置和方法,以及一种确定网络资源点的抓取频率的装置和方法。The present invention relates to the field of network data communication technologies, and in particular, to an apparatus and method for obtaining a correlation degree of a question and answer pair, an apparatus and method for optimizing a search ranking of a question and answer pair, and a method for determining a frequency of capturing network resource points. Apparatus and method.
问答社区是一种用户产生内容的网络应用,基本形式是由用户根据自己的需求提出问题,并由其他的用户来给出回答。这种形式为用户在网络上获取信息提供了新的渠道。然而由于任何用户都可以随意地创建内容,导致了问答社区中的信息质量差异非常大,以至于问答社区中出现了大量的低质量问答对。这不但给用户查找信息带来了诸多不便,同时也降低了问答社区的质量。同时,现有技术的判断问答对质量的方法,更多地依赖于问答对的非文本特征来评价问答对质量,会影响其通用性。The Q&A community is a web application that generates content for users. The basic form is that users ask questions according to their own needs, and other users give answers. This form provides a new channel for users to access information on the web. However, since any user is free to create content, the quality of the information in the Q&A community is so different that there are a large number of low-quality Q&A pairs in the Q&A community. This not only brings a lot of inconvenience to users to find information, but also reduces the quality of the Q&A community. At the same time, the prior art method of judging the quality of question and answer depends more on the non-text features of the question and answer pair to evaluate the quality of the question and answer, which will affect its versatility.
另外,使用现有的搜索技术进行问答搜索时,获取的搜索结果中存在部分低质量的问答对而现有技术的对搜索结果进行排序的方法,更多地依赖于问答对所属的网站和问答对的非文本特征来对问答对进行排序,会影响搜索结果的精确性和通用性。In addition, when using the existing search technology for question-and-answer search, there are some low-quality question and answer pairs in the obtained search results, and the prior art method of sorting the search results depends more on the question and answer on the website and question and answer. The non-text features of the pair to sort the question and answer pairs will affect the accuracy and versatility of the search results.
同时地,使用现有的搜索技术进行问答搜索时,难以判断问答社区作为网络资源点的质量而现有技术(例如,爬虫蜘蛛)的对网络资源点设置抓取频率方法,更多地依赖于问答对网站的链接的分析,这样的方法用于问答搜索,不能从语义上分析问答对也不能根据网络资源点的质量调整抓取频率(或,爬取细度、爬取频率),会影响搜索结果的精确性和通用性。At the same time, when using the existing search technology for question and answer search, it is difficult to judge the quality of the question and answer community as a network resource point. The prior art (for example, a crawler spider) sets the crawl frequency method for the network resource point, and relies more on Q&A analysis of links to websites, such methods are used for question-and-answer searches. They cannot be semantically analyzed. Q&A pairs cannot adjust the frequency of crawling (or crawling fineness, crawling frequency) according to the quality of network resource points. The accuracy and versatility of search results.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种获取问答对的相关联程度的装置和方法,一种优化问答对的搜索排名的装置和方法,以及一种确定网络资源点的抓取频率的装置和方法。In view of the above problems, the present invention has been made in order to provide an apparatus and method for obtaining the degree of association of a question and answer pair that overcomes the above problems or at least partially solves the above problems, and an apparatus and method for optimizing a search ranking of a question and answer pair, And an apparatus and method for determining a crawl frequency of a network resource point.
依据本发明的一个方面,提供了一种获取问答对的相关联程度的装置,该装置包括:问答知识库,适于存储多条问答知识记录;词语提取单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;相关联程度计算单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。According to an aspect of the present invention, there is provided an apparatus for obtaining a degree of association of a question and answer pair, the apparatus comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; a word extraction unit adapted to the question and answer pair to be analyzed The problem content and the answer content are subjected to a word extraction operation to obtain at least one question word to be analyzed and at least one answer word to be analyzed; the correlation degree calculating unit is adapted to select at least the question answer knowledge base according to the question word to be analyzed and the answer word to be analyzed. A question and answer knowledge record that calculates the degree of association of the question and answer pairs to be analyzed based on the selected question and answer knowledge record.
根据本发明的另一方面,提供了一种优化问答对的搜索排名的装置,该装置包括:问答知识库,适于存储多条问答知识记录;搜索单元,适于接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;搜索排名单元,适于根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。According to another aspect of the present invention, there is provided an apparatus for optimizing a search ranking of a question and answer pair, the apparatus comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; and a search unit adapted to receive a user's search request, Obtaining, according to the user's search request, a plurality of pairs of questions and answers to be analyzed that are matched with the search request; and the calculating unit is configured to acquire, according to the question and answer knowledge base, the degree of association of each question and answer pair to be analyzed; the search ranking unit is adapted to be according to the The degree of association of the question and answer pairs to be analyzed optimizes the search ranking of the question and answer pairs to be analyzed.
根据本发明的又一方面,提供了一种确定网络资源点的抓取频率的装置,该装置包括:问答知识库,适于存储多条问答知识记录;资源分析单元,适于由网络资源点抓取多个待分析问答对;计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;抓取频率确定单元,根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。According to still another aspect of the present invention, an apparatus for determining a crawling frequency of a network resource point is provided, the apparatus comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; and a resource analysis unit adapted to be configured by a network resource point Grasping a plurality of pairs of questions to be analyzed; the calculating unit is adapted to obtain an association degree of each question and answer pair to be analyzed according to the question and answer knowledge base; the crawling frequency determining unit determines the association according to the degree of association of the question and answer pairs to be analyzed The frequency of crawling network resource points.
根据本发明的另一方面,提供了一种获取问答对的相关联程度的方法,该方法包括如下步骤:对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;根据待分析问题词语和待分析答案词语,从包括多条问答知识记录的问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。According to another aspect of the present invention, a method for obtaining a degree of association of a question and answer pair is provided, the method comprising the steps of: performing a word extraction operation on a question content and an answer content of the question and answer pair to be analyzed, and obtaining at least one problem to be analyzed a word and at least one word to be analyzed; selecting at least one question and answer knowledge record from the question and answer knowledge base including the plurality of question and answer knowledge records according to the question word to be analyzed and the word to be analyzed, and calculating the question and answer to be analyzed according to the selected question and answer knowledge record The degree of association.
根据本发明的又一方面,提供了一种优化问答对的搜索排名的方法,该方法包括如下步骤:接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。According to still another aspect of the present invention, a method for optimizing a search ranking of a question and answer pair is provided, the method comprising the steps of: receiving a search request of a user, and acquiring a plurality of to-be-matched matches with the search request according to the search request of the user The question and answer pair is analyzed; according to the question and answer knowledge base including the plurality of question and answer knowledge records, the degree of association of each question and answer pair to be analyzed is obtained; and the search ranking of the question and answer pair to be analyzed is optimized according to the degree of association of the question and answer pairs to be analyzed.
根据本发明的再一方面,提供了一种确定网络资源点的抓取频率的方法,该方法包括如下步骤:由网络资源点抓取多个待分析问答对;根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。According to still another aspect of the present invention, a method for determining a crawling frequency of a network resource point is provided, the method comprising the steps of: capturing, by a network resource point, a plurality of question and answer pairs to be analyzed; according to the plurality of question and answer knowledge records The question and answer knowledge base obtains the degree of association of each question and answer pair to be analyzed; and determines the frequency of the crawling of the network resource points according to the degree of association of the question and answer pairs to be analyzed.
根据本发明的技术方案,从含有问答对的网页提取出多个问答对并根据提取的问答对构建包括多条
问答知识记录的问答知识库,对待分析的问答对的问题内容和答案内容进行词语提取操作而得到至少一个待分析问题词语和至少一个待分析答案词语,进而根据待分析问题词语和待分析答案词语从问答知识库选择至少一条问答知识记录并根据所选择的问答知识记录计算待分析的问答对的相关联程度,可以从语义方面评价问答对的质量,解决了现有技术仅在词法层面上评价问答对的质量而导致的评价效果不佳的问题,同时,在根据使用者的搜索请求获取的与搜索请求匹配的多个待分析问答对的情况下,根据问答知识库获取每个待分析问答对的相关联程度并根据待分析问答对的相关联程度优化待分析问答对的搜索排名,可以从语义方面评价待分析问答对的质量,解决了现有技术依赖于问答对所属的网页和问答对的非文本特征来对问答对进行排序而导致的排序效果不佳的问题;进一步地,借助由网络资源点抓取多个待分析问答对,根据问答知识库获取每个待分析问答对的相关联程度并根据待分析问答对的相关联程度确定所述网络资源点的抓取频率,可以通过评价网络资源点的质量确定抓取频率,解决了现有技术不能根据网络资源点的质量调整抓取频率而导致的搜索效果不佳的问题。而且本申请的方案容易实现、通用性强。According to the technical solution of the present invention, multiple question and answer pairs are extracted from a webpage containing a question and answer pair, and multiple pieces are constructed according to the extracted question and answer pairs.
The question and answer knowledge base of the question and answer knowledge record, the word extraction operation of the question and answer pair of the question and the answer, and at least one word to be analyzed and at least one word to be analyzed are obtained, and then according to the question word to be analyzed and the word to be analyzed Selecting at least one Q&A knowledge record from the Q&A knowledge base and calculating the correlation degree of the Q&A pairs to be analyzed according to the selected Q&A knowledge record can evaluate the quality of the Q&A pair from the semantic aspect and solve the prior art evaluation only on the lexical level. The problem of poor evaluation caused by the quality of the question and answer pair. At the same time, in the case of multiple question and answer pairs to be analyzed that are matched with the search request according to the user's search request, each question and question to be analyzed is obtained according to the question and answer knowledge base. The degree of association of the pair and the search ranking of the question and answer pair to be analyzed according to the degree of association of the question and answer pairs to be analyzed can evaluate the quality of the question and answer pair to be analyzed from the semantic aspect, and solve the problem that the prior art relies on the question and answer on the webpage and question and answer. Pair of non-text features to sort the question and answer pairs The problem of poor sorting effect; further, by grasping a plurality of question and answer pairs to be analyzed by the network resource point, obtaining the correlation degree of each question and answer pair to be analyzed according to the question and answer knowledge base and determining the correlation degree according to the question and answer pair to be analyzed The crawling frequency of the network resource point can determine the crawling frequency by evaluating the quality of the network resource point, and solves the problem that the prior art cannot select the crawling frequency according to the quality of the network resource point. Moreover, the solution of the present application is easy to implement and has high versatility.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示出了根据本发明一个实施例的获取问答对的相关联程度的方法的流程图;1 shows a flow chart of a method of obtaining a degree of association of a question and answer pair, in accordance with one embodiment of the present invention;
图2示出了构建问答知识库的详细的流程图;Figure 2 shows a detailed flow chart for building a Q&A knowledge base;
图3示出了使用如图2所示的步骤而得到的问答知识库的一个解释模型示意图;FIG. 3 is a schematic diagram showing an explanation model of the question and answer knowledge base obtained by using the steps shown in FIG. 2;
图4示出了图1中步骤S200的详细的流程图;以及Figure 4 shows a detailed flow chart of step S200 of Figure 1;
图5示出了根据本发明一个实施例的获取问答对的相关联程度的装置的框图;FIG. 5 illustrates a block diagram of an apparatus for obtaining a degree of association of a question and answer pair, in accordance with one embodiment of the present invention; FIG.
图6示出了根据本发明一个实施例的优化问答对的搜索排名的方法的流程图;6 shows a flow chart of a method for optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention;
图7示出了根据本发明一个实施例的优化问答对的搜索排名的装置的框图;7 shows a block diagram of an apparatus for optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention;
图8示出了根据本发明一个实施例的确定网络资源点的抓取频率的方法的流程图;8 shows a flow chart of a method of determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention;
图9示出了根据本发明一个实施例的确定网络资源点的抓取频率的装置的框图;9 shows a block diagram of an apparatus for determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention;
图10示出了用于执行根据本发明的方法的应用服务器的框图;以及Figure 10 shows a block diagram of an application server for performing the method according to the invention;
图11示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。Figure 11 shows a storage unit for holding or carrying program code implementing the method according to the invention.
附图实施例BRIEF DESCRIPTION OF THE DRAWINGS
现有的获取问答对的相关联程度的方法,是使用文本特征和非文本特征来描述问答对的问题和答案。类似地,现有的获取问答对的搜索排名的方法,是使用文本特征和非文本特征来描述问答对的问题和答案从而对问答对进行排名,或根据问答对所属的网站的排名对问答对进行排名。文本特征主要包括文本视觉特征(例如标点符号密度,平均词长,文本熵等)和文本内容特征(例如文本内容词比例,疑问词密度,相关词覆盖等),并提取中文自动差错广泛采用的特征(例如单字密度特征等);非文本特征包含用户的权成度指标,答案问题状态,答案回答时间,用户关系交互特征等。在对问题和答案分别提取出特征后,在训练集上分别学习出一个问题质量预测模型和答案质量预测模型,并使用两个模型的输出结果来评价问答对质量。然而,使用现有的获取问答对的相关联程度的方法对于答案质量进行评价时,仅仅使用了相关词覆盖特征来描述问题和答案问的语义匹配度,这不但仅仅是停留在词法层面上的,而且没有考虑问题和答案问的语义匹配度。然而问题和答案问的语义匹配度恰恰是问答对质量的核心,比如问题为“中国的首都是哪里?”,答案1为“北京”,答案2为“中国的首都是上海”。那么问题经过分词及丢弃停用词处理后,为“中国首都哪里”,答案1分词结果为“北京”,答案2分词结果为“中国首都上海”。现有技术中,语义匹配度可以定义为:问题和答案中共同出现的词语个数除以问题和答案中所有词语的个数。则问题和答案1的语义匹配度为:0/4=0。问题和答案2的语义匹配度为:2/4=0.5。使用现有技术,就会认为答案2和问题较为匹配。而我们知道这显然是不当的。The existing method of obtaining the degree of association of question and answer pairs is to use text features and non-text features to describe the questions and answers of the question and answer pairs. Similarly, the existing method for obtaining a search ranking of a question and answer pair is to use a text feature and a non-text feature to describe the question and answer pair to rank the question and answer pair, or to answer questions based on the question and answer. Ranking. Text features mainly include textual visual features (such as punctuation density, average word length, text entropy, etc.) and text content features (such as text content word scale, question word density, related word coverage, etc.), and extract Chinese automatic errors widely used. Features (such as single-word density features, etc.); non-text features include user weightedness indicators, answer question status, answer answer time, user relationship interaction features, and so on. After extracting the features from the questions and answers respectively, a problem quality prediction model and an answer quality prediction model are respectively learned on the training set, and the output of the two models is used to evaluate the quality of the question and answer. However, when using the existing method of obtaining the degree of relevance of the question and answer pair to evaluate the quality of the answer, only the relevant word coverage feature is used to describe the semantic matching of the question and answer questions, which is not only at the lexical level. And did not consider the semantic matching of questions and answers. However, the semantic matching of questions and answers is precisely the core of question and answer. For example, the question is “Where is the capital of China?”, the answer 1 is “Beijing” and the answer 2 is “China's capital is Shanghai”. Then the question is “where is the capital of China” after the word segmentation and discarding the stop words, the answer 1 word segmentation result is “Beijing”, and the answer 2 word segmentation result is “China Capital Shanghai”. In the prior art, the semantic matching degree can be defined as: the number of words co-occurring in the question and the answer divided by the number of all the words in the question and the answer. Then the semantic matching degree of question 1 and answer 1 is: 0/4=0. The semantic matching degree of question 2 and answer 2 is: 2/4=0.5. Using the prior art, it is considered that the answer 2 and the question are more matching. And we know that this is obviously not appropriate.
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present invention have been shown in the drawings, the embodiments Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.
图1示出了根据本发明一个实施例的获取问答对的相关联程度的方法的流程图。根据本发明的另一方面,提供了一种获取问答对的相关联程度的方法,该方法包括如下步骤S100和步骤S200:
1 shows a flow chart of a method of obtaining the degree of association of a question and answer pair, in accordance with one embodiment of the present invention. According to another aspect of the present invention, there is provided a method of obtaining a degree of association of a question and answer pair, the method comprising the following steps S100 and S200:
S100、对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。S100: performing a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtaining at least one question word to be analyzed and at least one answer word to be analyzed.
在本发明的一个实施例中,对待分析的问答对的问题内容和答案内容进行词语提取操作具体包括:对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作。则由待分析的问答对的问题内容得到至少一个待分析问题词语,由待分析的问答对的答案内容得到至少一个待分析答案词语。In an embodiment of the present invention, the word extraction operation of the question content and the answer content of the question and answer pair to be analyzed specifically includes: segmenting the question content and the answer content of the question and answer pair to be analyzed, removing the stop word, and word merge (word Join), and the operation of extracting entity words (such as nouns, verbs, etc.). Then, at least one problem word to be analyzed is obtained from the question content of the question and answer pair to be analyzed, and at least one answer word to be analyzed is obtained from the answer content of the question and answer pair to be analyzed.
S200、根据待分析问题词语和待分析答案词语,从包括多条问答知识记录的问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。S200: Select at least one question and answer knowledge record from the question and answer knowledge base including the plurality of question and answer knowledge records according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
本实施例的步骤S200,可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。In step S200 of the embodiment, the problem content and the answer content of the analysis question and answer pair may be analyzed from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
进一步地,所述包括多条问答知识记录的问答知识库,是通过预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建而得到的。在本发明的一个实施例中,在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别。则在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录。得到的问答知识库之中的每个问答知识记录对应于一个类别,分别包括一个问题词语(QW)、一个答案词语(AW),以及所述问题词语和所述答案词语之间的语义相关度。Further, the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs. In one embodiment of the present invention, when a plurality of question and answer pairs are extracted from a web page having a question and answer pair, the category corresponding to the question and answer pair is captured. Then, when constructing the question and answer knowledge base according to the extracted question and answer pairs, the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair. Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word. .
通过利用由网页提取的海量的、高质量的问答对构建包括多条问答知识记录的问答知识库,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度;而且通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,方法的通用性更强。By constructing a Q&A knowledge base including multiple Q&A knowledge records by using a large number of high-quality Q&A pairs extracted from web pages, the semantics between problem words and answer words of multiple Q&A knowledge records can be obtained based on the learning of massive information. Correlation; and by building a Q&A knowledge base using information extracted from web pages, the scope of application is broader and the method is more versatile.
图2示出了构建问答知识库的详细的流程图。具体包括以下步骤S310、步骤S320和步骤S330:Figure 2 shows a detailed flow chart for building a Q&A knowledge base. Specifically, the following steps S310, S320, and S330 are included:
S310、预先从含有问答对的网页提取出多个问答对,抓取与所述问答对对应的类别。S310. Extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and grab the category corresponding to the question and answer pair.
本实施例中,可以通过使用网络爬虫,从互联网上含有高质量问答对的网页抓取数据并提取出问答对,以保证所提取的问答对的质量;所述含有高质量问答对的网页包括cQA(Customer Quality Assurance客户品质保证)社区、各大专业论坛等,则可以使用楼层识别技术,根据楼主(即针对一个问题首个发出帖子的使用者)提问题,1楼2楼(即依序回复帖子的使用者)等回复的内容为答案的方式,来提取问答对。由于所述含有高质量问答对的网页中包括对应于每个问答对的类别信息,所以可以在抓取问答对的同时一并抓取与所述问答对对应的类别。In this embodiment, by using a web crawler, data may be fetched from a webpage containing a high-quality question and answer pair on the Internet, and a question and answer pair may be extracted to ensure the quality of the extracted question and answer pair; the webpage including the high-quality question and answer pair includes cQA (Customer Quality Assurance) community, major professional forums, etc., can use the floor identification technology, according to the landlord (that is, the first user to post a question), the first floor, 2nd floor (ie in order The user who replies to the post) waits for the content of the reply as the answer to extract the question and answer pair. Since the webpage containing the high-quality question and answer pair includes the category information corresponding to each question and answer pair, the category corresponding to the question and answer pair can be grasped together while the question and answer pair is captured.
S320、对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录。S320. For each question and answer pair, perform a word extraction operation on the question content and the answer content of the question and answer pair to obtain a question word set and an answer word set; and each of the question words and the answer word set in the question word set The answer words form an information record on each category corresponding to the question and answer pair.
在本发明的一个实施例中,对步骤S310中提取得到的所述问答对中的每一个问答对的问题内容和答案内容进行词语提取操作,具体包括,对问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。In an embodiment of the present invention, the word extraction operation is performed on the question content and the answer content of each question and answer pair in the question and answer pairs extracted in step S310, specifically including the question content and the answer content of the question and answer pair. Word segmentation, removal of stop words, word merging, and operations for extracting entity words.
则由每一个问答对的问题内容得到至少一个问题词语,由每一个问答对的答案内容得到至少一个答案词语,则可以得到针对该问答对的类别集合<C1,…,Ck,…,Cp>、问题词语集合<QW1,…,QWi,…,QWm>和答案词语集合<AW1,…,AWj,…,AWn>。Then, at least one question word is obtained from the question content of each question and answer pair, and at least one answer word is obtained from the answer content of each question and answer pair, and the category set <C 1 ,..., C k ,... for the question and answer pair can be obtained. C p >, question word set <QW 1 ,...,QW i ,...,QW m >and answer word set <AW 1 ,...,AW j ,...,AW n >.
通过令问题词语集合中的每个问题词语(QWi)与答案词语集合中的每个答案词语(AWj)分别在与该问答对对应的每个类别(Ck)上形成一条信息记录,例如<QWi,AWj,Ck>,则可以形成m*n*p条信息记录。Forming an information record on each of the question words (QW i ) in the set of question words and each answer word (AW j ) in the set of answer words, respectively, on each category (C k ) corresponding to the question and answer pair, For example, <QW i , AW j , C k >, then m*n*p information records can be formed.
S330、对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录<QWi,AWj,weight(QWi,AWj)>或<QWi,AWj,Ck,weight(QWi,AWj)>。本实施例中的步骤S330,可以是在对从网页抓取的海量的问答对进行了如步骤S320所述的词语提取操作而得到海量的信息记录之后基于所述海量的信息记录进行的,则基于海量的信息记录而获取的语义相关度更准确。S330. For each piece of information record, perform the following operations: calculate a probability that the answer word belongs to the category, calculate a degree of specificity of the answer word to the question word in the category, and calculate the problem word in the category. The strength of the answer word is explained; the above probability, the degree of specificity and the intensity are multiplied, and the obtained product is the semantic relevance of the answer word and the question word; the question word, the answer word and its semantic relevance A question and answer knowledge record <QW i , AW j , weight(QW i , AW j )> or <QW i , AW j , C k , weight(QW i , AW j )> corresponding to the category is formed. Step S330 in this embodiment may be performed based on the mass information record after the massive question and answer pair obtained from the web page is subjected to the word extraction operation as described in step S320 to obtain a massive information record. The semantic relevance obtained based on massive information records is more accurate.
较佳地,所述计算该答案词语属于该类别的概率,具体包括:Preferably, the calculating the probability that the answer word belongs to the category includes:
所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:The calculating the degree of specificity of each answer word on the question word in the category includes:
所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:The calculating the strength of the question word in the category to be explained by each answer word, specifically comprising:
将上述概率、专一程度和强度相乘,具体包括:Multiply the above probability, specificity and intensity, including:
weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi, AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;
#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;
#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
由步骤S310、步骤S320和步骤S330,可以得到问答知识记录而构建问答知识库。图3示出了使用如图2所示的步骤而得到的问答知识库的一个解释模型示意图。可知,对于每一问题词语QWi,可以针对类别集合<C1,…,Ck,…,Cp>中的每一类别,获得n条问答知识记录。当然,本领域技术人员可以了解的是,若计算得到的语义相关度为0,则可以删除相应的问答知识记录;再者,如果问答知识库中问答知识记录的数量过大而使得存储问答知识记录和计算待分析问答对的相关联程度的开销过大,可以预设一个阈值,将语义相关度小于阈值的问答知识记录删除以减小开销。From step S310, step S320 and step S330, a question and answer knowledge record can be obtained to construct a question and answer knowledge base. Figure 3 shows a schematic diagram of an explanatory model of a question and answer knowledge base obtained using the steps shown in Figure 2. It can be seen that for each question word QW i , n question and answer knowledge records can be obtained for each of the category sets <C 1 , . . . , C k , . . . , C p >. Of course, those skilled in the art can understand that if the calculated semantic relevance is 0, the corresponding question and answer knowledge record can be deleted; further, if the number of question and answer knowledge records in the question and answer knowledge base is too large, the question and answer knowledge is stored. The overhead of recording and calculating the degree of association of the question and answer pairs to be analyzed is too large, and a threshold can be preset, and the question and answer knowledge record whose semantic relevance is less than the threshold is deleted to reduce the overhead.
图4示出了图1中步骤S200的详细的流程图。在通过步骤S100得到至少一个待分析问题词语和至少一个待分析答案词语后,步骤S200具体包括以下步骤S210、步骤S220和步骤S230:FIG. 4 shows a detailed flowchart of step S200 in FIG. 1. After obtaining at least one problem word to be analyzed and at least one word to be analyzed by step S100, step S200 specifically includes the following steps S210, S220, and S230:
S210、选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串,本实施例通过步骤S210,使用字段匹配或字段搜索的方法,从问答知识库中选出部分与待分析的问答对相关的问答知识记录。S210: Select a question and answer knowledge record that matches the problem words included in the problem word to be analyzed and the included answer words and the answer words to be analyzed. In this embodiment, the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word. In this embodiment, through step S210, a field matching or field search method is used to select a part of the question and answer knowledge record related to the question and answer pair to be analyzed from the question and answer knowledge base. .
S220、根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。S220. According to the question and answer knowledge record corresponding to the same category in the selected question and answer knowledge record, obtain the degree of association of the question and answer pairs to be analyzed for each category, and specifically include: the selected question and answer knowledge record corresponds to the same category The semantic relevance of the Q&A knowledge record is weighted and added, and the degree of association of the question and answer pairs to be analyzed for each category is obtained.
本实施例,将通过步骤S210选出的问答知识记录根据其所对应的类别进行分组,对应于相同类别的问答知识记录为一组;将每一组的问答知识记录的语义相关度加权(例如,权值为1或100)相加,得到该待分析的问答对针对该类别的相关联程度;由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度。In this embodiment, the Q&A knowledge records selected by step S210 are grouped according to their corresponding categories, and the Q&A knowledge records corresponding to the same category are grouped; the semantic relevance of each group of Q&A knowledge records is weighted (for example, And adding a weight of 1 or 100), obtaining the degree of association of the question and answer pair to be analyzed for the category; thereby obtaining at least one (the number of degrees of association in the embodiment is the corresponding category of the question and answer pair to be analyzed The number) the degree of association.
S230、选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。S230. Select the maximum value of the correlation degree of the question and answer pairs to be analyzed for each category, and use the maximum value as the correlation degree of the question and answer pair to be analyzed.
图5示出了根据本发明一个实施例的获取问答对的相关联程度的装置的框图。该装置包括问答知识库100、词语提取单元200和相关联程度计算单元300。Figure 5 illustrates a block diagram of an apparatus for obtaining the degree of association of a question and answer pair, in accordance with one embodiment of the present invention. The apparatus includes a question and answer knowledge base 100, a word extraction unit 200, and an associated degree calculation unit 300.
问答知识库100,适于存储多条问答知识记录;本实施例的问答知识库100能够通过抓取网页中的海量问答对构建得到。The question and answer knowledge base 100 is adapted to store a plurality of question and answer knowledge records; the question and answer knowledge base 100 of the present embodiment can be constructed by crawling a large number of question and answer pairs in the web page.
词语提取单元200,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。The word extracting unit 200 is adapted to perform a word extracting operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
在本发明的一个实施例中,词语提取单元200,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作,以得到至少一个待分析问题词语和至少一个待分析答案词语。In an embodiment of the present invention, the word extracting unit 200 is adapted to perform word segmentation, remove stop words, word join, and extract entity words (for example, nouns) for the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
相关联程度计算单元300,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。The association degree calculation unit 300 is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
在本发明的一个实施例中,相关联程度计算单元300,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度,更具体地,是将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权(例如,权值为1或100)相加而得到该待分析的问答对分别针对各个类别的相关联程度,由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度;选取上述该待分析的问答对针对各个类别的
相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。In an embodiment of the present invention, the correlation degree calculation unit 300 is adapted to select a question and answer knowledge record whose question words are matched with the question words to be analyzed and the included answer words match the answer words to be analyzed. In this embodiment, the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category. Degree, thereby obtaining at least one (the number of degrees of association in the embodiment, that is, the number of categories to be analyzed, the number of categories to be analyzed); the above-mentioned question and answer pairs to be analyzed are selected for each category
The maximum value of the degree of association, with the maximum value as the degree of association of the question and answer pairs to be analyzed.
利用问答知识库100、词语提取单元200和相关联程度计算单元300,通过利用待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,以及根据所选择的问答知识记录计算待分析的问答对的相关联程度,可以从语义方面对待分析问答对进行分析,评价效果更好而且容易实现,通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,通用性更强。Using the question and answer knowledge base 100, the word extracting unit 200, and the associated degree calculating unit 300, selecting at least one question and answer knowledge record from the question and answer knowledge base by using the question word to be analyzed and the answer word to be analyzed, and calculating according to the selected question and answer knowledge record The degree of correlation between the question and answer pairs to be analyzed can be analyzed from the semantic aspect of the analysis question and answer pair. The evaluation effect is better and easier to implement. By using the information extracted from the web page to construct the question and answer knowledge base, the scope of application is wider and versatile. Stronger.
在本实施例中,该装置还包括问答知识库构建单元400,问答知识库构建单元400适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库。在图5所示的装置中,问答知识库是已有的,由于实际网络的信息量不断增加,信息内容的变化速度快,问答知识库的内容往往需要更新,通过增设问答知识库构建单元400构建(或者说更新)问答知识库,可以保证问答知识库的内容的即时性和可靠性。In this embodiment, the device further includes a question and answer knowledge base construction unit 400, and the question and answer knowledge base construction unit 400 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs. Recorded Q&A knowledge base. In the device shown in FIG. 5, the Q&A knowledge base is existing. Since the amount of information of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base often needs to be updated, by adding a Q&A knowledge base building unit 400. Build (or update) the Q&A knowledge base to ensure the immediacy and reliability of the content of the Q&A knowledge base.
较佳地,在从含有问答对的网页提取出多个问答对时,问答知识库构建单元400抓取与所述问答对对应的类别。本实施例中,可以通过使用网络爬虫,从互联网上含有高质量问答对的网页抓取数据并提取出问答对,以保证所提取的问答对的质量;所述含有高质量问答对的网页包括cQA社区、各大专业论坛等。由于所述含有高质量问答对的网页中包括对应于每个问答对的类别信息,所以问答知识库构建单元400可以在抓取问答对的同时一并抓取与所述问答对对应的类别。Preferably, when a plurality of question and answer pairs are extracted from the web page containing the question and answer pair, the question and answer knowledge base construction unit 400 grabs the category corresponding to the question and answer pair. In this embodiment, by using a web crawler, data may be fetched from a webpage containing a high-quality question and answer pair on the Internet, and a question and answer pair may be extracted to ensure the quality of the extracted question and answer pair; the webpage including the high-quality question and answer pair includes cQA community, major professional forums, etc. Since the webpage containing the high quality question and answer pair includes category information corresponding to each question and answer pair, the question and answer knowledge base construction unit 400 can grab the category corresponding to the question and answer pair while grabbing the question and answer pair.
在本实施例中,问答知识库构建单元400,适于对每个问答对执行以下操作:对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合,具体地,问答知识库构建单元400对提取得到的所述问答对中的每一个问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作而得到问题词语和答案词语;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录。问答知识库构建单元400,适于对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。In this embodiment, the question and answer knowledge base construction unit 400 is adapted to perform the following operations on each question and answer pair: performing a word extraction operation on the question content and the answer content of the question and answer pair to obtain a question word set and an answer word set, specifically The question and answer knowledge base construction unit 400 performs the word segmentation, the removal of the stop word, the word combination, and the operation of extracting the entity word for the problem content and the answer content of each of the question and answer pairs in the extracted question and answer pairs to obtain the question words and answers. a word; each of the question words in the set of question words and each answer word in the set of answer words form an information record on each of the categories corresponding to the question and answer pair. The question and answer knowledge base construction unit 400 is adapted to record, for each piece of information, an operation of calculating a probability that the answer word belongs to the category, and calculating a degree of specificity of the answer word to the question word on the category, The strength of the question word in the category to be explained by the answer word; multiplying the above probability, the degree of specificity and the intensity, the product obtained is the semantic relevance of the answer word and the question word; The answer words and their semantic relevance form a question and answer knowledge record corresponding to the category.
更具体地,问答知识库构建单元400,适于按照如下的方法计算该答案词语属于该类别的概率:More specifically, the question and answer knowledge base construction unit 400 is adapted to calculate the probability that the answer word belongs to the category according to the following method:
更具体地,问答知识库构建单元400,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度:More specifically, the question and answer knowledge base construction unit 400 is adapted to calculate the degree of specificity of the interpretation of the question words by the respective answer words on the category according to the following method:
更具体地,问答知识库构建单元400,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:More specifically, the question and answer knowledge base construction unit 400 is adapted to calculate the strength of the problem words explained by the respective answer words on the category according to the following method:
更具体地,问答知识库构建单元400,适于按照如下的方法将上述概率、专一程度和强度相乘:More specifically, the question and answer knowledge base construction unit 400 is adapted to multiply the above probability, specific degree, and intensity according to the following method:
weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi, AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;
#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;
#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
以下通过一个例子说明使用本发明的实施例所能达到的效果,比如有如下问答对,类别为“医疗健康”:The following can be used to illustrate the effects that can be achieved by using the embodiments of the present invention, such as the following question and answer pairs, the category is "medical health":
通过分词技术处理,得到待分析问题词语和待分析答案词语如下:Through the word segmentation technology, the words to be analyzed and the words to be analyzed are as follows:
从分词结果可以看出,问题和答案中没有相关词覆盖,因此如果使用现有技术则容易认为该问答对相关联程度低,质量不高。但是实际上使用人工判断明显可知该问答对是一个高质量的问答对。As can be seen from the word segmentation results, there is no relevant word coverage in the questions and answers, so if the existing technology is used, it is easy to think that the question and answer is low in relevance and low in quality. However, it is obvious that the question and answer pair is a high-quality question and answer pair.
若使用本发明的方法和装置处理上述问答对,首先,可以调取已有的问答知识库,或者通过抓取cQA社区、各大专业论坛的问答对,构建问答知识库;If the method and apparatus of the present invention are used to process the above question and answer pairs, first, an existing Q&A knowledge base may be retrieved, or a Q&A knowledge base may be constructed by grasping the QQA community and the Q&A pairs of the major professional forums;
第二步,对上述待分析的问答对,经过词语提取操作得到待分析问题词语集合<孩子,咳嗽,鼻涕>、待分析答案词语集合<症状,药物,治疗,抗病毒,小儿感冒颗粒,说明,剂量,止咳,中药,冲剂,抗生素,阿莫西林,阿莫西林颗粒,颗粒,口服,罗红霉素,疗效>,并且得到待分析的问答对的类别为“医疗健康”;The second step is to answer the question and answer pair to be analyzed. After the word extraction operation, the word set to be analyzed is obtained. <Child, cough, snot>, the answer word set to be analyzed <symptoms, drugs, treatment, anti-virus, pediatric cold particles, description , dosage, cough, Chinese medicine, granules, antibiotics, amoxicillin, amoxicillin granules, granules, oral, roxithromycin, efficacy>, and the type of question and answer pair to be analyzed is “medical health”;
第三步,根据各个待分析问题词语以及该类别,从问答知识库中选择得到问题词语与待分析问题词语匹配的若干问答知识记录,从而得到如下答案词语及语义相关度(为了方便阅读,下表中的语义相关度的数值是进行了适当的归一化处理后的数值):In the third step, according to the words to be analyzed and the category, a plurality of question and answer knowledge records matching the problem words and the words to be analyzed are selected from the question and answer knowledge base, thereby obtaining the following answer words and semantic relevance (for convenience of reading, The values of the semantic relevance in the table are the values that have been properly normalized):
第四步,根据待分析答案词语集合中的待分析答案词语,在第三步所选择得到的问答知识记录的基础上筛选出其包括的答案词语与待分析答案词语匹配的问答知识记录,进而得到所筛选出的问答知识记录的语义相关度。经分析可知,本例中与问答知识记录中的答案词语匹配的待分析答案词语包括:<口服,咳喘,小儿感冒颗粒,检查,止咳,治疗,流感症状,感冒颗粒>。In the fourth step, according to the answer words to be analyzed in the set of answers to be analyzed, based on the Q&A knowledge records selected in the third step, the Q&A knowledge records including the answer words and the answers to be analyzed are selected, and further Get the semantic relevance of the selected question and answer knowledge records. According to the analysis, the answers to the answers in this example that match the answer words in the Q&A knowledge record include: <Oral, Kechuan, Pediatric cold particles, examination, cough, treatment, flu symptoms, cold particles>.
再计算上述待分析的问答对的相关联程度可以得出,该待分析的问答对的相关联程度达到了0.9(在相关联程度取值范围为0~1的条件下)。The degree of correlation of the question and answer pairs to be analyzed may be calculated, and the degree of correlation of the question and answer pairs to be analyzed reaches 0.9 (under the condition that the correlation degree ranges from 0 to 1).
图6示出了根据本发明一个实施例的优化问答对的搜索排名的方法的流程图。该方法包括如下步骤S610、步骤S620和步骤S630:6 shows a flow chart of a method of optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention. The method includes the following steps S610, S620, and S630:
S610、接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对。S610. Receive a search request of the user, and obtain a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user.
在本发明的一个实施例中,可以是使用网络搜索技术,例如使用问答对搜索引擎,根据使用者的搜索请求获取待分析问答对。In an embodiment of the present invention, the network search technology may be used, for example, using a question and answer pair search engine to obtain a question and answer pair to be analyzed according to the user's search request.
S620、根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度。S620: Obtain an association degree of each question and answer pair to be analyzed according to a Q&A knowledge base including a plurality of Q&A knowledge records.
本实施例的步骤S620,可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。In step S620 of the embodiment, the question content and the answer content of the question and answer pair may be analyzed from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
更具体地,本实施例的步骤S620的获得待分析问答对的相关联程度的具体实施方式,与如图1、4所示的获取问答对的相关联程度的方法大致相同,此处不再重复。More specifically, the specific implementation manner of obtaining the degree of association of the question and answer pair to be analyzed in step S620 of the embodiment is substantially the same as the method of obtaining the degree of association of the question and answer pair as shown in FIG. repeat.
进一步地,所述包括多条问答知识记录的问答知识库,是通过预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建而得到的。在本发明的一个实施例中,在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别。则在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录。得到的问答知识库之中的每个问答知识记录对应于一个类别,分别包括一个问题词语(QW)、一个答案词语(AW),以及所述问题词语和所述答案词语之间的语义相关度。通过利用由网页提取的海量的、高质量的问答对构建包括多条问答知识记录的问答知识库,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度;通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,方法的通用性更强。Further, the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs. In one embodiment of the present invention, when a plurality of question and answer pairs are extracted from a web page having a question and answer pair, the category corresponding to the question and answer pair is captured. Then, when constructing the question and answer knowledge base according to the extracted question and answer pairs, the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair. Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word. . By constructing a Q&A knowledge base including multiple Q&A knowledge records by using a large number of high-quality Q&A pairs extracted from web pages, the semantics between problem words and answer words of multiple Q&A knowledge records can be obtained based on the learning of massive information. Correlation; by using the information extracted from the web page to build a question-and-answer knowledge base, the scope of application is broader, and the method is more versatile.
更具体地,本实施例的方法还包括构建问答知识库的步骤,构建问答知识库的流程与图2所示的流程大致相同;本实施例的问答知识库的解释模型与如图3所示的解释模型大致相同。此处不再重复。More specifically, the method of the embodiment further includes the step of constructing the question and answer knowledge base, and the process of constructing the question and answer knowledge base is substantially the same as the process shown in FIG. 2; the interpretation model of the question and answer knowledge base of the present embodiment is as shown in FIG. The interpretation model is roughly the same. It will not be repeated here.
S630、根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。S630. Optimize a search ranking of the pair of questions to be analyzed according to the degree of association of the question and answer pairs to be analyzed.
由于待分析问答对的相关联程度反映了质量,所以可以利用相关联程度优化所述待分析问答对的搜索排名,排名效果更好。Since the degree of association of the question and answer pairs to be analyzed reflects the quality, the search ranking of the question and answer pair to be analyzed can be optimized by using the degree of association, and the ranking effect is better.
具体的方法,可以是以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名,即相关联程度高的问答对的搜索排名靠前;也可以是先根据搜索排列技术初步排列所述待分析问答对所属的网站,根据该初步排列的次序号与所述待分析问答对的相关联程度计算所述待分析问答对的搜索排名,例如,可以将所述待分析问答对所属的网站的初步排列的次序号与所述待分析问答对的相关联程度相乘,以相乘运算的结果的次序作为所述待分析问答对的搜索排名;通过将待分析问答对的质量和其所属网站的排
名结合,以对待分析问答对进行排序,使用者使用问答对搜索时,能够获得更好的结果排序的质量。The specific method may be the search ranking of the question-and-answer pair to be analyzed in the order of the degree of association of the question-and-answer pairs to be analyzed, that is, the search ranking of the question-and-answer pair with a high degree of relevance is ranked first; or may be based on the search first The ranking technique initially arranges the website to which the question and answer pair to be analyzed belongs, and calculates a search ranking of the pair of questions to be analyzed according to the degree of association between the sequence number of the preliminary arrangement and the question and answer pair to be analyzed, for example, the waiting The analysis question and answer is multiplied by the degree of association of the preliminary arrangement of the website to which it belongs, and the order of the result of the multiplication operation is used as the search ranking of the question and answer pair to be analyzed; The quality of the pair and the row of the website to which it belongs
The combination of names, sorting pairs of questions and answers to be analyzed, users can get better results sorting quality when using Q&A.
图7示出了根据本发明一个实施例的优化问答对的搜索排名的装置的框图。该装置包括问答知识库710、搜索单元720、计算单元730和搜索排名单元740。7 shows a block diagram of an apparatus for optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention. The device includes a question and answer knowledge base 710, a search unit 720, a calculation unit 730, and a search ranking unit 740.
问答知识库710,适于存储多条问答知识记录。本实施例的问答知识库710能够通过抓取网页中的海量问答对构建得到。The question and answer knowledge base 710 is adapted to store a plurality of question and answer knowledge records. The question and answer knowledge base 710 of the present embodiment can be constructed by crawling a massive question and answer pair in a web page.
搜索单元720,适于接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对。The searching unit 720 is adapted to receive a search request of the user, and obtain a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user.
在本发明的一个实施例中,搜索单元720可以是问答对搜索引擎,根据使用者的搜索请求获取待分析问答对;例如搜索单元720是用于问答对搜索的网络搜索引擎,接收使用者通过浏览器输入的搜索请求并获取待分析问答对。In an embodiment of the present invention, the search unit 720 may be a question and answer pair search engine, and obtain a question and answer pair to be analyzed according to the user's search request; for example, the search unit 720 is a web search engine for question and answer search, and the receiving user passes The search request entered by the browser and the question and answer pair to be analyzed.
计算单元730,适于根据问答知识库710获取每个待分析问答对的相关联程度。The calculating unit 730 is adapted to obtain the degree of association of each question and answer pair to be analyzed according to the question and answer knowledge base 710.
本发明的计算单元730可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。问答知识库710利用由网页提取的海量的、高质量的问答对构建并且包括多条问答知识记录,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度。The calculation unit 730 of the present invention can analyze the problem content and the answer content of the analysis question and answer pair from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement. The question and answer knowledge base 710 constructs and includes a plurality of question and answer knowledge records using a large number of high quality question and answer pairs extracted from web pages, and can acquire semantics between problem words and answer words of multiple question and answer knowledge records based on learning of massive information. relativity.
搜索排名单元740,适于根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。The search ranking unit 740 is adapted to optimize the search ranking of the question and answer pair to be analyzed according to the degree of association of the question and answer pairs to be analyzed.
由于待分析问答对的相关联程度反映了质量,所以可以利用相关联程度优化所述待分析问答对的搜索排名,排名效果更好。具体的方法,可以是以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名,即相关联程度高的问答对的搜索排名靠前;也可以是先根据搜索排列技术初步排列所述待分析问答对所属的网站,根据该初步排列的次序号与所述待分析问答对的相关联程度计算所述待分析问答对的搜索排名,例如,可以将所述待分析问答对所属的网站的初步排列的次序号与所述待分析问答对的相关联程度相乘,以相乘运算的结果的次序作为所述待分析问答对的搜索排名。Since the degree of association of the question and answer pairs to be analyzed reflects the quality, the search ranking of the question and answer pair to be analyzed can be optimized by using the degree of association, and the ranking effect is better. The specific method may be the search ranking of the question-and-answer pair to be analyzed in the order of the degree of association of the question-and-answer pairs to be analyzed, that is, the search ranking of the question-and-answer pair with a high degree of relevance is ranked first; or may be based on the search first The ranking technique initially arranges the website to which the question and answer pair to be analyzed belongs, and calculates a search ranking of the pair of questions to be analyzed according to the degree of association between the sequence number of the preliminary arrangement and the question and answer pair to be analyzed, for example, the waiting The analysis question and answer is multiplied by the degree of association of the preliminary arrangement of the website to which it belongs, and the order of the result of the multiplication operation is used as the search ranking of the question and answer pair to be analyzed.
在本实施例中,该装置还包括问答知识库构建单元750,问答知识库构建单元750适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库。在图7所示的装置中,问答知识库710是已有的,由于实际网络的信息量不断增加,信息内容的变化速度快,问答知识库710的内容往往需要更新,本实施例通过增设问答知识库构建单元750构建(或者说更新)问答知识库710,可以保证问答知识库710的内容的即时性和可靠性。本实施例的问答知识库构建单元750与如图5所示的问答知识库构建单元400相同,此处不再重复说明。In this embodiment, the apparatus further includes a question and answer knowledge base construction unit 750, wherein the question and answer knowledge base construction unit 750 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs. Recorded Q&A knowledge base. In the device shown in FIG. 7, the Q&A knowledge base 710 is already existing. Since the information volume of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base 710 often needs to be updated. The knowledge base building unit 750 constructs (or updates) the question and answer knowledge base 710, which can ensure the immediacy and reliability of the content of the question and answer knowledge base 710. The question and answer knowledge base construction unit 750 of the present embodiment is the same as the question and answer knowledge base construction unit 400 shown in FIG. 5, and the description thereof will not be repeated here.
图7中的计算单元630具体包括词语提取子单元和相关联程度计算子单元(图未示)。The calculation unit 630 in FIG. 7 specifically includes a word extraction subunit and an associated degree calculation subunit (not shown).
词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。The word extraction subunit is adapted to perform the word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
在本发明的一个实施例中,词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作,以得到至少一个待分析问题词语和至少一个待分析答案词语。In an embodiment of the present invention, the word extraction subunit is adapted to perform word segmentation, remove stop words, word join, and extract entity words (eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。The correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
在本发明的一个实施例中,相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度,更具体地,是将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权(例如,权值为1或100)相加而得到该待分析的问答对分别针对各个类别的相关联程度,由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。In an embodiment of the present invention, the correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words match the answer words to be analyzed. In this embodiment, the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category. Degree, thereby obtaining at least one (the number of degrees of association in the embodiment, that is, the number of categories to be analyzed, the number of categories to be analyzed) is associated; selecting the above-mentioned question and answer pairs to be analyzed is the largest degree of association for each category The value, with the maximum value as the degree of association of the question and answer pairs to be analyzed.
图8示出了根据本发明一个实施例的确定网络资源点的抓取频率的方法的流程图。该方法包括如下步骤S810、步骤S820和步骤S830:FIG. 8 illustrates a flow chart of a method of determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention. The method includes the following steps S810, S820, and S830:
S810、由网络资源点抓取多个待分析问答对。S810. The plurality of to-be-analyzed question and answer pairs are captured by the network resource point.
在本发明的一个实施例中,可以是对于特定的需要确定抓取频率的网络资源点,例如需要确定抓取频率的问答社区,使用楼层识别技术,根据楼主(即针对一个问题首个发出帖子的使用者)提问题,1楼2楼(即依序回复帖子的使用者)等回复的内容为答案的方式,来提取待分析问答对。In an embodiment of the present invention, it may be a network resource point for determining a specific fetching frequency, for example, a Q&A community that needs to determine a fetching frequency, using a floor identification technology, according to the landlord (ie, the first post for a question) The user asks questions, and the content of the reply on the 2nd floor of the 1st floor (that is, the user who replies to the post in order) is the answer, to extract the question and answer pair to be analyzed.
S820、根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度。S820. Obtain an association degree of each question and answer pair to be analyzed according to a Q&A knowledge base including a plurality of Q&A knowledge records.
本实施例的步骤S820,可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进
行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。In step S820 of the embodiment, the question content and the answer content of the question and answer pair may be analyzed semantically by using the question and answer knowledge base.
The analysis is performed to obtain the degree of correlation of the question and answer pairs to be analyzed, and the evaluation effect is better and easier to implement.
更具体地,本实施例的步骤S820的获得待分析问答对的相关联程度的具体实施方式,与如图1、4所示的获取问答对的相关联程度的方法大致相同,此处不再重复。More specifically, the specific implementation manner of obtaining the degree of association of the question and answer pair to be analyzed in step S820 of the embodiment is substantially the same as the method of obtaining the degree of association of the question and answer pair as shown in FIG. repeat.
进一步地,所述包括多条问答知识记录的问答知识库,是通过预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建而得到的。在本发明的一个实施例中,在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别。则在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录。得到的问答知识库之中的每个问答知识记录对应于一个类别,分别包括一个问题词语(QW)、一个答案词语(AW),以及所述问题词语和所述答案词语之间的语义相关度。通过利用由网页提取的海量的、高质量的问答对构建包括多条问答知识记录的问答知识库,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度;通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,方法的通用性更强。Further, the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs. In one embodiment of the present invention, when a plurality of question and answer pairs are extracted from a web page having a question and answer pair, the category corresponding to the question and answer pair is captured. Then, when constructing the question and answer knowledge base according to the extracted question and answer pairs, the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair. Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word. . By constructing a Q&A knowledge base including multiple Q&A knowledge records by using a large number of high-quality Q&A pairs extracted from web pages, the semantics between problem words and answer words of multiple Q&A knowledge records can be obtained based on the learning of massive information. Correlation; by using the information extracted from the web page to build a question-and-answer knowledge base, the scope of application is broader, and the method is more versatile.
更具体地,本实施例的方法还包括构建问答知识库的步骤,其中构建问答知识库的流程与图2所示的流程大致相同;本实施例的问答知识库的解释模型与如图3所示的解释模型大致相同。此处不再重复。More specifically, the method of the embodiment further includes the step of constructing a question and answer knowledge base, wherein the process of constructing the question and answer knowledge base is substantially the same as the process shown in FIG. 2; the interpretation model of the question and answer knowledge base of the present embodiment is as shown in FIG. 3 The explanatory models shown are roughly the same. It will not be repeated here.
S830、根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。S830. Determine a frequency of capturing the network resource point according to the correlation degree of the question and answer pair to be analyzed.
由于待分析问答对的相关联程度反映了质量,所以可以利用多个待分析问答对的相关联程度确定网络资源点的质量,进而确定网络资源点的抓取频率。Since the degree of association of the question and answer pairs to be analyzed reflects the quality, the quality of the network resource points can be determined by using the correlation degree of the plurality of question and answer pairs to be analyzed, thereby determining the frequency of the network resource points.
具体的方法,可以是以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率,即相关联程度的平均值大(即质量好)的网络资源点的抓取频率越高(例如,蜘蛛爬虫爬取该网络资源点的频率高);也可以是使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率,例如,可以使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,使用所述相关联程度的平均值对上述初始抓取频率进行加权(包括相乘、归一化等)而确定所述网络资源点的抓取频率,使得质量高的网络资源点的抓取频率得到提高,从而可以优化搜索质量。The specific method may be that the average value of the correlation degree of the pair of questions to be analyzed is used as the crawling frequency of the network resource point, that is, the network resource point with a large average value (ie, good quality) of the associated degree The higher the frequency (for example, the frequency at which the spider crawler crawls the network resource point is high); or the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed. An average value, using the average value to adjust the initial crawl frequency to determine a crawl frequency of the network resource point, for example, an spider crawler may be used to obtain an initial crawl frequency of the network resource point, using the correlation degree The average value of the initial capture frequency is weighted (including multiplication, normalization, etc.) to determine the capture frequency of the network resource point, so that the capture frequency of the high-quality network resource point is improved, thereby optimizing Search quality.
本实施例通过分析由网络资源点抓取待分析问答对的相关联程度,并根据相关联程度确定网络资源点的抓取频率,可以提高抓取结果的准确性。In this embodiment, the correlation degree of the question and answer pair to be analyzed is analyzed by the network resource point, and the crawling frequency of the network resource point is determined according to the degree of association, so that the accuracy of the crawling result can be improved.
图9示出了根据本发明一个实施例的确定网络资源点的抓取频率的装置的框图。该装置包括问答知识库91、资源分析单元920、计算单元930和抓取频率获取单元940。9 shows a block diagram of an apparatus for determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention. The apparatus includes a question and answer knowledge base 91, a resource analysis unit 920, a calculation unit 930, and a capture frequency acquisition unit 940.
问答知识库910,适于存储多条问答知识记录。本实施例的问答知识库910能够通过抓取网页中的海量问答对构建得到。The Q&A knowledge base 910 is adapted to store a plurality of Q&A knowledge records. The question and answer knowledge base 910 of the present embodiment can be constructed by crawling a large number of question and answer pairs in a web page.
资源分析单元920,适于由网络资源点抓取多个待分析问答对。The resource analysis unit 920 is adapted to capture a plurality of question and answer pairs to be analyzed by the network resource point.
在本发明的一个实施例中,资源分析单元920可以对于特定的需要确定抓取频率的网络资源点,例如需要确定抓取频率的问答社区,使用楼层识别技术,根据楼主(即针对一个问题首个发出帖子的使用者)提问题,1楼2楼(即依序回复帖子的使用者)等回复的内容为答案的方式,来提取待分析问答对。In an embodiment of the present invention, the resource analysis unit 920 may determine a network resource point of a capture frequency for a specific need, for example, a question and answer community that needs to determine a crawl frequency, and use a floor identification technology according to the landlord (ie, for a problem first) The user who posts the question) asks questions, and the content of the reply on the 1st floor and the 2nd floor (that is, the user who replies to the post in order) is the answer, to extract the question and answer pair to be analyzed.
计算单元930,适于根据问答知识库获取每个待分析问答对的相关联程度。The calculating unit 930 is adapted to obtain the degree of association of each question and answer pair to be analyzed according to the question and answer knowledge base.
本发明的计算单元930可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。问答知识库910利用由网页提取的海量的、高质量的问答对构建并且包括多条问答知识记录,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度。The calculation unit 930 of the present invention can analyze the problem content and the answer content of the analysis question and answer pair from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement. The Q&A knowledge base 910 is constructed using a large number of high-quality Q&A pairs extracted from web pages and includes a plurality of Q&A knowledge records, which can acquire semantics between problem words and answer words of multiple Q&A knowledge records based on learning of massive information. relativity.
抓取频率确定单元940,适于根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。The capture frequency determining unit 940 is adapted to determine a crawling frequency of the network resource point according to the correlation degree of the question and answer pair to be analyzed.
由于待分析问答对的相关联程度反映了质量,所以可以利用多个待分析问答对的相关联程度确定网络资源点的质量,进而确定网络资源点的抓取频率。具体的方法,可以是以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率,即相关联程度的平均值大(即质量好)的网络资源点的抓取频率越高(例如,蜘蛛爬虫爬取该网络资源点的频率高);也可以是使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率,例如,可以使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,使用所述相关联程度的平均值对上述初始抓取频率进行加权(包括相乘、归一化等)而确定所述网络资源点的抓取频率,使得质量高的网络资源点的抓取频率得到提高,从而可以优化搜索质量。Since the degree of association of the question and answer pairs to be analyzed reflects the quality, the quality of the network resource points can be determined by using the correlation degree of the plurality of question and answer pairs to be analyzed, thereby determining the frequency of the network resource points. The specific method may be that the average value of the correlation degree of the pair of questions to be analyzed is used as the crawling frequency of the network resource point, that is, the network resource point with a large average value (ie, good quality) of the associated degree The higher the frequency (for example, the frequency at which the spider crawler crawls the network resource point is high); or the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed. An average value, using the average value to adjust the initial crawl frequency to determine a crawl frequency of the network resource point, for example, an spider crawler may be used to obtain an initial crawl frequency of the network resource point, using the correlation degree The average value of the initial capture frequency is weighted (including multiplication, normalization, etc.) to determine the capture frequency of the network resource point, so that the capture frequency of the high-quality network resource point is improved, thereby optimizing Search quality.
在本实施例中,该装置还包括问答知识库构建单元950,问答知识库构建单元950适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库。在图9所示的装置中,问答知识库910是已有的,由于实际网络的信息量不断增加,信息内容的变化速度快,问答知识库910的内容往往需要更新,本实施例通过增设问答知识库构建单元950构建(或者说更新)问答知识库,可以保证问答知识库的内容的即时性和可靠性。本实施例的问答知识库构建单元950与如图5所示的问答知识库构建单元400相同,此处不再重复说明。
In this embodiment, the apparatus further includes a question and answer knowledge base construction unit 950, and the question and answer knowledge base construction unit 950 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs. Recorded Q&A knowledge base. In the apparatus shown in FIG. 9, the Q&A knowledge base 910 is existing. Since the amount of information of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base 910 often needs to be updated. The knowledge base building unit 950 builds (or updates) the Q&A knowledge base to ensure the immediacy and reliability of the content of the Q&A knowledge base. The question and answer knowledge base construction unit 950 of the present embodiment is the same as the question and answer knowledge base construction unit 400 shown in FIG. 5, and the description thereof will not be repeated here.
图9中计算单元930具体包括词语提取子单元和相关联程度计算子单元(图未示)。The calculation unit 930 in FIG. 9 specifically includes a word extraction subunit and an associated degree calculation subunit (not shown).
词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。The word extraction subunit is adapted to perform the word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
在本发明的一个实施例中,词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作,以得到至少一个待分析问题词语和至少一个待分析答案词语。In an embodiment of the present invention, the word extraction subunit is adapted to perform word segmentation, remove stop words, word join, and extract entity words (eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。The correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
在本发明的一个实施例中,相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度,更具体地,是将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权(例如,权值为1或100)相加而得到该待分析的问答对分别针对各个类别的相关联程度,由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。In an embodiment of the present invention, the correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words match the answer words to be analyzed. In this embodiment, the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category. Degree, thereby obtaining at least one (the number of degrees of association in the embodiment, that is, the number of categories to be analyzed, the number of categories to be analyzed) is associated; selecting the above-mentioned question and answer pairs to be analyzed is the largest degree of association for each category The value, with the maximum value as the degree of association of the question and answer pairs to be analyzed.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的获取问答对的相关联程度的装置,优化问答对的搜索排名的装置,以及确定网络资源点的抓取频率的装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. It should be understood by those skilled in the art that a microprocessor or digital signal processor (DSP) can be used in practice to implement a device for obtaining the degree of association of a question and answer pair according to an embodiment of the present invention, and a device for optimizing search ranking of a question and answer pair. And some or all of the functions of some or all of the means for determining the frequency of crawling of network resource points. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图10示出了用于执行根据本发明的获取问答对的相关联程度的方法,优化问答对的搜索排名的方法,以及确定网络资源点的抓取频率的方法的服务器,例如应用服务器的框图。该应用服务器传统上包括处理器1010和以存储器1020形式的计算机程序产品或者计算机可读介质。存储器1020可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器1020具有用于执行上述方法中的任何方法步骤的程序代码1031的存储空间1030。例如,用于程序代码的存储空间1030可以包括分别用于实现上面的方法中的各种步骤的各个程序代码1031。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图11所述的便携式或者固定存储单元。该存储单元可以具有与图10的应用服务器中的存储器1020类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码1131’,即可以由例如诸如处理器1010之类的处理器读取的代码,这些代码当由服务器运行时,导致该服务器执行上面所描述的方法中的各个步骤。For example, FIG. 10 illustrates a method for performing an association degree of obtaining a question and answer pair according to the present invention, a method of optimizing a search ranking of a question and answer pair, and a server for determining a frequency of crawling a network resource point, such as an application server. Block diagram. The application server traditionally includes a processor 1010 and a computer program product or computer readable medium in the form of a memory 1020. The memory 1020 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. The memory 1020 has a memory space 1030 for executing program code 1031 of any of the above method steps. For example, storage space 1030 for program code may include various program code 1031 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have a storage section, a storage space, and the like arranged similarly to the storage 1020 in the application server of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 1131 ', ie, code that can be read by, for example, a processor, such as processor 1010, which, when executed by a server, causes the server to perform each of the methods described above. step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。
In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.
Claims (52)
- 一种获取问答对的相关联程度的装置,该装置包括:A device for obtaining the degree of association of a question and answer pair, the device comprising:问答知识库,适于存储多条问答知识记录;Question and answer knowledge base, suitable for storing multiple Q&A knowledge records;词语提取单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;a word extracting unit, configured to perform a word extracting operation on the question content and the answer content of the question and answer pair to be analyzed, to obtain at least one question word to be analyzed and at least one answer word to be analyzed;相关联程度计算单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。The correlation degree calculation unit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
- 根据权利要求1所述的装置,其中,该装置进一步包括问答知识库构建单元,The apparatus of claim 1, wherein the apparatus further comprises a question and answer knowledge base building unit,所述问答知识库构建单元,适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;The question and answer knowledge base construction unit is adapted to extract a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and construct a question and answer knowledge base including a plurality of question and answer knowledge records according to the extracted question and answer pairs;所述问答知识库构建单元,进一步适于在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;The question and answer knowledge base construction unit is further adapted to: when extracting a plurality of question and answer pairs from the webpage having the question and answer pair, grab the category corresponding to the question and answer pair;所述问答知识库构建单元,进一步适于在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。The question and answer knowledge base construction unit is further adapted to construct a question and answer knowledge record according to the question and answer pair and the category corresponding to the question and answer pair when constructing the question and answer knowledge base according to the extracted question and answer pair; each question and answer knowledge record corresponds to a category, Each includes a question word, an answer word, and a semantic relevance between the question word and the answer word.
- 根据权利要求1或2所述的装置,其中,The device according to claim 1 or 2, wherein所述相关联程度计算单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。The correlation degree calculation unit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words and the answer words to be analyzed match; according to the selected question and answer knowledge records, the same corresponds to the same The question and answer knowledge record of the category, the degree of association of the question and answer pairs to be analyzed for each category is obtained; the maximum value of the correlation degree of the question and answer pairs to be analyzed for each category is selected, and the maximum value is used as the question and answer pair to be analyzed. The degree of association.
- 根据权利要求2所述的装置,其中,The device according to claim 2, wherein所述问答知识库构建单元,适于对每个问答对执行以下操作:The question and answer knowledge base building unit is adapted to perform the following operations on each question and answer pair:对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;Performing a word extraction operation on the question content and the answer content of the question and answer pair to obtain a question word set and an answer word set; respectively, each question word in the question word set and each answer word in the answer word set are respectively associated with the question and answer pair Forming an information record on each of the corresponding categories;所述问答知识库构建单元,适于对每一条信息记录,执行以下操作:The question and answer knowledge base building unit is adapted to record each piece of information and perform the following operations:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。Calculating a probability that the answer word belongs to the category, calculating a degree of specificity of the answer word to the question word on the category, and calculating an intensity of the question word using the answer word in the category; The degree of specificity is multiplied by the intensity, and the resulting product is the semantic relevance of the answer word and the question word; the question word, the answer word, and its semantic relevance form a question and answer knowledge record corresponding to the category.
- 根据权利要求1至4任一权利要求所述的装置,其中,A device according to any one of claims 1 to 4, wherein所述相关联程度计算单元,适于将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。The association degree calculation unit is adapted to weight-add the semantic relevance of the question-and-answer knowledge records corresponding to the same category in the selected question-and-answer knowledge records to obtain the degree of association of the question-answer pairs to be analyzed for each category.
- 根据权利要求1至5任一权利要求所述的装置,其中,A device according to any one of claims 1 to 5, wherein可选地,所述词语提取单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。Optionally, the word extracting unit is adapted to perform word segmentation, remove stop words, word merge, and extract entity words from the question content and the answer content of the question and answer pair to be analyzed.
- 根据权利要求1至6任一权利要求所述的装置,其中,A device according to any one of claims 1 to 6, wherein所述问答知识库构建单元,适于按照如下的方法计算该答案词语属于该类别的概率:The question and answer knowledge base construction unit is adapted to calculate a probability that the answer word belongs to the category according to the following method:所述问答知识库构建单元,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度:The question and answer knowledge base construction unit is adapted to calculate a degree of specificity of the interpretation of the question word by each answer word in the category according to the following method:所述问答知识库构建单元,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:The question and answer knowledge base construction unit is adapted to calculate the strength of the problem word explained by each answer word in the category according to the following method:所述问答知识库构建单元,适于按照如下的方法将上述概率、专一程度和强度相乘: The question and answer knowledge base construction unit is adapted to multiply the above probability, specific degree and intensity according to the following method:weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi, AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
- 一种优化问答对的搜索排名的装置,该装置包括:A device for optimizing a search ranking of a question and answer pair, the device comprising:问答知识库,适于存储多条问答知识记录;Question and answer knowledge base, suitable for storing multiple Q&A knowledge records;搜索单元,适于接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;The search unit is adapted to receive a search request of the user, and obtain a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user;计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;a calculating unit, configured to obtain, according to the question and answer knowledge base, the degree of association of each question and answer pair to be analyzed;搜索排名单元,适于根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。The search ranking unit is adapted to optimize the search ranking of the pair of questions to be analyzed according to the degree of association of the question and answer pairs to be analyzed.
- 根据权利要求8所述的装置,其中,所述计算单元包括:The apparatus of claim 8 wherein said computing unit comprises:词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;a word extraction subunit, which is adapted to perform a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed;相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。The correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
- 根据权利要求8或9所述的装置,其中,The device according to claim 8 or 9, wherein所述搜索排名单元,适于以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名。The search ranking unit is adapted to use the order of relevance of the question and answer pairs to be analyzed as the search ranking of the question and answer pair to be analyzed.
- 根据权利要求8至10任一项所述的装置,其中,该装置还包括问答知识库构建单元,The apparatus according to any one of claims 8 to 10, wherein the apparatus further comprises a question and answer knowledge base building unit,所述问答知识库构建单元,适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;The question and answer knowledge base construction unit is adapted to extract a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and construct a question and answer knowledge base including a plurality of question and answer knowledge records according to the extracted question and answer pairs;所述问答知识库构建单元,进一步适于在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;The question and answer knowledge base construction unit is further adapted to: when extracting a plurality of question and answer pairs from the webpage having the question and answer pair, grab the category corresponding to the question and answer pair;所述问答知识库构建单元,进一步适于在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。The question and answer knowledge base construction unit is further adapted to construct a question and answer knowledge record according to the question and answer pair and the category corresponding to the question and answer pair when constructing the question and answer knowledge base according to the extracted question and answer pair; each question and answer knowledge record corresponds to a category, Each includes a question word, an answer word, and a semantic relevance between the question word and the answer word.
- 根据权利要求8至11任一项所述的装置,其中,The apparatus according to any one of claims 8 to 11, wherein所述相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。The correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words and the answer words to be analyzed match; according to the selected question and answer knowledge record corresponds to The question and answer knowledge record of the same category is obtained, and the degree of association of the question and answer pairs to be analyzed for each category is obtained; the maximum value of the correlation degree of the question and answer pairs to be analyzed for each category is selected, and the maximum value is used as the question and answer to be analyzed. The degree of association.
- 根据权利要求8至12任一项所述的装置,其中,A device according to any one of claims 8 to 12, wherein所述相关联程度计算子单元,适于将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。The association degree calculation sub-unit is adapted to weight-add the semantic relevance of the question-and-answer knowledge records corresponding to the same category in the selected question-and-answer knowledge record, to obtain the degree of association of the question-answer pairs to be analyzed for each category.
- 根据权利要求8至13任一项所述的装置,其中,A device according to any one of claims 8 to 13, wherein所述词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。The word extraction subunit is adapted to perform word segmentation, remove stop words, word merge, and extract entity words for the question content and the answer content of the question and answer pair to be analyzed.
- 根据权利要求8至14任一项所述的装置,其中,A device according to any one of claims 8 to 14, wherein所述问答知识库构建单元,适于对每个问答对执行以下操作:对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;The question and answer knowledge base construction unit is adapted to perform the following operations on each question and answer pair: performing a word extraction operation on the question content and the answer content of the question and answer pair, obtaining a question word set and an answer word set; and making each of the question word sets Each of the answer words in the question word and the answer word set form an information record on each category corresponding to the question and answer pair;所述问答知识库构建单元,适于对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。The question and answer knowledge base construction unit is adapted to perform, for each piece of information record, an operation of calculating a probability that the answer word belongs to the category, calculating a degree of specificity of the answer word on the question word in the category, and calculating The strength of the question word in the category to be explained by the answer word; multiplying the above probability, the degree of specificity and the intensity, the product obtained is the semantic relevance of the answer word and the question word; The answer word and its semantic relevance form a question and answer knowledge record corresponding to the category.
- 根据权利要求8至15任一项所述的装置,其中,A device according to any one of claims 8 to 15, wherein所述问答知识库构建单元,适于按照如下的方法计算该答案词语属于该类别的概率:The question and answer knowledge base construction unit is adapted to calculate a probability that the answer word belongs to the category according to the following method:所述问答知识库构建单元,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度: The question and answer knowledge base construction unit is adapted to calculate a degree of specificity of the interpretation of the question word by each answer word in the category according to the following method:所述问答知识库构建单元,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:The question and answer knowledge base construction unit is adapted to calculate the strength of the problem word explained by each answer word in the category according to the following method:所述问答知识库构建单元,适于按照如下的方法将上述概率、专一程度和强度相乘:The question and answer knowledge base construction unit is adapted to multiply the above probability, specific degree and intensity according to the following method:weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi, AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
- 一种确定网络资源点的抓取频率的装置,该装置包括:A device for determining a crawling frequency of a network resource point, the device comprising:问答知识库,适于存储多条问答知识记录;Question and answer knowledge base, suitable for storing multiple Q&A knowledge records;资源分析单元,适于由网络资源点抓取多个待分析问答对;The resource analysis unit is adapted to capture a plurality of question and answer pairs to be analyzed by the network resource point;计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;a calculating unit, configured to obtain, according to the question and answer knowledge base, the degree of association of each question and answer pair to be analyzed;抓取频率确定单元,根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。The capture frequency determining unit determines the crawling frequency of the network resource point according to the correlation degree of the question and answer pair to be analyzed.
- 根据权利要求17所述的装置,其中,所述计算单元包括:The apparatus of claim 17, wherein the computing unit comprises:词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;a word extraction subunit, which is adapted to perform a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed;相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。The correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
- 根据权利要求17或18所述的装置,其中,The device according to claim 17 or 18, wherein所述抓取频率确定单元,适于以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率;或,使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率。The capture frequency determining unit is configured to use, as the crawling frequency of the network resource point, an average value of the correlation degree of the question and answer pair to be analyzed; or use an spider crawler to obtain an initial crawling of the network resource point. Frequency, calculating an average value of the correlation degree of the question and answer pair to be analyzed, and using the average value to adjust the initial grab frequency to determine a crawling frequency of the network resource point.
- 根据权利要求17至19任一项所述的装置,其中,该装置还包括问答知识库构建单元,The apparatus according to any one of claims 17 to 19, wherein the apparatus further comprises a question and answer knowledge base building unit,所述问答知识库构建单元,适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;The question and answer knowledge base construction unit is adapted to extract a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and construct a question and answer knowledge base including a plurality of question and answer knowledge records according to the extracted question and answer pairs;所述问答知识库构建单元,进一步适于在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;The question and answer knowledge base construction unit is further adapted to: when extracting a plurality of question and answer pairs from the webpage having the question and answer pair, grab the category corresponding to the question and answer pair;所述问答知识库构建单元,进一步适于在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。The question and answer knowledge base construction unit is further adapted to construct a question and answer knowledge record according to the question and answer pair and the category corresponding to the question and answer pair when constructing the question and answer knowledge base according to the extracted question and answer pair; each question and answer knowledge record corresponds to a category, Each includes a question word, an answer word, and a semantic relevance between the question word and the answer word.
- 根据权利要求17至20任一项所述的装置,其中,The apparatus according to any one of claims 17 to 20, wherein所述相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。The correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words and the answer words to be analyzed match; according to the selected question and answer knowledge record corresponds to The question and answer knowledge record of the same category is obtained, and the degree of association of the question and answer pairs to be analyzed for each category is obtained; the maximum value of the correlation degree of the question and answer pairs to be analyzed for each category is selected, and the maximum value is used as the question and answer to be analyzed. The degree of association.
- 根据权利要求17至21任一项所述的装置,其中,The apparatus according to any one of claims 17 to 21, wherein所述相关联程度计算子单元,适于将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。The association degree calculation sub-unit is adapted to weight-add the semantic relevance of the question-and-answer knowledge records corresponding to the same category in the selected question-and-answer knowledge record, to obtain the degree of association of the question-answer pairs to be analyzed for each category.
- 根据权利要求17至22任一项所述的装置,其中,The apparatus according to any one of claims 17 to 22, wherein所述词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。The word extraction subunit is adapted to perform word segmentation, remove stop words, word merge, and extract entity words for the question content and the answer content of the question and answer pair to be analyzed.
- 根据权利要求17至23任一项所述的装置,其中,The apparatus according to any one of claims 17 to 23, wherein所述问答知识库构建单元,适于对每个问答对执行以下操作:对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;The question and answer knowledge base construction unit is adapted to perform the following operations on each question and answer pair: performing a word extraction operation on the question content and the answer content of the question and answer pair, obtaining a question word set and an answer word set; and making each of the question word sets Each of the answer words in the question word and the answer word set form an information record on each category corresponding to the question and answer pair;所述问答知识库构建单元,适于对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词 语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。The question and answer knowledge base construction unit is adapted to perform, for each piece of information record, an operation of calculating a probability that the answer word belongs to the category, calculating a degree of specificity of the answer word on the question word in the category, and calculating Use the answer word for the question word on the category The strength of the interpretation of the language; multiplying the above probability, the degree of specificity and the intensity, the product obtained is the semantic relevance of the answer word and the question word; making the question word, the answer word and its semantic relevance form a Corresponds to the Q&A knowledge record for this category.
- 根据权利要求17至24任一项所述的装置,其中,A device according to any one of claims 17 to 24, wherein所述问答知识库构建单元,适于按照如下的方法计算该答案词语属于该类别的概率:The question and answer knowledge base construction unit is adapted to calculate a probability that the answer word belongs to the category according to the following method:所述问答知识库构建单元,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度:The question and answer knowledge base construction unit is adapted to calculate a degree of specificity of the interpretation of the question word by each answer word in the category according to the following method:所述问答知识库构建单元,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:The question and answer knowledge base construction unit is adapted to calculate the strength of the problem word explained by each answer word in the category according to the following method:所述问答知识库构建单元,适于按照如下的方法将上述概率、专一程度和强度相乘:The question and answer knowledge base construction unit is adapted to multiply the above probability, specific degree and intensity according to the following method:weight(QWi,AWj|C=Ck)=P(Ck|AWj)*soecific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi,AWj|C=Ck)=P(Ck|AWj)*soecific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
- 一种获取问答对的相关联程度的方法,该方法包括如下步骤:A method of obtaining the degree of association of a question and answer pair, the method comprising the following steps:对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;Performing a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtaining at least one word to be analyzed and at least one word to be analyzed;根据待分析问题词语和待分析答案词语,从包括多条问答知识记录的问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。According to the word to be analyzed and the word to be analyzed, at least one question and answer knowledge record is selected from the question and answer knowledge base including the plurality of question and answer knowledge records, and the degree of association of the question and answer pairs to be analyzed is calculated according to the selected question and answer knowledge record.
- 根据权利要求26所述的方法,其中,该方法进一步包括:The method of claim 26, wherein the method further comprises:预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;Extracting multiple question and answer pairs from the web page containing the question and answer pairs, and constructing a question and answer knowledge base including multiple question and answer knowledge records according to the extracted question and answer pairs;在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;When extracting a plurality of question and answer pairs from a webpage having a question and answer pair, fetching a category corresponding to the question and answer pair;在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;When constructing the question and answer knowledge base according to the extracted question and answer pairs, construct a question and answer knowledge record according to the question and answer pair and the category corresponding to the question and answer pair;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。Each question and answer knowledge record corresponds to a category, including a question word, an answer word, and a semantic relevance between the question word and the answer word.
- 根据权利要求26或27所述的方法,其中,The method according to claim 26 or 27, wherein所述根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度,具体包括:According to the problem word to be analyzed and the answer word to be analyzed, at least one question and answer knowledge record is selected from the question and answer knowledge base, and the correlation degree of the question and answer pair to be analyzed is calculated according to the selected question and answer knowledge record, which specifically includes:选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;Selecting a question and answer knowledge record that matches the question words included in the problem word to be analyzed and the included answer words and the answer words to be analyzed;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;Correlating the question and answer knowledge records corresponding to the same category in the selected question and answer knowledge record, and obtaining the correlation degree of the question and answer pairs to be analyzed for each category;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。The maximum number of associations of the question and answer pairs to be analyzed for each category is selected, and the maximum value is used as the correlation degree of the question and answer pairs to be analyzed.
- 根据权利要求26至28任一权利要求所述的方法所述的方法,其中,A method according to the method of any of claims 26 to 28, wherein根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:According to the question and answer knowledge record corresponding to the same category in the selected question and answer knowledge record, the degree of association of the question and answer pair to be analyzed for each category is obtained, which specifically includes:将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。The semantic relevance of the question and answer knowledge records corresponding to the same category in the selected question and answer knowledge records is weighted and added, and the degree of association of the question and answer pairs to be analyzed for each category is obtained.
- 根据权利要求26至29任一权利要求所述的方法所述的方法,其中,所述根据问答对和与所述问答对对应的类别构建问答知识库,具体包括:The method according to the method of any one of claims 26 to 29, wherein the constructing the question and answer knowledge base according to the question and answer pair and the category corresponding to the question and answer pair comprises:对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;For each question and answer pair, the word extraction operation is performed on the question content and the answer content of the question and answer pair, and the problem word set and the answer word set are obtained;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个 类别上形成一条信息记录;Having each question word in the question word set and each answer word in the answer word set in each of the question and answer pairs Form an information record on the category;对每一条信息记录,执行以下操作:For each record of information, do the following:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;Calculating a probability that the answer word belongs to the category, calculating a degree of specificity of the answer word to the question word on the category, and calculating an intensity of the question word using the answer word in the category;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;Multiplying the above probability, specificity, and intensity, the resulting product is the semantic relevance of the answer word and the question word;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。The question word, the answer word, and its semantic relevance are formed into a question and answer knowledge record corresponding to the category.
- 根据权利要求26至30任一权利要求所述的方法所述的方法,其中,A method according to the method of any of claims 26 to 30, wherein所述计算该答案词语属于该类别的概率,具体包括:The calculating the probability that the answer word belongs to the category includes:所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:The calculating the degree of specificity of each answer word on the question word in the category includes:所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:The calculating the strength of the question word in the category to be explained by each answer word, specifically comprising:将上述概率、专一程度和强度相乘,具体包括:Multiply the above probability, specificity and intensity, including:weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi, AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
- 根据权利要求26至31任一权利要求所述的方法,其中,A method according to any of claims 26 to 31, wherein所述对所述待分析的问答对的问题内容和答案内容进行词语提取操作,具体包括:对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。Performing a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, specifically including: segmenting the question content and the answer content of the question and answer pair to be analyzed, removing the stop word, word merging, and extracting the entity word Operation.
- 一种优化问答对的搜索排名的方法,该方法包括如下步骤:A method for optimizing a search ranking of a question and answer pair, the method comprising the following steps:接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;Receiving a search request of the user, and acquiring a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user;根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;Obtain the correlation degree of each question and answer pair to be analyzed according to the Q&A knowledge base including multiple Q&A knowledge records;根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。The search ranking of the pair of questions to be analyzed is optimized according to the degree of association of the question and answer pairs to be analyzed.
- 根据权利要求33所述的方法,其中,所述根据包括多条问答知识记录的问答知识库获取每个待分析问答对的相关联程度,包括对每个待分析问答对执行以下操作:The method according to claim 33, wherein said obtaining a degree of association of each question and answer pair to be analyzed according to a question and answer knowledge base comprising a plurality of question and answer knowledge records comprises performing the following operations for each question and answer pair to be analyzed:对该待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;Performing a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtaining at least one question word to be analyzed and at least one answer word to be analyzed;根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算该待分析的问答对的相关联程度。According to the problem words to be analyzed and the words to be analyzed, at least one question and answer knowledge record is selected from the question and answer knowledge base, and the degree of association of the question and answer pairs to be analyzed is calculated according to the selected question and answer knowledge record.
- 根据权利要求33或34所述的方法,其中,所述根据所述待分析问答对的相关联程度调整所述待分析问答对的搜索排名,具体包括:The method according to claim 33 or claim 34, wherein the adjusting the search ranking of the question and answer pair to be analyzed according to the degree of association of the question and answer pair to be analyzed comprises:以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名。The ranking of the degree of association of the question and answer pairs to be analyzed is used as the search ranking of the question and answer pair to be analyzed.
- 根据权利要求33至35任一项所述的方法,其中,该方法进一步包括:The method of any of claims 33 to 35, wherein the method further comprises:预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;Extracting multiple question and answer pairs from the web page containing the question and answer pairs, and constructing a question and answer knowledge base including multiple question and answer knowledge records according to the extracted question and answer pairs;在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;When extracting a plurality of question and answer pairs from a webpage having a question and answer pair, fetching a category corresponding to the question and answer pair;在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;When constructing the question and answer knowledge base according to the extracted question and answer pairs, construct a question and answer knowledge record according to the question and answer pair and the category corresponding to the question and answer pair;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。Each question and answer knowledge record corresponds to a category, including a question word, an answer word, and a semantic relevance between the question word and the answer word.
- 根据权利要求33至36任一项所述的方法,其中,A method according to any one of claims 33 to 36, wherein所述根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度,具体包括:According to the problem word to be analyzed and the answer word to be analyzed, at least one question and answer knowledge record is selected from the question and answer knowledge base, and the correlation degree of the question and answer pair to be analyzed is calculated according to the selected question and answer knowledge record, which specifically includes:选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录; Selecting a question and answer knowledge record that matches the question words included in the problem word to be analyzed and the included answer words and the answer words to be analyzed;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;Correlating the question and answer knowledge records corresponding to the same category in the selected question and answer knowledge record, and obtaining the correlation degree of the question and answer pairs to be analyzed for each category;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。The maximum number of associations of the question and answer pairs to be analyzed for each category is selected, and the maximum value is used as the correlation degree of the question and answer pairs to be analyzed.
- 根据权利要求33至37任一项所述的方法,其中,A method according to any one of claims 33 to 37, wherein根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:According to the question and answer knowledge record corresponding to the same category in the selected question and answer knowledge record, the degree of association of the question and answer pair to be analyzed for each category is obtained, which specifically includes:将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。The semantic relevance of the question and answer knowledge records corresponding to the same category in the selected question and answer knowledge records is weighted and added, and the degree of association of the question and answer pairs to be analyzed for each category is obtained.
- 根据权利要求33至38任一项所述的方法,其中,A method according to any one of claims 33 to 38, wherein所述对所述待分析的问答对的问题内容和答案内容进行词语提取操作,具体包括:And performing the word extraction operation on the problem content and the answer content of the question and answer pair to be analyzed, specifically including:对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。The problem content and answer content of the question and answer pair to be analyzed are performed by word segmentation, removal of stop words, word merging, and extraction of entity words.
- 根据权利要求33至39任一项所述的方法,其中,A method according to any one of claims 33 to 39, wherein所述根据问答对和与所述问答对对应的类别构建问答知识库,具体包括:The constructing the question and answer knowledge base according to the question and answer pair and the category corresponding to the question and answer pair, specifically includes:对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;For each question and answer pair, the word extraction operation is performed on the question content and the answer content of the question and answer pair, and the problem word set and the answer word set are obtained;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;Having each of the question words in the set of question words and each answer word in the set of answer words form an information record on each category corresponding to the question and answer pair;对每一条信息记录,执行以下操作:For each record of information, do the following:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;Calculating a probability that the answer word belongs to the category, calculating a degree of specificity of the answer word to the question word on the category, and calculating an intensity of the question word using the answer word in the category;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;Multiplying the above probability, specificity, and intensity, the resulting product is the semantic relevance of the answer word and the question word;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。The question word, the answer word, and its semantic relevance are formed into a question and answer knowledge record corresponding to the category.
- 根据权利要求33至40任一项所述的方法,其中,A method according to any one of claims 33 to 40, wherein所述计算该答案词语属于该类别的概率,具体包括:The calculating the probability that the answer word belongs to the category includes:所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:The calculating the degree of specificity of each answer word on the question word in the category includes:所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:The calculating the strength of the question word in the category to be explained by each answer word, specifically comprising:将上述概率、专一程度和强度相乘,具体包括:Multiply the above probability, specificity and intensity, including:weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specfic(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specfic(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
- 一种确定网络资源点的抓取频率的方法,该方法包括如下步骤:A method for determining a crawl frequency of a network resource point, the method comprising the following steps:由网络资源点抓取多个待分析问答对;A plurality of question and answer pairs to be analyzed are captured by the network resource point;根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;Obtain the correlation degree of each question and answer pair to be analyzed according to the Q&A knowledge base including multiple Q&A knowledge records;根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。Determining the frequency of the network resource points according to the degree of association of the question and answer pairs to be analyzed.
- 根据权利要求42所述的方法,其中,所述根据包括多条问答知识记录的问答知识库获取每个待分析问答对的相关联程度,包括对每个待分析问答对执行以下操作:The method according to claim 42, wherein said obtaining a degree of association of each question and answer pair to be analyzed according to a question and answer knowledge base including a plurality of question and answer knowledge records, comprising performing the following operations for each question and answer pair to be analyzed:对该待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;Performing a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtaining at least one question word to be analyzed and at least one answer word to be analyzed;根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算该待分析的问答对的相关联程度。According to the problem words to be analyzed and the words to be analyzed, at least one question and answer knowledge record is selected from the question and answer knowledge base, and the degree of association of the question and answer pairs to be analyzed is calculated according to the selected question and answer knowledge record.
- 根据权利要求42或43所述的方法,其中,所述根据所述待分析问答对的相关联程度确定所述网 络资源点的抓取频率,具体包括:The method according to claim 42 or 43, wherein said determining said network based on said degree of association of said question and answer pairs to be analyzed The crawling frequency of the network resource points, including:以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率;Taking the average value of the correlation degree of the question and answer pair to be analyzed as the crawling frequency of the network resource point;或,or,使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率。Obtaining an initial crawling frequency of the network resource point by using a spider crawler, calculating an average value of the correlation degree of the question and answer pair to be analyzed, and using the average value to adjust the initial crawling frequency to determine the network resource point Take the frequency.
- 根据权利要求42至44任一项所述的方法,其中,该方法进一步包括:The method of any of claims 42 to 44, wherein the method further comprises:预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;Extracting multiple question and answer pairs from the web page containing the question and answer pairs, and constructing a question and answer knowledge base including multiple question and answer knowledge records according to the extracted question and answer pairs;在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;When extracting a plurality of question and answer pairs from a webpage having a question and answer pair, fetching a category corresponding to the question and answer pair;在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;When constructing the question and answer knowledge base according to the extracted question and answer pairs, construct a question and answer knowledge record according to the question and answer pair and the category corresponding to the question and answer pair;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。Each question and answer knowledge record corresponds to a category, including a question word, an answer word, and a semantic relevance between the question word and the answer word.
- 根据权利要求42至45任一项所述的方法,其中,A method according to any one of claims 42 to 45, wherein所述根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度,具体包括:According to the problem word to be analyzed and the answer word to be analyzed, at least one question and answer knowledge record is selected from the question and answer knowledge base, and the correlation degree of the question and answer pair to be analyzed is calculated according to the selected question and answer knowledge record, which specifically includes:选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;Selecting a question and answer knowledge record that matches the question words included in the problem word to be analyzed and the included answer words and the answer words to be analyzed;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;Correlating the question and answer knowledge records corresponding to the same category in the selected question and answer knowledge record, and obtaining the correlation degree of the question and answer pairs to be analyzed for each category;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。The maximum number of associations of the question and answer pairs to be analyzed for each category is selected, and the maximum value is used as the correlation degree of the question and answer pairs to be analyzed.
- 根据权利要求42至46任一项所述的方法,其中,A method according to any one of claims 42 to 46, wherein根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:According to the question and answer knowledge record corresponding to the same category in the selected question and answer knowledge record, the degree of association of the question and answer pair to be analyzed for each category is obtained, which specifically includes:将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。The semantic relevance of the question and answer knowledge records corresponding to the same category in the selected question and answer knowledge records is weighted and added, and the degree of association of the question and answer pairs to be analyzed for each category is obtained.
- 根据权利要求42至47任一项所述的方法,其中,A method according to any one of claims 42 to 47, wherein所述对所述待分析的问答对的问题内容和答案内容进行词语提取操作,具体包括:And performing the word extraction operation on the problem content and the answer content of the question and answer pair to be analyzed, specifically including:对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。The problem content and answer content of the question and answer pair to be analyzed are performed by word segmentation, removal of stop words, word merging, and extraction of entity words.
- 根据权利要求42至48任一项所述的方法,其中,A method according to any one of claims 42 to 48, wherein所述根据问答对和与所述问答对对应的类别构建问答知识库,具体包括:The constructing the question and answer knowledge base according to the question and answer pair and the category corresponding to the question and answer pair, specifically includes:对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;For each question and answer pair, the word extraction operation is performed on the question content and the answer content of the question and answer pair, and the problem word set and the answer word set are obtained;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;Having each of the question words in the set of question words and each answer word in the set of answer words form an information record on each category corresponding to the question and answer pair;对每一条信息记录,执行以下操作:For each record of information, do the following:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;Calculating a probability that the answer word belongs to the category, calculating a degree of specificity of the answer word to the question word on the category, and calculating an intensity of the question word using the answer word in the category;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;Multiplying the above probability, specificity, and intensity, the resulting product is the semantic relevance of the answer word and the question word;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。The question word, the answer word, and its semantic relevance are formed into a question and answer knowledge record corresponding to the category.
- 根据权利要求42至49任一项所述的方法,其中,A method according to any one of claims 42 to 49, wherein所述计算该答案词语属于该类别的概率,具体包括:The calculating the probability that the answer word belongs to the category includes:所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:The calculating the degree of specificity of each answer word on the question word in the category includes:所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:The calculating the strength of the question word in the category to be explained by each answer word, specifically comprising:将上述概率、专一程度和强度相乘,具体包括: Multiply the above probability, specificity and intensity, including:weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);Weight(QWi, AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;Where P(C k ) represents the probability of occurrence of the category C k ; P(AW j ) represents the probability that the answer is AW j ; P(AW j |C k ) represents the probability that the C k category belongs to AW j ;#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;#(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;#(AWj)表示答案词语为AWj的次数。#(AW j ) indicates the number of times the answer word is AW j .
- 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求26至50中的任一个所述的方法。A computer program comprising computer readable code that, when executed on a computing device, causes the computing device to perform the method of any one of claims 26-50.
- 一种计算机可读介质,其中存储了如权利要求51所述的计算机程序。 A computer readable medium storing the computer program of claim 51.
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310495881.4 | 2013-10-21 | ||
CN201310495881.4A CN103577558B (en) | 2013-10-21 | 2013-10-21 | Device and method for optimizing search ranking of frequently asked question and answer pairs |
CN201310495641.4A CN103577556B (en) | 2013-10-21 | 2013-10-21 | Device and method for obtaining association degree of question and answer pair |
CN201310495641.4 | 2013-10-21 | ||
CN201310495856.6 | 2013-10-21 | ||
CN201310495856.6A CN103577557B (en) | 2013-10-21 | 2013-10-21 | A kind of apparatus and method of the crawl frequency for determining network resource point |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015058604A1 true WO2015058604A1 (en) | 2015-04-30 |
Family
ID=52992233
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/086838 WO2015058604A1 (en) | 2013-10-21 | 2014-09-18 | Apparatus and method for obtaining degree of association of question and answer pair and for search ranking optimization |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2015058604A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9760627B1 (en) | 2016-05-13 | 2017-09-12 | International Business Machines Corporation | Private-public context analysis for natural language content disambiguation |
CN108717433A (en) * | 2018-05-14 | 2018-10-30 | 南京邮电大学 | A kind of construction of knowledge base method and device of programming-oriented field question answering system |
CN109934347A (en) * | 2017-12-18 | 2019-06-25 | 上海智臻智能网络科技股份有限公司 | Extend the device of question and answer knowledge base |
CN110019729A (en) * | 2017-12-25 | 2019-07-16 | 上海智臻智能网络科技股份有限公司 | Intelligent answer method and storage medium, terminal |
CN110019838A (en) * | 2017-12-25 | 2019-07-16 | 上海智臻智能网络科技股份有限公司 | Intelligent Answer System and intelligent terminal |
US10361981B2 (en) | 2015-05-15 | 2019-07-23 | Microsoft Technology Licensing, Llc | Automatic extraction of commitments and requests from communications and content |
CN110334272A (en) * | 2019-05-29 | 2019-10-15 | 平安科技(深圳)有限公司 | The intelligent answer method, apparatus and computer storage medium of knowledge based map |
CN110580313A (en) * | 2018-06-08 | 2019-12-17 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN111382235A (en) * | 2018-12-27 | 2020-07-07 | 上海智臻智能网络科技股份有限公司 | Question-answer knowledge base optimization method and device |
CN111552789A (en) * | 2020-04-27 | 2020-08-18 | 中国银行股份有限公司 | Self-learning method and device for customer service knowledge base |
CN111984768A (en) * | 2019-05-24 | 2020-11-24 | 北京京东尚科信息技术有限公司 | Corpus processing and question-answer interaction method and device, computer equipment and storage medium |
US10984387B2 (en) | 2011-06-28 | 2021-04-20 | Microsoft Technology Licensing, Llc | Automatic task extraction and calendar entry |
CN113239164A (en) * | 2021-05-13 | 2021-08-10 | 杭州摸象大数据科技有限公司 | Multi-round conversation process construction method and device, computer equipment and storage medium |
CN113807512A (en) * | 2020-06-12 | 2021-12-17 | 株式会社理光 | Training method and device of machine reading understanding model and readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520802A (en) * | 2009-04-13 | 2009-09-02 | 腾讯科技(深圳)有限公司 | Question-answer pair quality evaluation method and system |
CN101986293A (en) * | 2010-09-03 | 2011-03-16 | 百度在线网络技术(北京)有限公司 | Method and equipment for displaying search answer information on search interface |
US20120078826A1 (en) * | 2010-09-29 | 2012-03-29 | International Business Machines Corporation | Fact checking using and aiding probabilistic question answering |
US8346701B2 (en) * | 2009-01-23 | 2013-01-01 | Microsoft Corporation | Answer ranking in community question-answering sites |
CN102884527A (en) * | 2010-04-06 | 2013-01-16 | 新加坡国立大学 | Automatic frequently asked question compilation from community-based question answering archive |
-
2014
- 2014-09-18 WO PCT/CN2014/086838 patent/WO2015058604A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8346701B2 (en) * | 2009-01-23 | 2013-01-01 | Microsoft Corporation | Answer ranking in community question-answering sites |
CN101520802A (en) * | 2009-04-13 | 2009-09-02 | 腾讯科技(深圳)有限公司 | Question-answer pair quality evaluation method and system |
CN102884527A (en) * | 2010-04-06 | 2013-01-16 | 新加坡国立大学 | Automatic frequently asked question compilation from community-based question answering archive |
CN101986293A (en) * | 2010-09-03 | 2011-03-16 | 百度在线网络技术(北京)有限公司 | Method and equipment for displaying search answer information on search interface |
US20120078826A1 (en) * | 2010-09-29 | 2012-03-29 | International Business Machines Corporation | Fact checking using and aiding probabilistic question answering |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10984387B2 (en) | 2011-06-28 | 2021-04-20 | Microsoft Technology Licensing, Llc | Automatic task extraction and calendar entry |
US10361981B2 (en) | 2015-05-15 | 2019-07-23 | Microsoft Technology Licensing, Llc | Automatic extraction of commitments and requests from communications and content |
US9760627B1 (en) | 2016-05-13 | 2017-09-12 | International Business Machines Corporation | Private-public context analysis for natural language content disambiguation |
CN109934347A (en) * | 2017-12-18 | 2019-06-25 | 上海智臻智能网络科技股份有限公司 | Extend the device of question and answer knowledge base |
CN109934347B (en) * | 2017-12-18 | 2024-02-02 | 上海智臻智能网络科技股份有限公司 | Device for expanding question-answer knowledge base |
CN110019729A (en) * | 2017-12-25 | 2019-07-16 | 上海智臻智能网络科技股份有限公司 | Intelligent answer method and storage medium, terminal |
CN110019838A (en) * | 2017-12-25 | 2019-07-16 | 上海智臻智能网络科技股份有限公司 | Intelligent Answer System and intelligent terminal |
CN110019729B (en) * | 2017-12-25 | 2024-03-15 | 上海智臻智能网络科技股份有限公司 | Intelligent question-answering method, storage medium and terminal |
CN108717433A (en) * | 2018-05-14 | 2018-10-30 | 南京邮电大学 | A kind of construction of knowledge base method and device of programming-oriented field question answering system |
CN110580313B (en) * | 2018-06-08 | 2024-02-02 | 北京搜狗科技发展有限公司 | Data processing method a treatment method apparatus and apparatus for data processing |
CN110580313A (en) * | 2018-06-08 | 2019-12-17 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN111382235A (en) * | 2018-12-27 | 2020-07-07 | 上海智臻智能网络科技股份有限公司 | Question-answer knowledge base optimization method and device |
CN111984768A (en) * | 2019-05-24 | 2020-11-24 | 北京京东尚科信息技术有限公司 | Corpus processing and question-answer interaction method and device, computer equipment and storage medium |
CN110334272B (en) * | 2019-05-29 | 2022-04-12 | 平安科技(深圳)有限公司 | Intelligent question-answering method and device based on knowledge graph and computer storage medium |
CN110334272A (en) * | 2019-05-29 | 2019-10-15 | 平安科技(深圳)有限公司 | The intelligent answer method, apparatus and computer storage medium of knowledge based map |
CN111552789A (en) * | 2020-04-27 | 2020-08-18 | 中国银行股份有限公司 | Self-learning method and device for customer service knowledge base |
CN113807512A (en) * | 2020-06-12 | 2021-12-17 | 株式会社理光 | Training method and device of machine reading understanding model and readable storage medium |
CN113807512B (en) * | 2020-06-12 | 2024-01-23 | 株式会社理光 | Training method and device for machine reading understanding model and readable storage medium |
CN113239164B (en) * | 2021-05-13 | 2023-07-04 | 杭州摸象大数据科技有限公司 | Multi-round dialogue flow construction method and device, computer equipment and storage medium |
CN113239164A (en) * | 2021-05-13 | 2021-08-10 | 杭州摸象大数据科技有限公司 | Multi-round conversation process construction method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015058604A1 (en) | Apparatus and method for obtaining degree of association of question and answer pair and for search ranking optimization | |
US10831769B2 (en) | Search method and device for asking type query based on deep question and answer | |
US9558264B2 (en) | Identifying and displaying relationships between candidate answers | |
CN103577558B (en) | Device and method for optimizing search ranking of frequently asked question and answer pairs | |
US9740769B2 (en) | Interpreting and distinguishing lack of an answer in a question answering system | |
JP7153004B2 (en) | COMMUNITY Q&A DATA VERIFICATION METHOD, APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM | |
US8255414B2 (en) | Search assist powered by session analysis | |
Hartawan et al. | Using vector space model in question answering system | |
CN107193796B (en) | Public opinion event detection method and device | |
US8825620B1 (en) | Behavioral word segmentation for use in processing search queries | |
US20180204106A1 (en) | System and method for personalized deep text analysis | |
CN104376115B (en) | A kind of fuzzy word based on global search determines method and device | |
US20150206101A1 (en) | System for determining infringement of copyright based on the text reference point and method thereof | |
WO2020074017A1 (en) | Deep learning-based method and device for screening for keywords in medical document | |
CN107784069B (en) | Method for intelligently diagnosing knowledge ability of students | |
CN108280081B (en) | Method and device for generating webpage | |
US20190294705A1 (en) | Image annotation | |
CN103577557A (en) | Device and method for determining capturing frequency of network resource point | |
WO2017000659A1 (en) | Enriched uniform resource locator (url) identification method and apparatus | |
CN109033318A (en) | Intelligent answer method and device | |
US10783140B2 (en) | System and method for augmenting answers from a QA system with additional temporal and geographic information | |
CN113010639A (en) | Commodity analysis method and device based on E-commerce platform | |
WO2019192122A1 (en) | Document topic parameter extraction method, product recommendation method and device, and storage medium | |
CN113569044B (en) | Method for classifying webpage text content based on natural language processing technology | |
CN104933097A (en) | Data processing method and device for retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14856111 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14856111 Country of ref document: EP Kind code of ref document: A1 |