WO2019218527A1 - 多系统相结合的自然语言处理方法及装置 - Google Patents
多系统相结合的自然语言处理方法及装置 Download PDFInfo
- Publication number
- WO2019218527A1 WO2019218527A1 PCT/CN2018/102875 CN2018102875W WO2019218527A1 WO 2019218527 A1 WO2019218527 A1 WO 2019218527A1 CN 2018102875 W CN2018102875 W CN 2018102875W WO 2019218527 A1 WO2019218527 A1 WO 2019218527A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- group
- weight value
- text information
- value
- search result
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
Definitions
- the present application relates to the field of insurance finance, and in particular to a natural language processing method and apparatus combining multiple systems.
- the existing man-machine dialogue solution usually returns the results according to the steps of word segmentation, substitution, matching, etc., such as the natural language processing system such as Keda Xunfei and Turing Robot.
- the usual implementation method is to maintain the term knowledge base in advance, and the query phase. First, the problem is replaced by synonyms, stop words, etc., then the words are segmented and matched, and finally the one with the highest matching degree in the database is returned. For some systems, you can learn new words, sentences, and updates in the knowledge base from the user question and answer process.
- the inventor is aware of the man-machine dialogue realized in this way.
- the robot can only match the results from the existing knowledge base.
- the quality of the answer depends largely on the number of knowledge bases, and it is very easy for the answer to be inconsistent with the problem. Therefore, in the existing technical solutions, the processing result is single, the knowledge surface is insufficient, the problem and answer matching degree is too low, and the processing result is not accurate enough.
- the present application provides a multi-system natural language processing method and corresponding apparatus, computer equipment and readable storage medium, the main purpose of which is to use a plurality of system voting systems to generate a final result by combining a plurality of single natural language processing systems.
- the results are output so that the results returned by the system are more and more accurate.
- the present application also provides a computer device and a readable storage medium for performing the multi-system natural language processing method of the present application.
- the present application provides a multi-system natural language processing method, the method comprising: extracting a feature word in the received text information; and calculating the matching degree according to the matching degree of the feature word with a pre-stored keyword a first weight value in each group classified by the keyword according to the keyword; a search result corresponding to the text information is respectively obtained from a plurality of third-party systems, and each group is in itself according to each search result
- the second weight value in the subordinate third-party system and the first weight value corresponding to each group respectively calculate the scores of the respective search results, and the search result with the highest score is the output result corresponding to the text information.
- the present application further provides a multi-system natural language processing apparatus, including: an extracting module, configured to extract feature words in the received text information; and a matching module, configured to use the feature words and the pre-stored
- the matching degree of the keyword is used to calculate a first weight value of each group of the text information classified according to the keyword; and a calculating module, configured to respectively acquire a search corresponding to the text information from multiple third-party systems
- the scores of the respective search results are respectively calculated according to the respective search results, the second weight value of each group in the third-party system to which the group belongs, and the corresponding first weight value, and the search result with the highest score is the text.
- the output corresponding to the information is a multi-system natural language processing apparatus, including: an extracting module, configured to extract feature words in the received text information; and a matching module, configured to use the feature words and the pre-stored
- the matching degree of the keyword is used to calculate a first weight value of each group of the text information classified according to the keyword; and a
- the present application further provides a computer device comprising a memory and a processor, wherein the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, causing the processor to execute A multi-system natural language processing method, the multi-system natural language processing method comprising the steps of: extracting feature words in the received text information; and matching the feature words with pre-stored keywords Calculating a first weight value of each group of the text information classified according to the keyword; acquiring a search result corresponding to the text information from a plurality of third-party systems, and according to each search result, each group The scores of the respective search results are respectively calculated by the second weight value in the third-party system to which the slave belongs and the first weight value corresponding to each group, and the search result with the highest score is the output result corresponding to the text information.
- the present application further provides a computer readable non-volatile storage medium, where the computer readable storage medium includes a prompting program when online payment is performed, and when the prompting program during online payment is executed by a processor,
- the multi-system natural language processing method comprising the steps of: extracting feature words in the received text information; and matching the feature words with pre-stored keywords Degreely calculating a first weight value of each group of the text information classified according to the keyword; acquiring, from a plurality of third-party systems, search results corresponding to the text information, and according to each search result, each The group calculates the score of each search result in the second weight value in the third-party system to which the group belongs and the first weight value corresponding to each group, and the search result with the highest score is the output result corresponding to the text information.
- the present application provides a multi-system natural language processing method, which combines multiple independent natural language processing systems, and the retrieval results of multiple third-party systems are comprehensively calculated to produce final output results, so that the system returns The results are more precise and the knowledge is more complete.
- FIG. 1 is a flow chart of an embodiment of a natural language processing method combining multiple systems of the present application
- FIG. 2 is a flow chart of an embodiment of a natural language processing device combined with multiple systems of the present application
- FIG. 3 is a block diagram showing the internal structure of a computer device in an embodiment.
- a multi-system combined natural language processing method provided by the present application, wherein a specific implementation manner includes the following steps:
- the text information may be a statement, such as a question input by a user, or a piece of text including multiple questions.
- the feature word is a word with a relatively high degree of importance in the text information.
- the present application preferably extracts the feature words of the received text information by the following scheme:
- the text information is segmented, and the tools include the Harbin University of Technology word segmentation tool and the Xunfei voice cloud.
- the specific word segmentation method is a conventional method in the field, and will not be described herein.
- the content of the word segmentation is processed to filter out some stop words and replace some synonyms.
- pre-processing such as screening or replacement of each word is required.
- the first choice is to count the words after the word segmentation. If a word or phrase appears frequently in a text and rarely appears in other texts, the word or phrase is considered to have a good class distinguishing ability and is suitable for classification.
- the words with the most occurrences may be --"", “yes”, “in”---- the most commonly used words in this category, which are "stop words”, indicating Words that are not helpful in finding results and must be filtered out.
- TF-IDF is a statistical method used to assess the importance of a word for a file set or one of the files in a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases inversely with the frequency it appears in the corpus.
- Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between a file and a user query.
- TF frequency refers to the number of times a given word appears in the file. This number is usually normalized to prevent it from being biased toward long files where the numerator is generally smaller than the denominator to distinguish it from the IDF to prevent it from biasing towards long files.
- the same word may have a higher word frequency in a long file than a short file, regardless of whether the word is important or not.
- the inverse document frequency (IDF) is a measure of the universal importance of a word.
- the IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word and then taking the obtained business logarithm.
- TF-IDF The main idea of TF-IDF is: If a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, then the word or phrase is considered to have good class distinguishing ability and is suitable for use. To classify.
- the TF-IDF is actually: TF*IDF, where TF is the word frequency and IDF is the anti-document frequency.
- word frequency refers to the frequency with which a given word appears in the file. This number is a normalization of the number of words to prevent it from being biased towards long files.
- word t i in a particular file, its importance can be expressed as equation (1-1):
- n i,j is the number of occurrences of the word t i in the file d j
- the denominator is the sum of the occurrences of all the words in the file d j .
- the Reverse File Frequency is a measure of the universal importance of a word.
- the IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the logarithm of the obtained quotient.
- the specific formula is as shown in formula (1-2):
- the numerator is the total number of files in the corpus
- the denominator is the total number of files containing words.
- IDF anti-document frequency means that the fewer documents contain entries, the larger the IDF, the better the classification ability of the terms. But in fact, sometimes, if a term appears frequently in a class's document, it means that the term is a good representation of the characteristics of the text of the class, such terms should give them a higher weight, and It is chosen as the characteristic word of this type of text to distinguish it from other types of documents.
- the TF-IDF value of a word in a specific file is as follows (1-3):
- TF-IDF tends to filter out common words and retain important words.
- the importance degree of each word after the word segmentation in the text information can be calculated, and the words whose TF-IDF exceeds a certain threshold value are filtered out as the feature words of the text information by each importance degree.
- the final "television" TF-IDF has a value of 1/7*2.
- the TF-IDF of each word is sorted as follows: “television” is larger than “movie” is greater than “like”, and if the word whose value of TF-IDF is greater than 2/7 is selected as the characteristic word of the text information, the text is The characteristic words of the information are "television” and "movie”.
- the group in the system construction phase, it is necessary to summarize the keyword list included in different groups.
- the group may be a problem category classified according to each question statement, for example, the group may be: a bank question, an insurance question, and a chat question, etc.
- the group may also be other classification topics, which are not specifically limited herein.
- the group as a bank problem group as an example, first searching for the keyword "bank” in the search engine, using the crawler tool to recursively access the search engine return result, and using TF-IDF and the like, summarizing As a result of the keywords and importance in the webpage, the words with the top 100 importance are selected as the keywords under the bank question group, and the feature words in the received text information can be matched with the keywords in the subsequent matching. Calculating the first weight value of the text information in the group.
- the present application preferably calculates the first weight value of the text information in the group by the following scheme:
- the first weight value of the text information in the specified group is equal to a sum of products of the first importance degree and the second importance degree of the feature words in the specified group in the text information.
- calculating a formula for calculating the first weight value of the text information in the group is as follows: (2-1):
- S category is the first weight value of the current text information in the kth group
- a ij-k is the ji-kth feature word
- TI aij-k is the a ij
- I aij-k is the second importance of the a ij-k .
- the application further provides a step of normalizing the weight of the problem information in the used group, so that the normalized first weight value is within a preset threshold range, such as returning ownership. Once between (0, 1). Its calculation formula is as follows (2-2):
- S'category is a value obtained by normalizing the first weight value of the current text information in the kth group
- max( Scate ) is a value of the text information in each group.
- the maximum value of the first weight value, min(S category ), is the minimum value of the first weight value of the text information in each group.
- the text information can respectively calculate one of the first weight values under each group in each third-party system.
- Table 1 is a list of possible first weight values of the text information in each group.
- Table 1 The first weight value table of one possible text information in each group
- the system contains a total of three groups of banking problems, insurance issues, and chat questions.
- the first weight values of the currently input text information in the corresponding three groups in System 1 are S1 and S2, respectively. And S3.
- the value of S1 is equal to the accumulation of the importance of each feature word included in the text information in the text information and the importance of each feature word in the group.
- the calculation process of S2 and S3 Same as S1.
- the text information includes feature words M1, M2, and M3, wherein the first importance of the feature word M1 in the text information is A1, and the second importance degree of the feature word M1 in the group K is A2, the first importance degree of the feature word M2 in the text information is A2, the second importance degree of the feature word M2 in the group K is A4, and the first importance of the feature word M3 in the text information
- the degree is A5
- the second importance of the feature word M3 in the group K is A6
- the first weight value of the text information in other groups may be calculated.
- each feature word included in the text information has the same importance in the text information, and each feature word has the same importance in the group, so the system is described in different systems.
- the first weight value of the text information in the same group is the same.
- the present application preferably calculates the output result of the current text information by the following scheme:
- the sum of the similarities between the search results and the other search results is calculated to obtain a first intermediate amount.
- the search result corresponding to the text information is obtained from other third-party systems, and the similarity C ij of each search result and other search results is calculated according to an algorithm such as word coincidence degree calculation and word vector distance calculation. Calculate the sum of similarities
- the specific algorithm is a conventional method in the art, and details are not described herein again.
- the second weight value is a voting weight of each group by each third-party system, which depends on the score of each retrieval result and the initial initial value.
- the initial values of the second weight values of the respective groups in the respective systems are equal, and assuming that there are a total of Q systems, the second weight values of the respective groups in the third-party system to which they belong are equal and Equal to 1/Q.
- the second weight value corresponding to each of the groups is 1/Q, and after completing the first search, according to the text information in the first search, the first in each group a weight value and a score of each search result to calculate the second weight value corresponding to each group after the first time search to generate the corresponding group in the third search in the third party system to which the group belongs The second weight value.
- the present application preferably calculates the current second weight value of each group in each third-party system by the following scheme:
- the text information satisfies the first weight value in the i-th group being the largest and the text information is the highest in the k-th third-party system.
- a second weight value corresponding to the i-th group in a round of retrieval, a first weight value of the text information in the i-th group, and the text information in the kth third-party system The search result in the learning result and the learning rate calculate a second weight value of the current round corresponding to the i-th group in the kth third-party system, and the learning rate is the amplitude of the second weight value adjustment;
- the second weight value of the current round of the other group in the other third-party systems to which the corresponding group belongs is the same as the second weight value of the previous round, and the other groups are the respective groups in the specified system.
- the groups other than the i-th group are the groups other than the i-th group.
- the above method adjusts each of the second weight values of each round according to the result of the search process of each round, wherein the initial value of each of the second weight values is preset.
- the second weight value used for each retrieval is set depending on the result of its previous retrieval.
- the learning rate is the amplitude of the second weight value adjustment, and the learning rate is a very small number. Since the answering question does not greatly adjust the corresponding parameter, it is necessary to test the value during use, usually Take a value less than 0.001.
- the second weight value of one of the groups in one of the third-party systems needs to be adjusted, and the other groups are in other third-party systems.
- the second weight value remains unchanged, that is, equal to the value 1/Q at the time of the first search, and so on, each time a search is completed, each group can be updated according to the current score result.
- the second weight value in the third party system is unchanged, that is, equal to the value 1/Q at the time of the first search, and so on, each time a search is completed, each group can be updated according to the current score result.
- Table 2 is a possible adjustment data table of the second weight values of each group in each third-party system.
- Table 2 Adjustment data table of a possible second weight value of each group in each third-party system
- the system includes two third-party systems, System 1 and System 2, each system including three groups, Group 1, Group 2, and Group 3.
- the second weight values shown in the initial stages of the system construction in each third-party system are equal and equal to 1/2.
- the score of the search result of the group 3 in the system 1 is the highest, and the first weight value of the text information in the group 3 is the largest, so that the adjustment group 3 is satisfied in the system 1
- the condition of the second weight value, the second weight value of the group 3 in the system 1 is adjusted, according to the score of the search result of the group 3 in the system 1 and the text information in the group 3
- the first weight value calculates a new said second weight value 1/2+M.
- the formula for specifically calculating the second weight value of each group in each third-party system is as follows:
- the current condition is met when searching:
- the second weight value of the system i' to the group k' may be increased as the current search system i' is an input value of the second weight value of the group k'.
- the formula for adjusting the second weight value is as follows (3-1), and the temporary value is calculated first:
- E' i'-k'(new) is a temporary value after adjusting the second weight value of the third-party system i' to the group k'
- E'i'-k'(old) is a third-party system i' is the value before the adjustment of the second weight value of the group k'
- ⁇ is the learning rate.
- the formula (3-2) indicates that when the second weight value of the system i' to k' is adjusted, the other third party system does not change the second weight value of the other group.
- E ik (new) is the value after the second weight value is adjusted. Therefore, after the previous retrieval is completed, the second weight value of the third party system i' for the group k' is adjusted to E ik (new) , and the second weight value of the other group in other third party systems is not And each of the second weight values is used as an input value of the second weight value of the currently searched respective groups in the corresponding third party system.
- K is the total number of the groups
- S' category is the value of the current text information normalized by the first weight value in the kth group
- E ik is the kth group The second weight value in the i-th system.
- r i is the sum of similarities between the search results and other search results
- K is the total number of the groups
- E ik is the second weight value of the text information k in the third-party system i
- S ' category is a value normalized by the first weight value of the textual information in the group.
- the output result is tts voice broadcast.
- Table 3 is a possible score table of each search result corresponding to the text information.
- Table 3 A possible score table of each search result corresponding to the text information
- the system After receiving the text information, the system obtains corresponding search results F1, F2, and F3 from the system 1, the system 2, and the system 3, respectively, and calculates the similarity between each search result and the other two search results.
- the sum of the similarities corresponding to the respective search results are r1, r2, and r3, respectively. It can be known from formula (3-4):
- the search result corresponding to the largest one of R1, R2, and R3 is the output result corresponding to the text information.
- the present application further provides an online learning optimization strategy, and the scores of the final results obtained are used to optimize the second weight value of the system to finally optimize the accuracy of the output of the system.
- the present application adjusts the second weight value of each group in the third-party system to which the group belongs after each search is completed by introducing a parameter of the learning rate.
- the learning rate is used to characterize the magnitude of the adjustment of the second weight value.
- the second weight of the system i for each group is equal and is 1/Q (assuming there are Q systems), that is, the problem corresponding to each group, and the output of each system is accurate.
- the probability of the answers is equal.
- an embodiment of the present application further provides a multi-system natural language processing device.
- an extraction module 11, a matching module 12, and a computing module 13 are included. among them,
- the extracting module 11 is configured to extract feature words in the received text information.
- the text information may be a statement, such as a question input by a user, or a piece of text including multiple questions.
- the feature word is a word with a relatively high degree of importance in the text information, or is easy to understand is a word with a relatively large number of occurrences in the text information.
- the present application preferably extracts the feature words of the received text information by the following scheme:
- the text information is segmented, and the tools include the Harbin University of Technology word segmentation tool and the Xunfei voice cloud.
- the specific word segmentation method is a conventional method in the field, and will not be described herein.
- the content of the word segmentation is processed to filter out some stop words and replace some synonyms.
- pre-processing such as screening or replacement of each word is required.
- the first choice is to count the words after the word segmentation. If a word or phrase appears frequently in a text and rarely appears in other texts, the word or phrase is considered to have a good class distinguishing ability and is suitable for classification.
- the words with the most occurrences may be --"", “yes”, “in”---- the most commonly used words in this category, which are "stop words”, indicating Words that are not helpful in finding results and must be filtered out.
- TF-IDF is a statistical method used to assess the importance of a word for a file set or one of the files in a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases inversely with the frequency it appears in the corpus.
- Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between a file and a user query.
- TF frequency refers to the number of times a given word appears in the file. This number is usually normalized to prevent it from being biased toward long files where the numerator is generally smaller than the denominator to distinguish it from the IDF to prevent it from biasing towards long files.
- the same word may have a higher word frequency in a long file than a short file, regardless of whether the word is important or not.
- the inverse document frequency (IDF) is a measure of the universal importance of a word.
- the IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word and then taking the obtained business logarithm.
- TF-IDF The main idea of TF-IDF is: If a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, then the word or phrase is considered to have good class distinguishing ability and is suitable for use. To classify.
- the TF-IDF is actually: TF*IDF, where TF is the word frequency and IDF is the anti-document frequency.
- word frequency refers to the frequency with which a given word appears in the file. This number is a normalization of the number of words to prevent it from being biased towards long files.
- word t i in a particular file, its importance can be expressed as equation (1-1):
- n i,j is the number of occurrences of the word t i in the file d j
- the denominator is the sum of the occurrences of all the words in the file d j .
- the Reverse File Frequency is a measure of the universal importance of a word.
- the IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the logarithm of the obtained quotient.
- the specific formula is as shown in formula (1-2):
- the numerator is the total number of files in the corpus
- the denominator is the total number of files containing words.
- IDF anti-document frequency means that the fewer documents contain entries, the larger the IDF, the better the classification ability of the terms. But in fact, sometimes, if a term appears frequently in a class's document, it means that the term is a good representation of the characteristics of the text of the class, such terms should give them a higher weight, and Selected as a feature word for this type of text to distinguish it from other class documents.
- the TF-IDF value of a word in a specific file is as follows (1-3):
- TF-IDF tends to filter out common words and retain important words.
- the importance degree of each word after the word segmentation in the text information can be calculated, and the words whose TF-IDF exceeds a certain threshold value are filtered out as the feature words of the text information by each importance degree.
- the matching module 12 is configured to calculate, according to the matching degree of the feature word and the pre-stored keyword, the first weight value of each group of the text information classified according to the keyword.
- the group in the system construction phase, it is necessary to summarize the keyword list included in different groups.
- the group may be a problem category classified according to each question statement, for example, the group may be: a bank question, an insurance question, and a chat question, etc.
- the group may also be other classification topics, which are not specifically limited herein.
- the group as a bank problem group as an example, first searching for the keyword "bank” in the search engine, using the crawler tool to recursively access the search engine return result, and using TF-IDF and the like, summarizing As a result of the keywords and importance in the webpage, the words with the top 100 importance are selected as the keywords under the bank question group, and the feature words in the received text information can be matched with the keywords in the subsequent matching. Calculating the first weight value of the text information in the group.
- the present application preferably calculates the first weight value of the text information in the group by the following scheme:
- the first weight value of the text information in the specified group is equal to a sum of products of the first importance degree and the second importance degree of the feature words in the specified group in the text information.
- calculating a formula for calculating the first weight value of the text information in the group is as follows: (2-1):
- S category is the first weight value of the current text information in the kth group
- a ij-k is the ji-kth feature word
- TI aij-k is the a ij
- I aij-k is the second importance of the a ij-k .
- the present application also provides a step of normalizing the weight of the problem information in the used group, and renormalizing the ownership to between (0, 1). Its calculation formula is as follows (2-2):
- S'category is a value obtained by normalizing the first weight value of the current text information in the kth group
- max( Scate ) is a value of the text information in each group.
- the maximum value of the first weight value, min(S category ), is the minimum value of the first weight value of the text information in each group.
- the text information can respectively calculate one of the first weight values under each group in each third-party system.
- Table 1 is a list of possible first weight values of the text information in each group.
- the system includes a total of three groups of banking problems, insurance problems, and chat questions.
- the first weight value of the currently input text information in the corresponding three groups in the system 1 is S1.
- the value of S1 is equal to the accumulation of the importance of each feature word included in the text information in the text information and the importance of each feature word in the group.
- the calculation process of S2 and S3 Same as S1.
- the text information includes feature words M1, M2, and M3, wherein the first importance of the feature word M1 in the text information is A1, and the second importance degree of the feature word M1 in the group K is A2, the first importance degree of the feature word M2 in the text information is A2, the second importance degree of the feature word M2 in the group K is A4, and the first importance of the feature word M3 in the text information
- the degree is A5
- the second importance of the feature word M3 in the group K is A6
- the first weight value of the text information in other groups may be calculated.
- each feature word included in the text information has the same importance in the text information, and each feature word has the same importance in the group, so the system is described in different systems.
- the first weight value of the text information in the same group is the same.
- the calculating module 13 is configured to separately obtain a search result corresponding to the text information from a plurality of third-party systems, and according to each search result, a second weight value of each of the groups in a third-party system to which the group belongs, and The corresponding first weight value calculates the score of each search result, and the search result with the highest score is the output result corresponding to the text information.
- the present application preferably calculates the output result of the current text information by the following scheme:
- the sum of the similarities between the search results and the other search results is calculated to obtain a first intermediate amount.
- the search result corresponding to the text information is obtained from other third-party systems, and the similarity C ij of each search result and other search results is calculated according to an algorithm such as word coincidence degree calculation and word vector distance calculation. Calculate the sum of similarities
- the specific algorithm is a conventional method in the art, and details are not described herein again.
- the second weight value is a voting weight of each group by each third-party system, which depends on the score of each retrieval result and the initial initial value.
- the initial values of the second weight values of the respective groups in the respective systems are equal, and assuming that there are a total of Q systems, the second weight values of the respective groups in the third-party system to which they belong are equal. And equal to 1/Q.
- the second weight value corresponding to each of the groups is 1/Q, and after completing the first search, according to the text information in the first search, the first in each group a weight value and a score of each search result to calculate the second weight value corresponding to each group after the first time search to generate the second corresponding group in the second party of the third party system to which the group belongs Weights.
- the present application preferably calculates the current second weight value of each group in each third-party system by the following scheme:
- the text information satisfies the first weight value in the i-th group being the largest and the text information is the highest in the k-th third-party system.
- a second weight value corresponding to the i-th group in a round of retrieval, a first weight value of the text information in the i-th group, and the text information in the kth third-party system The search result in the learning result and the learning rate calculate a second weight value of the current round corresponding to the i-th group in the kth third-party system, and the learning rate is the amplitude of the second weight value adjustment;
- the second weight value of the current round of the other group in the other third-party systems to which the corresponding group belongs is the same as the second weight value of the previous round, and the other groups are the respective groups in the specified system.
- the groups other than the i-th group are the groups other than the i-th group.
- the learning rate is the amplitude of the second weight value adjustment, and the learning rate is a very small number. Since the answering question does not greatly adjust the corresponding parameter, it is necessary to test the value during use, usually Take a value less than 0.001.
- the second weight value of one of the groups in one of the third-party systems needs to be adjusted, and the other groups are in other third-party systems.
- the second weight value remains unchanged, that is, equal to the value 1/Q at the time of the first search, and so on, each time a search is completed, each group can be updated according to the current score result.
- the second weight value in the third party system is unchanged, that is, equal to the value 1/Q at the time of the first search, and so on, each time a search is completed, each group can be updated according to the current score result.
- Table 2 is a possible adjustment data table of the second weight value of each group in each third-party system.
- the system includes two third-party systems, System 1 and System 2, each system including three groups, Group 1, Group 2, and Group 3.
- the second weight values shown in the initial stages of the system construction in each third-party system are equal and equal to 1/2.
- the score of the search result of the group 3 in the system 1 is the highest, and the first weight value of the text information in the group 3 is the largest, so that the adjustment group 3 is satisfied in the system 1
- the condition of the second weight value, the second weight value of the group 3 in the system 1 is adjusted, according to the score of the search result of the group 3 in the system 1 and the text information in the group 3
- the first weight value calculates a new said second weight value 1/2+M.
- the formula for specifically calculating the second weight value of each group in each third-party system is as follows:
- the current condition is met when searching:
- the second weight value of the system i' to the group k' may be increased as the current search system i' is an input value of the second weight value of the group k'.
- the formula for adjusting the second weight value is as follows (3-1), and the temporary value is calculated first:
- E' i'-k'(new) is a temporary value after adjusting the second weight value of the third-party system i' to the group k'
- E'i'-k'(old) is a third-party system i' is the value before the adjustment of the second weight value of the group k'
- ⁇ is the learning rate.
- ⁇ is the learning rate
- the learning rate is a very small number. Since the answering question does not greatly adjust the corresponding parameters, it is necessary to test the value during use, and usually a value less than 0.001 can be taken.
- the formula (3-2) indicates that when the second weight value of the system i' to k' is adjusted, the other third party system does not change the second weight value of the other group.
- E ik (new) is the value after the second weight value is adjusted.
- K is the total number of the groups
- S' category is the value of the current text information normalized by the first weight value in the kth group
- E ik is the kth group The second weight value in the i-th system.
- r i is the sum of similarities between the search results and other search results
- K is the total number of the groups
- E ik is the second weight value of the text information k in the third-party system i
- S ' category is a value normalized by the first weight value of the textual information in the group.
- the output result is tts voice broadcast.
- Table 3 is a possible score table of each search result corresponding to the text information.
- the system After receiving the text information, the system obtains corresponding search results F1, F2, and F3 from the system 1, the system 2, and the system 3, respectively, and calculates the similarity between each search result and the other two search results.
- the sum of the similarities corresponding to the respective search results are r1, r2, and r3, respectively.
- the search result corresponding to the largest one of R1, R2, and R3 is the output result corresponding to the text information.
- the present application further provides an online learning optimization strategy, and the scores of the final results obtained are used to optimize the second weight value of the system to finally optimize the accuracy of the output of the system.
- the present application adjusts the second weight value of each group in the third-party system to which the group belongs after each search is completed by introducing a parameter of the learning rate.
- the learning rate is used to characterize the magnitude of the adjustment of the second weight value.
- the second weight of the system i for each group is equal and is 1/Q (assuming there are Q systems), that is, the problem corresponding to each group, and the output of each system is accurate.
- the probability of the answers is equal.
- the present application also provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing The computer program implements the steps of: extracting feature words in the received text information; and calculating, according to the degree of matching of the feature words with the pre-stored keywords, the text information in each group classified according to the keywords a first weight value; respectively, obtaining a search result corresponding to the text information from a plurality of third-party systems, and according to each search result, a second weight value of each of the groups in a third-party system to which the group belongs The corresponding first weight value calculates the score of each search result, and the search result with the highest score is the output result corresponding to the text information.
- the step of calculating, by the processor, the first weight value of each group of the text information classified according to the keyword according to the matching degree of the feature word with the pre-stored keyword includes: according to the TF- The IDF algorithm calculates a first importance degree of the feature word in the text information; and calculates a second importance degree of the feature word in a specified group according to a TF-IDF algorithm; the text information is in a specified group
- the first weight value is equal to a sum of products of the first importance degree and the second importance degree of the feature words in the specified group in the text information.
- the processor when executing the computer readable instructions, further performs the step of normalizing the first weight value.
- FIG. 3 is a schematic diagram showing the internal structure of a computer device in an embodiment.
- the computer device includes a processor 1, a storage medium 2, a memory 3, and a network interface 4 connected by a system bus.
- the storage medium 2 of the computer device stores an operating system, a database, and computer readable instructions.
- the database may store a sequence of control information.
- the processor 1 may be implemented.
- the multi-system combined natural language processing method the processor 1 can implement the functions of the extraction module, the matching module and the calculation module in a multi-system natural language processing device in the embodiment shown in FIG. 2.
- the processor 1 of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device.
- Computer readable instructions may be stored in the memory 3 of the computer device.
- the processor 1 may be caused to perform a multi-system natural language processing method.
- the network interface 4 of the computer device is used to communicate with the terminal connection. It will be understood by those skilled in the art that the structure shown in FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
- the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
- the present application also provides a non-volatile storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to execute The following steps: extracting a feature word in the received text information; calculating a first weight value of each group of the text information classified according to the keyword according to a matching degree of the feature word and the pre-stored keyword; Obtaining a search result corresponding to the text information from a plurality of third-party systems, and according to each search result, a second weight value of each of the groups in a third-party system to which the group belongs and a corresponding first weight value The score of each search result is calculated, and the search result with the highest score is the output result corresponding to the text information.
- This application is designed to combine the use of multiple single natural language processing systems and to vote for multiple systems for the existing single natural language processing system with single results, insufficient knowledge, and low problem matching.
- the program that produces the final output results through continuous adjustment of relevant parameters, to continuously optimize during use, and to give different weights to each system for different groups, such as chat questions, weather issues, business problems, news, etc. The value, so that the returned result is more and more accurate.
- the present application provides a scoring mechanism to score each of the obtained search results to finally select an optimal output result. Accordingly, the present application further provides an adjustment mechanism according to the score result to determine the score and the basis of each of the search results.
- the text information is adjusted in real time to the second weight value corresponding to the packet in the first weight value of the corresponding packet. Specifically, the present application determines an adjustment value corresponding to the second weight value according to a score of each of the search results and the text information in a first weight value of the corresponding group, and updates the adjustment value to the first
- the second weight value is adjusted in real time, and the online learning optimization of the system is implemented by continuously adjusting each second-party system to the second weight value of different groups, so as to finally make the output result more accurate. The higher.
- the present application combines a plurality of single natural language processing systems to generate final output results by multiple third-party systems, and solves the problem of single output, insufficient knowledge, and matching of questions and answers in the prior art solutions. Too low a problem.
- the storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种多系统相结合的自然语言处理方法,所述方法包括:提取接收的文本信息中的特征词(S11);根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值(S12);从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果(S13)。该方法能够将多种单一自然语言处理系统结合使用,由多个系统投票产生最终答案,从而使系统返回的结果更精准,解决了现有的自然语言处理系统存在的结果单一、知识面不足、问题与答案匹配度太低的问题。
Description
本申请要求于2018年5月14日提交中国专利局、申请号为201810455437.2,发明名称为“多系统相结合的自然语言处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及保险金融领域,尤其涉及多系统相结合的自然语言处理方法及装置。
现有的人机对话解决方案,通常按照分词、替换、匹配等步骤实现结果返回,如科大讯飞、图灵机器人等自然语言处理系统,通常的实现方式是提前维护词条知识库,查询阶段先对问题进行同义词、停用词等替换,再对文本进行分词和匹配,最终返回数据库中匹配度最高的一个词条。对于部分系统,还能从与用户问答过程中,学习新的词语、句式,并更新在知识库中。
发明人意识到这种方式实现的人机对话,机器人只能从已有知识库中匹配结果,回答质量很大程度上取决于知识库数量,而且非常容易出现回答与问题不符合的情况。因此,现有的技术方案中,处理结果单一、知识面不足、问题与答案匹配度太低,处理结果不够精准。
发明内容
本申请提供一种多系统相结合的自然语言处理方法及相应的装置、计算机设备及可读存储介质,其主要目的在于通过将多种单一自然语言处理系统结合使用,由多个系统投票产生最终输出结果,从而使系统返回的结果越来越准确。
本申请还提供一种用于执行本申请的多系统相结合的自然语言处理方法的计算机设备及可读存储介质。
为解决上述问题,本申请采用如下各方面的技术方案:
第一方面,本申请提供一种多系统相结合的自然语言处理方法,所述方法包括:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
第二方面,本申请还提供一种多系统相结合的自然语言处理装置,包括:提取模块,用于提取接收的文本信息中的特征词;匹配模块,用于根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;计算模块,用于从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
第三方面,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器 执行一种多系统相结合的自然语言处理方法,所述多系统相结合的自然语言处理方法包括以下步骤:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
第四方面,本申请还提供一种计算机可读非易失性存储介质,所述计算机可读存储介质中包括在线支付时的提示程序,所述在线支付时的提示程序被处理器执行时,实现一种多系统相结合的自然语言处理方法,所述多系统相结合的自然语言处理方法包括以下步骤:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
本申请提供一种多系统相结合的自然语言处理方法,实现将多个独立的自然语言处理系统结合使用,由多个第三方系统的检索结果经过综合计算产生最终输出结果,从而使系统返回的结果更加精准且知识面更全。
图1为本申请多系统相结合的自然语言处理方法一种实施例流程框图;
图2本申请多系统相结合的自然语言处理装置一种实施例流程框图;
图3为一个实施例中计算机设备的内部结构框图。
请参阅图1,本申请所提供的一种多系统相结合的自然语言处理方法,其中,具体的一种实施方式中,包括如下步骤:
S11、提取接收的文本信息中的特征词。
本申请实施例中,所述文本信息可以是一个语句,例如一个用户输入的问题,也可以是包括多个问题的一段文本。所述特征词为所述文本信息中的重要度比较高的词。
在一种可能的实施方式中,本申请优选以下方案提取接收的文本信息的特征词:
其一、对所述文本信息进行分词,工具有哈工大分词工具、讯飞语音云等,具体的分词方法为本领域的惯用手段,在此不再赘述。
其二、依据预存的同义词、停用词等词库对分词后的内容进行处理以过滤掉一些停用词,替换掉一些同义词。
在分词后需要对各词进行筛选或替换等预处理。首选需要对分词后的词进行统计。如果某个词或短语在一个文本中出现的频率高,并且在其他文本中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。
一般而言,出现次数最多的词可能是----"的"、"是"、"在"----这一类最常用的词,该类词即为"停用词",表示对找到结果毫无帮助、必须过滤掉的词。
进一步的,如果所述文本信息中出现“开心”、“高兴”等同义词,那么可以用一个词来替换对应的其他同义词。
其三、使用维基百科等训练语料,采用词频及逆向文件频率TF-IDF等算法,计算 所述文本信息中被分出的各个词的重要度,选取重要度靠前的预设个数的词即为所述文本信息的所述特征词。
TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。
在一份给定的文件里,词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件,其中,分子一般小于分母区别于IDF,以防止它偏向长的文件。同一个词语在长文件里可能会比短文件有更高的词频,而不管该词语重要与否。
逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。
TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。TF-IDF实际上是:TF*IDF,其中,TF为词频,IDF为反文档频率。
具体的,在一份给定的文件里,词频(TF)指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数的归一化,以防止它偏向长的文件。对于在某一特定文件里的词语t
i来说,它的重要性可表示为如公式(1-1):
以上式子中n
i,j是该词t
i在文件d
j中的出现次数,而分母则是在文件d
j中所有字词的出现次数之和。
逆向文件频率(IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到,具体公式如公式(1-2):
其中,分子为语料库中的文件总数,分母为包含词语的文件总数。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m,而其它类包含t的文档总数为k,显然所有包含t的文档数n=m+k,当m大的时候,n也大,按照IDF公式得到的IDF的值会小,就说明该词条t类别区分能力不强。另一说法为:IDF反文档频率是指果包含词条的文档越少,IDF越大,则说明词条具有很好的类别区分能力。但是实际上,有时候,如果一个词条在一个类的文档中频繁出现,则说明该词条能够很好代表这个类的文本的特征,这样的词条应该给它们赋予较高的权重,并选来作为该类文 本的特征词以区别于其它类文档。
根据公式(1-2)以及公式(1-3)可得,某个词在特定文件中的TF-IDF值如下式(1-3):
因此,某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。
基于以上TF-IDF算法的原理,可以计算出所述文本信息中分词后的各个词的重要度,并通过各重要度筛选出TF-IDF超过一定阈值的词作为所述文本信息的特征词。
例如,输入的文本信息为“你喜欢看电影还是看电视?”,首先进行分词:你\喜欢\看\电影\还是\电视。统计各词出现的次数:“你”、“电视”、“电影”、“还是”以及“喜欢”各一次,“看”2次。去掉停用词,“你”、“还是”以及“看”。计算各词的TF值。“电视”的TF=1/7,“电影”的TF=1/7,“喜欢”的TF=1/7。
假设“电视”一词在1,000份文件出现过,而文件总数是10,000,000份的话,其逆向文件频率就是log(10,000,000/1,000)=4。最后的“电视”TF-IDF的值为1/7*4。
假设“电影”一词在1,0000份文件出现过,其逆向文件频率就是log(10,000,000/1,0000)=3。最后的“电视”TF-IDF的值为1/7*3。
假设“喜欢”一词在1,00000份文件出现过,其逆向文件频率就是log(10,000,000/1,0000)=2。最后的“电视”TF-IDF的值为1/7*2。
故各词的TF-IDF排序为:“电视”大于“电影”大于“喜欢”,若预设选取TF-IDF的值大于2/7的词为所述文本信息的特征词时,则该文本信息的特征词为“电视”和“电影”。
S12、根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值。
本申请实施例中,在系统构建阶段,需要归纳不同组别所包含的关键词列表。假设所述文本信息为单个的问题语句,那么所述组别可以为依据各个问题语句而进行分类的问题类别,例如所述组别可以为:银行问题、保险问题以及闲聊问题等分类,在本申请的另一些实施例中,所述组别也可以为其他的分类主题,在此不做具体限定。
具体的,以所述组别为银行问题组别为例,首先在搜索引擎中搜索关键词“银行”,利用爬虫工具对搜索引擎返回结果进行递归访问,并利用TF-IDF等方法,归纳出结果网页中的关键词及重要度,选取重要度排在前100的词作为银行问题组别下的关键词,后续匹配时可以将接收的文本信息中的特征词与该些关键词进行匹配以计算所述文本信息在所述组别中的所述第一权重值。
一种可能的实施方式中,本申请优选以下方案计算所述文本信息在所述组别中的所述第一权重值:
依据TF-IDF计算所述特征词在所述文本信息中的第一重要度;
依据TF-IDF计算所述特征词在指定组别中的第二重要度;
所述文本信息在指定组别中的所述第一权重值等于所述文本信息中各特征词在指 定组别中所述第一重要度与所述第二重要度的乘积的总和。
本申请实施例中,计算所述文本信息在所述组别中的所述第一权重值的计算公式如下式(2-1):
其中,S
category为当前的所述文本信息在第k个组别中的所述第一权重值,a
ij-k为第ji-k个所述特征词,TI
aij-k为所述a
ij-k的第一重要度,I
aij-k为所述a
ij-k的第二重要度。
依据上述TF-IDF算法中的公式(1-1)、(1-2)以及(1-3)可以计算出所述第一重要度TI
aij-k的值以及所述第二重要度I
aij-k的值,将其代入公式(2-1)即可计算出当前的所述文本信息在第k个组别中的所述第一权重值。
进一步的,由于所述第一权重值是以累加和求得,并且多次参与后续计算,其计算值太大或太小、或者相互之间相差太大,均会影响计算结果。因此,本申请还提供一个步骤对该问题信息在对用分组中的权重进行归一化处理,以使得归一化后的第一权重值均在预设的阈值范围内,如将所有权重归一化到(0,1)之间。其计算公式如下式(2-2):
其中,S′
category为当前的所述文本信息在第k个组别中的所述第一权重值归一化之后的值,max(S
category)为所述文本信息在各个组别中的所述第一权重值的最大值,min(S
category)为所述文本信息在各个组别中的所述第一权重值的最小值。
由此可知,所述文本信息在各第三方系统中的各组别下分别可以计算出一个所述第一权重值。
请参考下表1,表1为一种可能的所述文本信息在各组别中的所述第一权重值表。
表1 一种可能的所述文本信息在各组别中的所述第一权重值表
如表所示,本系统一共包含3个组别分别是银行问题、保险问题以及闲聊问题,当前输入的文本信息在系统1中的对应3个组别中的第一权重值分别为S1、S2以及 S3。其中,S1的值等于所述文本信息所包含的各个特征词在所述文本信息中的重要度与各特征词在组别中的重要度的乘积的累加,同理,S2以及S3的计算过程同S1。例如,所述文本信息包含特征词M1、M2以及M3,其中,特征词M1在所述文本信息中的第一重要度为A1,所述特征词M1在组别K中的第二重要度为A2,特征词M2在所述文本信息中的第一重要度为A2,所述特征词M2在组别K中的第二重要度为A4,特征词M3在所述文本信息中的第一重要度为A5,所述特征词M3在组别K中的第二重要度为A6,则所述文本信息在组别K的所述第一权重值S=A1*A2+A3*A4+A5*A6,对应的,可以计算出所述文本信息在其他各组别中的所述第一权重值。
进一步的,由于在不同系统中,所述文本信息所包含的各个特征词在所述文本信息中的重要度相同,各特征词在组别中的重要度也相同,故在不同系统中所述文本信息在相同组别中的所述第一权重值相同。
S13、向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
本申请实施例中,本申请优选以下方案计算当前文本信息的输出结果:
其一、计算各检索结果与其他检索结果的相似度之和以得到第一中间量。
在接收到所述文本信息后向其他各第三方系统获取该文本信息对应的检索结果,并依据词语重合度计算、词向量距离计算等算法计算各检索结果与其他检索结果的相似度C
i-j并计算各相似度之和
以得到第一中间量,具体算法为本领域惯用手段,在此不再赘述。
其二、计算指定系统中各组别在其所属的第三方系统中的所述第二权重值。
本申请实施例中,所述第二权重值为各第三方系统对各组别的投票权重,其依赖于每一次的检索结果的评分以及最初的初始值。本申请优选各个组别在各个系统中的所述第二权重值的初始值相等,假设共有Q个系统,那么各所述组别在其所属的第三方系统中的第二权重值均相等且等于1/Q。则第一次检索时,各所述组别对应的所述第二权重值均为1/Q,完成第一检索之后,依据第一次检索时的文本信息在各组别中的所述第一权重值以及各检索结果的评分计算出第一次次检索之后各组别对应的所述第二权重值以生成第二次检索时对应的各组别在其所属的第三方系统的所述第二权重值。
具体的,本申请优选以下方案计算各组别在各第三方系统中的当前所述第二权重值:
检测前一轮次的检索过程中所产生的所述文本信息在各第三方系统中的检索结果的评分以及所述文本信息在各组别中的所述第一权重值;
当前一轮次的检索过程中,所述文本信息满足在第i个组别中的所述第一权重值最大且该文本信息在第k个第三方系统中的检索结果评分最大时,依据前一轮次的检索过程中的该第i个组别对应的第二权重值、所述文本信息在该第i个组别中的第一权重值、所述文本信息在第k个第三方系统中的检索结果以及学习率计算该第i个组别在第k个第三方系统中对应的当前轮次的第二权重值,所述学习率为所述第二权重值调整的幅度;
其他组别在对应组别所属的其他第三方系统中的当前轮次的第二权重值与前一轮次的第二权重值相同,所述其他组别为所述指定系统中的各组别之中除所述第i个组别 之外的组别。
本申请优选以上方法依据每一轮次的检索过程的结果调整每一轮次的各所述第二权重值,其中,各所述第二权重值的初始值已预先设定。每次的检索所用到的所述第二权重值依赖于其前一次的检索的结果而设定。
其中,所述学习率为所述第二权重值调整的幅度,学习率是一个很微小的数,由于回答一次问题不会大幅度调整对应的参数,需要在使用过程中测试取值,通常可以取小于0.001的数值。
依据所述第二权重值的计算原理可知,第一次检索完成后,其中一个组别在其中一个第三方系统中的所述第二权重值需要被调整,其他组别在其他第三方系统中的所述第二权重值保持不变,也即等于第一次检索时的值1/Q,以此类推,每完成一次检索时,均可以依据当次的评分结果更新一次各组别在对应的第三方系统中的所述第二权重值。
请参考表2,表2为一种可能的各组别在各第三方系统的所述第二权重值的调整数据表。
表2 一种可能的各组别在各第三方系统的所述第二权重值的调整数据表
如表2所示,本系统包括两个第三方系统,系统1和系统2,各系统包括三个组别,组别1、组别2以及组别3。系统构建初期个组别在各第三方系统中的所示第二权重值均相等且等于1/2。当经过一次检索后,组别3在系统1中的检索结果的评分最高,且文本信息在组别3中的所述第一权重值最大,所以满足调整组别3在系统1中的所述第二权重值的条件,对组别3在系统1中的所述第二权重值进行调整,具体依据组别3在系统1中的检索结果的评分以及文本信息在组别3中的所示第一权重值计算新的所述第二权重值1/2+M。
本申请实施例中,具体计算各组别在各第三方系统中的所述第二权重值的公式如下:
当前一次检索时满足条件:
a)所述文本信息在所述组别k′中的权重=各权重的最大值
b)所述文本信息在系统i′中的检索结果的评分=各检索结果的评分的最大值时,则可以增加系统i′对组别k′的所述第二权重值作为当前检索的系统i′对组别k′的所述第二权重值的输入值,具体的,调整所述第二权重值的公式如下式(3-1),先计算临时值:
E′
i-k(new)=E
i-k(old)(i≠i′且k≠k′) (3-2)
其中,E′
i′-k′(new)为调整第三方系统i′对组别k′的所述第二权重值之后的临时值,E′
i′-k′(old)为第三方系统i′对组别k′的所述第二权重值调整之前的值,η为学习率。
进一步的,式(3-2)表示在调整系统i′对k′的所述第二权重值时,其他第三方系统对其他组别的所述第二权重值不变。
再应用softmax函数,对输出的各临时值归一化为概率值,确保对组别k′,所有的第三方系统的所述第二权重值和为1,具体公式如下式(3-3):
其中,E
i-k(new)为所述第二权重值调整之后的值。故前一次检索完成之后,第三方系统i′对组别k′的所述第二权重值被调整为E
i-k(new),其他组别在其他第三方系统中的所述第二权重值不变,各所述第二权重值作为当前检索的所述各组别在对应第三方系统中的所述第二权重值的输入值。
其三、计算指定系统中各组别对应的所述第二权重值与所述文本信息在各组别中的所述第一权重值归一化之后值的乘积的总和以得到第二中间量如下式(3-4)。
其中,K为所述组别的总数,S′
category为当前的所述文本信息在第k个组别中的所述第一权重值归一化之后的值,E
i-k为第k个组别在第i个系统中的所述第二权重值。
其四、计算所述第一中间量与所述第二中间量的乘积以得到各检索结果在该检索结果对应的第三方系统中的评分,最后取评分最高的检索结果为所述文本信息对应的输出结果。
具体的,本申请计算所述各检索结果的评分的计算公式如下式(3-5):
其中,r
i为各检索结果与其他检索结果的相似度之和,K为所述组别的总数,E
i-k为所述文本信息k在第三方系统i中的所述第二权重值,S′
category为所述文本信息在所述组别中的所述第一权重值归一化的值。
本申请实施例中,获得该最终输出结果后对该输出结果进行tts语音播报。
请参考下表3,表3为一种可能的所述文本信息对应的各个检索结果的评分表。
表3 一种可能的所述文本信息对应的各个检索结果的评分表
如表3所示,本系统接收所述文本信息后分别向系统1、系统2以及系统3获取对应的检索结果F1、F2以及F3,分别计算各个检索结果与其他两个检索结果的相似度之和得到各个检索结果对应的相似度之和分别为r1、r2以及r3。由公式(3-4)可知:
检索结果F1的评分R1=r1*(S1*E1+S2*E2+S3*E3);
检索结果F2的评分R2=r2*(S1*E4+S2*E5+S3*E6);
检索结果F3的评分R3=r3*(S1*E7+S2*E8+S3*E9)。
取R1、R2以及R3中最大的一个评分对应的检索结果即为所述文本信息对应的输出结果。
需要说明的是,本申请还提供在线学习优化策略,将最终得出的各个结果的评分用于优化系统的所述第二权重值,以最终优化系统输出的结果的精准度。
具体而言,本申请通过引入学习率的参数,在每一次检索完成之后对各组别在其所属的第三方系统中的第二权重值进行一次调整。所述学习率用于表征所述第二权重值的调整幅度。在对当前轮次的检索中的所述第二权重值进行调整时,依据其前一次的检索过程中所产生的各第三方系统的检索结果的评分、各组别对应的第二权重值、文本信息在各组别的中的第一权重值以及所述学习率计算当前轮次的所述第二权重值。本申请提供该机制不断调整各组别在对应第三方系统中的所述第二权重值,使得不同系统对其所擅长的问题组别赋予更高的权重,以实现问题匹配越来越精准。
具体的,系统上线初期,系统i对各组别的所述第二权重均相等且为1/Q(假设有 Q个系统),也即对每个组别对应的问题,每个系统输出准确答案的概率均相等。当经过一轮检索之后,假设所述文本信息在组别K中的第一权重值最大,所述文本信息在系统i中的检索结果的评分最高,则根据公式(3-1)以及(3-2)可以计算出组别K在系统i中对应的新的所述第二权重值。
请参考图2,本申请的实施例还提供一种多系统相结合的自然语言处理装置,一种本实施例中,包括提取模块11、匹配模块12以及计算模块13。其中,
提取模块11,用于提取接收的文本信息中的特征词。
本申请实施例中,所述文本信息可以是一个语句,例如一个用户输入的问题,也可以是包括多个问题的一段文本。所述特征词为所述文本信息中的重要度比较高的词,或者通俗易懂的讲就是所述文本信息中出现次数比较多的词。
在一种可能的实施方式中,本申请优选以下方案提取接收的文本信息的特征词:
其一、对所述文本信息进行分词,工具有哈工大分词工具、讯飞语音云等,具体的分词方法为本领域的惯用手段,在此不再赘述。
其二、依据预存的同义词、停用词等词库对分词后的内容进行处理以过滤掉一些停用词,替换掉一些同义词。
在分词后需要对各词进行筛选或替换等预处理。首选需要对分词后的词进行统计。如果某个词或短语在一个文本中出现的频率高,并且在其他文本中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。
一般而言,出现次数最多的词可能是----"的"、"是"、"在"----这一类最常用的词,该类词即为"停用词",表示对找到结果毫无帮助、必须过滤掉的词。
进一步的,如果所述文本信息中出现“开心”、“高兴”等同义词,那么可以用一个词来替换对应的其他同义词。
其三、使用维基百科等训练语料,采用TF-IDF等算法,计算所述文本信息中被分出的各个词的重要度,选取重要度靠前的预设个数的词即为所述文本信息的所述特征词。
TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。
在一份给定的文件里,词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件,其中,分子一般小于分母区别于IDF,以防止它偏向长的文件。同一个词语在长文件里可能会比短文件有更高的词频,而不管该词语重要与否。
逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。
TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。TF-IDF实际上是:TF*IDF,其中,TF为词频,IDF为反文档频率。
具体的,在一份给定的文件里,词频(TF)指的是某一个给定的词语在该文件中 出现的频率。这个数字是对词数的归一化,以防止它偏向长的文件。对于在某一特定文件里的词语t
i来说,它的重要性可表示为如公式(1-1):
以上式子中n
i,j是该词t
i在文件d
j中的出现次数,而分母则是在文件d
j中所有字词的出现次数之和。
逆向文件频率(IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到,具体公式如公式(1-2):
其中,分子为语料库中的文件总数,分母为包含词语的文件总数。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m,而其它类包含t的文档总数为k,显然所有包含t的文档数n=m+k,当m大的时候,n也大,按照IDF公式得到的IDF的值会小,就说明该词条t类别区分能力不强。另一说法为:IDF反文档频率是指果包含词条的文档越少,IDF越大,则说明词条具有很好的类别区分能力。但是实际上,有时候,如果一个词条在一个类的文档中频繁出现,则说明该词条能够很好代表这个类的文本的特征,这样的词条应该给它们赋予较高的权重,并选来作为该类文本的特征词以区别与其它类文档。
根据公式(1-2)以及公式(1-3)可得,某个词在特定文件中的TF-IDF值如下式(1-3):
TI
i,j=tf
i,j×idf
i (1-3)
因此,某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。
基于以上TF-IDF算法的原理,可以计算出所述文本信息中分词后的各个词的重要度,并通过各重要度筛选出TF-IDF超过一定阈值的词作为所述文本信息的特征词。
匹配模块12,用于根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值。
本申请实施例中,在系统构建阶段,需要归纳不同组别所包含的关键词列表。假设所述文本信息为单个的问题语句,那么所述组别可以为依据各个问题语句而进行分类的问题类别,例如所述组别可以为:银行问题、保险问题以及闲聊问题等分类,在本申请的另一些实施例中,所述组别也可以为其他的分类主题,在此不做具体限定。
具体的,以所述组别为银行问题组别为例,首先在搜索引擎中搜索关键词“银行”,利用爬虫工具对搜索引擎返回结果进行递归访问,并利用TF-IDF等方法,归纳出结果 网页中的关键词及重要度,选取重要度排在前100的词作为银行问题组别下的关键词,后续匹配时可以将接收的文本信息中的特征词与该些关键词进行匹配以计算所述文本信息在所述组别中的所述第一权重值。
一种可能的实施方式中,本申请优选以下方案计算所述文本信息在所述组别中的所述第一权重值:
依据TF-IDF计算所述特征词在所述文本信息中的第一重要度;
依据TF-IDF计算所述特征词在指定组别中的第二重要度;
所述文本信息在指定组别中的所述第一权重值等于所述文本信息中各特征词在指定组别中所述第一重要度与所述第二重要度的乘积的总和。
本申请实施例中,计算所述文本信息在所述组别中的所述第一权重值的计算公式如下式(2-1):
其中,S
category为当前的所述文本信息在第k个组别中的所述第一权重值,a
ij-k为第ji-k个所述特征词,TI
aij-k为所述a
ij-k的第一重要度,I
aij-k为所述a
ij-k的第二重要度。
依据上述TF-IDF算法中的公式(1-1)、(1-2)以及(1-3)可以计算出所述第一重要度TI
aij-k的值以及所述第二重要度I
aij-k的值,将其代入公式(2-1)即可计算出当前的所述文本信息在第k个组别中的所述第一权重值。
进一步的,由于所述第一权重值是以累加和求得,并且多次参与后续计算,其计算值太大或太小、或者相互之间相差太大,均会影响计算结果。因此,本申请还提供一个步骤对该问题信息在对用分组中的权重进行归一化处理,将所有权重归一化到(0,1)之间。其计算公式如下式(2-2):
其中,S′
category为当前的所述文本信息在第k个组别中的所述第一权重值归一化之后的值,max(S
category)为所述文本信息在各个组别中的所述第一权重值的最大值,min(S
category)为所述文本信息在各个组别中的所述第一权重值的最小值。
由此可知,所述文本信息在各第三方系统中的各组别下分别可以计算出一个所述第一权重值。
请参考上表1,表1为一种可能的所述文本信息在各组别中的所述第一权重值表。
如表1所示,本系统一共包含3个组别分别是银行问题、保险问题以及闲聊问题,当前输入的文本信息在系统1中的对应3个组别中的第一权重值分别为S1、S2以及S3。其中,S1的值等于所述文本信息所包含的各个特征词在所述文本信息中的重要度与各特征词在组别中的重要度的乘积的累加,同理,S2以及S3的计算过程同S1。例如,所述文本信息包含特征词M1、M2以及M3,其中,特征词M1在所述文本信息中的第一重要度为A1,所述特征词M1在组别K中的第二重要度为A2,特征词M2在 所述文本信息中的第一重要度为A2,所述特征词M2在组别K中的第二重要度为A4,特征词M3在所述文本信息中的第一重要度为A5,所述特征词M3在组别K中的第二重要度为A6,则所述文本信息在组别K的所述第一权重值S=A1*A2+A3*A4+A5*A6,对应的,可以计算出所述文本信息在其他各组别中的所述第一权重值。
进一步的,由于在不同系统中,所述文本信息所包含的各个特征词在所述文本信息中的重要度相同,各特征词在组别中的重要度也相同,故在不同系统中所述文本信息在相同组别中的所述第一权重值相同。
计算模块13,用于向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
本申请实施例中,本申请优选以下方案计算当前文本信息的输出结果:
其一、计算各检索结果与其他检索结果的相似度之和以得到第一中间量。
在接收到所述文本信息后向其他各第三方系统获取该文本信息对应的检索结果,并依据词语重合度计算、词向量距离计算等算法计算各检索结果与其他检索结果的相似度C
i-j并计算各相似度之和
以得到第一中间量,具体算法为本领域惯用手段,在此不再赘述。
其二、计算指定系统中各组别在其所属的第三方系统中的所述第二权重值。
本申请实施例中,所述第二权重值为各第三方系统对各组别的投票权重,其依赖于每一次的检索结果的评分以及最初的初始值。本申请优选各个组别在各个系统中的所述第二权重值的初始值相等,假设一共有Q个系统,那么各所述组别在其所属的第三方系统中的第二权重值均相等且等于1/Q。则第一次检索时,各所述组别对应的所述第二权重值均为1/Q,完成第一检索之后,依据第一次检索时的文本信息在各组别中的所述第一权重值以及各检索结果的评分计算出第一次次检索之后各组别对应的所述第二权重值以生成第二次对应的各组别在其所属的第三方系统的所述第二权重值。
具体的,本申请优选以下方案计算各组别在各第三方系统中的当前所述第二权重值:
检测前一轮次的检索过程中所产生的文本信息在各第三方系统中的检索结果的评分以及所述文本信息在各组别中的所述第一权重值;
当前一轮次的检索过程中,所述文本信息满足在第i个组别中的所述第一权重值最大且该文本信息在第k个第三方系统中的检索结果评分最大时,依据前一轮次的检索过程中的该第i个组别对应的第二权重值、所述文本信息在该第i个组别中的第一权重值、所述文本信息在第k个第三方系统中的检索结果以及学习率计算该第i个组别在第k个第三方系统中对应的当前轮次的第二权重值,所述学习率为所述第二权重值调整的幅度;
其他组别在对应组别所属的其他第三方系统中的当前轮次的第二权重值与前一轮次的第二权重值相同,所述其他组别为所述指定系统中的各组别之中除所述第i个组别之外的组别。
其中,所述学习率为所述第二权重值调整的幅度,学习率是一个很微小的数,由于回答一次问题不会大幅度调整对应的参数,需要在使用过程中测试取值,通常可以 取小于0.001的数值。
依据所述第二权重值的计算原理可知,第一次检索完成后,其中一个组别在其中一个第三方系统中的所述第二权重值需要被调整,其他组别在其他第三方系统中的所述第二权重值保持不变,也即等于第一次检索时的值1/Q,以此类推,每完成一次检索时,均可以依据当次的评分结果更新一次各组别在对应的第三方系统中的所述第二权重值。
请参考上表2,表2为一种可能的各组别在各第三方系统的所述第二权重值的调整数据表。
如表2所示,本系统包括两个第三方系统,系统1和系统2,各系统包括三个组别,组别1、组别2以及组别3。系统构建初期个组别在各第三方系统中的所示第二权重值均相等且等于1/2。当经过一次检索后,组别3在系统1中的检索结果的评分最高,且文本信息在组别3中的所述第一权重值最大,所以满足调整组别3在系统1中的所述第二权重值的条件,对组别3在系统1中的所述第二权重值进行调整,具体依据组别3在系统1中的检索结果的评分以及文本信息在组别3中的所示第一权重值计算新的所述第二权重值1/2+M。
本申请实施例中,具体计算各组别在各第三方系统中的所述第二权重值的公式如下:
当前一次检索时满足条件:
a)所述文本信息在所述组别k′中的权重=各权重的最大值
b)所述文本信息在系统i′中的检索结果的评分=各检索结果的评分的最大值时,则可以增加系统i′对组别k′的所述第二权重值作为当前检索的系统i′对组别k′的所述第二权重值的输入值,具体的,调整所述第二权重值的公式如下式(3-1),先计算临时值:
E′
i-k(new)=E
i-k(old)(i≠i′且k≠k′) (3-2)
其中,E′
i′-k′(new)为调整第三方系统i′对组别k′的所述第二权重值之后的临时值,E′
i′-k′(old)为第三方系统i′对组别k′的所述第二权重值调整之前的值,η为学习率。
具体的,η为学习率,学习率是一个很微小的数,由于回答一次问题不会大幅度调整对应的参数,需要在使用过程中测试取值,通常可以取小于0.001的数值。
进一步的,式(3-2)表示在调整系统i′对k′的所述第二权重值时,其他第三方系统对其他组别的所述第二权重值不变。
再应用softmax函数,确保对组别k,所有的第三方系统的所述第二权重值和为1,具体公式如下式(3-3):
其中,E
i-k(new)为所述第二权重值调整之后的值。
其三、计算指定系统中各组别对应的所述第二权重值与所述文本信息在各组别中的所述第一权重值归一化之后值的乘积的总和以得到第二中间量如下式(3-4)。
其中,K为所述组别的总数,S′
category为当前的所述文本信息在第k个组别中的所述第一权重值归一化之后的值,E
i-k为第k个组别在第i个系统中的所述第二权重值。
其四、计算所述第一中间量与所述第二中间量的乘积以得到各检索结果在该检索结果对应的第三方系统中的评分,最后取评分最高的检索结果为所述文本信息对应的输出结果。
具体的,本申请计算所述各检索结果的评分的计算公式如下式(3-5)
其中,r
i为各检索结果与其他检索结果的相似度之和,K为所述组别的总数,E
i-k为所述文本信息k在第三方系统i中的所述第二权重值,S′
category为所述文本信息在所述组别中的所述第一权重值归一化的值。
本申请实施例中,获得该最终输出结果后对该输出结果进行tts语音播报。
请参考上表3,表3为一种可能的所述文本信息对应的各个检索结果的评分表。
如表3所示,本系统接收所述文本信息后分别向系统1、系统2以及系统3获取对应的检索结果F1、F2以及F3,分别计算各个检索结果与其他两个检索结果的相似度之和得到各个检索结果对应的相似度之和分别为r1、r2以及r3。由公式(3-4)可知,
检索结果F1的评分R1=r1*(S1*E1+S2*E2+S3*E3);
检索结果F2的评分R2=r2*(S1*E4+S2*E5+S3*E6);
检索结果F3的评分R3=r3*(S1*E7+S2*E8+S3*E9)。
取R1、R2以及R3中最大的一个评分对应的检索结果即为所述文本信息对应的输出结果。
需要说明的是,本申请还提供在线学习优化策略,将最终得出的各个结果的评分用于优化系统的所述第二权重值,以最终优化系统输出的结果的精准度。
具体而言,本申请通过引入学习率的参数,在每一次检索完成之后对各组别在其所属的第三方系统中的第二权重值进行一次调整。所述学习率用于表征所述第二权重值的调整幅度。在对当前轮次的检索中的所述第二权重值进行调整时,依据其前一次的检索过程中所产生的各第三方系统的检索结果的评分、各组别对应的第二权重值、文本信息在各组别的中的第一权重值以及所述学习率计算当前轮次的所述第二权重值。本申请提供该机制不断调整各组别在对应第三方系统中的所述第二权重值,使得不同系统对其所擅长的问题组别赋予更高的权重,以实现问题匹配越来越精准。
具体的,系统上线初期,系统i对各组别的所述第二权重均相等且为1/Q(假设有Q个系统),也即对每个组别对应的问题,每个系统输出准确答案的概率均相等。当经过一轮检索之后,假设所述文本信息在组别K中的第一权重值最大,所述文本信息在系统i中的检索结果的评分最高,则根据公式(3-1)以及(3-2)可以计算出组别K在系统i中对应的新的所述第二权重值。
在一个实施例中,本申请还提出了一种计算机设备,所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
所述处理器所执行的根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值的步骤包括:依据TF-IDF算法计算所述特征词在所述文本信息中的第一重要度;依据TF-IDF算法计算所述特征词在指定组别中的第二重要度;所述文本信息在指定组别中的所述第一权重值等于所述文本信息中各特征词在指定组别中所述第一重要度与所述第二重要度的乘积的总和。
在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:对所述第一权重值进行归一化处理。
所述处理器所执行的向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分的步骤包括:计算各检索结果与其他检索结果的相似度之和以得到第一中间量;计算指定系统中各组别在其所属的第三方系统中的所述第二权重值;计算指定系统中各组别对应的所述第二权重值与所述文本信息在各组别中的所述第一权重值归一化之后值的乘积的总和以得到第二中间量;计算所述第一中间量与所述第二中间量的乘积以得到各检索结果在该检索结果对应的第三方系统中的评分。
请参考图3,图3为一个实施例中计算机设备的内部结构示意图。如图3所示,该计算机设备包括通过系统总线连接的处理器1、存储介质2、存储器3和网络接口4。其中,该计算机设备的存储介质2存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器1执行时,可使得处理器1实现一种多系统相结合的自然语言处理方法,处理器1能实现图2所示实施例中的一种多系统相结合的自然语言处理装置中的提取模块、匹配模块和计算模块的功能。该计算机设备的处理器1用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器3中可存储有计算机可读指令,该计算机可读指令被处理器1执行时,可使得处理器1执行一种多系统相结合的自然语言处理方法方法。该计算机设备的网络接口4用于与终端连接通信。本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的 计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请还提出了一种存储有计算机可读指令的非易失性存储介质,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
综合上述实施例可知,本申请最大的有益效果在于:
本申请针对现有的单一自然语言处理系统存在的结果单一、知识面不足、问题与答案匹配度太低等问题,设计了一种将多种单一自然语言处理系统结合使用,由多个系统投票产生最终输出结果的方案,通过学习不断调整相关的参数,实现在使用过程中不断优化,针对不同组别,如闲聊问题、天气问题、业务问题、新闻等,给予每个系统不同的第二权重值,从而使返回的结果越来越准确。
本申请提供评分机制对获取到的各个检索结果进行评分以最终筛选出最优的输出结果,相应的,本申请还提供依据该评分结果而制定调整机制以依据各所述检索结果的评分以及所述文本信息在对应分组的第一权重值对所述分组对应的第二权重值进行实时调整。具体而言,本申请依据各所述检索结果的评分以及所述文本信息在对应分组的第一权重值确定所述第二权重值对应的调整值,并将所述调整值更新至所述第二权重值中以实时调整所述第二权重值,通过不断调整每个第三方系统对不同组别的所述第二权重值实现系统的在线学习优化,以最终使得输出的结果精准度越来越高。
综上,本申请通过将多种单一自然语言处理系统结合使用,由多个第三方系统投票产生最终输出结果,解决了现有技术方案中,输出结果单一、知识面不足、问题与答案匹配度太低等问题。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
Claims (20)
- 多系统相结合的自然语言处理方法,所述方法包括:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
- 根据权利要求1所述的多系统相结合的自然语言处理方法,所述根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值,具体包括:依据词频及逆向文件频率TF-IDF算法计算所述特征词在所述文本信息中的第一重要度;依据TF-IDF算法计算所述特征词在指定组别中的第二重要度;计算所述文本信息中各特征词在指定组别中的所述第一重要度与所述第二重要度的乘积的总和,以得到所述文本信息在指定组别中的所述第一权重值。
- 根据权利要求1所述的多系统相结合的自然语言处理方法,所述向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分,具体包括:计算各检索结果与其他检索结果的相似度之和以得到第一中间量;计算指定系统中各组别在自身所属的第三方系统中的所述第二权重值;计算指定系统中各组别对应的所述第二权重值与所述文本信息在各组别中的所述第一权重值归一化之后值的乘积的总和以得到第二中间量;计算所述第一中间量与所述第二中间量的乘积以得到各检索结果在所述检索结果对应的第三方系统中的评分。
- 根据权利要求5所述的多系统相结合的自然语言处理方法,所述计算指定系统中各组别在其所属的第三方系统中的所述第二权重值,具体包括:当前一轮次的检索过程中,所述文本信息满足在第i个组别中的所述第一权重值最大且该文本信息在第k个第三方系统中的检索结果评分最大时,依据前一轮次的检索过程中的该第i个组别对应的第二权重值、所述文本信息在该第i个组别中的第一权重值、所述文本信息在第k个第三方系统中的检索结果以及学习率计算该第i个组别在第k个第三方系统中对应的当前轮次的第二权重值,所述学习率为所述第二权重值调整的幅度;其他组别在对应组别所属的其他第三方系统中的当前轮次的第二权重值与前一轮次的第二权重值相同,所述其他组别为所述指定系统中的各组别之中除所述第i个组别之外的组别。
- 多系统相结合的自然语言处理装置,包括:提取模块,用于提取接收的文本信息中的特征词;匹配模块,用于根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;计算模块,用于从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
- 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行一种多系统相结合的自然语言处理方法,所述多系统相结合的自然语言处理方法包括以下步骤:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值;从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别 计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
- 根据权利要求9所述的计算机设备,所述根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值的步骤,具体包括:依据词频及逆向文件频率TF-IDF算法计算所述特征词在所述文本信息中的第一重要度;依据TF-IDF算法计算所述特征词在指定组别中的第二重要度;计算所述文本信息中各特征词在指定组别中的所述第一重要度与所述第二重要度的乘积的总和,以得到所述文本信息在指定组别中的所述第一权重值。
- 根据权利要求9所述的计算机设备,所述向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分,具体包括:计算各检索结果与其他检索结果的相似度之和以得到第一中间量;计算指定系统中各组别在自身所属的第三方系统中的所述第二权重值;计算指定系统中各组别对应的所述第二权重值与所述文本信息在各组别中的所述第一权重值归一化之后值的乘积的总和以得到第二中间量;计算所述第一中间量与所述第二中间量的乘积以得到各检索结果在所述检索结果对应的第三方系统中的评分。
- 一种计算机可读非易失性存储介质,所述计算机可读存储介质中包括在线支付时的提示程序,所述在线支付时的提示程序被处理器执行时,实现一种多系统相结合的自然语言处理方法,所述多系统相结合的自然语言处理方法包括以下步骤:提取接收的文本信息中的特征词;根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词 而进行分类的各组别中第一权重值;从多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各组别在自身所从属的第三方系统中的第二权重值以及各组别对应的第一权重值分别计算各检索结果的评分,取评分最高的检索结果为所述文本信息对应的输出结果。
- 根据权利要求14所述的计算机可读非易失性存储介质,所述根据所述特征词与预存的关键词的匹配程度计算所述文本信息在依据所述关键词而进行分类的各组别中第一权重值的步骤,具体包括:依据词频及逆向文件频率TF-IDF算法计算所述特征词在所述文本信息中的第一重要度;依据TF-IDF算法计算所述特征词在指定组别中的第二重要度;计算所述文本信息中各特征词在指定组别中的所述第一重要度与所述第二重要度的乘积的总和,以得到所述文本信息在指定组别中的所述第一权重值。
- 根据权利要求14所述的计算机可读非易失性存储介质,所述向多个第三方系统分别获取对应于所述文本信息的检索结果,并依据各检索结果、各所述组别在其所从属的第三方系统中的第二权重值以及对应的第一权重值计算各检索结果的评分,具体包括:计算各检索结果与其他检索结果的相似度之和以得到第一中间量;计算指定系统中各组别在自身所属的第三方系统中的所述第二权重值;计算指定系统中各组别对应的所述第二权重值与所述文本信息在各组别中的所述第一权重值归一化之后值的乘积的总和以得到第二中间量;计算所述第一中间量与所述第二中间量的乘积以得到各检索结果在所述检索结果对应的第三方系统中的评分。
- 根据权利要求18所述的计算机可读非易失性存储介质,所述计算指定系统中 各组别在其所属的第三方系统中的所述第二权重值,具体包括:当前一轮次的检索过程中,所述文本信息满足在第i个组别中的所述第一权重值最大且该文本信息在第k个第三方系统中的检索结果评分最大时,依据前一轮次的检索过程中的该第i个组别对应的第二权重值、所述文本信息在该第i个组别中的第一权重值、所述文本信息在第k个第三方系统中的检索结果以及学习率计算该第i个组别在第k个第三方系统中对应的当前轮次的第二权重值,所述学习率为所述第二权重值调整的幅度;其他组别在对应组别所属的其他第三方系统中的当前轮次的第二权重值与前一轮次的第二权重值相同,所述其他组别为所述指定系统中的各组别之中除所述第i个组别之外的组别。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810455437.2 | 2018-05-14 | ||
CN201810455437.2A CN108920488B (zh) | 2018-05-14 | 2018-05-14 | 多系统相结合的自然语言处理方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019218527A1 true WO2019218527A1 (zh) | 2019-11-21 |
Family
ID=64402596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/102875 WO2019218527A1 (zh) | 2018-05-14 | 2018-08-29 | 多系统相结合的自然语言处理方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108920488B (zh) |
WO (1) | WO2019218527A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287657A (zh) * | 2020-11-19 | 2021-01-29 | 每日互动股份有限公司 | 基于文本相似度的信息匹配系统 |
CN112580887A (zh) * | 2020-12-25 | 2021-03-30 | 百果园技术(新加坡)有限公司 | 多目标融合评价的权重确定方法、装置、设备及存储介质 |
CN112732886A (zh) * | 2021-01-08 | 2021-04-30 | 京东数字科技控股股份有限公司 | 一种会话管理方法、装置、系统及介质 |
CN113392637A (zh) * | 2021-06-24 | 2021-09-14 | 青岛科技大学 | 基于tf-idf的主题词提取方法、装置、设备及存储介质 |
CN113705200A (zh) * | 2021-08-31 | 2021-11-26 | 中国平安财产保险股份有限公司 | 投诉行为数据的分析方法、装置、设备及存储介质 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427626B (zh) * | 2019-07-31 | 2022-12-09 | 北京明略软件系统有限公司 | 关键词的提取方法及装置 |
CN110705249B (zh) * | 2019-09-03 | 2023-04-11 | 东南大学 | 一种基于重叠度计算的nlp库组合使用方法 |
US11303464B2 (en) * | 2019-12-05 | 2022-04-12 | Microsoft Technology Licensing, Llc | Associating content items with images captured of meeting content |
CN111782792A (zh) * | 2020-08-05 | 2020-10-16 | 支付宝(杭州)信息技术有限公司 | 用于信息处理的方法和装置 |
CN112069802A (zh) * | 2020-08-26 | 2020-12-11 | 北京小米松果电子有限公司 | 文章质量评分方法、文章质量评分装置及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184233A (zh) * | 2011-05-12 | 2011-09-14 | 西北工业大学 | 一种基于查询结果的语义相关度计算方法 |
CN102663129A (zh) * | 2012-04-25 | 2012-09-12 | 中国科学院计算技术研究所 | 医疗领域深度问答方法及医学检索系统 |
CN103034709A (zh) * | 2012-12-07 | 2013-04-10 | 北京海量融通软件技术有限公司 | 检索结果重排序系统及其方法 |
CN107844558A (zh) * | 2017-10-31 | 2018-03-27 | 金蝶软件(中国)有限公司 | 一种分类信息的确定方法以及相关装置 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3820242B2 (ja) * | 2003-10-24 | 2006-09-13 | 東芝ソリューション株式会社 | 質問応答型文書検索システム及び質問応答型文書検索プログラム |
CN101639857B (zh) * | 2009-04-30 | 2012-12-05 | 腾讯科技(深圳)有限公司 | 构建知识问答分享平台的方法、装置及系统 |
CN102637192A (zh) * | 2012-02-17 | 2012-08-15 | 清华大学 | 一种自然语言问答的方法 |
EP2992482A1 (en) * | 2013-04-29 | 2016-03-09 | Siemens Aktiengesellschaft | Data unification device and method for unifying unstructured data objects and structured data objects into unified semantic objects |
CN105005564B (zh) * | 2014-04-17 | 2019-09-03 | 北京搜狗科技发展有限公司 | 一种基于问答平台的数据处理方法和装置 |
CN104820694B (zh) * | 2015-04-28 | 2019-03-15 | 中国科学院自动化研究所 | 基于多知识库和整数线性规划ilp的自动问答方法和系统 |
CN106407280B (zh) * | 2016-08-26 | 2020-02-14 | 合一网络技术(北京)有限公司 | 查询目标匹配方法及装置 |
CN106897266A (zh) * | 2017-02-16 | 2017-06-27 | 北京光年无限科技有限公司 | 用于智能机器人的文本处理方法及系统 |
CN107273350A (zh) * | 2017-05-16 | 2017-10-20 | 广东电网有限责任公司江门供电局 | 一种实现智能问答的信息处理方法及其装置 |
-
2018
- 2018-05-14 CN CN201810455437.2A patent/CN108920488B/zh active Active
- 2018-08-29 WO PCT/CN2018/102875 patent/WO2019218527A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184233A (zh) * | 2011-05-12 | 2011-09-14 | 西北工业大学 | 一种基于查询结果的语义相关度计算方法 |
CN102663129A (zh) * | 2012-04-25 | 2012-09-12 | 中国科学院计算技术研究所 | 医疗领域深度问答方法及医学检索系统 |
CN103034709A (zh) * | 2012-12-07 | 2013-04-10 | 北京海量融通软件技术有限公司 | 检索结果重排序系统及其方法 |
CN107844558A (zh) * | 2017-10-31 | 2018-03-27 | 金蝶软件(中国)有限公司 | 一种分类信息的确定方法以及相关装置 |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287657A (zh) * | 2020-11-19 | 2021-01-29 | 每日互动股份有限公司 | 基于文本相似度的信息匹配系统 |
CN112287657B (zh) * | 2020-11-19 | 2024-01-30 | 每日互动股份有限公司 | 基于文本相似度的信息匹配系统 |
CN112580887A (zh) * | 2020-12-25 | 2021-03-30 | 百果园技术(新加坡)有限公司 | 多目标融合评价的权重确定方法、装置、设备及存储介质 |
CN112580887B (zh) * | 2020-12-25 | 2023-12-01 | 百果园技术(新加坡)有限公司 | 多目标融合评价的权重确定方法、装置、设备及存储介质 |
CN112732886A (zh) * | 2021-01-08 | 2021-04-30 | 京东数字科技控股股份有限公司 | 一种会话管理方法、装置、系统及介质 |
CN113392637A (zh) * | 2021-06-24 | 2021-09-14 | 青岛科技大学 | 基于tf-idf的主题词提取方法、装置、设备及存储介质 |
CN113392637B (zh) * | 2021-06-24 | 2023-02-07 | 青岛科技大学 | 基于tf-idf的主题词提取方法、装置、设备及存储介质 |
CN113705200A (zh) * | 2021-08-31 | 2021-11-26 | 中国平安财产保险股份有限公司 | 投诉行为数据的分析方法、装置、设备及存储介质 |
CN113705200B (zh) * | 2021-08-31 | 2023-09-15 | 中国平安财产保险股份有限公司 | 投诉行为数据的分析方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN108920488B (zh) | 2021-09-28 |
CN108920488A (zh) | 2018-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019218527A1 (zh) | 多系统相结合的自然语言处理方法及装置 | |
WO2021051521A1 (zh) | 获取应答信息的方法、装置、计算机设备及存储介质 | |
CN109101479B (zh) | 一种用于中文语句的聚类方法及装置 | |
CN111125334B (zh) | 一种基于预训练的搜索问答系统 | |
WO2020140373A1 (zh) | 一种意图识别方法、识别设备及计算机可读存储介质 | |
CN105893533B (zh) | 一种文本匹配方法及装置 | |
WO2018157805A1 (zh) | 一种自动问答处理方法及自动问答系统 | |
CN106874441B (zh) | 智能问答方法和装置 | |
WO2020077896A1 (zh) | 提问数据生成方法、装置、计算机设备和存储介质 | |
WO2017101342A1 (zh) | 情感分类方法及装置 | |
CN107180093B (zh) | 信息搜索方法及装置和时效性查询词识别方法及装置 | |
US10095784B2 (en) | Synonym generation | |
WO2021189951A1 (zh) | 文本搜索方法、装置、计算机设备和存储介质 | |
CN112069298A (zh) | 基于语义网和意图识别的人机交互方法、设备及介质 | |
CN103106287B (zh) | 一种用户检索语句的处理方法及系统 | |
CN108009135B (zh) | 生成文档摘要的方法和装置 | |
CN106815252A (zh) | 一种搜索方法和设备 | |
WO2017091985A1 (zh) | 停用词识别方法与装置 | |
CN109033212B (zh) | 一种基于相似度匹配的文本分类方法 | |
WO2021051599A1 (zh) | 局部优化关键词的方法、装置、设备及存储介质 | |
CN106126589A (zh) | 简历搜索方法及装置 | |
CN108536667A (zh) | 中文文本识别方法及装置 | |
CN106095982B (zh) | 简历搜索方法及装置 | |
CN106681986A (zh) | 一种多维度情感分析系统 | |
CN109977397B (zh) | 基于词性组合的新闻热点提取方法、系统及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18918779 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18918779 Country of ref document: EP Kind code of ref document: A1 |