CN111401039A

CN111401039A - Word retrieval method, device, equipment and storage medium based on binary mutual information

Info

Publication number: CN111401039A
Application number: CN202010146242.7A
Authority: CN
Inventors: 梁志成
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-07-10

Abstract

The invention relates to the field of artificial intelligence, and discloses a word retrieval method, a word retrieval device, word retrieval equipment and a storage medium based on binary mutual information, which are used for increasing the accuracy of extracting keywords and further retrieving effective message responses. The method comprises the following steps: acquiring a target problem text sent by a target user; segmenting a target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in a plurality of candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjusting factor of each target word in a plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.

Description

Word retrieval method, device, equipment and storage medium based on binary mutual information

Technical Field

The invention relates to the technical field of keyword matching, in particular to a word retrieval method, a word retrieval device, word retrieval equipment and a storage medium based on binary mutual information.

Background

With the development of information technology, people can search and acquire required information from a network more and more conveniently. However, how to quickly acquire required information from massive network information is very important, so that information retrieval technologies come along, wherein one of the important support technologies is a keyword extraction technology. At present, the most widely used keyword extraction technology is the term frequency-inverse document frequency (TF-IDF) algorithm, and the basic principle of the TF-IDF algorithm is to calculate the ranking of words by the occurrence times and the term frequency weight, and select the top words as keywords.

In the existing scheme, because the calculation of the TF-IDF algorithm weight excessively depends on word frequency and the influence of the distribution condition of words in different documents on the weight is not considered, the existing product has inaccurate information extraction and cannot retrieve effective message response.

Disclosure of Invention

The invention provides a word retrieval method, a word retrieval device, word retrieval equipment and a storage medium based on binary mutual information, which are used for adding an adjusting factor to adjust the weight value of word frequency, eliminating the influence of different document distribution conditions on the weight value by adopting a mutual information factor, reducing the dependence degree of a TF-IDF algorithm on the word frequency, improving the accuracy of the weight value, increasing the accuracy of extracting key words and further retrieving effective message responses.

The first aspect of the embodiments of the present invention provides a term search method based on binary mutual information, including: acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text; segmenting the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in the candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjustment factor of each target word in the plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.

Optionally, in a first implementation manner of the first aspect of the embodiment of the present invention, the invoking a preset corpus to determine word frequencies TF and inverse document frequencies IDF of a plurality of target words in the plurality of candidate words includes: performing stop word filtering processing on the candidate words to obtain a plurality of target words; calling a preset corpus to determine the word frequency TF of each target word in a plurality of target words; and calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.

Optionally, in a second implementation manner of the first aspect of the embodiment of the present invention, the invoking a preset corpus to determine a word frequency TF of each target word in a plurality of target words includes: acquiring a preset corpus and determining a target corpus document in the preset corpus; and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.

Optionally, in a third implementation manner of the first aspect of the embodiment of the present invention, the invoking a preset corpus to determine a reverse document frequency IDF of each target term in a plurality of target terms includes: acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus; determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms; calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log₂(M/Wi + 1); and generating the inverse file frequency IDF of each target word.

Alternatively, in the first aspect of the embodiments of the present inventionIn a fourth implementation manner of the present invention, the calculating the binary mutual information of each target term in the plurality of target terms according to a preset formula includes: selecting any one target word from the plurality of target words as a candidate target word; determining a count of occurrences of the candidate target term in two consecutive corpus documents

Obtaining a first count; determining a count of occurrences of the candidate target term in two sequential corpus documents

As a second count; calculating the ratio of the first count and the second count to obtain a first ratio p (X | w)_i,w_i+k,w_i+1) (ii) a According to the first ratio p (X | w)_i,w_i+k,w_i+1) Determining binary mutual information mi (x, w) of the candidate target words_i,w_i+k,w_i+1)＝log₂p(X|w_i,w_i+k,w_i+1) (ii) a And generating the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.

Optionally, in a fifth implementation manner of the first aspect of the embodiment of the present invention, the obtaining, according to a preset algorithm, an adjustment factor of each target term in the plurality of target terms includes: determining a current business scene based on the target question text; dividing the plurality of target words into saliva words and key words based on the current business scenario; setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number; an adjustment factor for each target word is generated.

Optionally, in a sixth implementation manner of the first aspect of the embodiment of the present invention, the calculating a weight value of each target term according to the adjustment factor of each target term and the binary mutual information of each target term includes:

selecting one target word from the plurality of target words as a second target word,determining an adjustment factor mu corresponding to the second target word_xAnd binary mutual information mi (x); calculating the weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μ_x(ii) a And calculating to obtain the weight values of other target words in the target words, and obtaining the weight value of each target word.

A second aspect of the embodiments of the present invention provides a word search apparatus based on binary mutual information, including: the first obtaining unit is used for obtaining a target question text sent by a target user, and the target question text is used for indicating to obtain an answer corresponding to the target question text; the word segmentation unit is used for segmenting the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness; the calling and determining unit is used for calling a preset corpus to determine the word frequency TF and the inverse file frequency IDF of a plurality of target words in the candidate words; the first calculation unit is used for calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; the second acquisition unit is used for acquiring the adjustment factor of each target word in the plurality of target words according to a preset algorithm; the second calculation unit is used for calculating the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and the determining and retrieving unit is used for determining the keywords of the target question text according to the weight value of each target word and retrieving the corresponding answers according to the keywords.

Optionally, in a first implementation manner of the second aspect of the embodiment of the present invention, the call determining unit includes: the filtering module is used for filtering stop words of the candidate words to obtain a plurality of target words; the first determining module is used for calling a preset corpus to determine the word frequency TF of each target word in a plurality of target words; and the second determining module is used for calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.

Optionally, in a second implementation manner of the second aspect of the embodiment of the present invention, the first determining module is specifically configured to: acquiring a preset corpus and determining a target corpus document in the preset corpus; and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.

Optionally, in a third implementation manner of the second aspect of the embodiment of the present invention, the second determining module is specifically configured to: acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus; determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms; calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log₂(M/Wi + 1); and generating the inverse file frequency IDF of each target word.

Optionally, in a fourth implementation manner of the second aspect of the embodiment of the present invention, the first calculating unit is specifically configured to: selecting any one target word from the plurality of target words as a candidate target word; determining a count of occurrences of the candidate target term in two consecutive corpus documents

Optionally, in a fifth implementation manner of the second aspect of the embodiment of the present invention, the second obtaining unit is specifically configured to: determining a current business scene based on the target question text; dividing the plurality of target words into saliva words and key words based on the current business scenario; setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number; an adjustment factor for each target word is generated.

Optionally, in a sixth implementation manner of the second aspect of the embodiment of the present invention, the second calculating unit is specifically configured to:

selecting one target word from a plurality of target words as a second target word, and determining an adjustment factor mu corresponding to the second target word_xAnd binary mutual information mi (x); calculating the weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μ_x(ii) a And calculating to obtain the weight values of other target words in the target words, and obtaining the weight value of each target word.

A third aspect of the embodiments of the present invention provides a word retrieval device based on binary mutual information, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the word retrieval method based on binary mutual information according to any one of the above embodiments when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the method for word retrieval based on binary mutual information according to any one of the above embodiments.

According to the technical scheme provided by the embodiment of the invention, a target question text sent by a target user is obtained, and the target question text is used for indicating to obtain an answer corresponding to the target question text; segmenting a target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in a plurality of candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjusting factor of each target word in a plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords. According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the key words is improved, and effective message response is retrieved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a term retrieval method based on binary mutual information according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a term retrieval method based on binary mutual information according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a word retrieval apparatus based on binary mutual information according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a word retrieval device based on binary mutual information according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a word retrieval device based on binary mutual information in the embodiment of the present invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the invention, the embodiment of the invention will be described in conjunction with the attached drawings in the embodiment of the invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a flowchart of a term retrieval method based on binary mutual information according to an embodiment of the present invention specifically includes:

101. and acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.

The server acquires a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.

For example, in a chat customer service robot, a target user presents a question, "what interests do the insurance have? ", the server needs to retrieve the associated rights for" foupy insurance "and to use the retrieved rights as an answer to the question.

It is understood that the execution subject of the present invention may be a word retrieval device based on binary mutual information, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

102. And performing word segmentation on the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness.

The server performs word segmentation on the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness. In this embodiment, the server performs word segmentation on the target problem text by using a preset word segmentation algorithm. The preset word segmentation algorithm comprises a word segmentation algorithm based on character string matching, a word segmentation algorithm based on understanding and a word segmentation algorithm based on statistics.

It should be noted that (1) for the word segmentation algorithm based on character string matching, it matches the chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. (2) For the word segmentation algorithm based on understanding, the effect of recognizing words is achieved by enabling a computer to simulate the understanding of sentences by a human. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. (3) For the statistical-based word segmentation algorithm, a word is a stable combination of words in form, so that the more times adjacent words appear simultaneously in the context, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of closeness is above a certain threshold, it is considered that the word group may constitute a word. It is understood that other word segmentation methods can be adopted, and the details are not limited herein.

For example, in a chat customer service robot, a target user presents a question, "what interests do the insurance have? After word segmentation, words such as 'good fortune, insurance, rights and interests', 'which', 'have' and the like are obtained.

103. And calling a preset corpus to determine the word frequency TF and the inverse file frequency IDF of a plurality of target words in the candidate words.

The server calls a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in the candidate words. Specifically, the server performs stop word filtering processing on a plurality of candidate words to obtain a plurality of target words; the server calls a preset corpus to determine the word frequency TF of each target word in the plurality of target words; the server calls a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.

104. And calculating the binary mutual information of each target word in the plurality of target words according to a preset formula.

And the server calculates the binary mutual information of each target word in the plurality of target words according to a preset formula. The method specifically comprises the following steps:

(1) the server selects any one target word from the plurality of target words as a candidate target word.

(2) The server determines a count of occurrences of the candidate target term in two consecutive corpus documents

A first count is obtained.

Wherein the content of the first and second substances,

in two consecutive documents w for target term X_i,w_i+1Counts of occurrences, for example: document w₁,w₂,w₃All contain the word A, then

(3) The server determines a count of occurrences of the candidate target term in the two sequential corpus documents

As the second count.

Wherein the content of the first and second substances,

in two sequential documents w for target term X_i,w_i+kCounts of occurrences, for example: document w₁,w₂,w₃All contain the word A, then

(4) The server calculates the ratio of the first count to the second count to obtain a first ratio p (X | w)_i,w_i+k,w_i+1)；

Wherein the content of the first and second substances,

the ratio of the count of the occurrence of a word X in two documents that are consecutively adjacent to each other to the count of the occurrence of A in two documents in order, e.g., w₁,w₂For a continuous document, w₁,w₃Is a non-continuous document; w is a₁,w₂、w₁,w₃For sequential documents, w₂,w₁Is a non-sequential document.

(5) The server is based on the first ratio p (X | w)_i,w_i+k,w_i+1) Determining binary mutual information mi (x, w) of candidate target words_i,w_i+k,w_i+1)＝log₂p(X|w_i,w_i+k,w_i+1)；

For example, for keywords A, B, there are 10 corpus documents w₁,w₂,w₃,...,w₁₀Wherein the corpus document w₁Containing x₁A key word A, y₁A key word B; corpus document w₂Containing x₂A key word A, y₂A key word B; corpus document w₁₀Containing y₃And (4) a keyword B. Then the binary mutual information factors of the obtained keyword A, B are calculated as mi (a) and mi (b), and the specific calculation process is as follows:

(6) the server generates the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.

105. And obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm.

And the server acquires the adjustment factor of each target word in the plurality of target words according to a preset algorithm. Specifically, the server determines a current service scene based on a target problem text; the server divides a plurality of target words into a saliva word and a key word based on the current service scene; the server sets the adjustment factors corresponding to the word of the saliva to be negative numbers and sets the adjustment factors corresponding to the key words to be positive numbers; the server generates an adjustment factor for each target term.

For example, the weight of the saliva word "hello" needs to be reduced, giving the saliva word "hello" a μ value less than 0 (default value of-1) at initialization; the keyword "foguarantee" needs to be weighted up, and is given a μ value greater than 0 (default value of 1) at initialization. Namely, initializing a weighted bag of words: [ "hello": -1, "blessing and insurance": 1].

It should be noted that μmay also be adjusted. For example, the keywords A extracted by the word weight calculated by the BOOST algorithm are compared with the preset keywords B, and if A is related to B words, the words are compared with the preset keywords B_aFine tuning is performed to obtain mu_a＝μ_a+ w, where w may be set to μ_a1/100 for initial value, if A is not related to B, then for μ_aFine tuning is performed to obtain mu_a＝μ_a-w。

106. And calculating to obtain the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word.

And the server calculates the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word. Specifically, the server selects one target word from the plurality of target words as a second target word, and determines an adjustment factor mu corresponding to the second target word_xAnd binary mutual information mi (x); the server calculates a weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μ_x(ii) a And the server calculates and obtains the weight values of other target words in the target words to obtain the weight value of each target word.

For example, setting "Fu Bao" as word A and "Yi" as word B, if calculated according to the existing TF-IDF algorithm, it can be seen that in corpus w₁Middle TF IDF_A＜TF*IDF_BThen the weight of the right is greater than that of the good insurance, but the word of the good insurance is more representative, but the weight of the good insurance is weakened due to the intensive distribution in a certain corpus and the strong dependence of the TF-IDF algorithm on the word frequency, so that the weight value of the word needs to be adjusted.

For example, for "fubao", assuming "fubao" as word a, to increase the weight of the a word, the adjustment would need to be set accordingly to be greater than 0, where μ is set_AIf 1, the weighted value is calculated as follows:

for example, for "interest", assuming "interest" as the word B, to lower the weight value of the B word, the adjustment would need to be set to less than 0 accordingly, where μ is set_AWhen the value is-1, the weight value is calculated as follows:

therefore, f (A) > f (B), i.e., f (Fu Bao) > f (right interest).

107. And determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.

And the server determines keywords of the target question text according to the weight value of each target word, and retrieves corresponding answers according to the keywords.

It is understood that a keyword corresponds to one or more answers, for example, if the keyword is "rate of return", the corresponding answer may be: "5%", "10%" or "20%", but other values are also possible. If the keyword is "scene", the corresponding answer may be: "loan," "periodic financing," or "mortgage," among other scene types.

According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the key words is improved, and effective message response is retrieved.

Referring to fig. 2, another flowchart of the term searching method based on binary mutual information according to the embodiment of the present invention specifically includes:

201. and acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.

202. And performing word segmentation on the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness.

203. And performing stop word filtering processing on the candidate words to obtain a plurality of target words.

And the server performs stop word filtering processing on the candidate words to obtain a plurality of target words. Stop words may include functional words such as Chinese words such as "that", "those", "and", etc., or English words such as "the", "a", "an", "that", and "that".

204. And calling a preset corpus to determine the word frequency TF of each target word in the plurality of target words.

The server calls a preset corpus to determine the word frequency TF of each target word in the plurality of target words. Specifically, the server acquires a preset corpus and determines a target corpus document in the preset corpus; and the server determines the occurrence frequency T of each target word in the target corpus document and generates the word frequency TF of each target word.

For example, an existing corpus w₁,w₂,w₃,...,w₁₀Corpus w₁,w₂,w₃,...,w₁₀The number of words in (1) is 10, wherein the "blessing insurance" is intensively distributed in the corpus document w₁,w₂In, w₁And w₂If the occurrence frequency of the "fubao" is 1, the frequency TF of the "fubao" is 1; the rights are distributed in w₁,w₂,w₄In the corpus documents, the number of occurrences of the "interest" in each corpus document is 10, and the word frequency TF of the "interest" is 10.

205. And calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.

The server calls a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words. Specifically, the server acquires a preset corpus and determines the total number M of corpus documents in the preset corpus; the server determines the document quantity Wi containing a first target word in the M corpus documents, wherein i is a positive integer, and the first target word is any one of a plurality of target words; the server calls a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log₂(M/Wi + 1); the server generates the inverse file frequency IDF of each target term.

For example, an existing corpus w₁,w₂,w₃,...,w₁₀Corpus w₁,w₂,w₃,...,w₁₀The number of words in (1) is 10, wherein the "blessing insurance" is intensively distributed in the corpus document w₁,w₂In, w₁And w₂If the occurrence frequency of the middle "Fubao" is 1 time, the inverse file frequency IDF of the "Fubao" takes the value of

The rights are distributed in w₁,w₂,w₄In the corpus documents, the number of occurrences of the 'rights' in each corpus document is 10, and the inverse file frequency IDF of the 'rights' takes on the value

206. And calculating the binary mutual information of each target word in the plurality of target words according to a preset formula.

(2) Server determines candidate target words in two continuous corpus documentsCounting of occurrences

A first count is obtained.

Wherein the content of the first and second substances,

As the second count.

Wherein the content of the first and second substances,

Wherein the content of the first and second substances,

207. And obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm.

208. And calculating to obtain the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word.

For example, for "fuguabao", assuming that "fuguabao" is the word a, to increase the weight value of the a word, the adjustment would need to be set accordingly to be greater than 0,where set μ_AIf 1, the weighted value is calculated as follows:

therefore, f (A) > f (B), i.e., f (Fu Bao) > f (right interest).

209. And determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.

In the above description of the word retrieval method based on binary mutual information in the embodiment of the present invention, the following description of the word retrieval device based on binary mutual information in the embodiment of the present invention refers to fig. 3, and an embodiment of the word retrieval device based on binary mutual information in the embodiment of the present invention includes:

a first obtaining unit 301, configured to obtain a target question text sent by a target user, where the target question text is used to instruct to obtain an answer corresponding to the target question text;

a word segmentation unit 302, configured to perform word segmentation on the target problem text to obtain multiple candidate words, where each candidate word has uniqueness;

a calling determining unit 303, configured to call a preset corpus to determine word frequencies TF and inverse document frequencies IDF of multiple target words in the multiple candidate words;

a first calculating unit 304, configured to calculate binary mutual information of each target term in the multiple target terms according to a preset formula;

a second obtaining unit 305, configured to obtain an adjustment factor of each target term in the plurality of target terms according to a preset algorithm;

the second calculating unit 306 is configured to calculate a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;

the determining and retrieving unit 307 is configured to determine a keyword of the target question text according to the weight value of each target word, and retrieve a corresponding answer according to the keyword.

According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the keywords is improved, and further effective message response is retrieved.

Referring to fig. 4, another embodiment of the term searching apparatus based on binary mutual information according to the embodiment of the present invention includes:

Optionally, the call determining unit 303 includes:

a filtering module 3031, configured to perform stop word filtering processing on the multiple candidate words to obtain multiple target words;

a first determining module 3032, configured to invoke a preset corpus to determine a word frequency TF of each target word in a plurality of target words;

a second determining module 3033, configured to invoke a preset corpus to determine a reverse document frequency IDF of each target term in the plurality of target terms.

Optionally, the first determining module 3032 is specifically configured to:

acquiring a preset corpus and determining a target corpus document in the preset corpus; and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.

Optionally, the second determining module 3033 is specifically configured to:

acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus; determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms; calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log₂(M/Wi + 1); and generating the inverse file frequency IDF of each target word.

Optionally, the first calculating unit 304 is specifically configured to:

selecting any one target word from the plurality of target words as a candidate target word; determining a count of occurrences of the candidate target term in two consecutive corpus documents

As a second count; calculating the ratio of the first count and the second count to obtain a first ratio p (X | w)_i,w_i+k,w_i+1) (ii) a According to the first ratio p (X | w)_i,w_i+k,w_i+1) Determining binary mutual information mi (x, w) of the candidate target words_i,w_i+k,w_i+1)＝log₂p(X|w_i,w_i+k,w_i+1)；And generating the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.

Optionally, the second obtaining unit 305 is specifically configured to:

determining a current business scene based on the target question text; dividing the plurality of target words into saliva words and key words based on the current business scenario; setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number; an adjustment factor for each target word is generated.

Optionally, the second calculating unit 306 is specifically configured to:

According to the embodiment of the invention, a target question text sent by a target user is obtained, and the target question text is used for indicating to obtain an answer corresponding to the target question text; segmenting a target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in a plurality of candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjusting factor of each target word in a plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords. According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the keywords is improved, and further effective message response is retrieved.

Fig. 3 to 4 describe the term retrieval device based on binary mutual information in the embodiment of the present invention in detail from the perspective of a modular functional entity, and the term retrieval device based on binary mutual information in the embodiment of the present invention in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a term searching apparatus based on binary mutual information according to an embodiment of the present invention, where the term searching apparatus 500 based on binary mutual information may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 501 (e.g., one or more processors) and a memory 509, and one or more storage media 508 (e.g., one or more mass storage devices) storing applications 507 or data 506. Memory 509 and storage medium 508 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 508 may include one or more modules (not shown), each of which may include a series of instruction operations for a term retrieval device based on binary mutual information. Still further, the processor 501 may be configured to communicate with the storage medium 508 to execute a series of instruction operations in the storage medium 508 on the binary mutual information based word retrieval device 500.

The binary mutual information based word retrieval device 500 may further include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems 505, such as Windows Server, Mac OS X, Unix, L inux, FreeBSD, etc. it will be understood by those skilled in the art that the binary mutual information based word retrieval device architecture shown in FIG. 5 does not constitute a limitation of the binary mutual information based word retrieval device, may include more or less components than those shown, or may combine certain components, or a different arrangement of components.

The following specifically describes each component of the term search device based on binary mutual information with reference to fig. 5:

the processor 501 is a control center of the word retrieval device based on binary mutual information, and can perform processing according to a set word retrieval method based on binary mutual information. The processor 501 connects the various parts of the whole dual mutual information-based word retrieval device by using various interfaces and lines, and executes various functions and processing data of the dual mutual information-based word retrieval device by running or executing software programs and/or modules stored in the memory 509 and calling data stored in the memory 509, thereby reducing the dependence degree of the TF-IDF algorithm on word frequency, improving the accuracy of weight values, increasing the accuracy of extracting keywords, and further retrieving effective message responses. The storage medium 508 and the memory 509 are carriers for storing data, in the embodiment of the present invention, the storage medium 508 may be an internal memory with a small storage capacity but a high speed, and the memory 509 may be an external memory with a large storage capacity but a low storage speed.

The memory 509 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing of the word retrieval device 500 based on binary mutual information by operating the software programs and modules stored in the memory 509. The memory 509 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (for example, performing word segmentation on a target question text to obtain a plurality of candidate words, each candidate word having uniqueness), and the like; the storage data area may store data created from use of the word search apparatus based on binary mutual information (such as a weight value of each target word, etc.), and the like. Further, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. The word retrieval method program based on binary mutual information provided in the embodiment of the present invention and the received data stream are stored in a memory, and when they are needed to be used, the processor 501 calls from the memory 509.

When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, optical fiber, twisted pair) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., compact disk), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A word retrieval method based on binary mutual information is characterized by comprising the following steps:

acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text;

segmenting the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness;

calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in the candidate words;

calculating the binary mutual information of each target word in the plurality of target words according to a preset formula;

obtaining an adjustment factor of each target word in the plurality of target words according to a preset algorithm;

calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;

and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.

2. The method for retrieving words based on binary mutual information as claimed in claim 1, wherein said invoking a preset corpus to determine the word frequency TF and the inverse document frequency IDF of a plurality of target words in said plurality of candidate words comprises:

performing stop word filtering processing on the candidate words to obtain a plurality of target words;

calling a preset corpus to determine the word frequency TF of each target word in a plurality of target words;

and calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.

3. The method for retrieving words based on binary mutual information as claimed in claim 2, wherein said invoking a preset corpus to determine a word frequency TF of each target word of a plurality of target words comprises:

acquiring a preset corpus and determining a target corpus document in the preset corpus;

and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.

4. The method for retrieving words based on binary mutual information as claimed in claim 2, wherein said invoking a preset corpus to determine a reverse document frequency IDF of each target word of a plurality of target words comprises:

acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus;

determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms;

calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log₂(M/Wi+1)；

And generating the inverse file frequency IDF of each target word.

5. The method for word retrieval based on binary mutual information according to claim 1, wherein said calculating binary mutual information of each target word in said plurality of target words according to a preset formula comprises:

selecting any one target word from the plurality of target words as a candidate target word;

determining a count of occurrences of the candidate target term in two consecutive corpus documents

Obtaining a first count;

determining a count of occurrences of the candidate target term in two sequential corpus documents

As a second count;

calculating the ratio of the first count and the second count to obtain a first ratio p (X | w)_i,w_i+k,w_i+1)；

According to the first ratio p (X | w)_i,w_i+k,w_i+1) Determining binary mutual information mi (x, w) of the candidate target words_i,w_i+k,w_i+1)＝log₂p(X|w_i,w_i+k,w_i+1)；

And generating the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.

6. The method for retrieving words based on binary mutual information as claimed in claim 1, wherein said obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm comprises:

determining a current business scene based on the target question text;

dividing the plurality of target words into saliva words and key words based on the current business scenario;

setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number;

an adjustment factor for each target word is generated.

7. The method for word retrieval based on binary mutual information according to any one of claims 1-6, wherein the calculating a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word comprises:

at a plurality ofSelecting one target word from the target words as a second target word, and determining an adjustment factor mu corresponding to the second target word_xAnd binary mutual information mi (x);

calculating the weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μ_x；

And calculating to obtain the weight values of other target words in the target words, and obtaining the weight value of each target word.

8. A word retrieval device based on binary mutual information is characterized by comprising:

the first obtaining unit is used for obtaining a target question text sent by a target user, and the target question text is used for indicating to obtain an answer corresponding to the target question text;

the word segmentation unit is used for segmenting the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness;

the calling and determining unit is used for calling a preset corpus to determine the word frequency TF and the inverse file frequency IDF of a plurality of target words in the candidate words;

the first calculation unit is used for calculating the binary mutual information of each target word in the plurality of target words according to a preset formula;

the second acquisition unit is used for acquiring the adjustment factor of each target word in the plurality of target words according to a preset algorithm;

the second calculation unit is used for calculating the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;

and the determining and retrieving unit is used for determining the keywords of the target question text according to the weight value of each target word and retrieving the corresponding answers according to the keywords.

9. A word retrieval device based on binary mutual information, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the word retrieval method based on binary mutual information according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the binary mutual information based word retrieval method according to any one of claims 1 to 7.