CN111401039A - Word retrieval method, device, equipment and storage medium based on binary mutual information - Google Patents

Word retrieval method, device, equipment and storage medium based on binary mutual information Download PDF

Info

Publication number
CN111401039A
CN111401039A CN202010146242.7A CN202010146242A CN111401039A CN 111401039 A CN111401039 A CN 111401039A CN 202010146242 A CN202010146242 A CN 202010146242A CN 111401039 A CN111401039 A CN 111401039A
Authority
CN
China
Prior art keywords
target
word
words
mutual information
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010146242.7A
Other languages
Chinese (zh)
Inventor
梁志成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010146242.7A priority Critical patent/CN111401039A/en
Publication of CN111401039A publication Critical patent/CN111401039A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention relates to the field of artificial intelligence, and discloses a word retrieval method, a word retrieval device, word retrieval equipment and a storage medium based on binary mutual information, which are used for increasing the accuracy of extracting keywords and further retrieving effective message responses. The method comprises the following steps: acquiring a target problem text sent by a target user; segmenting a target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in a plurality of candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjusting factor of each target word in a plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.

Description

Word retrieval method, device, equipment and storage medium based on binary mutual information
Technical Field
The invention relates to the technical field of keyword matching, in particular to a word retrieval method, a word retrieval device, word retrieval equipment and a storage medium based on binary mutual information.
Background
With the development of information technology, people can search and acquire required information from a network more and more conveniently. However, how to quickly acquire required information from massive network information is very important, so that information retrieval technologies come along, wherein one of the important support technologies is a keyword extraction technology. At present, the most widely used keyword extraction technology is the term frequency-inverse document frequency (TF-IDF) algorithm, and the basic principle of the TF-IDF algorithm is to calculate the ranking of words by the occurrence times and the term frequency weight, and select the top words as keywords.
In the existing scheme, because the calculation of the TF-IDF algorithm weight excessively depends on word frequency and the influence of the distribution condition of words in different documents on the weight is not considered, the existing product has inaccurate information extraction and cannot retrieve effective message response.
Disclosure of Invention
The invention provides a word retrieval method, a word retrieval device, word retrieval equipment and a storage medium based on binary mutual information, which are used for adding an adjusting factor to adjust the weight value of word frequency, eliminating the influence of different document distribution conditions on the weight value by adopting a mutual information factor, reducing the dependence degree of a TF-IDF algorithm on the word frequency, improving the accuracy of the weight value, increasing the accuracy of extracting key words and further retrieving effective message responses.
The first aspect of the embodiments of the present invention provides a term search method based on binary mutual information, including: acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text; segmenting the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in the candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjustment factor of each target word in the plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.
Optionally, in a first implementation manner of the first aspect of the embodiment of the present invention, the invoking a preset corpus to determine word frequencies TF and inverse document frequencies IDF of a plurality of target words in the plurality of candidate words includes: performing stop word filtering processing on the candidate words to obtain a plurality of target words; calling a preset corpus to determine the word frequency TF of each target word in a plurality of target words; and calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.
Optionally, in a second implementation manner of the first aspect of the embodiment of the present invention, the invoking a preset corpus to determine a word frequency TF of each target word in a plurality of target words includes: acquiring a preset corpus and determining a target corpus document in the preset corpus; and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.
Optionally, in a third implementation manner of the first aspect of the embodiment of the present invention, the invoking a preset corpus to determine a reverse document frequency IDF of each target term in a plurality of target terms includes: acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus; determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms; calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log2(M/Wi + 1); and generating the inverse file frequency IDF of each target word.
Alternatively, in the first aspect of the embodiments of the present inventionIn a fourth implementation manner of the present invention, the calculating the binary mutual information of each target term in the plurality of target terms according to a preset formula includes: selecting any one target word from the plurality of target words as a candidate target word; determining a count of occurrences of the candidate target term in two consecutive corpus documents
Figure BDA0002400835550000021
Obtaining a first count; determining a count of occurrences of the candidate target term in two sequential corpus documents
Figure BDA0002400835550000022
As a second count; calculating the ratio of the first count and the second count to obtain a first ratio p (X | w)i,wi+k,wi+1) (ii) a According to the first ratio p (X | w)i,wi+k,wi+1) Determining binary mutual information mi (x, w) of the candidate target wordsi,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1) (ii) a And generating the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
Optionally, in a fifth implementation manner of the first aspect of the embodiment of the present invention, the obtaining, according to a preset algorithm, an adjustment factor of each target term in the plurality of target terms includes: determining a current business scene based on the target question text; dividing the plurality of target words into saliva words and key words based on the current business scenario; setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number; an adjustment factor for each target word is generated.
Optionally, in a sixth implementation manner of the first aspect of the embodiment of the present invention, the calculating a weight value of each target term according to the adjustment factor of each target term and the binary mutual information of each target term includes:
selecting one target word from the plurality of target words as a second target word,determining an adjustment factor mu corresponding to the second target wordxAnd binary mutual information mi (x); calculating the weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μx(ii) a And calculating to obtain the weight values of other target words in the target words, and obtaining the weight value of each target word.
A second aspect of the embodiments of the present invention provides a word search apparatus based on binary mutual information, including: the first obtaining unit is used for obtaining a target question text sent by a target user, and the target question text is used for indicating to obtain an answer corresponding to the target question text; the word segmentation unit is used for segmenting the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness; the calling and determining unit is used for calling a preset corpus to determine the word frequency TF and the inverse file frequency IDF of a plurality of target words in the candidate words; the first calculation unit is used for calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; the second acquisition unit is used for acquiring the adjustment factor of each target word in the plurality of target words according to a preset algorithm; the second calculation unit is used for calculating the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and the determining and retrieving unit is used for determining the keywords of the target question text according to the weight value of each target word and retrieving the corresponding answers according to the keywords.
Optionally, in a first implementation manner of the second aspect of the embodiment of the present invention, the call determining unit includes: the filtering module is used for filtering stop words of the candidate words to obtain a plurality of target words; the first determining module is used for calling a preset corpus to determine the word frequency TF of each target word in a plurality of target words; and the second determining module is used for calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.
Optionally, in a second implementation manner of the second aspect of the embodiment of the present invention, the first determining module is specifically configured to: acquiring a preset corpus and determining a target corpus document in the preset corpus; and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.
Optionally, in a third implementation manner of the second aspect of the embodiment of the present invention, the second determining module is specifically configured to: acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus; determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms; calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log2(M/Wi + 1); and generating the inverse file frequency IDF of each target word.
Optionally, in a fourth implementation manner of the second aspect of the embodiment of the present invention, the first calculating unit is specifically configured to: selecting any one target word from the plurality of target words as a candidate target word; determining a count of occurrences of the candidate target term in two consecutive corpus documents
Figure BDA0002400835550000041
Obtaining a first count; determining a count of occurrences of the candidate target term in two sequential corpus documents
Figure BDA0002400835550000042
As a second count; calculating the ratio of the first count and the second count to obtain a first ratio p (X | w)i,wi+k,wi+1) (ii) a According to the first ratio p (X | w)i,wi+k,wi+1) Determining binary mutual information mi (x, w) of the candidate target wordsi,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1) (ii) a And generating the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
Optionally, in a fifth implementation manner of the second aspect of the embodiment of the present invention, the second obtaining unit is specifically configured to: determining a current business scene based on the target question text; dividing the plurality of target words into saliva words and key words based on the current business scenario; setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number; an adjustment factor for each target word is generated.
Optionally, in a sixth implementation manner of the second aspect of the embodiment of the present invention, the second calculating unit is specifically configured to:
selecting one target word from a plurality of target words as a second target word, and determining an adjustment factor mu corresponding to the second target wordxAnd binary mutual information mi (x); calculating the weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μx(ii) a And calculating to obtain the weight values of other target words in the target words, and obtaining the weight value of each target word.
A third aspect of the embodiments of the present invention provides a word retrieval device based on binary mutual information, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the word retrieval method based on binary mutual information according to any one of the above embodiments when executing the computer program.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the method for word retrieval based on binary mutual information according to any one of the above embodiments.
According to the technical scheme provided by the embodiment of the invention, a target question text sent by a target user is obtained, and the target question text is used for indicating to obtain an answer corresponding to the target question text; segmenting a target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in a plurality of candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjusting factor of each target word in a plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords. According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the key words is improved, and effective message response is retrieved.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a term retrieval method based on binary mutual information according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of another embodiment of a term retrieval method based on binary mutual information according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an embodiment of a word retrieval apparatus based on binary mutual information according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of another embodiment of a word retrieval device based on binary mutual information according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an embodiment of a word retrieval device based on binary mutual information in the embodiment of the present invention.
Detailed Description
The invention provides a word retrieval method, a word retrieval device, word retrieval equipment and a storage medium based on binary mutual information, which are used for adding an adjusting factor to adjust the weight value of word frequency, eliminating the influence of different document distribution conditions on the weight value by adopting a mutual information factor, reducing the dependence degree of a TF-IDF algorithm on the word frequency, improving the accuracy of the weight value, increasing the accuracy of extracting key words and further retrieving effective message responses.
In order to make the technical field of the invention better understand the scheme of the invention, the embodiment of the invention will be described in conjunction with the attached drawings in the embodiment of the invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, a flowchart of a term retrieval method based on binary mutual information according to an embodiment of the present invention specifically includes:
101. and acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
The server acquires a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
For example, in a chat customer service robot, a target user presents a question, "what interests do the insurance have? ", the server needs to retrieve the associated rights for" foupy insurance "and to use the retrieved rights as an answer to the question.
It is understood that the execution subject of the present invention may be a word retrieval device based on binary mutual information, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
102. And performing word segmentation on the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness.
The server performs word segmentation on the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness. In this embodiment, the server performs word segmentation on the target problem text by using a preset word segmentation algorithm. The preset word segmentation algorithm comprises a word segmentation algorithm based on character string matching, a word segmentation algorithm based on understanding and a word segmentation algorithm based on statistics.
It should be noted that (1) for the word segmentation algorithm based on character string matching, it matches the chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. (2) For the word segmentation algorithm based on understanding, the effect of recognizing words is achieved by enabling a computer to simulate the understanding of sentences by a human. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. (3) For the statistical-based word segmentation algorithm, a word is a stable combination of words in form, so that the more times adjacent words appear simultaneously in the context, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of closeness is above a certain threshold, it is considered that the word group may constitute a word. It is understood that other word segmentation methods can be adopted, and the details are not limited herein.
For example, in a chat customer service robot, a target user presents a question, "what interests do the insurance have? After word segmentation, words such as 'good fortune, insurance, rights and interests', 'which', 'have' and the like are obtained.
103. And calling a preset corpus to determine the word frequency TF and the inverse file frequency IDF of a plurality of target words in the candidate words.
The server calls a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in the candidate words. Specifically, the server performs stop word filtering processing on a plurality of candidate words to obtain a plurality of target words; the server calls a preset corpus to determine the word frequency TF of each target word in the plurality of target words; the server calls a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.
104. And calculating the binary mutual information of each target word in the plurality of target words according to a preset formula.
And the server calculates the binary mutual information of each target word in the plurality of target words according to a preset formula. The method specifically comprises the following steps:
(1) the server selects any one target word from the plurality of target words as a candidate target word.
(2) The server determines a count of occurrences of the candidate target term in two consecutive corpus documents
Figure BDA0002400835550000081
A first count is obtained.
Wherein the content of the first and second substances,
Figure BDA0002400835550000082
in two consecutive documents w for target term Xi,wi+1Counts of occurrences, for example: document w1,w2,w3All contain the word A, then
Figure BDA0002400835550000083
(3) The server determines a count of occurrences of the candidate target term in the two sequential corpus documents
Figure BDA0002400835550000084
As the second count.
Wherein the content of the first and second substances,
Figure BDA0002400835550000085
in two sequential documents w for target term Xi,wi+kCounts of occurrences, for example: document w1,w2,w3All contain the word A, then
Figure BDA0002400835550000086
(4) The server calculates the ratio of the first count to the second count to obtain a first ratio p (X | w)i,wi+k,wi+1);
Wherein the content of the first and second substances,
Figure BDA0002400835550000087
the ratio of the count of the occurrence of a word X in two documents that are consecutively adjacent to each other to the count of the occurrence of A in two documents in order, e.g., w1,w2For a continuous document, w1,w3Is a non-continuous document; w is a1,w2、w1,w3For sequential documents, w2,w1Is a non-sequential document.
(5) The server is based on the first ratio p (X | w)i,wi+k,wi+1) Determining binary mutual information mi (x, w) of candidate target wordsi,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1);
For example, for keywords A, B, there are 10 corpus documents w1,w2,w3,...,w10Wherein the corpus document w1Containing x1A key word A, y1A key word B; corpus document w2Containing x2A key word A, y2A key word B; corpus document w10Containing y3And (4) a keyword B. Then the binary mutual information factors of the obtained keyword A, B are calculated as mi (a) and mi (b), and the specific calculation process is as follows:
Figure BDA0002400835550000088
Figure BDA0002400835550000089
(6) the server generates the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
105. And obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm.
And the server acquires the adjustment factor of each target word in the plurality of target words according to a preset algorithm. Specifically, the server determines a current service scene based on a target problem text; the server divides a plurality of target words into a saliva word and a key word based on the current service scene; the server sets the adjustment factors corresponding to the word of the saliva to be negative numbers and sets the adjustment factors corresponding to the key words to be positive numbers; the server generates an adjustment factor for each target term.
For example, the weight of the saliva word "hello" needs to be reduced, giving the saliva word "hello" a μ value less than 0 (default value of-1) at initialization; the keyword "foguarantee" needs to be weighted up, and is given a μ value greater than 0 (default value of 1) at initialization. Namely, initializing a weighted bag of words: [ "hello": -1, "blessing and insurance": 1].
It should be noted that μmay also be adjusted. For example, the keywords A extracted by the word weight calculated by the BOOST algorithm are compared with the preset keywords B, and if A is related to B words, the words are compared with the preset keywords BaFine tuning is performed to obtain mua=μa+ w, where w may be set to μa1/100 for initial value, if A is not related to B, then for μaFine tuning is performed to obtain mua=μa-w。
106. And calculating to obtain the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word.
And the server calculates the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word. Specifically, the server selects one target word from the plurality of target words as a second target word, and determines an adjustment factor mu corresponding to the second target wordxAnd binary mutual information mi (x); the server calculates a weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μx(ii) a And the server calculates and obtains the weight values of other target words in the target words to obtain the weight value of each target word.
For example, setting "Fu Bao" as word A and "Yi" as word B, if calculated according to the existing TF-IDF algorithm, it can be seen that in corpus w1Middle TF IDFA<TF*IDFBThen the weight of the right is greater than that of the good insurance, but the word of the good insurance is more representative, but the weight of the good insurance is weakened due to the intensive distribution in a certain corpus and the strong dependence of the TF-IDF algorithm on the word frequency, so that the weight value of the word needs to be adjusted.
For example, for "fubao", assuming "fubao" as word a, to increase the weight of the a word, the adjustment would need to be set accordingly to be greater than 0, where μ is setAIf 1, the weighted value is calculated as follows:
Figure BDA0002400835550000091
for example, for "interest", assuming "interest" as the word B, to lower the weight value of the B word, the adjustment would need to be set to less than 0 accordingly, where μ is setAWhen the value is-1, the weight value is calculated as follows:
Figure BDA0002400835550000101
therefore, f (A) > f (B), i.e., f (Fu Bao) > f (right interest).
107. And determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.
And the server determines keywords of the target question text according to the weight value of each target word, and retrieves corresponding answers according to the keywords.
It is understood that a keyword corresponds to one or more answers, for example, if the keyword is "rate of return", the corresponding answer may be: "5%", "10%" or "20%", but other values are also possible. If the keyword is "scene", the corresponding answer may be: "loan," "periodic financing," or "mortgage," among other scene types.
According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the key words is improved, and effective message response is retrieved.
Referring to fig. 2, another flowchart of the term searching method based on binary mutual information according to the embodiment of the present invention specifically includes:
201. and acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
The server acquires a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text.
For example, in a chat customer service robot, a target user presents a question, "what interests do the insurance have? ", the server needs to retrieve the associated rights for" foupy insurance "and to use the retrieved rights as an answer to the question.
It is understood that the execution subject of the present invention may be a word retrieval device based on binary mutual information, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
202. And performing word segmentation on the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness.
The server performs word segmentation on the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness. In this embodiment, the server performs word segmentation on the target problem text by using a preset word segmentation algorithm. The preset word segmentation algorithm comprises a word segmentation algorithm based on character string matching, a word segmentation algorithm based on understanding and a word segmentation algorithm based on statistics.
It should be noted that (1) for the word segmentation algorithm based on character string matching, it matches the chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. (2) For the word segmentation algorithm based on understanding, the effect of recognizing words is achieved by enabling a computer to simulate the understanding of sentences by a human. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. (3) For the statistical-based word segmentation algorithm, a word is a stable combination of words in form, so that the more times adjacent words appear simultaneously in the context, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of closeness is above a certain threshold, it is considered that the word group may constitute a word. It is understood that other word segmentation methods can be adopted, and the details are not limited herein.
For example, in a chat customer service robot, a target user presents a question, "what interests do the insurance have? After word segmentation, words such as 'good fortune, insurance, rights and interests', 'which', 'have' and the like are obtained.
203. And performing stop word filtering processing on the candidate words to obtain a plurality of target words.
And the server performs stop word filtering processing on the candidate words to obtain a plurality of target words. Stop words may include functional words such as Chinese words such as "that", "those", "and", etc., or English words such as "the", "a", "an", "that", and "that".
204. And calling a preset corpus to determine the word frequency TF of each target word in the plurality of target words.
The server calls a preset corpus to determine the word frequency TF of each target word in the plurality of target words. Specifically, the server acquires a preset corpus and determines a target corpus document in the preset corpus; and the server determines the occurrence frequency T of each target word in the target corpus document and generates the word frequency TF of each target word.
For example, an existing corpus w1,w2,w3,...,w10Corpus w1,w2,w3,...,w10The number of words in (1) is 10, wherein the "blessing insurance" is intensively distributed in the corpus document w1,w2In, w1And w2If the occurrence frequency of the "fubao" is 1, the frequency TF of the "fubao" is 1; the rights are distributed in w1,w2,w4In the corpus documents, the number of occurrences of the "interest" in each corpus document is 10, and the word frequency TF of the "interest" is 10.
205. And calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.
The server calls a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words. Specifically, the server acquires a preset corpus and determines the total number M of corpus documents in the preset corpus; the server determines the document quantity Wi containing a first target word in the M corpus documents, wherein i is a positive integer, and the first target word is any one of a plurality of target words; the server calls a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log2(M/Wi + 1); the server generates the inverse file frequency IDF of each target term.
For example, an existing corpus w1,w2,w3,...,w10Corpus w1,w2,w3,...,w10The number of words in (1) is 10, wherein the "blessing insurance" is intensively distributed in the corpus document w1,w2In, w1And w2If the occurrence frequency of the middle "Fubao" is 1 time, the inverse file frequency IDF of the "Fubao" takes the value of
Figure BDA0002400835550000121
The rights are distributed in w1,w2,w4In the corpus documents, the number of occurrences of the 'rights' in each corpus document is 10, and the inverse file frequency IDF of the 'rights' takes on the value
Figure BDA0002400835550000122
206. And calculating the binary mutual information of each target word in the plurality of target words according to a preset formula.
And the server calculates the binary mutual information of each target word in the plurality of target words according to a preset formula. The method specifically comprises the following steps:
(1) the server selects any one target word from the plurality of target words as a candidate target word.
(2) Server determines candidate target words in two continuous corpus documentsCounting of occurrences
Figure BDA0002400835550000123
A first count is obtained.
Wherein the content of the first and second substances,
Figure BDA0002400835550000131
in two consecutive documents w for target term Xi,wi+1Counts of occurrences, for example: document w1,w2,w3All contain the word A, then
Figure BDA0002400835550000132
(3) The server determines a count of occurrences of the candidate target term in the two sequential corpus documents
Figure BDA0002400835550000133
As the second count.
Wherein the content of the first and second substances,
Figure BDA0002400835550000134
in two sequential documents w for target term Xi,wi+kCounts of occurrences, for example: document w1,w2,w3All contain the word A, then
Figure BDA0002400835550000135
(4) The server calculates the ratio of the first count to the second count to obtain a first ratio p (X | w)i,wi+k,wi+1);
Wherein the content of the first and second substances,
Figure BDA0002400835550000136
the ratio of the count of the occurrence of a word X in two documents that are consecutively adjacent to each other to the count of the occurrence of A in two documents in order, e.g., w1,w2For a continuous document, w1,w3Is a non-continuous document; w is a1,w2、w1,w3For sequential documents, w2,w1Is a non-sequential document.
(5) The server is based on the first ratio p (X | w)i,wi+k,wi+1) Determining binary mutual information mi (x, w) of candidate target wordsi,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1);
For example, for keywords A, B, there are 10 corpus documents w1,w2,w3,...,w10Wherein the corpus document w1Containing x1A key word A, y1A key word B; corpus document w2Containing x2A key word A, y2A key word B; corpus document w10Containing y3And (4) a keyword B. Then the binary mutual information factors of the obtained keyword A, B are calculated as mi (a) and mi (b), and the specific calculation process is as follows:
Figure BDA0002400835550000137
Figure BDA0002400835550000138
(6) the server generates the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
207. And obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm.
And the server acquires the adjustment factor of each target word in the plurality of target words according to a preset algorithm. Specifically, the server determines a current service scene based on a target problem text; the server divides a plurality of target words into a saliva word and a key word based on the current service scene; the server sets the adjustment factors corresponding to the word of the saliva to be negative numbers and sets the adjustment factors corresponding to the key words to be positive numbers; the server generates an adjustment factor for each target term.
For example, the weight of the saliva word "hello" needs to be reduced, giving the saliva word "hello" a μ value less than 0 (default value of-1) at initialization; the keyword "foguarantee" needs to be weighted up, and is given a μ value greater than 0 (default value of 1) at initialization. Namely, initializing a weighted bag of words: [ "hello": -1, "blessing and insurance": 1].
It should be noted that μmay also be adjusted. For example, the keywords A extracted by the word weight calculated by the BOOST algorithm are compared with the preset keywords B, and if A is related to B words, the words are compared with the preset keywords BaFine tuning is performed to obtain mua=μa+ w, where w may be set to μa1/100 for initial value, if A is not related to B, then for μaFine tuning is performed to obtain mua=μa-w。
208. And calculating to obtain the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word.
And the server calculates the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word. Specifically, the server selects one target word from the plurality of target words as a second target word, and determines an adjustment factor mu corresponding to the second target wordxAnd binary mutual information mi (x); the server calculates a weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μx(ii) a And the server calculates and obtains the weight values of other target words in the target words to obtain the weight value of each target word.
For example, setting "Fu Bao" as word A and "Yi" as word B, if calculated according to the existing TF-IDF algorithm, it can be seen that in corpus w1Middle TF IDFA<TF*IDFBThen the weight of the right is greater than that of the good insurance, but the word of the good insurance is more representative, but the weight of the good insurance is weakened due to the intensive distribution in a certain corpus and the strong dependence of the TF-IDF algorithm on the word frequency, so that the weight value of the word needs to be adjusted.
For example, for "fuguabao", assuming that "fuguabao" is the word a, to increase the weight value of the a word, the adjustment would need to be set accordingly to be greater than 0,where set μAIf 1, the weighted value is calculated as follows:
Figure BDA0002400835550000141
for example, for "interest", assuming "interest" as the word B, to lower the weight value of the B word, the adjustment would need to be set to less than 0 accordingly, where μ is setAWhen the value is-1, the weight value is calculated as follows:
Figure BDA0002400835550000142
therefore, f (A) > f (B), i.e., f (Fu Bao) > f (right interest).
209. And determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.
And the server determines keywords of the target question text according to the weight value of each target word, and retrieves corresponding answers according to the keywords.
It is understood that a keyword corresponds to one or more answers, for example, if the keyword is "rate of return", the corresponding answer may be: "5%", "10%" or "20%", but other values are also possible. If the keyword is "scene", the corresponding answer may be: "loan," "periodic financing," or "mortgage," among other scene types.
According to the technical scheme provided by the embodiment of the invention, a target question text sent by a target user is obtained, and the target question text is used for indicating to obtain an answer corresponding to the target question text; segmenting a target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in a plurality of candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjusting factor of each target word in a plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords. According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree of the TF-IDF algorithm on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the key words is improved, and effective message response is retrieved.
In the above description of the word retrieval method based on binary mutual information in the embodiment of the present invention, the following description of the word retrieval device based on binary mutual information in the embodiment of the present invention refers to fig. 3, and an embodiment of the word retrieval device based on binary mutual information in the embodiment of the present invention includes:
a first obtaining unit 301, configured to obtain a target question text sent by a target user, where the target question text is used to instruct to obtain an answer corresponding to the target question text;
a word segmentation unit 302, configured to perform word segmentation on the target problem text to obtain multiple candidate words, where each candidate word has uniqueness;
a calling determining unit 303, configured to call a preset corpus to determine word frequencies TF and inverse document frequencies IDF of multiple target words in the multiple candidate words;
a first calculating unit 304, configured to calculate binary mutual information of each target term in the multiple target terms according to a preset formula;
a second obtaining unit 305, configured to obtain an adjustment factor of each target term in the plurality of target terms according to a preset algorithm;
the second calculating unit 306 is configured to calculate a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
the determining and retrieving unit 307 is configured to determine a keyword of the target question text according to the weight value of each target word, and retrieve a corresponding answer according to the keyword.
According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the keywords is improved, and further effective message response is retrieved.
Referring to fig. 4, another embodiment of the term searching apparatus based on binary mutual information according to the embodiment of the present invention includes:
a first obtaining unit 301, configured to obtain a target question text sent by a target user, where the target question text is used to instruct to obtain an answer corresponding to the target question text;
a word segmentation unit 302, configured to perform word segmentation on the target problem text to obtain multiple candidate words, where each candidate word has uniqueness;
a calling determining unit 303, configured to call a preset corpus to determine word frequencies TF and inverse document frequencies IDF of multiple target words in the multiple candidate words;
a first calculating unit 304, configured to calculate binary mutual information of each target term in the multiple target terms according to a preset formula;
a second obtaining unit 305, configured to obtain an adjustment factor of each target term in the plurality of target terms according to a preset algorithm;
the second calculating unit 306 is configured to calculate a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
the determining and retrieving unit 307 is configured to determine a keyword of the target question text according to the weight value of each target word, and retrieve a corresponding answer according to the keyword.
Optionally, the call determining unit 303 includes:
a filtering module 3031, configured to perform stop word filtering processing on the multiple candidate words to obtain multiple target words;
a first determining module 3032, configured to invoke a preset corpus to determine a word frequency TF of each target word in a plurality of target words;
a second determining module 3033, configured to invoke a preset corpus to determine a reverse document frequency IDF of each target term in the plurality of target terms.
Optionally, the first determining module 3032 is specifically configured to:
acquiring a preset corpus and determining a target corpus document in the preset corpus; and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.
Optionally, the second determining module 3033 is specifically configured to:
acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus; determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms; calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log2(M/Wi + 1); and generating the inverse file frequency IDF of each target word.
Optionally, the first calculating unit 304 is specifically configured to:
selecting any one target word from the plurality of target words as a candidate target word; determining a count of occurrences of the candidate target term in two consecutive corpus documents
Figure BDA0002400835550000171
Obtaining a first count; determining a count of occurrences of the candidate target term in two sequential corpus documents
Figure BDA0002400835550000172
As a second count; calculating the ratio of the first count and the second count to obtain a first ratio p (X | w)i,wi+k,wi+1) (ii) a According to the first ratio p (X | w)i,wi+k,wi+1) Determining binary mutual information mi (x, w) of the candidate target wordsi,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1);And generating the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
Optionally, the second obtaining unit 305 is specifically configured to:
determining a current business scene based on the target question text; dividing the plurality of target words into saliva words and key words based on the current business scenario; setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number; an adjustment factor for each target word is generated.
Optionally, the second calculating unit 306 is specifically configured to:
selecting one target word from a plurality of target words as a second target word, and determining an adjustment factor mu corresponding to the second target wordxAnd binary mutual information mi (x); calculating the weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μx(ii) a And calculating to obtain the weight values of other target words in the target words, and obtaining the weight value of each target word.
According to the embodiment of the invention, a target question text sent by a target user is obtained, and the target question text is used for indicating to obtain an answer corresponding to the target question text; segmenting a target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness; calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in a plurality of candidate words; calculating the binary mutual information of each target word in the plurality of target words according to a preset formula; obtaining an adjusting factor of each target word in a plurality of target words according to a preset algorithm; calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word; and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords. According to the embodiment of the invention, an adjusting factor is added to adjust the weighted value of the word frequency, and mutual information factors are adopted to eliminate the influence of different document distribution conditions on the weighted value, so that the dependence degree on the word frequency is reduced, the accuracy of the weighted value is improved, the accuracy of extracting the keywords is improved, and further effective message response is retrieved.
Fig. 3 to 4 describe the term retrieval device based on binary mutual information in the embodiment of the present invention in detail from the perspective of a modular functional entity, and the term retrieval device based on binary mutual information in the embodiment of the present invention in detail from the perspective of hardware processing.
Fig. 5 is a schematic structural diagram of a term searching apparatus based on binary mutual information according to an embodiment of the present invention, where the term searching apparatus 500 based on binary mutual information may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 501 (e.g., one or more processors) and a memory 509, and one or more storage media 508 (e.g., one or more mass storage devices) storing applications 507 or data 506. Memory 509 and storage medium 508 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 508 may include one or more modules (not shown), each of which may include a series of instruction operations for a term retrieval device based on binary mutual information. Still further, the processor 501 may be configured to communicate with the storage medium 508 to execute a series of instruction operations in the storage medium 508 on the binary mutual information based word retrieval device 500.
The binary mutual information based word retrieval device 500 may further include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems 505, such as Windows Server, Mac OS X, Unix, L inux, FreeBSD, etc. it will be understood by those skilled in the art that the binary mutual information based word retrieval device architecture shown in FIG. 5 does not constitute a limitation of the binary mutual information based word retrieval device, may include more or less components than those shown, or may combine certain components, or a different arrangement of components.
The following specifically describes each component of the term search device based on binary mutual information with reference to fig. 5:
the processor 501 is a control center of the word retrieval device based on binary mutual information, and can perform processing according to a set word retrieval method based on binary mutual information. The processor 501 connects the various parts of the whole dual mutual information-based word retrieval device by using various interfaces and lines, and executes various functions and processing data of the dual mutual information-based word retrieval device by running or executing software programs and/or modules stored in the memory 509 and calling data stored in the memory 509, thereby reducing the dependence degree of the TF-IDF algorithm on word frequency, improving the accuracy of weight values, increasing the accuracy of extracting keywords, and further retrieving effective message responses. The storage medium 508 and the memory 509 are carriers for storing data, in the embodiment of the present invention, the storage medium 508 may be an internal memory with a small storage capacity but a high speed, and the memory 509 may be an external memory with a large storage capacity but a low storage speed.
The memory 509 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing of the word retrieval device 500 based on binary mutual information by operating the software programs and modules stored in the memory 509. The memory 509 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (for example, performing word segmentation on a target question text to obtain a plurality of candidate words, each candidate word having uniqueness), and the like; the storage data area may store data created from use of the word search apparatus based on binary mutual information (such as a weight value of each target word, etc.), and the like. Further, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. The word retrieval method program based on binary mutual information provided in the embodiment of the present invention and the received data stream are stored in a memory, and when they are needed to be used, the processor 501 calls from the memory 509.
When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, optical fiber, twisted pair) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., compact disk), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A word retrieval method based on binary mutual information is characterized by comprising the following steps:
acquiring a target question text sent by a target user, wherein the target question text is used for indicating to acquire an answer corresponding to the target question text;
segmenting the target problem text to obtain a plurality of candidate words, wherein each candidate word has uniqueness;
calling a preset corpus to determine word frequency TF and inverse file frequency IDF of a plurality of target words in the candidate words;
calculating the binary mutual information of each target word in the plurality of target words according to a preset formula;
obtaining an adjustment factor of each target word in the plurality of target words according to a preset algorithm;
calculating to obtain a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
and determining keywords of the target question text according to the weight value of each target word, and retrieving corresponding answers according to the keywords.
2. The method for retrieving words based on binary mutual information as claimed in claim 1, wherein said invoking a preset corpus to determine the word frequency TF and the inverse document frequency IDF of a plurality of target words in said plurality of candidate words comprises:
performing stop word filtering processing on the candidate words to obtain a plurality of target words;
calling a preset corpus to determine the word frequency TF of each target word in a plurality of target words;
and calling a preset corpus to determine the inverse file frequency IDF of each target word in the plurality of target words.
3. The method for retrieving words based on binary mutual information as claimed in claim 2, wherein said invoking a preset corpus to determine a word frequency TF of each target word of a plurality of target words comprises:
acquiring a preset corpus and determining a target corpus document in the preset corpus;
and determining the occurrence frequency T of each target word in the target corpus document, and generating the word frequency TF of each target word.
4. The method for retrieving words based on binary mutual information as claimed in claim 2, wherein said invoking a preset corpus to determine a reverse document frequency IDF of each target word of a plurality of target words comprises:
acquiring a preset corpus and determining the total number M of corpus documents in the preset corpus;
determining the document quantity Wi including a first target term in M corpus documents, wherein i is a positive integer, and the first target term is any one of the target terms;
calling a first preset formula, the document number Wi and the total number M to generate a reverse file frequency IDF of the first target term, wherein the first preset formula is that the IDF is log2(M/Wi+1);
And generating the inverse file frequency IDF of each target word.
5. The method for word retrieval based on binary mutual information according to claim 1, wherein said calculating binary mutual information of each target word in said plurality of target words according to a preset formula comprises:
selecting any one target word from the plurality of target words as a candidate target word;
determining a count of occurrences of the candidate target term in two consecutive corpus documents
Figure FDA0002400835540000021
Obtaining a first count;
determining a count of occurrences of the candidate target term in two sequential corpus documents
Figure FDA0002400835540000022
As a second count;
calculating the ratio of the first count and the second count to obtain a first ratio p (X | w)i,wi+k,wi+1);
According to the first ratio p (X | w)i,wi+k,wi+1) Determining binary mutual information mi (x, w) of the candidate target wordsi,wi+k,wi+1)=log2p(X|wi,wi+k,wi+1);
And generating the binary mutual information of other target words in the target words to obtain the binary mutual information of each target word.
6. The method for retrieving words based on binary mutual information as claimed in claim 1, wherein said obtaining the adjustment factor of each target word in the plurality of target words according to a preset algorithm comprises:
determining a current business scene based on the target question text;
dividing the plurality of target words into saliva words and key words based on the current business scenario;
setting the adjustment factor corresponding to the saliva word as a negative number, and setting the adjustment factor corresponding to the key word as a positive number;
an adjustment factor for each target word is generated.
7. The method for word retrieval based on binary mutual information according to any one of claims 1-6, wherein the calculating a weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word comprises:
at a plurality ofSelecting one target word from the target words as a second target word, and determining an adjustment factor mu corresponding to the second target wordxAnd binary mutual information mi (x);
calculating the weight value f (x) of the second target word according to a preset formula, wherein the preset formula is as follows: f (x) mi (x) TF IDF + μx
And calculating to obtain the weight values of other target words in the target words, and obtaining the weight value of each target word.
8. A word retrieval device based on binary mutual information is characterized by comprising:
the first obtaining unit is used for obtaining a target question text sent by a target user, and the target question text is used for indicating to obtain an answer corresponding to the target question text;
the word segmentation unit is used for segmenting the target problem text to obtain a plurality of candidate words, and each candidate word has uniqueness;
the calling and determining unit is used for calling a preset corpus to determine the word frequency TF and the inverse file frequency IDF of a plurality of target words in the candidate words;
the first calculation unit is used for calculating the binary mutual information of each target word in the plurality of target words according to a preset formula;
the second acquisition unit is used for acquiring the adjustment factor of each target word in the plurality of target words according to a preset algorithm;
the second calculation unit is used for calculating the weight value of each target word according to the adjustment factor of each target word and the binary mutual information of each target word;
and the determining and retrieving unit is used for determining the keywords of the target question text according to the weight value of each target word and retrieving the corresponding answers according to the keywords.
9. A word retrieval device based on binary mutual information, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the word retrieval method based on binary mutual information according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the binary mutual information based word retrieval method according to any one of claims 1 to 7.
CN202010146242.7A 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information Pending CN111401039A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010146242.7A CN111401039A (en) 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010146242.7A CN111401039A (en) 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information

Publications (1)

Publication Number Publication Date
CN111401039A true CN111401039A (en) 2020-07-10

Family

ID=71430502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010146242.7A Pending CN111401039A (en) 2020-03-05 2020-03-05 Word retrieval method, device, equipment and storage medium based on binary mutual information

Country Status (1)

Country Link
CN (1) CN111401039A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036159A (en) * 2020-09-01 2020-12-04 北京金堤征信服务有限公司 Word cloud data generation method and device
CN112184027A (en) * 2020-09-29 2021-01-05 壹链盟生态科技有限公司 Task progress updating method and device and storage medium
CN113609248A (en) * 2021-08-20 2021-11-05 北京金山数字娱乐科技有限公司 Word weight generation model training method and device and word weight generation method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036159A (en) * 2020-09-01 2020-12-04 北京金堤征信服务有限公司 Word cloud data generation method and device
CN112036159B (en) * 2020-09-01 2023-11-03 北京金堤征信服务有限公司 Word cloud data generation method and device
CN112184027A (en) * 2020-09-29 2021-01-05 壹链盟生态科技有限公司 Task progress updating method and device and storage medium
CN112184027B (en) * 2020-09-29 2023-12-26 壹链盟生态科技有限公司 Task progress updating method, device and storage medium
CN113609248A (en) * 2021-08-20 2021-11-05 北京金山数字娱乐科技有限公司 Word weight generation model training method and device and word weight generation method and device

Similar Documents

Publication Publication Date Title
US10452691B2 (en) Method and apparatus for generating search results using inverted index
US10169449B2 (en) Method, apparatus, and server for acquiring recommended topic
CA2851772C (en) Method and apparatus for automatically summarizing the contents of electronic documents
US8666984B2 (en) Unsupervised message clustering
US11176453B2 (en) System and method for detangling of interleaved conversations in communication platforms
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN109657053B (en) Multi-text abstract generation method, device, server and storage medium
US20230177360A1 (en) Surfacing unique facts for entities
CN111401039A (en) Word retrieval method, device, equipment and storage medium based on binary mutual information
EP3345118B1 (en) Identifying query patterns and associated aggregate statistics among search queries
KR101423549B1 (en) Sentiment-based query processing system and method
US20180268071A1 (en) Systems and methods of de-duplicating similar news feed items
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
US9407589B2 (en) System and method for following topics in an electronic textual conversation
CN109918656B (en) Live broadcast hotspot acquisition method and device, server and storage medium
CN113326420B (en) Question retrieval method, device, electronic equipment and medium
US20100306214A1 (en) Identifying modifiers in web queries over structured data
WO2008144457A2 (en) Efficient retrieval algorithm by query term discrimination
CN111767393A (en) Text core content extraction method and device
JP5538185B2 (en) Text data summarization device, text data summarization method, and text data summarization program
CN111309916A (en) Abstract extraction method and device, storage medium and electronic device
CN109460499A (en) Target search word generation method and device, electronic equipment, storage medium
CN110245357B (en) Main entity identification method and device
US9454568B2 (en) Method, apparatus and computer storage medium for acquiring hot content
CN105653553B (en) Word weight generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination