CN105302790B - The method and apparatus for handling text - Google Patents

The method and apparatus for handling text Download PDF

Info

Publication number
CN105302790B
CN105302790B CN201410371035.6A CN201410371035A CN105302790B CN 105302790 B CN105302790 B CN 105302790B CN 201410371035 A CN201410371035 A CN 201410371035A CN 105302790 B CN105302790 B CN 105302790B
Authority
CN
China
Prior art keywords
text
word
replied
stop
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410371035.6A
Other languages
Chinese (zh)
Other versions
CN105302790A (en
Inventor
吉宗诚
吕正东
李航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201410371035.6A priority Critical patent/CN105302790B/en
Publication of CN105302790A publication Critical patent/CN105302790A/en
Application granted granted Critical
Publication of CN105302790B publication Critical patent/CN105302790B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present invention provides a kind of method for handling text, and this method includes:Obtain text to be replied and N number of candidate reply text;Determine MiEach non-stop words in a non-stop words belongs to the topic word probability of the text to be replied;Determine the MiEach non-stop words in a non-stop words belongs to this i-th candidate topic word probability for replying text;Determine this i-th candidate topic word similarity for replying text and the text to be replied;Using N number of candidate reply text replied in text with the highest candidate reply text of the topic word similarity of the text to be replied as the text to be replied.Above-mentioned technical proposal enables to the topic for replying text that can link closely the topic of text to be replied, substantially reduces the topic of the reply text probability unrelated with the topic of text to be replied.

Description

Method and apparatus for processing text
Technical Field
The embodiment of the invention relates to the technical field of information, in particular to a method and equipment for processing texts.
Background
Automatic dialogue techniques are a hot problem in the field of information technology. With automatic dialogue techniques, the user is able to implement a man-machine dialogue. The automatic dialogue in the prior art is realized by the information retrieval technology. After receiving the text to be replied sent by the user, the computer or the server and other devices can retrieve the sentence from the conversation database as the reply text. In the prior art, only the reply text is judged whether the same terms as the text to be replied exist in the reply text when the reply text is searched. Although the reply text may include the terms in the text to be replied, the content of the reply text is not necessarily suitable.
Disclosure of Invention
The embodiment of the invention provides a method and equipment for processing a text, which can select a proper text as a reply text according to a topic of the text to be replied, which is input by a user, so that the reply text can tightly buckle the topic of the text to be replied.
In a first aspect, an embodiment of the present invention provides a method for processing a text, where the method includes: acquiring a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied; determining MiEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the MiThe non-stop words are non-stop words of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied; determining the MiThe probability that each non-stop word in the non-stop words belongs to the topic word of the ith candidate reply text; according to the MiThe probability that a non-stop word belongs to the topic word of the text to be replied and the MiDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word; and taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the determining MiThe probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied comprises the following steps: determining the MiThe feature vector of each non-stop word in the text to be replied; according to the MiDetermining the feature vector and topic word prediction parameter of each non-stop word in the text to be repliediAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the MiThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is determined by the following formula:
wherein,indicating the first term in the text to be repliedThe feature vector in the text, wherein the first term is the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,and indicating the probability that the first term is the topic word of the text to be replied.
With reference to the first aspect, in a third possible implementation manner of the first aspect, the determining MiThe topic word probability that each of the non-stop words belongs to the ith candidate reply text comprises: determining the MiA feature vector of each of the non-stop words in the ith candidate reply text; according to the MiDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the MiAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.
With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the MiThe topic word probability that any one of the non-stop words belongs to the ith candidate reply text is determined by the following formula:
wherein,feature vectors representing second terms in the ith candidate reply text, wherein the second termsIs the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,indicating the topic word probability that the second term is the ith candidate reply text.
In a fifth possible implementation manner of the first aspect, the topic word prediction parameter is determined by: obtaining P training texts, wherein a topic word characteristic value corresponding to each non-stop word in each training text is determined, wherein the topic word characteristic value is used for indicating whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located; determining a feature vector of each non-stop word in each training text; performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameter, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.
With reference to the first aspect or any one of the foregoing possible implementations of the first aspect, in a sixth possible implementation of the first aspect, the topic word similarity of the ith candidate reply text and the text to be replied is determined according to the following formula:
wherein, WqIs represented by the MiThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, WdIs represented by the MiAnd score (q, d) represents the similarity between the ith candidate reply text and the topic word of the text to be replied.
In a second aspect, an embodiment of the present invention provides an apparatus, including: the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text to be replied and N candidate reply texts, and each candidate reply text comprises at least one non-stop word in the text to be replied; a first determination unit for determining MiEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the MiThe non-stop words are non-stop words of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied; a second determination unit for determining the MiThe probability that each non-stop word in the non-stop words belongs to the topic word of the ith candidate reply text; a similarity determination unit for determining a similarity according to the MiThe probability that a non-stop word belongs to the topic word of the text to be replied and the MiDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word; and the third determining unit is used for taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the first determining unit is specifically configured to determine the MiThe feature vector of each non-stop word in the text to be replied; according to the MiFeature vectors and topic word prediction parameters of each non-stop word in the text to be repliedDetermining the MiAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.
With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the first determining unit is specifically configured to determine the M by using the following formulaiThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is as follows:
wherein,representing the feature vector of the first term in the text to be replied, wherein the first term is the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,and indicating the probability that the first term is the topic word of the text to be replied.
With reference to the second aspect, in a third possible implementation manner of the second aspect, the second determining unit is specifically configured to determine the MiA feature vector of each of the non-stop words in the ith candidate reply text; according to the MiDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the MiAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.
With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the second determining unit is specifically configured to determine the M by using the following formulaiProbability that any one of the non-stop words belongs to the topic word of the ith candidate reply text:
wherein,means for representing a feature vector of a second term in the ith candidate reply text, wherein the second term is the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,indicating the topic word probability that the second term is the ith candidate reply text.
With reference to any one possible implementation manner of the first possible implementation manner of the second aspect to the fourth possible implementation manner of the second aspect, in a fifth possible implementation manner of the second aspect, the apparatus further includes: a training text acquisition unit, configured to acquire P training texts, where a topic word feature value corresponding to each non-stop word in each training text is determined, where the topic word feature value is used to indicate whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located; a training text determining unit, configured to determine a feature vector of each non-stop word in each training text; and the prediction parameter determining unit is used for performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameter, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.
With reference to the second aspect or any one of the foregoing possible implementation manners of the second aspect, in a sixth possible implementation manner of the second aspect, the similarity determining unit is specifically configured to determine the similarity between the ith candidate reply text and the topic word of the text to be replied, using the following formula:
wherein, WqIs represented by the MiThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, WdIs represented by the MiAnd score (q, d) represents the similarity between the ith candidate reply text and the topic word of the text to be replied.
The technical scheme can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply text topics according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for processing text provided according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart of another method for processing text provided by an embodiment of the invention.
Fig. 3 is a block diagram of a device provided according to an embodiment of the present invention.
Fig. 4 is a block diagram of another apparatus provided in accordance with an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Fig. 1 is a schematic flow chart of a method for processing text provided according to an embodiment of the present invention.
101, obtaining a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied.
102, determining MiEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the MiAnd each non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, which is the same as the text to be replied.
103, determining the MiA topic word probability that each of the non-stop words belongs to the ith candidate reply text.
104 according to the MiThe probability that an inactive word belongs to the topic word of the text to be replied, and the MiAnd determining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word.
105, taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.
The method shown in fig. 1 can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply texts according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.
Fig. 2 is a schematic flow chart of another method for processing text provided by an embodiment of the invention.
And 201, acquiring a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied.
Specifically, the text to be replied is the text input by the user. After the text to be replied is obtained, preprocessing is performed on the text to be replied, and the preprocessing includes word segmentation processing and word deactivation processing. Word segmentation is a common technical means in the prior art, and word segmentation is to identify terms in a sentence. The segmentation process may be implemented using existing tools (e.g., the open-source segmentation tool ICTCLAS). After the word segmentation is performed on the text to be replied, the word-deactivation processing is also performed on the text to be replied after the word segmentation. Stop word processing refers to the removal of terms from text, these removed terms are called stop words, and stop words are optional for understanding the meaning of a sentence, e.g. "what" is "and" what "is". In general, a deactivation word list may be stored. When a stop word is made, words belonging to the stop word list may be removed. After the text to be replied is preprocessed, the text to be replied is a text containing one or more non-stop words. And then, selecting N candidate reply texts from the conversation database according to the non-stop words in the text to be replied, wherein each candidate reply text comprises at least one non-stop word in the text to be replied. The process of retrieving candidate reply text may be implemented using existing tools (e.g., the open source tool Lucene may be used). The word segmentation processing and the word deactivation processing are also required to be carried out on the N searched candidate reply texts. The method for performing word segmentation and stop word processing on the candidate reply text is the same as the method for performing word segmentation and stop word processing on the text to be replied, and thus the description is not repeated here.
202, determining MiEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the MiAnd each non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, which is the same as the text to be replied.
It is to be understood that the ith candidate reply text is any one of the N candidate reply texts. Different candidate reply texts have different numbers of the same non-stop words from the text to be replied, but should contain at least one of the same non-stop words.
Specifically, the M is determinediThe probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied comprises the following steps: determining the MiEach non-stop word in the non-stop word is in the text to be repliedThe feature vector of (1); according to the MiDetermining the feature vector and topic word prediction parameter of each non-stop word in the text to be repliediAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied. Suppose term w is the MiExtracting the following characteristics of the term w in the text to be replied from any one of the non-stop words: the frequency of the term w appearing in the current text (i.e. the text to be replied), the inverse document frequency of the term w in the whole dialogue database, the number of sentences containing the term w in the current text, whether the term w appears in the first sentence of the current text, whether the term w appears in the last sentence of the current text, whether the term w is a named entity (i.e. a name of a person, a place, a name of a institution, a number of words and a time word), whether the term w is a named entity in the first sentence of the current text, whether the term w is a named entity in the last sentence of the current text, and the part of speech of the term w. The MiThe feature vector of each non-stop word in the text to be replied consists of the features of the non-stop word in the text to be replied.
For the MiAny non-stop word in the non-stop words, wherein the probability that any non-stop word belongs to the topic word of the text to be replied can be determined by the following formula:
wherein,representing the feature vector of the first term in the text to be replied, wherein the first term is the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,and indicating the probability that the first term is the topic word of the text to be replied.
203, determining the MiA topic word probability that each of the non-stop words belongs to the ith candidate reply text.
Specifically, the M is determinediThe topic word probability that each of the non-stop words belongs to the ith candidate reply text comprises: determining the MiA feature vector of each of the non-stop words in the ith candidate reply text; according to the MiDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the MiAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied. Suppose term w is the MiExtracting the following characteristics of the term w in the ith candidate reply text from any one of the non-stop words: the frequency of occurrence of term w in the current text (i.e., the ith candidate reply text), the inverse document frequency of term w in the entire dialog database, the number of sentences containing term w in the current text, whether term w occurs in the first sentence of the current text, whether term w occurs in the last sentence of the current text, whether term w is a named entity (i.e., name of a person, place, organization, number, and time), whether term w is a named entity in the first sentence of the current text, whether term w is a named entity in the last sentence of the current text, and the part of speech of term w. The MiThe feature vector of each of the non-stop words in the ith candidate reply text is composed of features of the non-stop word in the ith candidate reply text.
For the MiThe probability that any one of the non-stop words belongs to the topic word of the ith candidate reply text can be determined by the following formulaDetermining:
wherein,means for representing a feature vector of a second term in the ith candidate reply text, wherein the second term is the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,indicating the topic word probability that the second term is the ith candidate reply text.
Optionally, as another embodiment, a feature vector of each non-stop word in the text to be replied may be calculated, and a probability that each non-stop word in the text to be replied is a topic word of the text to be replied is calculated according to the feature vector of each non-stop word in the text to be replied and the topic word prediction parameter. Meanwhile, calculating a feature vector of each non-stop word in the ith candidate reply text, and calculating the probability that each non-stop word in the ith candidate reply text is the topic word of the ith candidate reply text according to the feature vector of each non-stop word in the ith candidate reply text and the topic word prediction parameter. Then, M which belongs to the text to be replied and the ith candidate reply text at the same time is foundiThe probability of each non-stop word in the text to be replied and the first topic wordTopic word probabilities for the i candidate reply texts.
Alternatively, as another embodiment, the probability of each non-stop word in each text in the dialogue database belonging to the topic word of the current text may be pre-calculated and stored in a topic word probability database. The topic word probability that each non-stop word in the ith candidate reply text belongs to the ith candidate reply text can be directly retrieved from the topic word probability database.
The topic word prediction parameters are determined by the following method: obtaining P training texts, wherein a topic word characteristic value corresponding to each non-stop word in each training text is determined, wherein the topic word characteristic value is used for indicating whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located; determining a feature vector of each non-stop word in each training text; performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameter, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.
The P training texts are selected from the dialogue database, the topic word feature value of the topic word in the non-stop words of the P training texts is marked as 1, the topic word feature value of the non-topic word in the non-stop words of the P training texts is marked as 0, and the logistic regression model training is performed by using the following formula:
wherein f (wx) represents the feature value of the non-stop word x in the text where the non-stop word x is located,the feature vector in the text where the non-stop word x is located for the non-stop word x,representing a weight vector, c represents a constant, whereinAnd c is the topic word prediction parameter obtained by training.
204 according to the MiThe probability that an inactive word belongs to the topic word of the text to be replied, and the MiDetermining the probability that each non-stop word belongs to the topic word of the ith candidate reply textiAnd the similarity between the candidate reply text and the topic word of the text to be replied.
Specifically, the topic word similarity of the text to be replied and the ith candidate reply text can be determined by using the following formula:
wherein, WqIs represented by the MiThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, WdIs represented by the MiThe non-stop words respectively belong to the vector formed by the probability of the topic words of the ith candidate reply text, Wd=(wd,1,......,wd,t),Wq=(wq,1,......,wq,t),wd,tRepresents the probability that the t-th non-topic word in the Mi non-topic words belongs to the topic word of the i-th candidate reply text, wq,tRepresents the MiAnd the probability that the tth non-topic word in the non-topic words belongs to the topic word of the text to be replied, score (q, d) represents the topic word similarity between the text to be replied and the ith candidate reply text.
205, the process shown in step 202 to step 204 is performed on the remaining texts to be replied in the N texts to be replied, and the topic word similarity between the N candidate reply texts and the text to be replied is determined.
And 206, taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.
The method shown in fig. 2 can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply texts according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.
The present invention will be further described with reference to specific examples. This embodiment is the specific embodiment of fig. 1 and 2. This example is intended to better aid the understanding of the invention and is not intended to be limiting.
Assume that the text entered by the user is: when the code control tool is selected, the SVN is still used, and the GIT is not needed, although the GIT is much stronger than the SVN. Since if there is a bird in the team, he may annoy you about his various problems with the GIT. The advantage of SVN is that it is very simple and the birds and vegetables can be mastered quickly. The text is the text to be replied, which needs to be replied.
The result after the word segmentation of the text to be replied is as follows: "select" "code" "control" "tool" "when" "still" "use" "SVN" "not" "want" "use" "GIT" "though" "GIT" "is stronger" "than" "SVN" "a lot of" "so as to avoid the problem of the false alarm. "because" "if" "team" "interior" "has" "one" "bird" "" "he" "various" "about" "GIT" "problem" "will" "die" "you" "so as to" have "" one "" will ". "SVN" is "extremely" simple "and" birds "are" also "fast" and "mastered".
For the text to be replied after word segmentation, the result after word deactivation is as follows: "select" "code" "control" "SVN" "GIT" "strong" "team" "interior" "one" "vegetable bird" "he" "various" "problems" "troublesome" "dead" "advantages" "simple" "fast" "grasp" "is provided. As can be seen, the preprocessed text to be replied collectively includes K non-stop words. In this example, K is 19.
And searching N candidate reply texts from the dialogue database according to the non-stop words in the text to be replied. Specifically, a text containing the non-stop words in the text to be replied can be selected from the dialogue database as a candidate reply text. For example, the text 1 is "I am a secretor, a vegetable bird and 90 after effort to not become an old vegetable bird". Since text 1 contains the term "cabbage", short text 1 may be one of the N candidate reply texts with candidate reply texts. When too many short texts containing terms in the processed text to be replied are contained in the dialogue database, N candidate reply texts may be selected from the database to be replied by using the weights of the terms, for example, the value of N may be 30. The process of retrieving candidate reply text may be implemented using existing tools (e.g., the open source tool Lucene may be used).
Table 1 shows a portion of the retrieved candidate reply texts.
Numbering Content providing method and apparatus
1 I is a secretor, a vegetable birdAfter 90, the aim is not to become old vegetable birds.
2 I, still many users are vegetable birds.
3 Take a good care of, hold slowly, this vegetable bird is oneself.
4 A version control tool.
TABLE 1
As can be seen, each candidate reply text in Table 1 includes the non-stop word in the text to be replied.
And performing word segmentation processing and word deactivation processing on each candidate reply text in the N candidate reply texts.
To MiExtracting the characteristics of each non-stop word in the text to be replied to obtain the MiThe feature vector of each non-stop word in the text to be replied, MiAnd the non-stop words are the non-stop words in the text to be replied and the non-stop words of the ith candidate reply text. Suppose term w is the MiExtracting the following characteristics of the term w in the text to be replied from any one of the non-stop words: the frequency of the term w appearing in the current text (i.e. the text to be replied), the inverse document frequency of the term w in the whole dialogue database, the number of sentences containing the term w in the current text, whether the term w appears in the first sentence of the current text, whether the term w appears in the last sentence of the current text, whether the term w is a named entity (i.e. a name of a person, a place, a name of a institution, a number of words and a time word), whether the term w is a named entity in the first sentence of the current text, and whether the term w isWhether the word is the named entity in the current text end sentence or the part of speech of the term w.
The inverse document frequency of term w in the entire conversation database can be determined by the following formula:
.
Where E represents the number of text entries in the entire dialog database and df represents the number of text entries containing the term w in the entire dialog database. Idf (w) represents the inverse document frequency of term w in the entire conversation database.
Assuming that the ith candidate reply text is the 4 th candidate reply text in Table 1, then M isiThe non-stop words include "control" and "tool". The feature vectors of the two non-stop words in the text to be replied are shown in table 2.
TABLE 2
Wherein, the meanings of TF, IDF, SF, First, Last, NE _ First, NE _ Last and POS are shown in Table 3.
TABLE 3
Wherein, the value of 1 in First, Last, NE _ First and NE _ Last represents yes, and the value of 0 represents no. In POS, "n" denotes a noun, "v" denotes a verb, "a" denotes an adjective, and "o" denotes other parts of speech. In a specific implementation, four numerical values may be used to represent four parts of speech in the POS, for example, a noun, a verb, an adjective, and other words may be represented by 1000, 0100, 0010, and 0001, respectively.
Similarly, the feature vectors of "control" and "tool" in the 4 th candidate reply text in Table 1 may be determined.
After determining the feature vectors of the control and the tool in the text to be replied and the 4 th candidate reply text, the probabilities of topic words belonging to the text to be replied of the control and the tool can be determined by using formula 1.1, and the probabilities of topic words belonging to the 4 th candidate reply text of the control and the tool can be determined by using formula 1.2. Then, the topic word similarity between the 4 th candidate reply text and the text to be replied is determined by using formula 1.4. Specifically, when the topic word similarity between the 4 th candidate reply text and the text to be replied is determined by using formula 1.4, Wd=(wd,1,wd,2),Wq=(wq,1,wq,2) Wherein w isd,1Indicates that "control" belongs to the topic word probability, w, of the 4 th candidate reply textd,2The probability that the "tool" belongs to the topic word of the 4 th candidate reply text. w is aq,1Means "control" the probability of belonging to a topic word of the text to be replied, wq,2And indicating the probability that the tool belongs to the topic word of the text to be replied. It can be seen that the topic word probabilities of the same non-stop word are the same in position in the two vectors of the composition.
After determining the topic similarity of the processed to-be-replied text and each candidate reply text in the N candidate reply texts, selecting the candidate reply text with the highest topic similarity value as the reply text.
How to obtain the topic word prediction parameters will be further described below with reference to specific embodiments.
200 texts are selected from the dialogue database as training texts, wherein the 200 texts have 2008 non-stop words in total. In the 2008 non-stop words, the topic word feature value of the non-stop word belonging to the located training text is marked as 1, and the topic word feature value of the non-stop word not belonging to the located training text is marked as 0. And determining the feature vector of the training text in which the 2008 terms are positioned. And (4) performing logistic regression model learning according to the determined feature vector and the feature value of the topic word to obtain the topic word prediction parameter. Those skilled in the art will appreciate that the number of training texts and the number of non-stop words are appropriately adjusted in relation to the accuracy of the topic word prediction parameters. If the training texts and the non-stop words used for determining the topic word prediction parameters are fewer, the topic word prediction accuracy when the topic word prediction parameters are used for topic word prediction of the text to be replied and the reasonable candidate reply text is lower. If the training text and the non-stop words used for determining the topic word prediction parameters are more, the topic word prediction parameters are used for higher accuracy when topic word prediction is carried out on the text to be replied and the candidate reply text.
Fig. 3 is a block diagram of a device provided according to an embodiment of the present invention. The apparatus shown in fig. 3 is capable of performing the steps of the method shown in fig. 1. As shown in fig. 3, the apparatus 300 includes: an acquisition unit 301, a first determination unit 302, a second determination unit 303, a similarity determination unit 304, and a third determination unit 305.
An obtaining unit 301, configured to obtain a to-be-replied text and N candidate reply texts, where each candidate reply text includes at least one non-stop word in the to-be-replied text.
A first determination unit 302 for determining MiEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the MiAnd each non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, which is the same as the text to be replied.
A second determining unit 303 for determining the MiA topic word probability that each of the non-stop words belongs to the ith candidate reply text.
A similarity determination unit 304 for determining a similarity according to the MiA non-stop word belonging to the standbyTopic word probability of reply text, and the MiAnd determining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word.
A third determining unit 305, configured to use, as the reply text of the text to be replied, a candidate reply text with the highest similarity to the topic word of the text to be replied among the N candidate reply texts.
The device shown in fig. 3 can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply text topic words according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.
Fig. 4 is a block diagram of another apparatus provided in accordance with an embodiment of the present invention. The apparatus shown in fig. 4 is capable of performing the steps of the method shown in fig. 1 or fig. 2. As shown in fig. 4, the apparatus 400 includes an acquisition unit 401, a first determination unit 402, a second determination unit 403, a similarity determination unit 404, and a third determination unit 405.
The obtaining unit 401 is configured to obtain a to-be-replied text and N candidate reply texts, where each candidate reply text includes at least one non-stop word in the to-be-replied text.
Specifically, the text input by the user, which is acquired by the acquiring unit 401, is the text to be replied. After the to-be-replied text is acquired, the acquiring unit 401 may perform preprocessing on the to-be-replied text, where the preprocessing includes word segmentation processing and word deactivation processing. Word segmentation is a common technical means in the prior art, and word segmentation is to identify terms in a sentence. After performing word segmentation on the text to be replied, the obtaining unit 401 further needs to perform word-off-processing on the text to be replied after word segmentation. Stop word processing refers to the removal of terms from text, these removed terms are called stop words, and stop words are optional for understanding the meaning of a sentence, e.g. "what" is "and" what "is". After preprocessing the text to be replied, the obtaining unit 401 obtains a text to be replied that includes a text of one or more non-stop words. Then, the obtaining unit 401 selects N candidate reply texts from the dialog database according to the non-stop word in the text to be replied, where each candidate reply text includes at least one non-stop word in the text to be replied. Also, the obtaining unit 401 needs to perform word segmentation processing and word deactivation processing on the N retrieved candidate reply texts. The method for performing word segmentation and stop word processing on the candidate reply text is the same as the method for performing word segmentation and stop word processing on the text to be replied, and thus the description is not repeated here. After performing word segmentation processing and word deactivation processing on the N candidate reply texts, the obtaining unit 401 obtains N candidate reply texts including one or more non-stop words, where each candidate reply text includes at least one non-stop word in the text to be replied.
A first determining unit 402 for determining MiEach of the non-stop words belongs to the topic word probability of the text to be replied, which is obtained by the obtaining unit 401, where M isiThe non-stop word is the ith candidate reply text and the M candidate reply texts in the N candidate reply texts acquired by the acquiring unit 401iThe non-stop words are identical.
It is to be understood that the ith candidate reply text is any one of the N candidate reply texts. Different candidate reply texts have different numbers of the same non-stop words from the text to be replied, but should contain at least one of the same non-stop words.
A first determining unit 402, specifically for determining the MiThe feature vector of each non-stop word in the text to be replied according to the MiThe feature vector and topic word prediction parameter of each non-stop word in the text to be replied are confirmedDetermining the MiAnd each non-stop word in the non-stop words belongs to the probability of the topic word of the text to be replied. More specifically, the first determination unit 402 may determine the M using equation 1.1iAnd the probability of the topic word of any one of the non-stop words in the text to be replied.
A second determining unit 403 for determining the MiEach of the non-stop words belongs to the topic word probability of the ith candidate reply text.
A second determination unit 403, in particular for determining the MiFeature vectors of each of the non-stop words in the ith candidate reply text according to the MiDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the MiAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied. More specifically, the second determination unit 403 may determine the M using equation 1.2iTopic word probability of any one of the non-stop words in the ith candidate reply text.
A similarity determination unit 404 for determining a similarity according to the MiThe probability that an inactive word belongs to the topic word of the text to be replied, and the MiAnd determining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word.
The similarity determining unit 404 is specifically configured to determine the similarity between the ith candidate reply text and the topic word of the text to be replied by using a formula 1.4.
The first determining unit 402, the second determining unit 403, and the similarity determining unit 404 may also be configured to perform the above operations on the remaining texts to be replied in the N texts to be replied, so as to obtain topic word similarities between the N candidate reply texts and the text to be replied.
A third determining unit 405, configured to use, as the reply text of the text to be replied, a candidate reply text with the highest similarity to the topic word of the text to be replied among the N candidate reply texts.
Optionally, as an embodiment, the apparatus 400 may further include a training text obtaining unit 406, a training text determining unit 407, and a prediction parameter determining unit 408.
A training text obtaining unit 406, configured to obtain P training texts, where a topic word feature value corresponding to each non-stop word in each training text is determined, where the topic word feature value is used to indicate whether the corresponding non-stop word belongs to a topic word of the training text where the corresponding non-stop word is located.
A training text determining unit 407, configured to determine a feature vector of each non-stop word in each training text.
A prediction parameter determining unit 408, configured to perform logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text, and determine the topic word prediction parameter, where the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention shall be subject to the protection scope of the claims.

Claims (14)

1. A method of processing text, the method comprising:
acquiring a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied;
determining MiEach of the non-stop words belongs to the topic word probability of the text to be replied, wherein the MiThe non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied;
Determining the MiA topic word probability that said each of the individual non-stop words belongs to said ith candidate reply text;
according to said MiThe probability that each non-stop word belongs to the topic word of the text to be replied and the MiDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability that each non-stop word belongs to the ith candidate reply text;
and taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.
2. The method of claim 1, wherein the determining MiThe probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied comprises the following steps:
determining the MiA feature vector of each of the non-stop words in the text to be replied;
according to said MiDetermining the M characteristic vectors and topic word prediction parameters of each non-stop word in the text to be repliediThe probability that each of the individual non-stop words belongs to the topic word of the text to be replied.
3. The method of claim 2, wherein M isiThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is determined by the following formula:
wherein,indicating that the first term is waiting for returnFeature vectors in complex text, wherein the first term is the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c is a parameter for the topic word prediction,and representing the probability that the first term is the topic word of the text to be replied.
4. The method of claim 1, wherein the determining MiA topic word probability that said each of the non-stop words belongs to said ith candidate reply text, comprising:
determining the MiA feature vector of said each of said plurality of non-stop words in said ith candidate reply text;
according to said MiDetermining feature vectors and topic word prediction parameters for each of the non-stop words in the ith candidate reply text, determining the MiA topic word probability that the each of the individual non-stop words belongs to the candidate reply text.
5. The method of claim 4, wherein M isiThe topic word probability that any one of the non-stop words belongs to the ith candidate reply text is determined by the following formula:
wherein,a feature vector representing a second term in said ith candidate reply text, wherein said second term is said MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c is a parameter for the topic word prediction,indicating the topic word probability that the second term is the ith candidate reply text.
6. The method of any one of claims 2-5, wherein the topic word prediction parameter is determined by:
obtaining P training texts, wherein a topic word characteristic value corresponding to each non-stop word in each training text is determined, wherein the topic word characteristic value is used for indicating whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located;
determining a feature vector for the each non-stop word in the each training text;
performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameters, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.
7. The method of any of claims 1-5, wherein the topic word similarity of the ith candidate reply text to the text to reply is determined according to the following formula:
wherein, WqIs represented by said MiThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, WdIs represented by said MiAnd each non-stop word belongs to a vector formed by the probability of the topic word of the ith candidate reply text, and score (q, d) represents the similarity between the ith candidate reply text and the topic word of the text to be replied.
8. An apparatus for processing text, the apparatus comprising:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text to be replied and N candidate reply texts, and each candidate reply text comprises at least one non-stop word in the text to be replied;
a first determination unit for determining MiEach of the non-stop words belongs to the topic word probability of the text to be replied, wherein the MiThe non-stop words are non-stop words of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied;
a second determination unit for determining the MiA topic word probability that said each of the individual non-stop words belongs to said ith candidate reply text;
a similarity determination unit for determining a similarity according to the MiThe probability that each non-stop word belongs to the topic word of the text to be replied and the MiDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability that each non-stop word belongs to the ith candidate reply text;
and the third determining unit is used for taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.
9. The device according to claim 8, wherein the first determination unit is specifically configured to determine the MiA feature vector of each of the non-stop words in the text to be replied; according to said MiDetermining the M characteristic vectors and topic word prediction parameters of each non-stop word in the text to be repliediThe probability that each of the individual non-stop words belongs to the topic word of the text to be replied.
10. The device according to claim 9, wherein the first determination unit is in particular adapted to determine the M using the following formulaiThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is as follows:
wherein,representing a feature vector of a first term in the text to be replied, wherein the first term is the MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c is a parameter for the topic word prediction,and representing the probability that the first term is the topic word of the text to be replied.
11. The device according to claim 8, wherein the second determination unit is specifically configured to determine the MiA feature vector of said each of said plurality of non-stop words in said ith candidate reply text; according to said MiDetermining feature vectors and topic word prediction parameters for each of the non-stop words in the ith candidate reply text, determining the MiA topic word probability that the each of the individual non-stop words belongs to the candidate reply text.
12. The device according to claim 11, wherein the second determination unit is in particular adapted to determine the M using the following formulaiProbability that any one of the non-stop words belongs to a topic word of the ith candidate reply text:
wherein,a feature vector representing a second term in said ith candidate reply text, wherein said second term is said MiAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c is a parameter for the topic word prediction,indicating the topic word probability that the second term is the ith candidate reply text.
13. The apparatus of any one of claims 9-12, wherein the apparatus further comprises:
a training text acquisition unit, configured to acquire P training texts, where a topic word feature value corresponding to each non-stop word in each training text is determined, where the topic word feature value is used to indicate whether the corresponding non-stop word belongs to a topic word of a training text in which the corresponding non-stop word is located;
a training text determining unit, configured to determine a feature vector of each non-stop word in each training text;
a prediction parameter determination unit, configured to perform logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text, and determine the topic word prediction parameter, where the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.
14. The device according to any of the claims 8 to 12, wherein the similarity determining unit is specifically configured to determine the topic word similarity of the i-th candidate reply text with the text to reply using the following formula:
wherein, WqIs represented by said MiThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, WdIs represented by said MiEach non-stop wordScore (q, d) represents the similarity of the topic words of the ith candidate reply text and the text to be replied.
CN201410371035.6A 2014-07-31 2014-07-31 The method and apparatus for handling text Active CN105302790B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410371035.6A CN105302790B (en) 2014-07-31 2014-07-31 The method and apparatus for handling text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410371035.6A CN105302790B (en) 2014-07-31 2014-07-31 The method and apparatus for handling text

Publications (2)

Publication Number Publication Date
CN105302790A CN105302790A (en) 2016-02-03
CN105302790B true CN105302790B (en) 2018-06-26

Family

ID=55200069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410371035.6A Active CN105302790B (en) 2014-07-31 2014-07-31 The method and apparatus for handling text

Country Status (1)

Country Link
CN (1) CN105302790B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291754B (en) * 2016-04-01 2020-12-04 北京大学 News comment prediction method and news comment prediction system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8346701B2 (en) * 2009-01-23 2013-01-01 Microsoft Corporation Answer ranking in community question-answering sites
CN103229223A (en) * 2010-09-28 2013-07-31 国际商业机器公司 Providing answers to questions using multiple models to score candidate answers
US20120303614A1 (en) * 2011-05-23 2012-11-29 Microsoft Corporation Automating responses to information queries
CN103425640A (en) * 2012-05-14 2013-12-04 华为技术有限公司 Multimedia questioning-answering system and method
CN103577558B (en) * 2013-10-21 2017-04-26 北京奇虎科技有限公司 Device and method for optimizing search ranking of frequently asked question and answer pairs

Also Published As

Publication number Publication date
CN105302790A (en) 2016-02-03

Similar Documents

Publication Publication Date Title
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
US11669698B2 (en) Method and system for automatic formality classification
CN106528845B (en) Retrieval error correction method and device based on artificial intelligence
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
JP6398510B2 (en) Entity linking method and entity linking apparatus
EP2581843B1 (en) Bigram Suggestions
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
US20140351228A1 (en) Dialog system, redundant message removal method and redundant message removal program
CN109299228B (en) Computer-implemented text risk prediction method and device
WO2017112417A1 (en) Method and system for automatic formality transformation
CN104484380A (en) Personalized search method and personalized search device
CN103823849A (en) Method and device for acquiring entries
CN104133855A (en) Smart association method and device for input method
CN110347833B (en) Classification method for multi-round conversations
CN104778283A (en) User occupation classification method and system based on microblog
CN105512122A (en) Ordering method and ordering device for information retrieval system
Suryaningrum Comparison of the TF-IDF method with the count vectorizer to classify hate speech
CN105302790B (en) The method and apparatus for handling text
CN108763258B (en) Document theme parameter extraction method, product recommendation method, device and storage medium
EP3660699A1 (en) Method and system to extract domain concepts to create domain dictionaries and ontologies
KR102078541B1 (en) Issue interest based news value evaluation apparatus and method, storage media storing the same
CN113808709B (en) Psychological elasticity prediction method and system based on text analysis
CN109189893A (en) A kind of method and apparatus of automatically retrieval
CN110162614B (en) Question information extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant