CN105302790B

CN105302790B - The method and apparatus for handling text

Info

Publication number: CN105302790B
Application number: CN201410371035.6A
Authority: CN
Inventors: 吉宗诚; 吕正东; 李航
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2014-07-31
Filing date: 2014-07-31
Publication date: 2018-06-26
Anticipated expiration: 2034-07-31
Also published as: CN105302790A

Abstract

The embodiment of the present invention provides a kind of method for handling text, and this method includes：Obtain text to be replied and N number of candidate reply text；Determine M_iEach non-stop words in a non-stop words belongs to the topic word probability of the text to be replied；Determine the M_iEach non-stop words in a non-stop words belongs to this i-th candidate topic word probability for replying text；Determine this i-th candidate topic word similarity for replying text and the text to be replied；Using N number of candidate reply text replied in text with the highest candidate reply text of the topic word similarity of the text to be replied as the text to be replied.Above-mentioned technical proposal enables to the topic for replying text that can link closely the topic of text to be replied, substantially reduces the topic of the reply text probability unrelated with the topic of text to be replied.

Description

Method and apparatus for processing text

Technical Field

The embodiment of the invention relates to the technical field of information, in particular to a method and equipment for processing texts.

Background

Automatic dialogue techniques are a hot problem in the field of information technology. With automatic dialogue techniques, the user is able to implement a man-machine dialogue. The automatic dialogue in the prior art is realized by the information retrieval technology. After receiving the text to be replied sent by the user, the computer or the server and other devices can retrieve the sentence from the conversation database as the reply text. In the prior art, only the reply text is judged whether the same terms as the text to be replied exist in the reply text when the reply text is searched. Although the reply text may include the terms in the text to be replied, the content of the reply text is not necessarily suitable.

Disclosure of Invention

The embodiment of the invention provides a method and equipment for processing a text, which can select a proper text as a reply text according to a topic of the text to be replied, which is input by a user, so that the reply text can tightly buckle the topic of the text to be replied.

In a first aspect, an embodiment of the present invention provides a method for processing a text, where the method includes: acquiring a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied; determining M_iEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the M_iThe non-stop words are non-stop words of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied; determining the M_iThe probability that each non-stop word in the non-stop words belongs to the topic word of the ith candidate reply text; according to the M_iThe probability that a non-stop word belongs to the topic word of the text to be replied and the M_iDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word; and taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the determining M_iThe probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied comprises the following steps: determining the M_iThe feature vector of each non-stop word in the text to be replied; according to the M_iDetermining the feature vector and topic word prediction parameter of each non-stop word in the text to be replied_iAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the M_iThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is determined by the following formula:

wherein,indicating the first term in the text to be repliedThe feature vector in the text, wherein the first term is the M_iAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,and indicating the probability that the first term is the topic word of the text to be replied.

With reference to the first aspect, in a third possible implementation manner of the first aspect, the determining M_iThe topic word probability that each of the non-stop words belongs to the ith candidate reply text comprises: determining the M_iA feature vector of each of the non-stop words in the ith candidate reply text; according to the M_iDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the M_iAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the M_iThe topic word probability that any one of the non-stop words belongs to the ith candidate reply text is determined by the following formula:

wherein,feature vectors representing second terms in the ith candidate reply text, wherein the second termsIs the M_iAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,indicating the topic word probability that the second term is the ith candidate reply text.

In a fifth possible implementation manner of the first aspect, the topic word prediction parameter is determined by: obtaining P training texts, wherein a topic word characteristic value corresponding to each non-stop word in each training text is determined, wherein the topic word characteristic value is used for indicating whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located; determining a feature vector of each non-stop word in each training text; performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameter, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.

With reference to the first aspect or any one of the foregoing possible implementations of the first aspect, in a sixth possible implementation of the first aspect, the topic word similarity of the ith candidate reply text and the text to be replied is determined according to the following formula:

wherein, W_qIs represented by the M_iThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, W_dIs represented by the M_iAnd score (q, d) represents the similarity between the ith candidate reply text and the topic word of the text to be replied.

In a second aspect, an embodiment of the present invention provides an apparatus, including: the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text to be replied and N candidate reply texts, and each candidate reply text comprises at least one non-stop word in the text to be replied; a first determination unit for determining M_iEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the M_iThe non-stop words are non-stop words of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied; a second determination unit for determining the M_iThe probability that each non-stop word in the non-stop words belongs to the topic word of the ith candidate reply text; a similarity determination unit for determining a similarity according to the M_iThe probability that a non-stop word belongs to the topic word of the text to be replied and the M_iDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word; and the third determining unit is used for taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the first determining unit is specifically configured to determine the M_iThe feature vector of each non-stop word in the text to be replied; according to the M_iFeature vectors and topic word prediction parameters of each non-stop word in the text to be repliedDetermining the M_iAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the first determining unit is specifically configured to determine the M by using the following formula_iThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is as follows:

wherein,representing the feature vector of the first term in the text to be replied, wherein the first term is the M_iAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,and indicating the probability that the first term is the topic word of the text to be replied.

With reference to the second aspect, in a third possible implementation manner of the second aspect, the second determining unit is specifically configured to determine the M_iA feature vector of each of the non-stop words in the ith candidate reply text; according to the M_iDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the M_iAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied.

With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the second determining unit is specifically configured to determine the M by using the following formula_iProbability that any one of the non-stop words belongs to the topic word of the ith candidate reply text:

wherein,means for representing a feature vector of a second term in the ith candidate reply text, wherein the second term is the M_iAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c predicting parameters for the topic word,indicating the topic word probability that the second term is the ith candidate reply text.

With reference to any one possible implementation manner of the first possible implementation manner of the second aspect to the fourth possible implementation manner of the second aspect, in a fifth possible implementation manner of the second aspect, the apparatus further includes: a training text acquisition unit, configured to acquire P training texts, where a topic word feature value corresponding to each non-stop word in each training text is determined, where the topic word feature value is used to indicate whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located; a training text determining unit, configured to determine a feature vector of each non-stop word in each training text; and the prediction parameter determining unit is used for performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameter, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.

With reference to the second aspect or any one of the foregoing possible implementation manners of the second aspect, in a sixth possible implementation manner of the second aspect, the similarity determining unit is specifically configured to determine the similarity between the ith candidate reply text and the topic word of the text to be replied, using the following formula:

The technical scheme can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply text topics according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for processing text provided according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of another method for processing text provided by an embodiment of the invention.

Fig. 3 is a block diagram of a device provided according to an embodiment of the present invention.

Fig. 4 is a block diagram of another apparatus provided in accordance with an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

101, obtaining a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied.

102, determining M_iEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the M_iAnd each non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, which is the same as the text to be replied.

103, determining the M_iA topic word probability that each of the non-stop words belongs to the ith candidate reply text.

104 according to the M_iThe probability that an inactive word belongs to the topic word of the text to be replied, and the M_iAnd determining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word.

105, taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.

The method shown in fig. 1 can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply texts according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.

And 201, acquiring a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied.

Specifically, the text to be replied is the text input by the user. After the text to be replied is obtained, preprocessing is performed on the text to be replied, and the preprocessing includes word segmentation processing and word deactivation processing. Word segmentation is a common technical means in the prior art, and word segmentation is to identify terms in a sentence. The segmentation process may be implemented using existing tools (e.g., the open-source segmentation tool ICTCLAS). After the word segmentation is performed on the text to be replied, the word-deactivation processing is also performed on the text to be replied after the word segmentation. Stop word processing refers to the removal of terms from text, these removed terms are called stop words, and stop words are optional for understanding the meaning of a sentence, e.g. "what" is "and" what "is". In general, a deactivation word list may be stored. When a stop word is made, words belonging to the stop word list may be removed. After the text to be replied is preprocessed, the text to be replied is a text containing one or more non-stop words. And then, selecting N candidate reply texts from the conversation database according to the non-stop words in the text to be replied, wherein each candidate reply text comprises at least one non-stop word in the text to be replied. The process of retrieving candidate reply text may be implemented using existing tools (e.g., the open source tool Lucene may be used). The word segmentation processing and the word deactivation processing are also required to be carried out on the N searched candidate reply texts. The method for performing word segmentation and stop word processing on the candidate reply text is the same as the method for performing word segmentation and stop word processing on the text to be replied, and thus the description is not repeated here.

202, determining M_iEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the M_iAnd each non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, which is the same as the text to be replied.

It is to be understood that the ith candidate reply text is any one of the N candidate reply texts. Different candidate reply texts have different numbers of the same non-stop words from the text to be replied, but should contain at least one of the same non-stop words.

Specifically, the M is determined_iThe probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied comprises the following steps: determining the M_iEach non-stop word in the non-stop word is in the text to be repliedThe feature vector of (1); according to the M_iDetermining the feature vector and topic word prediction parameter of each non-stop word in the text to be replied_iAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied. Suppose term w is the M_iExtracting the following characteristics of the term w in the text to be replied from any one of the non-stop words: the frequency of the term w appearing in the current text (i.e. the text to be replied), the inverse document frequency of the term w in the whole dialogue database, the number of sentences containing the term w in the current text, whether the term w appears in the first sentence of the current text, whether the term w appears in the last sentence of the current text, whether the term w is a named entity (i.e. a name of a person, a place, a name of a institution, a number of words and a time word), whether the term w is a named entity in the first sentence of the current text, whether the term w is a named entity in the last sentence of the current text, and the part of speech of the term w. The M_iThe feature vector of each non-stop word in the text to be replied consists of the features of the non-stop word in the text to be replied.

For the M_iAny non-stop word in the non-stop words, wherein the probability that any non-stop word belongs to the topic word of the text to be replied can be determined by the following formula:

203, determining the M_iA topic word probability that each of the non-stop words belongs to the ith candidate reply text.

Specifically, the M is determined_iThe topic word probability that each of the non-stop words belongs to the ith candidate reply text comprises: determining the M_iA feature vector of each of the non-stop words in the ith candidate reply text; according to the M_iDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the M_iAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied. Suppose term w is the M_iExtracting the following characteristics of the term w in the ith candidate reply text from any one of the non-stop words: the frequency of occurrence of term w in the current text (i.e., the ith candidate reply text), the inverse document frequency of term w in the entire dialog database, the number of sentences containing term w in the current text, whether term w occurs in the first sentence of the current text, whether term w occurs in the last sentence of the current text, whether term w is a named entity (i.e., name of a person, place, organization, number, and time), whether term w is a named entity in the first sentence of the current text, whether term w is a named entity in the last sentence of the current text, and the part of speech of term w. The M_iThe feature vector of each of the non-stop words in the ith candidate reply text is composed of features of the non-stop word in the ith candidate reply text.

For the M_iThe probability that any one of the non-stop words belongs to the topic word of the ith candidate reply text can be determined by the following formulaDetermining:

Optionally, as another embodiment, a feature vector of each non-stop word in the text to be replied may be calculated, and a probability that each non-stop word in the text to be replied is a topic word of the text to be replied is calculated according to the feature vector of each non-stop word in the text to be replied and the topic word prediction parameter. Meanwhile, calculating a feature vector of each non-stop word in the ith candidate reply text, and calculating the probability that each non-stop word in the ith candidate reply text is the topic word of the ith candidate reply text according to the feature vector of each non-stop word in the ith candidate reply text and the topic word prediction parameter. Then, M which belongs to the text to be replied and the ith candidate reply text at the same time is found_iThe probability of each non-stop word in the text to be replied and the first topic wordTopic word probabilities for the i candidate reply texts.

Alternatively, as another embodiment, the probability of each non-stop word in each text in the dialogue database belonging to the topic word of the current text may be pre-calculated and stored in a topic word probability database. The topic word probability that each non-stop word in the ith candidate reply text belongs to the ith candidate reply text can be directly retrieved from the topic word probability database.

The topic word prediction parameters are determined by the following method: obtaining P training texts, wherein a topic word characteristic value corresponding to each non-stop word in each training text is determined, wherein the topic word characteristic value is used for indicating whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located; determining a feature vector of each non-stop word in each training text; performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameter, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.

The P training texts are selected from the dialogue database, the topic word feature value of the topic word in the non-stop words of the P training texts is marked as 1, the topic word feature value of the non-topic word in the non-stop words of the P training texts is marked as 0, and the logistic regression model training is performed by using the following formula:

wherein f (wx) represents the feature value of the non-stop word x in the text where the non-stop word x is located,the feature vector in the text where the non-stop word x is located for the non-stop word x,representing a weight vector, c represents a constant, whereinAnd c is the topic word prediction parameter obtained by training.

204 according to the M_iThe probability that an inactive word belongs to the topic word of the text to be replied, and the M_iDetermining the probability that each non-stop word belongs to the topic word of the ith candidate reply text_iAnd the similarity between the candidate reply text and the topic word of the text to be replied.

Specifically, the topic word similarity of the text to be replied and the ith candidate reply text can be determined by using the following formula:

wherein, W_qIs represented by the M_iThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, W_dIs represented by the M_iThe non-stop words respectively belong to the vector formed by the probability of the topic words of the ith candidate reply text, W_d＝(w_d,1,......,w_d,t)，W_q＝(w_q,1,......,w_q,t)，w_d,tRepresents the probability that the t-th non-topic word in the Mi non-topic words belongs to the topic word of the i-th candidate reply text, w_q,tRepresents the M_iAnd the probability that the tth non-topic word in the non-topic words belongs to the topic word of the text to be replied, score (q, d) represents the topic word similarity between the text to be replied and the ith candidate reply text.

205, the process shown in step 202 to step 204 is performed on the remaining texts to be replied in the N texts to be replied, and the topic word similarity between the N candidate reply texts and the text to be replied is determined.

And 206, taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.

The method shown in fig. 2 can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply texts according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.

The present invention will be further described with reference to specific examples. This embodiment is the specific embodiment of fig. 1 and 2. This example is intended to better aid the understanding of the invention and is not intended to be limiting.

Assume that the text entered by the user is: when the code control tool is selected, the SVN is still used, and the GIT is not needed, although the GIT is much stronger than the SVN. Since if there is a bird in the team, he may annoy you about his various problems with the GIT. The advantage of SVN is that it is very simple and the birds and vegetables can be mastered quickly. The text is the text to be replied, which needs to be replied.

The result after the word segmentation of the text to be replied is as follows: "select" "code" "control" "tool" "when" "still" "use" "SVN" "not" "want" "use" "GIT" "though" "GIT" "is stronger" "than" "SVN" "a lot of" "so as to avoid the problem of the false alarm. "because" "if" "team" "interior" "has" "one" "bird" "" "he" "various" "about" "GIT" "problem" "will" "die" "you" "so as to" have "" one "" will ". "SVN" is "extremely" simple "and" birds "are" also "fast" and "mastered".

For the text to be replied after word segmentation, the result after word deactivation is as follows: "select" "code" "control" "SVN" "GIT" "strong" "team" "interior" "one" "vegetable bird" "he" "various" "problems" "troublesome" "dead" "advantages" "simple" "fast" "grasp" "is provided. As can be seen, the preprocessed text to be replied collectively includes K non-stop words. In this example, K is 19.

And searching N candidate reply texts from the dialogue database according to the non-stop words in the text to be replied. Specifically, a text containing the non-stop words in the text to be replied can be selected from the dialogue database as a candidate reply text. For example, the text 1 is "I am a secretor, a vegetable bird and 90 after effort to not become an old vegetable bird". Since text 1 contains the term "cabbage", short text 1 may be one of the N candidate reply texts with candidate reply texts. When too many short texts containing terms in the processed text to be replied are contained in the dialogue database, N candidate reply texts may be selected from the database to be replied by using the weights of the terms, for example, the value of N may be 30. The process of retrieving candidate reply text may be implemented using existing tools (e.g., the open source tool Lucene may be used).

Table 1 shows a portion of the retrieved candidate reply texts.

Numbering	Content providing method and apparatus
		1	I is a secretor, a vegetable birdAfter 90, the aim is not to become old vegetable birds.
2	I, still many users are vegetable birds.
		3	Take a good care of, hold slowly, this vegetable bird is oneself.
4	A version control tool.

TABLE 1

As can be seen, each candidate reply text in Table 1 includes the non-stop word in the text to be replied.

And performing word segmentation processing and word deactivation processing on each candidate reply text in the N candidate reply texts.

To M_iExtracting the characteristics of each non-stop word in the text to be replied to obtain the M_iThe feature vector of each non-stop word in the text to be replied, M_iAnd the non-stop words are the non-stop words in the text to be replied and the non-stop words of the ith candidate reply text. Suppose term w is the M_iExtracting the following characteristics of the term w in the text to be replied from any one of the non-stop words: the frequency of the term w appearing in the current text (i.e. the text to be replied), the inverse document frequency of the term w in the whole dialogue database, the number of sentences containing the term w in the current text, whether the term w appears in the first sentence of the current text, whether the term w appears in the last sentence of the current text, whether the term w is a named entity (i.e. a name of a person, a place, a name of a institution, a number of words and a time word), whether the term w is a named entity in the first sentence of the current text, and whether the term w isWhether the word is the named entity in the current text end sentence or the part of speech of the term w.

The inverse document frequency of term w in the entire conversation database can be determined by the following formula:

.

Where E represents the number of text entries in the entire dialog database and df represents the number of text entries containing the term w in the entire dialog database. Idf (w) represents the inverse document frequency of term w in the entire conversation database.

Assuming that the ith candidate reply text is the 4 th candidate reply text in Table 1, then M is_iThe non-stop words include "control" and "tool". The feature vectors of the two non-stop words in the text to be replied are shown in table 2.

TABLE 2

Wherein, the meanings of TF, IDF, SF, First, Last, NE _ First, NE _ Last and POS are shown in Table 3.

TABLE 3

Wherein, the value of 1 in First, Last, NE _ First and NE _ Last represents yes, and the value of 0 represents no. In POS, "n" denotes a noun, "v" denotes a verb, "a" denotes an adjective, and "o" denotes other parts of speech. In a specific implementation, four numerical values may be used to represent four parts of speech in the POS, for example, a noun, a verb, an adjective, and other words may be represented by 1000, 0100, 0010, and 0001, respectively.

Similarly, the feature vectors of "control" and "tool" in the 4 th candidate reply text in Table 1 may be determined.

After determining the feature vectors of the control and the tool in the text to be replied and the 4 th candidate reply text, the probabilities of topic words belonging to the text to be replied of the control and the tool can be determined by using formula 1.1, and the probabilities of topic words belonging to the 4 th candidate reply text of the control and the tool can be determined by using formula 1.2. Then, the topic word similarity between the 4 th candidate reply text and the text to be replied is determined by using formula 1.4. Specifically, when the topic word similarity between the 4 th candidate reply text and the text to be replied is determined by using formula 1.4, W_d＝(w_d,1,w_d,2)，W_q＝(w_q,1,w_q,2) Wherein w is_d,1Indicates that "control" belongs to the topic word probability, w, of the 4 th candidate reply text_d,2The probability that the "tool" belongs to the topic word of the 4 th candidate reply text. w is a_q,1Means "control" the probability of belonging to a topic word of the text to be replied, w_q,2And indicating the probability that the tool belongs to the topic word of the text to be replied. It can be seen that the topic word probabilities of the same non-stop word are the same in position in the two vectors of the composition.

After determining the topic similarity of the processed to-be-replied text and each candidate reply text in the N candidate reply texts, selecting the candidate reply text with the highest topic similarity value as the reply text.

How to obtain the topic word prediction parameters will be further described below with reference to specific embodiments.

200 texts are selected from the dialogue database as training texts, wherein the 200 texts have 2008 non-stop words in total. In the 2008 non-stop words, the topic word feature value of the non-stop word belonging to the located training text is marked as 1, and the topic word feature value of the non-stop word not belonging to the located training text is marked as 0. And determining the feature vector of the training text in which the 2008 terms are positioned. And (4) performing logistic regression model learning according to the determined feature vector and the feature value of the topic word to obtain the topic word prediction parameter. Those skilled in the art will appreciate that the number of training texts and the number of non-stop words are appropriately adjusted in relation to the accuracy of the topic word prediction parameters. If the training texts and the non-stop words used for determining the topic word prediction parameters are fewer, the topic word prediction accuracy when the topic word prediction parameters are used for topic word prediction of the text to be replied and the reasonable candidate reply text is lower. If the training text and the non-stop words used for determining the topic word prediction parameters are more, the topic word prediction parameters are used for higher accuracy when topic word prediction is carried out on the text to be replied and the candidate reply text.

Fig. 3 is a block diagram of a device provided according to an embodiment of the present invention. The apparatus shown in fig. 3 is capable of performing the steps of the method shown in fig. 1. As shown in fig. 3, the apparatus 300 includes: an acquisition unit 301, a first determination unit 302, a second determination unit 303, a similarity determination unit 304, and a third determination unit 305.

An obtaining unit 301, configured to obtain a to-be-replied text and N candidate reply texts, where each candidate reply text includes at least one non-stop word in the to-be-replied text.

A first determination unit 302 for determining M_iEach non-stop word in the non-stop words belongs to the topic word probability of the text to be replied, wherein the M_iAnd each non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, which is the same as the text to be replied.

A second determining unit 303 for determining the M_iA topic word probability that each of the non-stop words belongs to the ith candidate reply text.

A similarity determination unit 304 for determining a similarity according to the M_iA non-stop word belonging to the standbyTopic word probability of reply text, and the M_iAnd determining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word.

A third determining unit 305, configured to use, as the reply text of the text to be replied, a candidate reply text with the highest similarity to the topic word of the text to be replied among the N candidate reply texts.

The device shown in fig. 3 can predict the probability of the same topic word of the text to be replied and the plurality of candidate reply texts, determine the similarity between the text to be replied and the plurality of candidate reply text topic words according to the predicted probability, and then select the candidate reply text with the highest similarity to reply the text to be replied. Therefore, the topic of the reply text can be fastened with the topic of the text to be replied, and the probability that the topic of the reply text is irrelevant to the topic of the text to be replied is greatly reduced.

Fig. 4 is a block diagram of another apparatus provided in accordance with an embodiment of the present invention. The apparatus shown in fig. 4 is capable of performing the steps of the method shown in fig. 1 or fig. 2. As shown in fig. 4, the apparatus 400 includes an acquisition unit 401, a first determination unit 402, a second determination unit 403, a similarity determination unit 404, and a third determination unit 405.

The obtaining unit 401 is configured to obtain a to-be-replied text and N candidate reply texts, where each candidate reply text includes at least one non-stop word in the to-be-replied text.

Specifically, the text input by the user, which is acquired by the acquiring unit 401, is the text to be replied. After the to-be-replied text is acquired, the acquiring unit 401 may perform preprocessing on the to-be-replied text, where the preprocessing includes word segmentation processing and word deactivation processing. Word segmentation is a common technical means in the prior art, and word segmentation is to identify terms in a sentence. After performing word segmentation on the text to be replied, the obtaining unit 401 further needs to perform word-off-processing on the text to be replied after word segmentation. Stop word processing refers to the removal of terms from text, these removed terms are called stop words, and stop words are optional for understanding the meaning of a sentence, e.g. "what" is "and" what "is". After preprocessing the text to be replied, the obtaining unit 401 obtains a text to be replied that includes a text of one or more non-stop words. Then, the obtaining unit 401 selects N candidate reply texts from the dialog database according to the non-stop word in the text to be replied, where each candidate reply text includes at least one non-stop word in the text to be replied. Also, the obtaining unit 401 needs to perform word segmentation processing and word deactivation processing on the N retrieved candidate reply texts. The method for performing word segmentation and stop word processing on the candidate reply text is the same as the method for performing word segmentation and stop word processing on the text to be replied, and thus the description is not repeated here. After performing word segmentation processing and word deactivation processing on the N candidate reply texts, the obtaining unit 401 obtains N candidate reply texts including one or more non-stop words, where each candidate reply text includes at least one non-stop word in the text to be replied.

A first determining unit 402 for determining M_iEach of the non-stop words belongs to the topic word probability of the text to be replied, which is obtained by the obtaining unit 401, where M is_iThe non-stop word is the ith candidate reply text and the M candidate reply texts in the N candidate reply texts acquired by the acquiring unit 401_iThe non-stop words are identical.

A first determining unit 402, specifically for determining the M_iThe feature vector of each non-stop word in the text to be replied according to the M_iThe feature vector and topic word prediction parameter of each non-stop word in the text to be replied are confirmedDetermining the M_iAnd each non-stop word in the non-stop words belongs to the probability of the topic word of the text to be replied. More specifically, the first determination unit 402 may determine the M using equation 1.1_iAnd the probability of the topic word of any one of the non-stop words in the text to be replied.

A second determining unit 403 for determining the M_iEach of the non-stop words belongs to the topic word probability of the ith candidate reply text.

A second determination unit 403, in particular for determining the M_iFeature vectors of each of the non-stop words in the ith candidate reply text according to the M_iDetermining the feature vector and topic word prediction parameter of each non-stop word in the ith candidate reply text to determine the M_iAnd the probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied. More specifically, the second determination unit 403 may determine the M using equation 1.2_iTopic word probability of any one of the non-stop words in the ith candidate reply text.

A similarity determination unit 404 for determining a similarity according to the M_iThe probability that an inactive word belongs to the topic word of the text to be replied, and the M_iAnd determining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability of the candidate reply text of the ith non-stop word.

The similarity determining unit 404 is specifically configured to determine the similarity between the ith candidate reply text and the topic word of the text to be replied by using a formula 1.4.

The first determining unit 402, the second determining unit 403, and the similarity determining unit 404 may also be configured to perform the above operations on the remaining texts to be replied in the N texts to be replied, so as to obtain topic word similarities between the N candidate reply texts and the text to be replied.

A third determining unit 405, configured to use, as the reply text of the text to be replied, a candidate reply text with the highest similarity to the topic word of the text to be replied among the N candidate reply texts.

Optionally, as an embodiment, the apparatus 400 may further include a training text obtaining unit 406, a training text determining unit 407, and a prediction parameter determining unit 408.

A training text obtaining unit 406, configured to obtain P training texts, where a topic word feature value corresponding to each non-stop word in each training text is determined, where the topic word feature value is used to indicate whether the corresponding non-stop word belongs to a topic word of the training text where the corresponding non-stop word is located.

A training text determining unit 407, configured to determine a feature vector of each non-stop word in each training text.

A prediction parameter determining unit 408, configured to perform logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text, and determine the topic word prediction parameter, where the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of processing text, the method comprising:

acquiring a text to be replied and N candidate reply texts, wherein each candidate reply text comprises at least one non-stop word in the text to be replied;

determining M_iEach of the non-stop words belongs to the topic word probability of the text to be replied, wherein the M_iThe non-stop word is the non-stop word of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied；

Determining the M_iA topic word probability that said each of the individual non-stop words belongs to said ith candidate reply text;

according to said M_iThe probability that each non-stop word belongs to the topic word of the text to be replied and the M_iDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability that each non-stop word belongs to the ith candidate reply text;

and taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.

2. The method of claim 1, wherein the determining M_iThe probability that each non-stop word in the non-stop words belongs to the topic word of the text to be replied comprises the following steps:

determining the M_iA feature vector of each of the non-stop words in the text to be replied;

according to said M_iDetermining the M characteristic vectors and topic word prediction parameters of each non-stop word in the text to be replied_iThe probability that each of the individual non-stop words belongs to the topic word of the text to be replied.

3. The method of claim 2, wherein M is_iThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is determined by the following formula:

wherein,indicating that the first term is waiting for returnFeature vectors in complex text, wherein the first term is the M_iAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c is a parameter for the topic word prediction,and representing the probability that the first term is the topic word of the text to be replied.

4. The method of claim 1, wherein the determining M_iA topic word probability that said each of the non-stop words belongs to said ith candidate reply text, comprising:

determining the M_iA feature vector of said each of said plurality of non-stop words in said ith candidate reply text;

according to said M_iDetermining feature vectors and topic word prediction parameters for each of the non-stop words in the ith candidate reply text, determining the M_iA topic word probability that the each of the individual non-stop words belongs to the candidate reply text.

5. The method of claim 4, wherein M is_iThe topic word probability that any one of the non-stop words belongs to the ith candidate reply text is determined by the following formula:

wherein,a feature vector representing a second term in said ith candidate reply text, wherein said second term is said M_iAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c is a parameter for the topic word prediction,indicating the topic word probability that the second term is the ith candidate reply text.

6. The method of any one of claims 2-5, wherein the topic word prediction parameter is determined by:

obtaining P training texts, wherein a topic word characteristic value corresponding to each non-stop word in each training text is determined, wherein the topic word characteristic value is used for indicating whether the corresponding non-stop word belongs to a topic word of the training text in which the corresponding non-stop word is located;

determining a feature vector for the each non-stop word in the each training text;

performing logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text to determine the topic word prediction parameters, wherein the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.

7. The method of any of claims 1-5, wherein the topic word similarity of the ith candidate reply text to the text to reply is determined according to the following formula:

wherein, W_qIs represented by said M_iThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, W_dIs represented by said M_iAnd each non-stop word belongs to a vector formed by the probability of the topic word of the ith candidate reply text, and score (q, d) represents the similarity between the ith candidate reply text and the topic word of the text to be replied.

8. An apparatus for processing text, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text to be replied and N candidate reply texts, and each candidate reply text comprises at least one non-stop word in the text to be replied;

a first determination unit for determining M_iEach of the non-stop words belongs to the topic word probability of the text to be replied, wherein the M_iThe non-stop words are non-stop words of the ith candidate reply text in the N candidate reply texts, wherein the i candidate reply text is the same as the text to be replied;

a second determination unit for determining the M_iA topic word probability that said each of the individual non-stop words belongs to said ith candidate reply text;

a similarity determination unit for determining a similarity according to the M_iThe probability that each non-stop word belongs to the topic word of the text to be replied and the M_iDetermining the topic word similarity of the ith candidate reply text and the text to be replied according to the topic word probability that each non-stop word belongs to the ith candidate reply text;

and the third determining unit is used for taking the candidate reply text with the highest similarity with the topic word of the text to be replied in the N candidate reply texts as the reply text of the text to be replied.

9. The device according to claim 8, wherein the first determination unit is specifically configured to determine the M_iA feature vector of each of the non-stop words in the text to be replied; according to said M_iDetermining the M characteristic vectors and topic word prediction parameters of each non-stop word in the text to be replied_iThe probability that each of the individual non-stop words belongs to the topic word of the text to be replied.

10. The device according to claim 9, wherein the first determination unit is in particular adapted to determine the M using the following formula_iThe probability that any one of the non-stop words belongs to the topic word of the text to be replied is as follows:

wherein,representing a feature vector of a first term in the text to be replied, wherein the first term is the M_iAny one of the non-stop words,representing a weight vector, c represents a constant, whereinAnd c is a parameter for the topic word prediction,and representing the probability that the first term is the topic word of the text to be replied.

11. The device according to claim 8, wherein the second determination unit is specifically configured to determine the M_iA feature vector of said each of said plurality of non-stop words in said ith candidate reply text; according to said M_iDetermining feature vectors and topic word prediction parameters for each of the non-stop words in the ith candidate reply text, determining the M_iA topic word probability that the each of the individual non-stop words belongs to the candidate reply text.

12. The device according to claim 11, wherein the second determination unit is in particular adapted to determine the M using the following formula_iProbability that any one of the non-stop words belongs to a topic word of the ith candidate reply text:

13. The apparatus of any one of claims 9-12, wherein the apparatus further comprises:

a training text acquisition unit, configured to acquire P training texts, where a topic word feature value corresponding to each non-stop word in each training text is determined, where the topic word feature value is used to indicate whether the corresponding non-stop word belongs to a topic word of a training text in which the corresponding non-stop word is located;

a training text determining unit, configured to determine a feature vector of each non-stop word in each training text;

a prediction parameter determination unit, configured to perform logistic regression model learning according to the feature vector of each non-stop word in each training text and the topic word feature value of each non-stop word in each training text, and determine the topic word prediction parameter, where the feature vector of each non-stop word in each training text is an input item of the logistic regression model, and the topic word feature value of each non-stop word in each training text is an output item of the logistic regression model.

14. The device according to any of the claims 8 to 12, wherein the similarity determining unit is specifically configured to determine the topic word similarity of the i-th candidate reply text with the text to reply using the following formula:

wherein, W_qIs represented by said M_iThe non-stop words respectively belong to the vector formed by the probability of the topic words of the text to be replied, W_dIs represented by said M_iEach non-stop wordScore (q, d) represents the similarity of the topic words of the ith candidate reply text and the text to be replied.