CN106980624A

CN106980624A - A kind for the treatment of method and apparatus of text data

Info

Publication number: CN106980624A
Application number: CN201610031796.6A
Authority: CN
Inventors: 江会星; 孙健; 初敏
Original assignee: Alibaba Group Holding Ltd
Current assignee: Taobao China Software Co Ltd
Priority date: 2016-01-18
Filing date: 2016-01-18
Publication date: 2017-07-25
Anticipated expiration: 2036-01-18
Also published as: EP3405912A1; WO2017127296A1; US10176804B2; EP3405912A4; CN106980624B; US20170206897A1

Abstract

The embodiment of the present application provides a kind for the treatment of method and apparatus of text data, and this method includes：Obtain the first text data；Judge whether first text data is suitable to analogy；If so, then extracting first instance word from first text data；Analogy is carried out to the first instance word, second instance word is obtained；Second text data is generated according to the second instance word.The embodiment of the present application direct construction term vector in largely without mark text, realize analogy answer, knowledge base need not be built, reduce the consuming of manpower and physics, cost is reduced, both definite relations are not replied directly, are replied using analogical pattern, coverage rate is improved, the reply success rate of analog problem is improved.

Description

A kind for the treatment of method and apparatus of text data

Technical field

The application is related to text-processing technical field, more particularly to a kind of processing method of text data and A kind of processing unit of text data.

Background technology

Just becoming increasingly with the development of science and technology, computer carries out intelligent sound or the demand of word response Extensively, many intelligent chat robots are occurred in that successively.

In voice or word response, analog problem be it is relatively common, such as " Xiao Ming and it is small it is red what is Relation ".

At present, intelligent chat robots are generally based on RDF (Resource Description Framework, resource description framework) similar or analogy relation between two entities is derived, so that Answer analog problem.

Relation between two entities is asked, it is necessary to build perfect RDF knowledge in advance based on RDF knowledge bases Storehouse.

The structure of RDF knowledge bases, is generally required by excavating relationship templates, cleaning encyclopaedia class data, closing System extracts three step iteration and carried out, and expends substantial amounts of man power and material, and cost is high, and still, coverage rate is not high, So that the reply success rate of analog problem is low.

For example, " Liu Dehua and Cheng Long have been base friends " in some Eight Diagrams grabbed news, is described, The information such as Liu Dehua, Cheng Long, relation base friend are then recorded in RDF knowledge bases.

If the problem of receiving " what relation Liu Dehua and Cheng Long are " that user sends, in RDF It is base friend that relation is found in knowledge base, then answers " base friend ".

If formerly not grabbing Eight Diagrams news, it can not reply, what relation may be answered " is” Get around problem.

In addition, the reply based on RDF is catechetical, in chat system, it possibly can not draw and answer Case, sometimes, lacks the ability to express of anthropomorphic humour.

The content of the invention

In view of the above problems, it is proposed that the embodiment of the present application overcomes above mentioned problem or extremely to provide one kind A kind of processing method of the text data partially solved the above problems and a kind of corresponding text data Processing unit.

In order to solve the above problems, the embodiment of the present application discloses a kind of processing method of text data, bag Include：

Obtain the first text data；

Judge whether first text data is suitable to analogy；If so, then from first text data Extract first instance word；

Analogy is carried out to the first instance word, second instance word is obtained；

Second text data is generated according to the second instance word.

Preferably, it is described to judge that the step of whether first text data is suitable to analogy includes：

Word segmentation processing is carried out to first text data, multiple first text participles are obtained；

Multiple first text participles of first text data and default analog problem template are carried out Matching；

When the match is successful, determine that first text data is suitable to analogy.

Preferably, described to carry out analogy to the first instance word, the step of obtaining second instance word includes：

When the first instance word is one, search similar to the first instance word one or more First candidate's entity word；

Entity word type and the first instance are screened from one or more of first candidate entity words The one or more second candidate entity words of word identical；

One or more second instance words are selected from one or more of second candidate entity words.

Preferably, it is described to search the one or more first candidate entity words similar to the first instance word The step of include：

Inquire about the first term vector and one or more first candidate entity words of the first instance word One or more second term vectors；

One or more the are calculated based on first term vector and one or more of second term vectors One similarity；

Extract the one or more first candidate entity words of the first similarity highest, as with it is described first real The similar one or more first candidate entity words of pronouns, general term for nouns, numerals and measure words.

When the first instance word includes the first fructification word and the second fructification word, search and described the The similar one or more 3rd candidate's entity words of one fructification word；

Entity word type is screened from one or more of 3rd candidate's entity words and the described first son is real The one or more 4th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical；

Based on the first fructification word, the second fructification word and one or more of 4th candidates Entity word calculates one or more 5th candidate's entity words；

Entity word type is screened from one or more of 5th candidate's entity words and the described second son is real The one or more 6th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical；

From one or more of 4th candidate's entity words and one or more of 6th candidate's entity words Choose second instance word.

Preferably, it is described to search the one or more threeth candidate entities similar to the first fructification word The step of word, includes：

Inquire about the 3rd term vector and one or more 3rd candidate's entity words of the first fructification word One or more 4th term vectors；

One or more the are calculated based on the 3rd term vector and one or more of 4th term vectors Two similarities；

Extract the one or more 3rd candidate's entity words of the second similarity highest, as with described first son The similar one or more 3rd candidate's entity words of entity word.

Preferably, it is described based on the first fructification word, the second fructification word and it is one or The step of multiple 4th candidate's entity words calculate one or more 5th candidate's entity words includes：

Inquire about the 3rd term vector of the first fructification word, one or more of 4th candidate's entity words One or more 4th term vectors, the 5th term vector of the second fructification word；

On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th word to Amount, obtains the 6th term vector；

When the 7th term vector of some entity word is nearest with the 6th term vector, the entity word is confirmed For the 5th candidate's entity word.

Preferably, it is described from one or more of 4th candidate's entity words and the one or more of 6th The step of candidate's entity word chooses second instance word includes：

The 3rd term vector and the 4th word of the 4th candidate's entity word based on the first fructification word Vector calculates the first distance；

The 6th term vector based on the 7th term vector and the 6th candidate's entity word calculate second away from From；

4th candidate's entity word and described the are calculated using first distance and the second distance The scoring of six candidate's entity words；

Choose the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word is used as second instance word.

Preferably, the step of second text data of generation according to the second instance word includes：

Search the analogy for belonging to same relation type with the analog problem template and answer template；

The second instance word is embedded in into the analogy to answer in template, the second text data is obtained.

Preferably, in addition to：

When receiving the first speech data of client transmission, first speech data is converted to the One text data；

Second text data is converted into second speech data；

The second speech data is returned into the client.

The embodiment of the present application also discloses a kind of processing unit of text data, including：

First text data acquisition module, for obtaining the first text data；

Analogy is intended to judge module, for judging whether first text data is suitable to analogy；If so, Then call entity word extraction module；

Entity word extraction module, for extracting first instance word from first text data；

Entity word analogy module, for carrying out analogy to the first instance word, obtains second instance word；

Second text data generation module, for generating the second text data according to the second instance word.

Preferably, the analogy, which is intended to judge module, includes：

Participle submodule, for carrying out word segmentation processing to first text data, obtains multiple first texts This participle；

Analog problem template matches submodule, for multiple first texts of first text data to be divided Word is matched with default analog problem template；

Analogy is intended to determination sub-module, for when the match is successful, determining that first text data is suitable to Analogy.

Preferably, the entity word analogy module includes：

First candidate's entity word search submodule, for the first instance word be one when, search with The similar one or more first candidate entity words of the first instance word；

Second candidate's entity word screens submodule, for from one or more of first candidate entity words Screen entity word type and the one or more second candidate entity words of the first instance word identical；

Second instance selected ci poem selects submodule, for being selected from one or more of second candidate entity words One or more second instance words.

Preferably, the first candidate entity word is searched submodule and included：

Primary vector query unit, for inquire about the first instance word the first term vector and one or One or more second term vectors of multiple first candidate entity words；

First similarity calculated, for based on first term vector and one or more of second Term vector calculates one or more first similarities；

First candidate's entity word extraction unit, for extracting the first similarity highest one or more first Candidate's entity word, is used as the one or more first candidate entity words similar to the first instance word.

Preferably, the entity word analogy module includes：

3rd candidate's entity word searches submodule, for including the first fructification word in the first instance word During with the second fructification word, one or more threeth candidates similar to the first fructification word are searched real Pronouns, general term for nouns, numerals and measure words；

4th candidate's entity word screens submodule, for from one or more of 3rd candidate's entity words Screen entity word type and the one or more 4th candidate's entity words of the first fructification word identical；

5th candidate's entity word calculating sub module, for based on the first fructification word, second son Entity word and one or more of 4th candidate's entity words calculate one or more 5th candidate's entity words；

6th candidate's entity word screens submodule, for from one or more of 5th candidate's entity words Screen entity word type and the one or more 6th candidate's entity words of the second fructification word identical；

Second instance selected ci poem takes submodule, for from one or more of 4th candidate's entity words and described One or more 6th candidate's entity words choose second instance word.

Preferably, the 3rd candidate's entity word is searched submodule and included：

Second term vector query unit, the 3rd term vector and one for inquiring about the first fructification word One or more 4th term vectors of individual or multiple 3rd candidate's entity words；

Second similarity calculated, for based on the 3rd term vector and one or more of four Term vector calculates one or more second similarities；

3rd candidate's entity word extraction unit, for extracting the second similarity highest one or more three Candidate's entity word, is used as the one or more threeth candidate entity words similar to the first fructification word.

Preferably, the 5th candidate entity word calculating sub module includes：

3rd vectorial query unit, for inquire about the first fructification word the 3rd term vector, described one One or more 4th term vectors of individual or multiple 4th candidate's entity words, the of the second fructification word Five term vectors；

Vector calculation unit, on the basis of the 3rd term vector, subtract the 5th term vector, Plus the 4th term vector, the 6th term vector is obtained；

5th candidate's entity word determining unit, for the 7th term vector and the described 6th in some entity word When term vector is nearest, it is the 5th candidate's entity word to confirm the entity word.

Preferably, the second instance selected ci poem takes submodule to include：

First metrics calculation unit, for the 3rd term vector based on the first fructification word and described the 4th term vector of four candidate's entity words calculates the first distance；

Sixth term vector meter of the second distance based on the 7th term vector Yu the 6th candidate's entity word Calculate second distance；

Score calculation unit, is waited for calculating the described 4th using first distance and the second distance Select the scoring of entity word and the 6th candidate's entity word；

Unit is chosen, for choosing the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word conduct Second instance word.

Preferably, the second text data generation module includes：

Analogy answers template and searches submodule, belongs to the same relation with the analog problem template for searching Template is answered in the analogy of type；

Template insertion submodule is answered in analogy, and mould is answered for the second instance word to be embedded in into the analogy In plate, the second text data is obtained.

Preferably, in addition to：

Text conversion module, for when receiving the first speech data of client transmission, by described the One speech data is converted to the first text data；

Voice conversion module, for second text data to be converted into second speech data；

Voice returns to module, for the second speech data to be returned into the client.

The embodiment of the present application includes advantages below：

The embodiment of the present application is when confirming that the first text data has analogy intention, to the first text data First instance word carries out analogy, obtains second instance word, and then generates the second text data, in a large amount of nothings Direct construction term vector in text is marked, analogy answer is realized, without building knowledge base, reduces people The consuming of power and physics, reduces cost, both definite relations is not replied directly, using analogical pattern Reply, improve coverage rate, improve the reply success rate of analog problem.

Brief description of the drawings

Fig. 1 is a kind of step flow chart of the processing method embodiment of text data of the application；

Fig. 2A and Fig. 2 B are a kind of exemplary plots of analog problem template of the embodiment of the present application；

Fig. 3 is a kind of structure chart of CBOW models of the embodiment of the present application；

Fig. 4 is a kind of structured flowchart of the processing unit embodiment of text data of the application.

Embodiment

To enable above-mentioned purpose, the feature and advantage of the application more obvious understandable, below in conjunction with the accompanying drawings The application is described in further detail with embodiment.

Reference picture 1, shows a kind of step flow of the processing method embodiment of text data of the application Figure, specifically may include steps of：

Step 101, the first text data is obtained；

It should be noted that the embodiment of the present application can apply artificial in chat robots, voice assistant etc. In intelligent use.

The artificial intelligence application can be deployed in terminal local, for example, mobile phone, tablet personal computer, intelligence are worn Equipment (such as bracelet, wrist-watch, glasses) is worn, can also be disposed beyond the clouds or in server, for example, Distributed system, the embodiment of the present application is not any limitation as to this.

If deployment is beyond the clouds, the first text data that can be sent with direct reception client end.

Or,

When receiving the first speech data of client transmission, voice can be carried out to the first speech data Recognize (Automatic Speech Recognition, ASR), the first speech data is converted into the first text Notebook data.

In the specific implementation, carrying out the speech recognition system of speech recognition generally by following basic module Constituted：

1st, signal transacting and characteristic extracting module；The main task of the module is extracted from speech data Feature, for acoustic model processing.Meanwhile, it typically also includes some signal processing technologies, to the greatest extent may be used The influence that the factors such as ambient noise, channel, speaker are caused to feature can be reduced.

2nd, acoustic model；Use more and be modeled based on single order HMM speech recognition system.

3rd, pronunciation dictionary；Pronunciation dictionary includes the speech recognition system treatable word finder of institute and its pronunciation. The actual mapping there is provided acoustic model and language model of pronunciation dictionary.

4th, language model；The language model language targeted to speech recognition system is modeled.It is theoretical On, including regular language, the various language models including context-free grammar can serve as language mould Type, but the N-gram and its variant that are also based on statistics that various systems are generally used at present.

5th, decoder；Decoder is one of core of speech recognition system, and its task is the letter to input Number, according to acoustics, language model and dictionary, searching can export the word string of the signal with maximum probability.

Step 102, judge whether first text data is suitable to analogy；If so, then performing step 103；

So-called analogy, i.e., be compared two different (two classes) objects, according to two (two classes) Object is similar on a series of attributes, and known one of object also has other attributes, by This, which releases another object, also has the conclusion of other similar attributes.

In embodiments of the present invention, the first text data can be problem, such as " whom the good friend of desk lamp is ", " what relation Liu Dehua and Cheng Long are ", can be answered with analogy.

In one embodiment of the application, step 102 can include following sub-step：

Sub-step S11, word segmentation processing is carried out to first text data, obtains multiple first texts point Word；

In the embodiment of the present application, word segmentation processing can be carried out in following one or more modes：

1st, the participle based on string matching：Refer to the Chinese character string being analysed to according to certain strategy Matched with the entry in a preset machine dictionary, if finding some character string in dictionary, Then the match is successful (identifying a word).

2nd, the participle of feature based scanning or mark cutting：Refer to the preferential knowledge in character string to be analyzed Not and be syncopated as some carry obvious characteristic words, can be by former character string using these words as breakpoint It is divided into less string and enters mechanical Chinese word segmentation again, so as to reduces the error rate of matching；Or by participle and Part-of-speech tagging combines, and help is provided participle decision-making using abundant grammatical category information, and Word segmentation result is tested in turn, adjusted again in annotation process, so as to improve the accurate of cutting Rate.

3rd, the participle based on understanding：Refer to, by allowing the understanding of anthropomorphic distich of computer mould, reach Recognize the effect of word.Its basic thought is exactly that syntax, semantic analysis are carried out while participle, profit Ambiguity is handled with syntactic information and semantic information.It generally includes three parts：Participle System, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem The syntax and semantic information about word, sentence etc. can be obtained to judge segmentation ambiguity, i.e., It simulates understanding process of the people to sentence.

4th, the segmenting method based on statistics：Refer to, due to word co-occurrence adjacent with word in Chinese information Frequency or probability can preferably reflect into the confidence level of word, it is possible to adjacent co-occurrence in language material The frequency of each combinatorics on words counted, calculate their information that appears alternatively, and calculate two Chinese character X, Y adjacent co-occurrence probabilities.The information that appears alternatively can embody the close of marriage relation between Chinese character Degree.When tightness degree is higher than some threshold value, just it is believed that this word group may constitute one Word.

Certainly, above-mentioned word segmentation processing mode is intended only as example, when implementing the embodiment of the present application, Other word segmentation processing modes can be set according to actual conditions, and the embodiment of the present application is not limited this System.In addition, in addition to above-mentioned word segmentation processing mode, those skilled in the art can also be according to reality Need to use other word segmentation processing modes, the embodiment of the present application is not also any limitation as this.

Sub-step S12, multiple first text participles and the default analogy of first text data are asked Topic template is matched；

Sub-step S13, when the match is successful, determines that first text data is suitable to analogy.

Using the embodiment of the present application, one or more relationship types (i.e. analogical pattern frame) can be directed to There is provided template is answered in the analog problem template of pairing and analogy.

In analog problem template, include the basic structure of (text) the problem of suitable for analogy.

In template is answered in analogy, with the basic structure answered problem, and entity word is remained Position.

Analog problem template and analogy answer template with customized structure persistent storage in the text, When matching, it is loaded into internal memory.

In the specific implementation, CFG analyzer (Context-free grammar can be utilized Parser, CFG) carry out analog problem template matching.

If the production rule of a formal grammar G=(N, Σ, P, S) all takes following form:V->W, Then it is referred to as context-free, wherein, V ∈ N, w ∈ (N ∪ Σ) *.

The reason for CFG is named as " context-free " is exactly because character V always can be with Freely replaced by word string w, without considering the context that character V occurs.

One formal language is context-free, if it is (the bar generated by context-free grammar Mesh context-free language).

If the first text participle and default analog problem template matches after participle, it is considered that the One text data is suitable to analogy.

Using still life relation as the example of relationship type, in analog problem template as shown in Figure 2 A, Arg1 presentation-entity words, have problematic basic structure " ", " good ", " friend/base friend ", "Yes", " who ".

For " whom the good friend of desk lamp is ", can be obtained after participle " desk lamp ", " ", " good friend Friend ", "Yes", " who ", with the analog problem template matches shown in Fig. 2A, it is believed that with analogy meaning Figure.

Using Eight Diagrams relation as the example of relationship type, in analog problem template as shown in Figure 2 B, arg1 With arg2 presentation-entity words, have problematic basic structure " and ", "Yes", " what ", " relation ".

For " what relation Liu Dehua and Cheng Long are ", can be obtained after participle " Liu Dehua ", " and ", " Cheng Long ", "Yes", " what ", " relation ", and analog problem template matches shown in Fig. 2 B can be with Think to be intended to analogy.

Step 103, first instance word is extracted from first text data；

Entity word, can correspond to a specific individual.

It should be noted that first instance word, second instance word, the first fructification word, the second fructification Word, first candidate's entity word, second candidate's entity word, the 3rd candidate's entity word, the 4th candidate's entity word, 5th candidate's entity word, the 6th candidate's entity word are its essence for different processing states It is entity word.

In star's classification, entity word can be Liu Dehua, Zhang Baizhi, woods green grass or young crops rosy clouds etc..

In addition, entity word can also include the individual of some wide in range representative classifications, such as people, film is bright Star, singer etc..

For example, for " whom the good friend of desk lamp is ", entity word is " desk lamp ".

In another example, for " what relation Liu Dehua and Cheng Long are ", entity word be " Liu Dehua ", " Cheng Long ".

Step 104, analogy is carried out to the first instance word, obtains second instance word；

In the embodiment of the present application, by some attributes of entity word, so as to derive similar its of attribute His entity word, such as derives similar second instance word from first instance word.

In the specific implementation, can capture in advance data training word2vec (word to vector) model, Analogy is carried out to the first instance word by word2vec models, second instance word is obtained.

Wherein, word2vec models are a works that the word in training data is converted into vector form Tool, can be converted to word the term vector of 200 dimensions, the word (including entity word) can be stored in In hash (Hash) table.

By conversion, the processing to content of text can be reduced to the vector operation in vector space, counted The similarity in vector space is calculated, to represent the similarity on text semantic.

The data of training can capture webpage by reptile spider, carry out after data cleansing, done Net title and body matter.

In actual applications, data can include two parts：

1st, network data；

Substantially stablize data, we used and accumulated (all encyclopaedia data and 1 year or so other Have details page web data) data, textual data；

2nd, news data；

The window of a nearly half a year is maintained, it is daily to update, can include all news of title and text Data.

This partial data is primarily to handle " relation " of dynamic change in the world, such as between men Friend, conjugal relation etc., therefore, need to react what is grown with each passing hour during training word2vec models News corpus.

Using word2vec CBOW (Continuous Bag-of-Word Model) model, such as scheme Shown in 3, CBOW models are by input layer (input), mapping layer (projection) and output layer (output) Constitute, current word w (t) vector representation is predicted using (n=4) individual word before w (t) and rear (n=4) individual word, should Mode enable to the distance of semantic identical or pattern identical word vector representation closer to.

In one embodiment of the application, step 104 can include following sub-step：

Sub-step S21, when the first instance word is one, is searched similar to the first instance word One or more first candidate entity words；

In the specific implementation, for the situation of problem only one of which entity word, first instance word can be inquired about The first term vector and one or more first candidate entity words one or more second term vectors；

One or more first similarities are calculated based on the first term vector and one or more second term vectors；

Extract the one or more first candidate entity words of the first similarity highest, as with first instance word Similar one or more first candidate entity words.

Specifically, more than word2vec can be calculated according to the vector after conversion by distance instruments Chordal distance (Cosine distance), to represent the similarity of vectorial (word).

For example, input " france ", distance instruments can be calculated and be shown most close with " france " distance Word, example is as follows：

Word	Cosine distance
		spain	0.678515
belgium	0.665923
		netherlands	0.652428
italy	0.633130
		switzerland	0.622323
luxembourg	0.610033
		portugal	0.577154
russia	0.571507
		germany	0.563291
catalonia	0.534176

Sub-step S22, screens entity word type and institute from one or more of first candidate entity words State the one or more second candidate entity words of first instance word identical；

In the embodiment of the present application, it is the answer for problem progress analogy, entity word in general considerations The type of type and entity word in answer is consistent.

For example, for " desk lamp ", entity word type identical entity word have " wall patch ", " LED ", " cabinet for TV " etc..

Sub-step S23, one or more second are selected from one or more of second candidate entity words Entity word.

In the specific implementation, can from based on entity word type screen after entity word in selection one or Multiple second instance words are answered.

In another embodiment of the application, step 104 can include following sub-step：

Sub-step S31, when the first instance word includes the first fructification word and the second fructification word, Search the one or more threeth candidate entity words similar to the first fructification word；

There is the situation of multiple first instance words for problem, such as two, for ease of being carried out to first instance word Expression, in the embodiment of the present application, can according to entity word order, with the first fructification word, second Fructification word etc. is replaced first instance word and expressed.

For example, for " what relation Liu Dehua and Cheng Long are ", the first fructification word is " Liu De China ", the second fructification word is " Cheng Long ".

In the specific implementation, in word2vec models, can inquire about the 3rd word of the first fructification word to One or more 4th term vectors of amount and one or more 3rd candidate's entity words；

Based on the 3rd term vector and one or more 4th term vectors, pass through the modes such as cosine similarity Calculate one or more second similarities；

Extract the one or more 3rd candidate's entity words of the second similarity highest, as with the first fructification The similar one or more 3rd candidate's entity words of word.

Conversely, the 3rd relatively low candidate's entity word of the second similarity is screened.

For example, for " what relation Liu Dehua and Cheng Long are ", can calculate and the first fructification Word " Liu Dehua " similar N (N is positive integer) individual 3rd candidate's entity word, e.g., " yellow solar corona ", " Miao Qiaowei ", " Wang Lihong ", " losing lonely ", " ice rain ", then carried from this N number of 3rd candidate's entity word Most like one or more 3rd candidate's entity words are taken, e.g., " Miao Qiaowei ", " yellow solar corona ", " Wang Li It is grand ", " ice rain ", and screen out " lose lonely ".

Sub-step S32, screens entity word type and institute from one or more of 3rd candidate's entity words State the first one or more 4th candidate's entity words of fructification word identical；

For ease of representing the state screened based on entity word type, screened from the 3rd candidate's entity word Entity word can be referred to as the 4th candidate's entity word.

For example, for " Liu Dehua ", entity word type is star, therefore, it can from " Miao Qiaowei ", " ice rain " that entity word type is song is screened out in " yellow solar corona ", " Wang Lihong ", " ice rain ", is protected Entity word type is stayed to be similarly " Miao Qiaowei ", " yellow solar corona ", " Wang Lihong " of star.

Sub-step S33, based on the first fructification word, the second fructification word and it is one or Multiple 4th candidate's entity words calculate one or more 5th candidate's entity words；In the specific implementation, can be with D=A-B+C mode computational entity word, wherein, A is that the first fructification word, B are the second fructification Word, C are the 4th candidate's entity word, and D is the 5th candidate's entity word.

Specifically, the 3rd term vector of the first fructification word, one or more 4th candidates can be inquired about One or more 4th term vectors, the 5th term vector of the second fructification word of entity word.

On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th term vector, obtain the 6th Term vector.

When the 7th term vector of some entity word is nearest with the 6th term vector, confirm that the entity word is 5th candidate's entity word.

For example, it is " Cheng Long " that if the first fructification word, which is " Liu Dehua ", the second fructification word, the 4th waits It is " Miao Qiaowei ", " yellow solar corona ", " Wang Lihong " to select entity word.

In one case, it can be subtracted " Cheng Long " on the basis of the 3rd term vector of " Liu Dehua " The 5th term vector, the 4th term vector plus " Miao Qiaowei ", the 6th term vector is obtained, if " nothing 7th vector of line " recently, then can confirm that " wireless " is the 5th candidate's entity word with six term vector.

In another case, can subtract on the basis of the 3rd term vector of " Liu Dehua " " into The 5th term vector, the 4th term vector plus " yellow solar corona " of dragon ", obtain the 6th term vector, If with six term vector recently, can confirm that " Liang Chaowei " is the 5th to the 7th vector of " Liang Chaowei " Candidate's entity word.

In another case, can subtract on the basis of the 3rd term vector of " Liu Dehua " " into The 5th term vector, the 4th term vector plus " Wang Lihong " of dragon ", obtain the 6th term vector, If with six term vector recently, can confirm that " Zhou Jielun " is the 5th to the 7th vector of " Zhou Jielun " Candidate's entity word.

Sub-step S34, screens entity word type and institute from one or more of 5th candidate's entity words State the second one or more 6th candidate's entity words of fructification word identical；

For example, for " Cheng Long ", entity word type is star, be therefore, it can from " wireless ", " beam Towards big ", " Wang Lihong ", screen out " wireless " that entity word type is company, reservation in " Zhou Jielun " Entity word type is similarly " Liang Chaowei ", " Zhou Jielun " of star.

It should be noted that because the 4th candidate's entity word and the 5th candidate's entity word are to be mutually related, Therefore, after the 5th candidate's entity word is screened, the 4th corresponding candidate's entity word can also be screened Out.

For example, because " wireless " is screened, therefore, " Miao Qiaowei " associated by " wireless " It is screened, i.e., remaining " yellow solar corona ", " Wang Lihong ".

Sub-step S35, from one or more of 4th candidate's entity words and the one or more of 6th Candidate's entity word chooses second instance word.

In the embodiment of the present application, second instance word can be chosen by equation below：

Wherein, A, B be first instance word, C, D be second instance word, score (C, D) be C and D scoring, c_iFor i-th of the 4th candidate's entity words, d_jFor j-th of the 6th candidate's entity words, λ is normal Number.

Specifically, the of the 3rd term vector that can be based on the first fructification word and the 4th candidate's entity word Four term vectors calculate the first distance；

Second distance is calculated based on the 7th term vector and the 6th term vector of the 6th candidate's entity word, wherein, 6th term vector is on the basis of the 3rd term vector, to subtract the 5th term vector, obtained plus the 4th term vector The term vector obtained；

4th candidate's entity word is calculated using the first distance and the second distance and the 6th candidate is real The scoring of pronouns, general term for nouns, numerals and measure words；

The scoring candidate's entity word of highest the 4th and the 6th candidate's entity word are chosen as second instance word, i.e., For ease of expressing second instance word, in the embodiment of the present application, can according to entity word order, Second instance word is replaced with the 4th candidate's entity word, the 6th candidate's entity word etc. to be expressed.

For example, according to above-mentioned formula, substituting into " Liu Dehua ", " Cheng Long ", " yellow solar corona ", " Liang Chaowei " The scoring calculated is 0.85, substitutes into " Liu Dehua ", " Cheng Long ", " Wang Lihong ", " Zhou Jielun " calculating The scoring arrived is 0.93, due to 0.93 ＞ 0.85, then it is that can determine " Wang Lihong ", " Zhou Jielun " Two entity words.

Step 105, the second text data is generated according to the second instance word.

In the embodiment of the present application, the analogy for belonging to same relation type with analog problem template is searched to answer Template.

By the second instance word embedded category than answering in template, the second text data is obtained.

It should be noted that because analogy answer template is more, it is therefore possible to use similar key-set<value>Mode store, wherein, key is relationship type, i.e. analogical pattern frame, such as Eight Diagrams relation, still life relation etc., set<value>It is one group of answer template.

When key is hit, from corresponding set<value>Middle one answer template of selection, selection Strategy can be random, can be provided according to probability, be also not necessarily limited to provide according to entity type certainly Different answer templates.

For example, for analog problem template as shown in Figure 2 A, template can be answered using following analogy：

1st, A good friend should be B.

2nd, I thinks that A good friend is B.

3rd, A good friend is that class of B.

4th, A and B happy be able to should become friends.

Wherein, A is that first instance word, B are second instance word.

For " whom the good friend of desk lamp is ", the 3rd template is applied mechanically, answer can be " the good friend of desk lamp Friend is wall patch, LED, cabinet for TV that class ".

In another example, for the analog problem template shown in Fig. 2 B, template can be answered using following analogy：

1st, their two relations are how complicated, just and C is similar with D relation.

2nd, just as C and D, what you understood.

3rd, their relation in fact, be with D relation with C just as it is the same.

4th, this is mentioned, I feels the relation like C and D.

If they the 5, are compared to C and D, whether very appropriate you feel

6th, relation of the A and B relation like C and D.

7th, A and B is similar to C and D.

8th, A and B are just as C and D.

9th, A and B relation feel just look like C and D relation.

10th, A and B relation allows me to contemplate C and D relation.

Wherein, A, B are first instance word, and C, D are second instance word.

For " what relation Liu Dehua and Cheng Long are ", the 6th template is applied mechanically, answer can be " Liu De The relation of China and Cheng Long are like grand and Zhou Jielun the relations of Wang Li ".

If what is formerly received is the first text data that client is sent, can be directly by the second textual data Shown according to client is returned.

If what is formerly received is the first speech data that client is sent, the second text data can be turned Second speech data is changed to, second speech data is returned into the client plays out, or, by Two text datas return to client displaying, or, enter while second speech data is returned into the client Row is played and the second text data is returned into client displaying.

It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as to one it is The combination of actions of row, but those skilled in the art should know that the embodiment of the present application is not by described Sequence of movement limitation because according to the embodiment of the present application, some steps can using other orders or Person is carried out simultaneously.Secondly, those skilled in the art should also know, embodiment described in this description Belong to necessary to preferred embodiment, involved action not necessarily the embodiment of the present application.

Reference picture 4, shows a kind of structured flowchart of the processing unit embodiment of text data of the application, Following module can specifically be included：

First text data acquisition module 401, for obtaining the first text data；

Analogy is intended to judge module 402, for judging whether first text data is suitable to analogy；If It is then to call entity word extraction module 403；

Entity word extraction module 403, for extracting first instance word from first text data；

Entity word analogy module 404, for carrying out analogy to the first instance word, obtains second instance Word；

Second text data generation module 405, for generating the second textual data according to the second instance word According to.

In a kind of embodiment of the application, the analogy, which is intended to judge module 402, can include following son Module：

In a kind of embodiment of the application, the entity word analogy module 403 can include following submodule Block：

In a kind of embodiment of the application, the first candidate entity word, which searches submodule, can be included such as Lower unit：

In a kind of embodiment of the application, the 3rd candidate's entity word, which searches submodule, can be included such as Lower unit：

In a kind of embodiment of the application, the 5th candidate's entity word calculating sub module can be included such as Lower unit：

In a kind of embodiment of the application, the second instance selected ci poem takes submodule to include such as placing an order Member：

In a kind of embodiment of the application, the second text data generation module 404 can be included such as Lower submodule：

In a kind of embodiment of the application, the device can also include following module：

For device embodiment, because it is substantially similar to embodiment of the method, so the comparison of description Simply, the relevent part can refer to the partial explaination of embodiments of method.

Each embodiment in this specification is described by the way of progressive, and each embodiment is stressed Be all between difference with other embodiment, each embodiment identical similar part mutually referring to .

It should be understood by those skilled in the art that, the embodiment of the embodiment of the present application can be provided as method, dress Put or computer program product.Therefore, the embodiment of the present application can using complete hardware embodiment, completely The form of embodiment in terms of software implementation or combination software and hardware.Moreover, the embodiment of the present application Can use can be situated between in one or more computers for wherein including computer usable program code with storage The computer journey that matter is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of sequence product.

In a typical configuration, the computer equipment includes one or more processors (CPU), input/output interface, network interface and internal memory.Internal memory potentially includes computer-readable medium In volatile memory, the shape such as random access memory (RAM) and/or Nonvolatile memory Formula, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.Computer-readable medium includes permanent and non-permanent, removable and non-removable media It can realize that information is stored by any method or technique.Information can be computer-readable instruction, Data structure, the module of program or other data.The example of the storage medium of computer includes, but Phase transition internal memory (PRAM), static RAM (SRAM), dynamic random is not limited to deposit Access to memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other in Deposit technology, read-only optical disc read-only storage (CD-ROM), digital versatile disc (DVD) or other Optical storage, magnetic cassette tape, tape magnetic rigid disk storage other magnetic storage apparatus or it is any its His non-transmission medium, the information that can be accessed by a computing device available for storage.According to herein Define, computer-readable medium does not include the computer readable media (transitory media) of non-standing, Such as the data-signal and carrier wave of modulation.

The embodiment of the present application is with reference to according to the method for the embodiment of the present application, terminal device (system) and meter The flow chart and/or block diagram of calculation machine program product is described.It should be understood that can be by computer program instructions Each flow and/or square frame and flow chart and/or square frame in implementation process figure and/or block diagram The combination of flow and/or square frame in figure.Can provide these computer program instructions to all-purpose computer, The processor of special-purpose computer, Embedded Processor or other programmable data processing terminal equipments is to produce One machine so that pass through the computing devices of computer or other programmable data processing terminal equipments Instruction produce be used to realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The device for the function of being specified in multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable datas to handle In the computer-readable memory that terminal device works in a specific way so that be stored in this computer-readable Instruction in memory, which is produced, includes the manufacture of command device, and command device realization is in flow chart one The function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.

These computer program instructions can also be loaded into computer or other programmable data processing terminals are set It is standby upper so that series of operation steps is performed on computer or other programmable terminal equipments in terms of producing The processing that calculation machine is realized, so that the instruction performed on computer or other programmable terminal equipments provides use In realization in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames The step of function of specifying.

Although having been described for the preferred embodiment of the embodiment of the present application, those skilled in the art are once Basic creative concept is known, then other change and modification can be made to these embodiments.So, Appended claims are intended to be construed to include preferred embodiment and fall into the institute of the embodiment of the present application scope Have altered and change.

Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms It is used merely to make a distinction an entity or operation with another entity or operation, and not necessarily requires Or imply between these entities or operation there is any this actual relation or order.Moreover, art Language " comprising ", "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, so that Process, method, article or terminal device including a series of key elements not only include those key elements, and Also include other key elements for being not expressly set out, or also include for this process, method, article or The intrinsic key element of person's terminal device.In the absence of more restrictions, by sentence " including one It is individual ... " limit key element, it is not excluded that at the process including the key element, method, article or end Also there is other identical element in end equipment.

Processing method above to a kind of text data provided herein and a kind of place of text data Device is managed, is described in detail, used herein principle and embodiment party of the specific case to the application Formula is set forth, and the explanation of above example is only intended to help and understands the present processes and its core Thought；Simultaneously for those of ordinary skill in the art, according to the thought of the application, in specific implementation It will change in mode and application, in summary, this specification content should not be construed as pair The limitation of the application.

Claims

1. a kind of processing method of text data, it is characterised in that including：

Obtain the first text data；

Second text data is generated according to the second instance word.

2. according to the method described in claim 1, it is characterised in that described to judge first text The step of whether data are suitable to analogy includes：

3. method according to claim 1 or 2, it is characterised in that described real to described first Pronouns, general term for nouns, numerals and measure words carries out analogy, and the step of obtaining second instance word includes：

4. method according to claim 3, it is characterised in that the lookup is real with described first The step of pronouns, general term for nouns, numerals and measure words similar one or more first candidate entity words, includes：

5. the method according to claim 1 or 2 or 4, it is characterised in that described to described One entity word carries out analogy, and the step of obtaining second instance word includes：

6. method according to claim 5, it is characterised in that the lookup and the described first son The step of entity word similar one or more 3rd candidate's entity words, includes：

7. method according to claim 5, it is characterised in that described real based on the described first son Pronouns, general term for nouns, numerals and measure words, the second fructification word and one or more of 4th candidate's entity words calculate one or more The step of 5th candidate's entity word, includes：

8. method according to claim 7, it is characterised in that described from one or more of The step of 4th candidate's entity word and one or more of 6th candidate's entity words choose second instance word Including：

9. method according to claim 2, it is characterised in that described according to the second instance The step of word generates the second text data includes：

10. the method according to claim 1 or 2 or 4 or 6 or 7 or 8 or 9, its feature exists In, in addition to：

Second text data is converted into second speech data；

The second speech data is returned into the client.

11. a kind of processing unit of text data, it is characterised in that including：

First text data acquisition module, for obtaining the first text data；

12. device according to claim 11, it is characterised in that the analogy is intended to judge mould Block includes：

13. the device according to claim 11 or 12, it is characterised in that the entity word analogy Module includes：

14. the device according to claim 11 or 12, it is characterised in that the entity word analogy Module includes：

15. device according to claim 12, it is characterised in that the second text data life Include into module：