CN106980624A - A kind for the treatment of method and apparatus of text data - Google Patents
A kind for the treatment of method and apparatus of text data Download PDFInfo
- Publication number
- CN106980624A CN106980624A CN201610031796.6A CN201610031796A CN106980624A CN 106980624 A CN106980624 A CN 106980624A CN 201610031796 A CN201610031796 A CN 201610031796A CN 106980624 A CN106980624 A CN 106980624A
- Authority
- CN
- China
- Prior art keywords
- word
- candidate
- entity
- instance
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 239000013598 vector Substances 0.000 claims abstract description 143
- 238000012545 processing Methods 0.000 claims description 26
- 230000011218 segmentation Effects 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 9
- 238000003672 processing method Methods 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000003780 insertion Methods 0.000 claims description 3
- 230000037431 insertion Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 abstract description 3
- 238000003860 storage Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 241001269238 Data Species 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 101150088826 arg1 gene Proteins 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000005520 cutting process Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 101150026173 ARG2 gene Proteins 0.000 description 1
- 241000239290 Araneae Species 0.000 description 1
- 244000025254 Cannabis sativa Species 0.000 description 1
- 101100005166 Hypocrea virens cpa1 gene Proteins 0.000 description 1
- 241000270322 Lepidosauria Species 0.000 description 1
- 101100260702 Mus musculus Tinagl1 gene Proteins 0.000 description 1
- 101100379633 Xenopus laevis arg2-a gene Proteins 0.000 description 1
- 101100379634 Xenopus laevis arg2-b gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application provides a kind for the treatment of method and apparatus of text data, and this method includes:Obtain the first text data;Judge whether first text data is suitable to analogy;If so, then extracting first instance word from first text data;Analogy is carried out to the first instance word, second instance word is obtained;Second text data is generated according to the second instance word.The embodiment of the present application direct construction term vector in largely without mark text, realize analogy answer, knowledge base need not be built, reduce the consuming of manpower and physics, cost is reduced, both definite relations are not replied directly, are replied using analogical pattern, coverage rate is improved, the reply success rate of analog problem is improved.
Description
Technical field
The application is related to text-processing technical field, more particularly to a kind of processing method of text data and
A kind of processing unit of text data.
Background technology
Just becoming increasingly with the development of science and technology, computer carries out intelligent sound or the demand of word response
Extensively, many intelligent chat robots are occurred in that successively.
In voice or word response, analog problem be it is relatively common, such as " Xiao Ming and it is small it is red what is
Relation ".
At present, intelligent chat robots are generally based on RDF (Resource Description
Framework, resource description framework) similar or analogy relation between two entities is derived, so that
Answer analog problem.
Relation between two entities is asked, it is necessary to build perfect RDF knowledge in advance based on RDF knowledge bases
Storehouse.
The structure of RDF knowledge bases, is generally required by excavating relationship templates, cleaning encyclopaedia class data, closing
System extracts three step iteration and carried out, and expends substantial amounts of man power and material, and cost is high, and still, coverage rate is not high,
So that the reply success rate of analog problem is low.
For example, " Liu Dehua and Cheng Long have been base friends " in some Eight Diagrams grabbed news, is described,
The information such as Liu Dehua, Cheng Long, relation base friend are then recorded in RDF knowledge bases.
If the problem of receiving " what relation Liu Dehua and Cheng Long are " that user sends, in RDF
It is base friend that relation is found in knowledge base, then answers " base friend ".
If formerly not grabbing Eight Diagrams news, it can not reply, what relation may be answered " is”
Get around problem.
In addition, the reply based on RDF is catechetical, in chat system, it possibly can not draw and answer
Case, sometimes, lacks the ability to express of anthropomorphic humour.
The content of the invention
In view of the above problems, it is proposed that the embodiment of the present application overcomes above mentioned problem or extremely to provide one kind
A kind of processing method of the text data partially solved the above problems and a kind of corresponding text data
Processing unit.
In order to solve the above problems, the embodiment of the present application discloses a kind of processing method of text data, bag
Include:
Obtain the first text data;
Judge whether first text data is suitable to analogy;If so, then from first text data
Extract first instance word;
Analogy is carried out to the first instance word, second instance word is obtained;
Second text data is generated according to the second instance word.
Preferably, it is described to judge that the step of whether first text data is suitable to analogy includes:
Word segmentation processing is carried out to first text data, multiple first text participles are obtained;
Multiple first text participles of first text data and default analog problem template are carried out
Matching;
When the match is successful, determine that first text data is suitable to analogy.
Preferably, described to carry out analogy to the first instance word, the step of obtaining second instance word includes:
When the first instance word is one, search similar to the first instance word one or more
First candidate's entity word;
Entity word type and the first instance are screened from one or more of first candidate entity words
The one or more second candidate entity words of word identical;
One or more second instance words are selected from one or more of second candidate entity words.
Preferably, it is described to search the one or more first candidate entity words similar to the first instance word
The step of include:
Inquire about the first term vector and one or more first candidate entity words of the first instance word
One or more second term vectors;
One or more the are calculated based on first term vector and one or more of second term vectors
One similarity;
Extract the one or more first candidate entity words of the first similarity highest, as with it is described first real
The similar one or more first candidate entity words of pronouns, general term for nouns, numerals and measure words.
Preferably, described to carry out analogy to the first instance word, the step of obtaining second instance word includes:
When the first instance word includes the first fructification word and the second fructification word, search and described the
The similar one or more 3rd candidate's entity words of one fructification word;
Entity word type is screened from one or more of 3rd candidate's entity words and the described first son is real
The one or more 4th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
Based on the first fructification word, the second fructification word and one or more of 4th candidates
Entity word calculates one or more 5th candidate's entity words;
Entity word type is screened from one or more of 5th candidate's entity words and the described second son is real
The one or more 6th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
From one or more of 4th candidate's entity words and one or more of 6th candidate's entity words
Choose second instance word.
Preferably, it is described to search the one or more threeth candidate entities similar to the first fructification word
The step of word, includes:
Inquire about the 3rd term vector and one or more 3rd candidate's entity words of the first fructification word
One or more 4th term vectors;
One or more the are calculated based on the 3rd term vector and one or more of 4th term vectors
Two similarities;
Extract the one or more 3rd candidate's entity words of the second similarity highest, as with described first son
The similar one or more 3rd candidate's entity words of entity word.
Preferably, it is described based on the first fructification word, the second fructification word and it is one or
The step of multiple 4th candidate's entity words calculate one or more 5th candidate's entity words includes:
Inquire about the 3rd term vector of the first fructification word, one or more of 4th candidate's entity words
One or more 4th term vectors, the 5th term vector of the second fructification word;
On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th word to
Amount, obtains the 6th term vector;
When the 7th term vector of some entity word is nearest with the 6th term vector, the entity word is confirmed
For the 5th candidate's entity word.
Preferably, it is described from one or more of 4th candidate's entity words and the one or more of 6th
The step of candidate's entity word chooses second instance word includes:
The 3rd term vector and the 4th word of the 4th candidate's entity word based on the first fructification word
Vector calculates the first distance;
The 6th term vector based on the 7th term vector and the 6th candidate's entity word calculate second away from
From;
4th candidate's entity word and described the are calculated using first distance and the second distance
The scoring of six candidate's entity words;
Choose the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word is used as second instance word.
Preferably, the step of second text data of generation according to the second instance word includes:
Search the analogy for belonging to same relation type with the analog problem template and answer template;
The second instance word is embedded in into the analogy to answer in template, the second text data is obtained.
Preferably, in addition to:
When receiving the first speech data of client transmission, first speech data is converted to the
One text data;
Second text data is converted into second speech data;
The second speech data is returned into the client.
The embodiment of the present application also discloses a kind of processing unit of text data, including:
First text data acquisition module, for obtaining the first text data;
Analogy is intended to judge module, for judging whether first text data is suitable to analogy;If so,
Then call entity word extraction module;
Entity word extraction module, for extracting first instance word from first text data;
Entity word analogy module, for carrying out analogy to the first instance word, obtains second instance word;
Second text data generation module, for generating the second text data according to the second instance word.
Preferably, the analogy, which is intended to judge module, includes:
Participle submodule, for carrying out word segmentation processing to first text data, obtains multiple first texts
This participle;
Analog problem template matches submodule, for multiple first texts of first text data to be divided
Word is matched with default analog problem template;
Analogy is intended to determination sub-module, for when the match is successful, determining that first text data is suitable to
Analogy.
Preferably, the entity word analogy module includes:
First candidate's entity word search submodule, for the first instance word be one when, search with
The similar one or more first candidate entity words of the first instance word;
Second candidate's entity word screens submodule, for from one or more of first candidate entity words
Screen entity word type and the one or more second candidate entity words of the first instance word identical;
Second instance selected ci poem selects submodule, for being selected from one or more of second candidate entity words
One or more second instance words.
Preferably, the first candidate entity word is searched submodule and included:
Primary vector query unit, for inquire about the first instance word the first term vector and one or
One or more second term vectors of multiple first candidate entity words;
First similarity calculated, for based on first term vector and one or more of second
Term vector calculates one or more first similarities;
First candidate's entity word extraction unit, for extracting the first similarity highest one or more first
Candidate's entity word, is used as the one or more first candidate entity words similar to the first instance word.
Preferably, the entity word analogy module includes:
3rd candidate's entity word searches submodule, for including the first fructification word in the first instance word
During with the second fructification word, one or more threeth candidates similar to the first fructification word are searched real
Pronouns, general term for nouns, numerals and measure words;
4th candidate's entity word screens submodule, for from one or more of 3rd candidate's entity words
Screen entity word type and the one or more 4th candidate's entity words of the first fructification word identical;
5th candidate's entity word calculating sub module, for based on the first fructification word, second son
Entity word and one or more of 4th candidate's entity words calculate one or more 5th candidate's entity words;
6th candidate's entity word screens submodule, for from one or more of 5th candidate's entity words
Screen entity word type and the one or more 6th candidate's entity words of the second fructification word identical;
Second instance selected ci poem takes submodule, for from one or more of 4th candidate's entity words and described
One or more 6th candidate's entity words choose second instance word.
Preferably, the 3rd candidate's entity word is searched submodule and included:
Second term vector query unit, the 3rd term vector and one for inquiring about the first fructification word
One or more 4th term vectors of individual or multiple 3rd candidate's entity words;
Second similarity calculated, for based on the 3rd term vector and one or more of four
Term vector calculates one or more second similarities;
3rd candidate's entity word extraction unit, for extracting the second similarity highest one or more three
Candidate's entity word, is used as the one or more threeth candidate entity words similar to the first fructification word.
Preferably, the 5th candidate entity word calculating sub module includes:
3rd vectorial query unit, for inquire about the first fructification word the 3rd term vector, described one
One or more 4th term vectors of individual or multiple 4th candidate's entity words, the of the second fructification word
Five term vectors;
Vector calculation unit, on the basis of the 3rd term vector, subtract the 5th term vector,
Plus the 4th term vector, the 6th term vector is obtained;
5th candidate's entity word determining unit, for the 7th term vector and the described 6th in some entity word
When term vector is nearest, it is the 5th candidate's entity word to confirm the entity word.
Preferably, the second instance selected ci poem takes submodule to include:
First metrics calculation unit, for the 3rd term vector based on the first fructification word and described the
4th term vector of four candidate's entity words calculates the first distance;
Sixth term vector meter of the second distance based on the 7th term vector Yu the 6th candidate's entity word
Calculate second distance;
Score calculation unit, is waited for calculating the described 4th using first distance and the second distance
Select the scoring of entity word and the 6th candidate's entity word;
Unit is chosen, for choosing the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word conduct
Second instance word.
Preferably, the second text data generation module includes:
Analogy answers template and searches submodule, belongs to the same relation with the analog problem template for searching
Template is answered in the analogy of type;
Template insertion submodule is answered in analogy, and mould is answered for the second instance word to be embedded in into the analogy
In plate, the second text data is obtained.
Preferably, in addition to:
Text conversion module, for when receiving the first speech data of client transmission, by described the
One speech data is converted to the first text data;
Voice conversion module, for second text data to be converted into second speech data;
Voice returns to module, for the second speech data to be returned into the client.
The embodiment of the present application includes advantages below:
The embodiment of the present application is when confirming that the first text data has analogy intention, to the first text data
First instance word carries out analogy, obtains second instance word, and then generates the second text data, in a large amount of nothings
Direct construction term vector in text is marked, analogy answer is realized, without building knowledge base, reduces people
The consuming of power and physics, reduces cost, both definite relations is not replied directly, using analogical pattern
Reply, improve coverage rate, improve the reply success rate of analog problem.
Brief description of the drawings
Fig. 1 is a kind of step flow chart of the processing method embodiment of text data of the application;
Fig. 2A and Fig. 2 B are a kind of exemplary plots of analog problem template of the embodiment of the present application;
Fig. 3 is a kind of structure chart of CBOW models of the embodiment of the present application;
Fig. 4 is a kind of structured flowchart of the processing unit embodiment of text data of the application.
Embodiment
To enable above-mentioned purpose, the feature and advantage of the application more obvious understandable, below in conjunction with the accompanying drawings
The application is described in further detail with embodiment.
Reference picture 1, shows a kind of step flow of the processing method embodiment of text data of the application
Figure, specifically may include steps of:
Step 101, the first text data is obtained;
It should be noted that the embodiment of the present application can apply artificial in chat robots, voice assistant etc.
In intelligent use.
The artificial intelligence application can be deployed in terminal local, for example, mobile phone, tablet personal computer, intelligence are worn
Equipment (such as bracelet, wrist-watch, glasses) is worn, can also be disposed beyond the clouds or in server, for example,
Distributed system, the embodiment of the present application is not any limitation as to this.
If deployment is beyond the clouds, the first text data that can be sent with direct reception client end.
Or,
When receiving the first speech data of client transmission, voice can be carried out to the first speech data
Recognize (Automatic Speech Recognition, ASR), the first speech data is converted into the first text
Notebook data.
In the specific implementation, carrying out the speech recognition system of speech recognition generally by following basic module
Constituted:
1st, signal transacting and characteristic extracting module;The main task of the module is extracted from speech data
Feature, for acoustic model processing.Meanwhile, it typically also includes some signal processing technologies, to the greatest extent may be used
The influence that the factors such as ambient noise, channel, speaker are caused to feature can be reduced.
2nd, acoustic model;Use more and be modeled based on single order HMM speech recognition system.
3rd, pronunciation dictionary;Pronunciation dictionary includes the speech recognition system treatable word finder of institute and its pronunciation.
The actual mapping there is provided acoustic model and language model of pronunciation dictionary.
4th, language model;The language model language targeted to speech recognition system is modeled.It is theoretical
On, including regular language, the various language models including context-free grammar can serve as language mould
Type, but the N-gram and its variant that are also based on statistics that various systems are generally used at present.
5th, decoder;Decoder is one of core of speech recognition system, and its task is the letter to input
Number, according to acoustics, language model and dictionary, searching can export the word string of the signal with maximum probability.
Step 102, judge whether first text data is suitable to analogy;If so, then performing step 103;
So-called analogy, i.e., be compared two different (two classes) objects, according to two (two classes)
Object is similar on a series of attributes, and known one of object also has other attributes, by
This, which releases another object, also has the conclusion of other similar attributes.
In embodiments of the present invention, the first text data can be problem, such as " whom the good friend of desk lamp is ",
" what relation Liu Dehua and Cheng Long are ", can be answered with analogy.
In one embodiment of the application, step 102 can include following sub-step:
Sub-step S11, word segmentation processing is carried out to first text data, obtains multiple first texts point
Word;
In the embodiment of the present application, word segmentation processing can be carried out in following one or more modes:
1st, the participle based on string matching:Refer to the Chinese character string being analysed to according to certain strategy
Matched with the entry in a preset machine dictionary, if finding some character string in dictionary,
Then the match is successful (identifying a word).
2nd, the participle of feature based scanning or mark cutting:Refer to the preferential knowledge in character string to be analyzed
Not and be syncopated as some carry obvious characteristic words, can be by former character string using these words as breakpoint
It is divided into less string and enters mechanical Chinese word segmentation again, so as to reduces the error rate of matching;Or by participle and
Part-of-speech tagging combines, and help is provided participle decision-making using abundant grammatical category information, and
Word segmentation result is tested in turn, adjusted again in annotation process, so as to improve the accurate of cutting
Rate.
3rd, the participle based on understanding:Refer to, by allowing the understanding of anthropomorphic distich of computer mould, reach
Recognize the effect of word.Its basic thought is exactly that syntax, semantic analysis are carried out while participle, profit
Ambiguity is handled with syntactic information and semantic information.It generally includes three parts:Participle
System, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem
The syntax and semantic information about word, sentence etc. can be obtained to judge segmentation ambiguity, i.e.,
It simulates understanding process of the people to sentence.
4th, the segmenting method based on statistics:Refer to, due to word co-occurrence adjacent with word in Chinese information
Frequency or probability can preferably reflect into the confidence level of word, it is possible to adjacent co-occurrence in language material
The frequency of each combinatorics on words counted, calculate their information that appears alternatively, and calculate two
Chinese character X, Y adjacent co-occurrence probabilities.The information that appears alternatively can embody the close of marriage relation between Chinese character
Degree.When tightness degree is higher than some threshold value, just it is believed that this word group may constitute one
Word.
Certainly, above-mentioned word segmentation processing mode is intended only as example, when implementing the embodiment of the present application,
Other word segmentation processing modes can be set according to actual conditions, and the embodiment of the present application is not limited this
System.In addition, in addition to above-mentioned word segmentation processing mode, those skilled in the art can also be according to reality
Need to use other word segmentation processing modes, the embodiment of the present application is not also any limitation as this.
Sub-step S12, multiple first text participles and the default analogy of first text data are asked
Topic template is matched;
Sub-step S13, when the match is successful, determines that first text data is suitable to analogy.
Using the embodiment of the present application, one or more relationship types (i.e. analogical pattern frame) can be directed to
There is provided template is answered in the analog problem template of pairing and analogy.
In analog problem template, include the basic structure of (text) the problem of suitable for analogy.
In template is answered in analogy, with the basic structure answered problem, and entity word is remained
Position.
Analog problem template and analogy answer template with customized structure persistent storage in the text,
When matching, it is loaded into internal memory.
In the specific implementation, CFG analyzer (Context-free grammar can be utilized
Parser, CFG) carry out analog problem template matching.
If the production rule of a formal grammar G=(N, Σ, P, S) all takes following form:V->W,
Then it is referred to as context-free, wherein, V ∈ N, w ∈ (N ∪ Σ) *.
The reason for CFG is named as " context-free " is exactly because character V always can be with
Freely replaced by word string w, without considering the context that character V occurs.
One formal language is context-free, if it is (the bar generated by context-free grammar
Mesh context-free language).
If the first text participle and default analog problem template matches after participle, it is considered that the
One text data is suitable to analogy.
Using still life relation as the example of relationship type, in analog problem template as shown in Figure 2 A,
Arg1 presentation-entity words, have problematic basic structure " ", " good ", " friend/base friend ", "Yes",
" who ".
For " whom the good friend of desk lamp is ", can be obtained after participle " desk lamp ", " ", " good friend
Friend ", "Yes", " who ", with the analog problem template matches shown in Fig. 2A, it is believed that with analogy meaning
Figure.
Using Eight Diagrams relation as the example of relationship type, in analog problem template as shown in Figure 2 B, arg1
With arg2 presentation-entity words, have problematic basic structure " and ", "Yes", " what ", " relation ".
For " what relation Liu Dehua and Cheng Long are ", can be obtained after participle " Liu Dehua ", " and ",
" Cheng Long ", "Yes", " what ", " relation ", and analog problem template matches shown in Fig. 2 B can be with
Think to be intended to analogy.
Step 103, first instance word is extracted from first text data;
Entity word, can correspond to a specific individual.
It should be noted that first instance word, second instance word, the first fructification word, the second fructification
Word, first candidate's entity word, second candidate's entity word, the 3rd candidate's entity word, the 4th candidate's entity word,
5th candidate's entity word, the 6th candidate's entity word are its essence for different processing states
It is entity word.
In star's classification, entity word can be Liu Dehua, Zhang Baizhi, woods green grass or young crops rosy clouds etc..
In addition, entity word can also include the individual of some wide in range representative classifications, such as people, film is bright
Star, singer etc..
For example, for " whom the good friend of desk lamp is ", entity word is " desk lamp ".
In another example, for " what relation Liu Dehua and Cheng Long are ", entity word be " Liu Dehua ",
" Cheng Long ".
Step 104, analogy is carried out to the first instance word, obtains second instance word;
In the embodiment of the present application, by some attributes of entity word, so as to derive similar its of attribute
His entity word, such as derives similar second instance word from first instance word.
In the specific implementation, can capture in advance data training word2vec (word to vector) model,
Analogy is carried out to the first instance word by word2vec models, second instance word is obtained.
Wherein, word2vec models are a works that the word in training data is converted into vector form
Tool, can be converted to word the term vector of 200 dimensions, the word (including entity word) can be stored in
In hash (Hash) table.
By conversion, the processing to content of text can be reduced to the vector operation in vector space, counted
The similarity in vector space is calculated, to represent the similarity on text semantic.
The data of training can capture webpage by reptile spider, carry out after data cleansing, done
Net title and body matter.
In actual applications, data can include two parts:
1st, network data;
Substantially stablize data, we used and accumulated (all encyclopaedia data and 1 year or so other
Have details page web data) data, textual data;
2nd, news data;
The window of a nearly half a year is maintained, it is daily to update, can include all news of title and text
Data.
This partial data is primarily to handle " relation " of dynamic change in the world, such as between men
Friend, conjugal relation etc., therefore, need to react what is grown with each passing hour during training word2vec models
News corpus.
Using word2vec CBOW (Continuous Bag-of-Word Model) model, such as scheme
Shown in 3, CBOW models are by input layer (input), mapping layer (projection) and output layer (output)
Constitute, current word w (t) vector representation is predicted using (n=4) individual word before w (t) and rear (n=4) individual word, should
Mode enable to the distance of semantic identical or pattern identical word vector representation closer to.
In one embodiment of the application, step 104 can include following sub-step:
Sub-step S21, when the first instance word is one, is searched similar to the first instance word
One or more first candidate entity words;
In the specific implementation, for the situation of problem only one of which entity word, first instance word can be inquired about
The first term vector and one or more first candidate entity words one or more second term vectors;
One or more first similarities are calculated based on the first term vector and one or more second term vectors;
Extract the one or more first candidate entity words of the first similarity highest, as with first instance word
Similar one or more first candidate entity words.
Specifically, more than word2vec can be calculated according to the vector after conversion by distance instruments
Chordal distance (Cosine distance), to represent the similarity of vectorial (word).
For example, input " france ", distance instruments can be calculated and be shown most close with " france " distance
Word, example is as follows:
Word | Cosine distance |
spain | 0.678515 |
belgium | 0.665923 |
netherlands | 0.652428 |
italy | 0.633130 |
switzerland | 0.622323 |
luxembourg | 0.610033 |
portugal | 0.577154 |
russia | 0.571507 |
germany | 0.563291 |
catalonia | 0.534176 |
Sub-step S22, screens entity word type and institute from one or more of first candidate entity words
State the one or more second candidate entity words of first instance word identical;
In the embodiment of the present application, it is the answer for problem progress analogy, entity word in general considerations
The type of type and entity word in answer is consistent.
For example, for " desk lamp ", entity word type identical entity word have " wall patch ", " LED ",
" cabinet for TV " etc..
Sub-step S23, one or more second are selected from one or more of second candidate entity words
Entity word.
In the specific implementation, can from based on entity word type screen after entity word in selection one or
Multiple second instance words are answered.
In another embodiment of the application, step 104 can include following sub-step:
Sub-step S31, when the first instance word includes the first fructification word and the second fructification word,
Search the one or more threeth candidate entity words similar to the first fructification word;
There is the situation of multiple first instance words for problem, such as two, for ease of being carried out to first instance word
Expression, in the embodiment of the present application, can according to entity word order, with the first fructification word, second
Fructification word etc. is replaced first instance word and expressed.
For example, for " what relation Liu Dehua and Cheng Long are ", the first fructification word is " Liu De
China ", the second fructification word is " Cheng Long ".
In the specific implementation, in word2vec models, can inquire about the 3rd word of the first fructification word to
One or more 4th term vectors of amount and one or more 3rd candidate's entity words;
Based on the 3rd term vector and one or more 4th term vectors, pass through the modes such as cosine similarity
Calculate one or more second similarities;
Extract the one or more 3rd candidate's entity words of the second similarity highest, as with the first fructification
The similar one or more 3rd candidate's entity words of word.
Conversely, the 3rd relatively low candidate's entity word of the second similarity is screened.
For example, for " what relation Liu Dehua and Cheng Long are ", can calculate and the first fructification
Word " Liu Dehua " similar N (N is positive integer) individual 3rd candidate's entity word, e.g., " yellow solar corona ",
" Miao Qiaowei ", " Wang Lihong ", " losing lonely ", " ice rain ", then carried from this N number of 3rd candidate's entity word
Most like one or more 3rd candidate's entity words are taken, e.g., " Miao Qiaowei ", " yellow solar corona ", " Wang Li
It is grand ", " ice rain ", and screen out " lose lonely ".
Sub-step S32, screens entity word type and institute from one or more of 3rd candidate's entity words
State the first one or more 4th candidate's entity words of fructification word identical;
In the embodiment of the present application, it is the answer for problem progress analogy, entity word in general considerations
The type of type and entity word in answer is consistent.
For ease of representing the state screened based on entity word type, screened from the 3rd candidate's entity word
Entity word can be referred to as the 4th candidate's entity word.
For example, for " Liu Dehua ", entity word type is star, therefore, it can from " Miao Qiaowei ",
" ice rain " that entity word type is song is screened out in " yellow solar corona ", " Wang Lihong ", " ice rain ", is protected
Entity word type is stayed to be similarly " Miao Qiaowei ", " yellow solar corona ", " Wang Lihong " of star.
Sub-step S33, based on the first fructification word, the second fructification word and it is one or
Multiple 4th candidate's entity words calculate one or more 5th candidate's entity words;In the specific implementation, can be with
D=A-B+C mode computational entity word, wherein, A is that the first fructification word, B are the second fructification
Word, C are the 4th candidate's entity word, and D is the 5th candidate's entity word.
Specifically, the 3rd term vector of the first fructification word, one or more 4th candidates can be inquired about
One or more 4th term vectors, the 5th term vector of the second fructification word of entity word.
On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th term vector, obtain the 6th
Term vector.
When the 7th term vector of some entity word is nearest with the 6th term vector, confirm that the entity word is
5th candidate's entity word.
For example, it is " Cheng Long " that if the first fructification word, which is " Liu Dehua ", the second fructification word, the 4th waits
It is " Miao Qiaowei ", " yellow solar corona ", " Wang Lihong " to select entity word.
In one case, it can be subtracted " Cheng Long " on the basis of the 3rd term vector of " Liu Dehua "
The 5th term vector, the 4th term vector plus " Miao Qiaowei ", the 6th term vector is obtained, if " nothing
7th vector of line " recently, then can confirm that " wireless " is the 5th candidate's entity word with six term vector.
In another case, can subtract on the basis of the 3rd term vector of " Liu Dehua " " into
The 5th term vector, the 4th term vector plus " yellow solar corona " of dragon ", obtain the 6th term vector,
If with six term vector recently, can confirm that " Liang Chaowei " is the 5th to the 7th vector of " Liang Chaowei "
Candidate's entity word.
In another case, can subtract on the basis of the 3rd term vector of " Liu Dehua " " into
The 5th term vector, the 4th term vector plus " Wang Lihong " of dragon ", obtain the 6th term vector,
If with six term vector recently, can confirm that " Zhou Jielun " is the 5th to the 7th vector of " Zhou Jielun "
Candidate's entity word.
Sub-step S34, screens entity word type and institute from one or more of 5th candidate's entity words
State the second one or more 6th candidate's entity words of fructification word identical;
In the embodiment of the present application, it is the answer for problem progress analogy, entity word in general considerations
The type of type and entity word in answer is consistent.
For example, for " Cheng Long ", entity word type is star, be therefore, it can from " wireless ", " beam
Towards big ", " Wang Lihong ", screen out " wireless " that entity word type is company, reservation in " Zhou Jielun "
Entity word type is similarly " Liang Chaowei ", " Zhou Jielun " of star.
It should be noted that because the 4th candidate's entity word and the 5th candidate's entity word are to be mutually related,
Therefore, after the 5th candidate's entity word is screened, the 4th corresponding candidate's entity word can also be screened
Out.
For example, because " wireless " is screened, therefore, " Miao Qiaowei " associated by " wireless "
It is screened, i.e., remaining " yellow solar corona ", " Wang Lihong ".
Sub-step S35, from one or more of 4th candidate's entity words and the one or more of 6th
Candidate's entity word chooses second instance word.
In the embodiment of the present application, second instance word can be chosen by equation below:
Wherein, A, B be first instance word, C, D be second instance word, score (C, D) be C and
D scoring, ciFor i-th of the 4th candidate's entity words, djFor j-th of the 6th candidate's entity words, λ is normal
Number.
Specifically, the of the 3rd term vector that can be based on the first fructification word and the 4th candidate's entity word
Four term vectors calculate the first distance;
Second distance is calculated based on the 7th term vector and the 6th term vector of the 6th candidate's entity word, wherein,
6th term vector is on the basis of the 3rd term vector, to subtract the 5th term vector, obtained plus the 4th term vector
The term vector obtained;
4th candidate's entity word is calculated using the first distance and the second distance and the 6th candidate is real
The scoring of pronouns, general term for nouns, numerals and measure words;
The scoring candidate's entity word of highest the 4th and the 6th candidate's entity word are chosen as second instance word, i.e.,
For ease of expressing second instance word, in the embodiment of the present application, can according to entity word order,
Second instance word is replaced with the 4th candidate's entity word, the 6th candidate's entity word etc. to be expressed.
For example, according to above-mentioned formula, substituting into " Liu Dehua ", " Cheng Long ", " yellow solar corona ", " Liang Chaowei "
The scoring calculated is 0.85, substitutes into " Liu Dehua ", " Cheng Long ", " Wang Lihong ", " Zhou Jielun " calculating
The scoring arrived is 0.93, due to 0.93 > 0.85, then it is that can determine " Wang Lihong ", " Zhou Jielun "
Two entity words.
Step 105, the second text data is generated according to the second instance word.
In the embodiment of the present application, the analogy for belonging to same relation type with analog problem template is searched to answer
Template.
By the second instance word embedded category than answering in template, the second text data is obtained.
It should be noted that because analogy answer template is more, it is therefore possible to use similar
key-set<value>Mode store, wherein, key is relationship type, i.e. analogical pattern frame, such as
Eight Diagrams relation, still life relation etc., set<value>It is one group of answer template.
When key is hit, from corresponding set<value>Middle one answer template of selection, selection
Strategy can be random, can be provided according to probability, be also not necessarily limited to provide according to entity type certainly
Different answer templates.
For example, for analog problem template as shown in Figure 2 A, template can be answered using following analogy:
1st, A good friend should be B.
2nd, I thinks that A good friend is B.
3rd, A good friend is that class of B.
4th, A and B happy be able to should become friends.
Wherein, A is that first instance word, B are second instance word.
For " whom the good friend of desk lamp is ", the 3rd template is applied mechanically, answer can be " the good friend of desk lamp
Friend is wall patch, LED, cabinet for TV that class ".
In another example, for the analog problem template shown in Fig. 2 B, template can be answered using following analogy:
1st, their two relations are how complicated, just and C is similar with D relation.
2nd, just as C and D, what you understood.
3rd, their relation in fact, be with D relation with C just as it is the same.
4th, this is mentioned, I feels the relation like C and D.
If they the 5, are compared to C and D, whether very appropriate you feel
6th, relation of the A and B relation like C and D.
7th, A and B is similar to C and D.
8th, A and B are just as C and D.
9th, A and B relation feel just look like C and D relation.
10th, A and B relation allows me to contemplate C and D relation.
Wherein, A, B are first instance word, and C, D are second instance word.
For " what relation Liu Dehua and Cheng Long are ", the 6th template is applied mechanically, answer can be " Liu De
The relation of China and Cheng Long are like grand and Zhou Jielun the relations of Wang Li ".
If what is formerly received is the first text data that client is sent, can be directly by the second textual data
Shown according to client is returned.
If what is formerly received is the first speech data that client is sent, the second text data can be turned
Second speech data is changed to, second speech data is returned into the client plays out, or, by
Two text datas return to client displaying, or, enter while second speech data is returned into the client
Row is played and the second text data is returned into client displaying.
The embodiment of the present application is when confirming that the first text data has analogy intention, to the first text data
First instance word carries out analogy, obtains second instance word, and then generates the second text data, in a large amount of nothings
Direct construction term vector in text is marked, analogy answer is realized, without building knowledge base, reduces people
The consuming of power and physics, reduces cost, both definite relations is not replied directly, using analogical pattern
Reply, improve coverage rate, improve the reply success rate of analog problem.
It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as to one it is
The combination of actions of row, but those skilled in the art should know that the embodiment of the present application is not by described
Sequence of movement limitation because according to the embodiment of the present application, some steps can using other orders or
Person is carried out simultaneously.Secondly, those skilled in the art should also know, embodiment described in this description
Belong to necessary to preferred embodiment, involved action not necessarily the embodiment of the present application.
Reference picture 4, shows a kind of structured flowchart of the processing unit embodiment of text data of the application,
Following module can specifically be included:
First text data acquisition module 401, for obtaining the first text data;
Analogy is intended to judge module 402, for judging whether first text data is suitable to analogy;If
It is then to call entity word extraction module 403;
Entity word extraction module 403, for extracting first instance word from first text data;
Entity word analogy module 404, for carrying out analogy to the first instance word, obtains second instance
Word;
Second text data generation module 405, for generating the second textual data according to the second instance word
According to.
In a kind of embodiment of the application, the analogy, which is intended to judge module 402, can include following son
Module:
Participle submodule, for carrying out word segmentation processing to first text data, obtains multiple first texts
This participle;
Analog problem template matches submodule, for multiple first texts of first text data to be divided
Word is matched with default analog problem template;
Analogy is intended to determination sub-module, for when the match is successful, determining that first text data is suitable to
Analogy.
In a kind of embodiment of the application, the entity word analogy module 403 can include following submodule
Block:
First candidate's entity word search submodule, for the first instance word be one when, search with
The similar one or more first candidate entity words of the first instance word;
Second candidate's entity word screens submodule, for from one or more of first candidate entity words
Screen entity word type and the one or more second candidate entity words of the first instance word identical;
Second instance selected ci poem selects submodule, for being selected from one or more of second candidate entity words
One or more second instance words.
In a kind of embodiment of the application, the first candidate entity word, which searches submodule, can be included such as
Lower unit:
Primary vector query unit, for inquire about the first instance word the first term vector and one or
One or more second term vectors of multiple first candidate entity words;
First similarity calculated, for based on first term vector and one or more of second
Term vector calculates one or more first similarities;
First candidate's entity word extraction unit, for extracting the first similarity highest one or more first
Candidate's entity word, is used as the one or more first candidate entity words similar to the first instance word.
In a kind of embodiment of the application, the entity word analogy module 403 can include following submodule
Block:
3rd candidate's entity word searches submodule, for including the first fructification word in the first instance word
During with the second fructification word, one or more threeth candidates similar to the first fructification word are searched real
Pronouns, general term for nouns, numerals and measure words;
4th candidate's entity word screens submodule, for from one or more of 3rd candidate's entity words
Screen entity word type and the one or more 4th candidate's entity words of the first fructification word identical;
5th candidate's entity word calculating sub module, for based on the first fructification word, second son
Entity word and one or more of 4th candidate's entity words calculate one or more 5th candidate's entity words;
6th candidate's entity word screens submodule, for from one or more of 5th candidate's entity words
Screen entity word type and the one or more 6th candidate's entity words of the second fructification word identical;
Second instance selected ci poem takes submodule, for from one or more of 4th candidate's entity words and described
One or more 6th candidate's entity words choose second instance word.
In a kind of embodiment of the application, the 3rd candidate's entity word, which searches submodule, can be included such as
Lower unit:
Second term vector query unit, the 3rd term vector and one for inquiring about the first fructification word
One or more 4th term vectors of individual or multiple 3rd candidate's entity words;
Second similarity calculated, for based on the 3rd term vector and one or more of four
Term vector calculates one or more second similarities;
3rd candidate's entity word extraction unit, for extracting the second similarity highest one or more three
Candidate's entity word, is used as the one or more threeth candidate entity words similar to the first fructification word.
In a kind of embodiment of the application, the 5th candidate's entity word calculating sub module can be included such as
Lower unit:
3rd vectorial query unit, for inquire about the first fructification word the 3rd term vector, described one
One or more 4th term vectors of individual or multiple 4th candidate's entity words, the of the second fructification word
Five term vectors;
Vector calculation unit, on the basis of the 3rd term vector, subtract the 5th term vector,
Plus the 4th term vector, the 6th term vector is obtained;
5th candidate's entity word determining unit, for the 7th term vector and the described 6th in some entity word
When term vector is nearest, it is the 5th candidate's entity word to confirm the entity word.
In a kind of embodiment of the application, the second instance selected ci poem takes submodule to include such as placing an order
Member:
First metrics calculation unit, for the 3rd term vector based on the first fructification word and described the
4th term vector of four candidate's entity words calculates the first distance;
Sixth term vector meter of the second distance based on the 7th term vector Yu the 6th candidate's entity word
Calculate second distance;
Score calculation unit, is waited for calculating the described 4th using first distance and the second distance
Select the scoring of entity word and the 6th candidate's entity word;
Unit is chosen, for choosing the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word conduct
Second instance word.
In a kind of embodiment of the application, the second text data generation module 404 can be included such as
Lower submodule:
Analogy answers template and searches submodule, belongs to the same relation with the analog problem template for searching
Template is answered in the analogy of type;
Template insertion submodule is answered in analogy, and mould is answered for the second instance word to be embedded in into the analogy
In plate, the second text data is obtained.
In a kind of embodiment of the application, the device can also include following module:
Text conversion module, for when receiving the first speech data of client transmission, by described the
One speech data is converted to the first text data;
Voice conversion module, for second text data to be converted into second speech data;
Voice returns to module, for the second speech data to be returned into the client.
For device embodiment, because it is substantially similar to embodiment of the method, so the comparison of description
Simply, the relevent part can refer to the partial explaination of embodiments of method.
Each embodiment in this specification is described by the way of progressive, and each embodiment is stressed
Be all between difference with other embodiment, each embodiment identical similar part mutually referring to
.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present application can be provided as method, dress
Put or computer program product.Therefore, the embodiment of the present application can using complete hardware embodiment, completely
The form of embodiment in terms of software implementation or combination software and hardware.Moreover, the embodiment of the present application
Can use can be situated between in one or more computers for wherein including computer usable program code with storage
The computer journey that matter is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of sequence product.
In a typical configuration, the computer equipment includes one or more processors
(CPU), input/output interface, network interface and internal memory.Internal memory potentially includes computer-readable medium
In volatile memory, the shape such as random access memory (RAM) and/or Nonvolatile memory
Formula, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium
Example.Computer-readable medium includes permanent and non-permanent, removable and non-removable media
It can realize that information is stored by any method or technique.Information can be computer-readable instruction,
Data structure, the module of program or other data.The example of the storage medium of computer includes, but
Phase transition internal memory (PRAM), static RAM (SRAM), dynamic random is not limited to deposit
Access to memory (DRAM), other kinds of random access memory (RAM), read-only storage
(ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other in
Deposit technology, read-only optical disc read-only storage (CD-ROM), digital versatile disc (DVD) or other
Optical storage, magnetic cassette tape, tape magnetic rigid disk storage other magnetic storage apparatus or it is any its
His non-transmission medium, the information that can be accessed by a computing device available for storage.According to herein
Define, computer-readable medium does not include the computer readable media (transitory media) of non-standing,
Such as the data-signal and carrier wave of modulation.
The embodiment of the present application is with reference to according to the method for the embodiment of the present application, terminal device (system) and meter
The flow chart and/or block diagram of calculation machine program product is described.It should be understood that can be by computer program instructions
Each flow and/or square frame and flow chart and/or square frame in implementation process figure and/or block diagram
The combination of flow and/or square frame in figure.Can provide these computer program instructions to all-purpose computer,
The processor of special-purpose computer, Embedded Processor or other programmable data processing terminal equipments is to produce
One machine so that pass through the computing devices of computer or other programmable data processing terminal equipments
Instruction produce be used to realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The device for the function of being specified in multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable datas to handle
In the computer-readable memory that terminal device works in a specific way so that be stored in this computer-readable
Instruction in memory, which is produced, includes the manufacture of command device, and command device realization is in flow chart one
The function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing terminals are set
It is standby upper so that series of operation steps is performed on computer or other programmable terminal equipments in terms of producing
The processing that calculation machine is realized, so that the instruction performed on computer or other programmable terminal equipments provides use
In realization in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames
The step of function of specifying.
Although having been described for the preferred embodiment of the embodiment of the present application, those skilled in the art are once
Basic creative concept is known, then other change and modification can be made to these embodiments.So,
Appended claims are intended to be construed to include preferred embodiment and fall into the institute of the embodiment of the present application scope
Have altered and change.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms
It is used merely to make a distinction an entity or operation with another entity or operation, and not necessarily requires
Or imply between these entities or operation there is any this actual relation or order.Moreover, art
Language " comprising ", "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, so that
Process, method, article or terminal device including a series of key elements not only include those key elements, and
Also include other key elements for being not expressly set out, or also include for this process, method, article or
The intrinsic key element of person's terminal device.In the absence of more restrictions, by sentence " including one
It is individual ... " limit key element, it is not excluded that at the process including the key element, method, article or end
Also there is other identical element in end equipment.
Processing method above to a kind of text data provided herein and a kind of place of text data
Device is managed, is described in detail, used herein principle and embodiment party of the specific case to the application
Formula is set forth, and the explanation of above example is only intended to help and understands the present processes and its core
Thought;Simultaneously for those of ordinary skill in the art, according to the thought of the application, in specific implementation
It will change in mode and application, in summary, this specification content should not be construed as pair
The limitation of the application.
Claims (15)
1. a kind of processing method of text data, it is characterised in that including:
Obtain the first text data;
Judge whether first text data is suitable to analogy;If so, then from first text data
Extract first instance word;
Analogy is carried out to the first instance word, second instance word is obtained;
Second text data is generated according to the second instance word.
2. according to the method described in claim 1, it is characterised in that described to judge first text
The step of whether data are suitable to analogy includes:
Word segmentation processing is carried out to first text data, multiple first text participles are obtained;
Multiple first text participles of first text data and default analog problem template are carried out
Matching;
When the match is successful, determine that first text data is suitable to analogy.
3. method according to claim 1 or 2, it is characterised in that described real to described first
Pronouns, general term for nouns, numerals and measure words carries out analogy, and the step of obtaining second instance word includes:
When the first instance word is one, search similar to the first instance word one or more
First candidate's entity word;
Entity word type and the first instance are screened from one or more of first candidate entity words
The one or more second candidate entity words of word identical;
One or more second instance words are selected from one or more of second candidate entity words.
4. method according to claim 3, it is characterised in that the lookup is real with described first
The step of pronouns, general term for nouns, numerals and measure words similar one or more first candidate entity words, includes:
Inquire about the first term vector and one or more first candidate entity words of the first instance word
One or more second term vectors;
One or more the are calculated based on first term vector and one or more of second term vectors
One similarity;
Extract the one or more first candidate entity words of the first similarity highest, as with it is described first real
The similar one or more first candidate entity words of pronouns, general term for nouns, numerals and measure words.
5. the method according to claim 1 or 2 or 4, it is characterised in that described to described
One entity word carries out analogy, and the step of obtaining second instance word includes:
When the first instance word includes the first fructification word and the second fructification word, search and described the
The similar one or more 3rd candidate's entity words of one fructification word;
Entity word type is screened from one or more of 3rd candidate's entity words and the described first son is real
The one or more 4th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
Based on the first fructification word, the second fructification word and one or more of 4th candidates
Entity word calculates one or more 5th candidate's entity words;
Entity word type is screened from one or more of 5th candidate's entity words and the described second son is real
The one or more 6th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
From one or more of 4th candidate's entity words and one or more of 6th candidate's entity words
Choose second instance word.
6. method according to claim 5, it is characterised in that the lookup and the described first son
The step of entity word similar one or more 3rd candidate's entity words, includes:
Inquire about the 3rd term vector and one or more 3rd candidate's entity words of the first fructification word
One or more 4th term vectors;
One or more the are calculated based on the 3rd term vector and one or more of 4th term vectors
Two similarities;
Extract the one or more 3rd candidate's entity words of the second similarity highest, as with described first son
The similar one or more 3rd candidate's entity words of entity word.
7. method according to claim 5, it is characterised in that described real based on the described first son
Pronouns, general term for nouns, numerals and measure words, the second fructification word and one or more of 4th candidate's entity words calculate one or more
The step of 5th candidate's entity word, includes:
Inquire about the 3rd term vector of the first fructification word, one or more of 4th candidate's entity words
One or more 4th term vectors, the 5th term vector of the second fructification word;
On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th word to
Amount, obtains the 6th term vector;
When the 7th term vector of some entity word is nearest with the 6th term vector, the entity word is confirmed
For the 5th candidate's entity word.
8. method according to claim 7, it is characterised in that described from one or more of
The step of 4th candidate's entity word and one or more of 6th candidate's entity words choose second instance word
Including:
The 3rd term vector and the 4th word of the 4th candidate's entity word based on the first fructification word
Vector calculates the first distance;
The 6th term vector based on the 7th term vector and the 6th candidate's entity word calculate second away from
From;
4th candidate's entity word and described the are calculated using first distance and the second distance
The scoring of six candidate's entity words;
Choose the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word is used as second instance word.
9. method according to claim 2, it is characterised in that described according to the second instance
The step of word generates the second text data includes:
Search the analogy for belonging to same relation type with the analog problem template and answer template;
The second instance word is embedded in into the analogy to answer in template, the second text data is obtained.
10. the method according to claim 1 or 2 or 4 or 6 or 7 or 8 or 9, its feature exists
In, in addition to:
When receiving the first speech data of client transmission, first speech data is converted to the
One text data;
Second text data is converted into second speech data;
The second speech data is returned into the client.
11. a kind of processing unit of text data, it is characterised in that including:
First text data acquisition module, for obtaining the first text data;
Analogy is intended to judge module, for judging whether first text data is suitable to analogy;If so,
Then call entity word extraction module;
Entity word extraction module, for extracting first instance word from first text data;
Entity word analogy module, for carrying out analogy to the first instance word, obtains second instance word;
Second text data generation module, for generating the second text data according to the second instance word.
12. device according to claim 11, it is characterised in that the analogy is intended to judge mould
Block includes:
Participle submodule, for carrying out word segmentation processing to first text data, obtains multiple first texts
This participle;
Analog problem template matches submodule, for multiple first texts of first text data to be divided
Word is matched with default analog problem template;
Analogy is intended to determination sub-module, for when the match is successful, determining that first text data is suitable to
Analogy.
13. the device according to claim 11 or 12, it is characterised in that the entity word analogy
Module includes:
First candidate's entity word search submodule, for the first instance word be one when, search with
The similar one or more first candidate entity words of the first instance word;
Second candidate's entity word screens submodule, for from one or more of first candidate entity words
Screen entity word type and the one or more second candidate entity words of the first instance word identical;
Second instance selected ci poem selects submodule, for being selected from one or more of second candidate entity words
One or more second instance words.
14. the device according to claim 11 or 12, it is characterised in that the entity word analogy
Module includes:
3rd candidate's entity word searches submodule, for including the first fructification word in the first instance word
During with the second fructification word, one or more threeth candidates similar to the first fructification word are searched real
Pronouns, general term for nouns, numerals and measure words;
4th candidate's entity word screens submodule, for from one or more of 3rd candidate's entity words
Screen entity word type and the one or more 4th candidate's entity words of the first fructification word identical;
5th candidate's entity word calculating sub module, for based on the first fructification word, second son
Entity word and one or more of 4th candidate's entity words calculate one or more 5th candidate's entity words;
6th candidate's entity word screens submodule, for from one or more of 5th candidate's entity words
Screen entity word type and the one or more 6th candidate's entity words of the second fructification word identical;
Second instance selected ci poem takes submodule, for from one or more of 4th candidate's entity words and described
One or more 6th candidate's entity words choose second instance word.
15. device according to claim 12, it is characterised in that the second text data life
Include into module:
Analogy answers template and searches submodule, belongs to the same relation with the analog problem template for searching
Template is answered in the analogy of type;
Template insertion submodule is answered in analogy, and mould is answered for the second instance word to be embedded in into the analogy
In plate, the second text data is obtained.
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610031796.6A CN106980624B (en) | 2016-01-18 | 2016-01-18 | Text data processing method and device |
US15/404,855 US10176804B2 (en) | 2016-01-18 | 2017-01-12 | Analyzing textual data |
PCT/US2017/013388 WO2017127296A1 (en) | 2016-01-18 | 2017-01-13 | Analyzing textual data |
EP17741788.8A EP3405912A4 (en) | 2016-01-18 | 2017-01-13 | Analyzing textual data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610031796.6A CN106980624B (en) | 2016-01-18 | 2016-01-18 | Text data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106980624A true CN106980624A (en) | 2017-07-25 |
CN106980624B CN106980624B (en) | 2021-03-26 |
Family
ID=59314671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610031796.6A Active CN106980624B (en) | 2016-01-18 | 2016-01-18 | Text data processing method and device |
Country Status (4)
Country | Link |
---|---|
US (1) | US10176804B2 (en) |
EP (1) | EP3405912A4 (en) |
CN (1) | CN106980624B (en) |
WO (1) | WO2017127296A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190102A (en) * | 2018-09-12 | 2019-01-11 | 张连祥 | The system and method that project of inviting outside investment negotiation scheme automatically generates |
CN110750627A (en) * | 2018-07-19 | 2020-02-04 | 上海谦问万答吧云计算科技有限公司 | Material retrieval method and device, electronic equipment and storage medium |
CN112861533A (en) * | 2019-11-26 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Entity word recognition method and device |
Families Citing this family (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10878338B2 (en) * | 2016-10-06 | 2020-12-29 | International Business Machines Corporation | Machine learning of analogic patterns |
US10331718B2 (en) | 2016-10-06 | 2019-06-25 | International Business Machines Corporation | Analogy outcome determination |
WO2018081020A1 (en) | 2016-10-24 | 2018-05-03 | Carlabs Inc. | Computerized domain expert |
US10325024B2 (en) * | 2016-11-30 | 2019-06-18 | International Business Machines Corporation | Contextual analogy response |
US10325025B2 (en) * | 2016-11-30 | 2019-06-18 | International Business Machines Corporation | Contextual analogy representation |
US10901992B2 (en) * | 2017-06-12 | 2021-01-26 | KMS Lighthouse Ltd. | System and method for efficiently handling queries |
US11182706B2 (en) * | 2017-11-13 | 2021-11-23 | International Business Machines Corporation | Providing suitable strategies to resolve work items to participants of collaboration system |
CN109801090A (en) * | 2017-11-16 | 2019-05-24 | 国家新闻出版广电总局广播科学研究院 | The cross-selling method and server of networking products data |
US11586655B2 (en) | 2017-12-19 | 2023-02-21 | Visa International Service Association | Hyper-graph learner for natural language comprehension |
US10572586B2 (en) * | 2018-02-27 | 2020-02-25 | International Business Machines Corporation | Technique for automatically splitting words |
CN108563468B (en) * | 2018-03-30 | 2021-09-21 | 深圳市冠旭电子股份有限公司 | Bluetooth sound box data processing method and device and Bluetooth sound box |
CN110263318B (en) * | 2018-04-23 | 2022-10-28 | 腾讯科技(深圳)有限公司 | Entity name processing method and device, computer readable medium and electronic equipment |
CN108829777A (en) * | 2018-05-30 | 2018-11-16 | 出门问问信息科技有限公司 | A kind of the problem of chat robots, replies method and device |
CN108959256B (en) * | 2018-06-29 | 2023-04-07 | 北京百度网讯科技有限公司 | Short text generation method and device, storage medium and terminal equipment |
JP6988715B2 (en) * | 2018-06-29 | 2022-01-05 | 日本電信電話株式会社 | Answer text selection device, method, and program |
WO2020031242A1 (en) * | 2018-08-06 | 2020-02-13 | 富士通株式会社 | Assessment program, assessment method, and information processing device |
CN109460503B (en) * | 2018-09-14 | 2022-01-14 | 阿里巴巴(中国)有限公司 | Answer input method, answer input device, storage medium and electronic equipment |
JP7159780B2 (en) * | 2018-10-17 | 2022-10-25 | 富士通株式会社 | Correction Content Identification Program and Report Correction Content Identification Device |
CN109635277B (en) * | 2018-11-13 | 2023-05-26 | 北京合享智慧科技有限公司 | Method and related device for acquiring entity information |
CN109783624A (en) * | 2018-12-27 | 2019-05-21 | 联想(北京)有限公司 | Answer generation method, device and the intelligent conversational system in knowledge based library |
CN109902286B (en) * | 2019-01-09 | 2023-12-12 | 千城数智(北京)网络科技有限公司 | Entity identification method and device and electronic equipment |
CN110263167B (en) * | 2019-06-20 | 2022-07-29 | 北京百度网讯科技有限公司 | Medical entity classification model generation method, device, equipment and readable storage medium |
US11741305B2 (en) | 2019-10-07 | 2023-08-29 | The Toronto-Dominion Bank | Systems and methods for automatically assessing fault in relation to motor vehicle collisions |
CN111222317B (en) * | 2019-10-16 | 2022-04-29 | 平安科技(深圳)有限公司 | Sequence labeling method, system and computer equipment |
CN112700203B (en) * | 2019-10-23 | 2022-11-01 | 北京易真学思教育科技有限公司 | Intelligent marking method and device |
CN111738596B (en) * | 2020-06-22 | 2024-03-22 | 中国银行股份有限公司 | Work order dispatching method and device |
CN112115212B (en) * | 2020-09-29 | 2023-10-03 | 中国工商银行股份有限公司 | Parameter identification method and device and electronic equipment |
US11941000B2 (en) | 2021-04-16 | 2024-03-26 | International Business Machines Corporation | Cognitive generation of tailored analogies |
CN113223532B (en) * | 2021-04-30 | 2024-03-05 | 平安科技(深圳)有限公司 | Quality inspection method and device for customer service call, computer equipment and storage medium |
CN113254620B (en) * | 2021-06-21 | 2022-08-30 | 中国平安人寿保险股份有限公司 | Response method, device and equipment based on graph neural network and storage medium |
CN113707131B (en) * | 2021-08-30 | 2024-04-16 | 中国科学技术大学 | Speech recognition method, device, equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6101490A (en) * | 1991-07-19 | 2000-08-08 | Hatton; Charles Malcolm | Computer system program for creating new ideas and solving problems |
CN1647069A (en) * | 2002-04-11 | 2005-07-27 | 株式会社PtoPA | Conversation control system and conversation control method |
CN1752966A (en) * | 2004-09-24 | 2006-03-29 | 北京亿维讯科技有限公司 | Method of solving problem using wikipedia and user inquiry treatment technology |
CN1794233A (en) * | 2005-12-28 | 2006-06-28 | 刘文印 | Network user interactive asking answering method and its system |
US20070209069A1 (en) * | 2006-03-03 | 2007-09-06 | Motorola, Inc. | Push-to-ask protocol layer provisioning and usage method |
US20090282114A1 (en) * | 2008-05-08 | 2009-11-12 | Junlan Feng | System and method for generating suggested responses to an email |
CN103902652A (en) * | 2014-02-27 | 2014-07-02 | 深圳市智搜信息技术有限公司 | Automatic question-answering system |
US20140358890A1 (en) * | 2013-06-04 | 2014-12-04 | Sap Ag | Question answering framework |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5519608A (en) | 1993-06-24 | 1996-05-21 | Xerox Corporation | Method for extracting from a text corpus answers to questions stated in natural language by using linguistic analysis and hypothesis generation |
US6778970B2 (en) * | 1998-05-28 | 2004-08-17 | Lawrence Au | Topological methods to organize semantic network data flows for conversational applications |
US7072883B2 (en) * | 2001-12-21 | 2006-07-04 | Ut-Battelle Llc | System for gathering and summarizing internet information |
US20060224569A1 (en) | 2005-03-31 | 2006-10-05 | Desanto John A | Natural language based search engine and methods of use therefor |
FR2906049A1 (en) * | 2006-09-19 | 2008-03-21 | Alcatel Sa | COMPUTER-IMPLEMENTED METHOD OF DEVELOPING ONTOLOGY FROM NATURAL LANGUAGE TEXT |
US7415409B2 (en) * | 2006-12-01 | 2008-08-19 | Coveo Solutions Inc. | Method to train the language model of a speech recognition system to convert and index voicemails on a search engine |
JP4355772B2 (en) * | 2007-02-19 | 2009-11-04 | パナソニック株式会社 | Force conversion device, speech conversion device, speech synthesis device, speech conversion method, speech synthesis method, and program |
JP5119700B2 (en) * | 2007-03-20 | 2013-01-16 | 富士通株式会社 | Prosody modification device, prosody modification method, and prosody modification program |
US8737975B2 (en) * | 2009-12-11 | 2014-05-27 | At&T Mobility Ii Llc | Audio-based text messaging |
US20140006012A1 (en) | 2012-07-02 | 2014-01-02 | Microsoft Corporation | Learning-Based Processing of Natural Language Questions |
US9798799B2 (en) | 2012-11-15 | 2017-10-24 | Sri International | Vehicle personal assistant that interprets spoken natural language input based upon vehicle context |
US9251474B2 (en) | 2013-03-13 | 2016-02-02 | International Business Machines Corporation | Reward based ranker array for question answer system |
US9189742B2 (en) * | 2013-11-20 | 2015-11-17 | Justin London | Adaptive virtual intelligent agent |
US9483582B2 (en) * | 2014-09-12 | 2016-11-01 | International Business Machines Corporation | Identification and verification of factual assertions in natural language |
-
2016
- 2016-01-18 CN CN201610031796.6A patent/CN106980624B/en active Active
-
2017
- 2017-01-12 US US15/404,855 patent/US10176804B2/en active Active
- 2017-01-13 WO PCT/US2017/013388 patent/WO2017127296A1/en unknown
- 2017-01-13 EP EP17741788.8A patent/EP3405912A4/en not_active Withdrawn
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6101490A (en) * | 1991-07-19 | 2000-08-08 | Hatton; Charles Malcolm | Computer system program for creating new ideas and solving problems |
CN1647069A (en) * | 2002-04-11 | 2005-07-27 | 株式会社PtoPA | Conversation control system and conversation control method |
CN1752966A (en) * | 2004-09-24 | 2006-03-29 | 北京亿维讯科技有限公司 | Method of solving problem using wikipedia and user inquiry treatment technology |
CN1794233A (en) * | 2005-12-28 | 2006-06-28 | 刘文印 | Network user interactive asking answering method and its system |
US20070209069A1 (en) * | 2006-03-03 | 2007-09-06 | Motorola, Inc. | Push-to-ask protocol layer provisioning and usage method |
US20090282114A1 (en) * | 2008-05-08 | 2009-11-12 | Junlan Feng | System and method for generating suggested responses to an email |
US20140358890A1 (en) * | 2013-06-04 | 2014-12-04 | Sap Ag | Question answering framework |
CN103902652A (en) * | 2014-02-27 | 2014-07-02 | 深圳市智搜信息技术有限公司 | Automatic question-answering system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750627A (en) * | 2018-07-19 | 2020-02-04 | 上海谦问万答吧云计算科技有限公司 | Material retrieval method and device, electronic equipment and storage medium |
CN109190102A (en) * | 2018-09-12 | 2019-01-11 | 张连祥 | The system and method that project of inviting outside investment negotiation scheme automatically generates |
CN112861533A (en) * | 2019-11-26 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Entity word recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
EP3405912A1 (en) | 2018-11-28 |
WO2017127296A1 (en) | 2017-07-27 |
US10176804B2 (en) | 2019-01-08 |
EP3405912A4 (en) | 2019-06-26 |
CN106980624B (en) | 2021-03-26 |
US20170206897A1 (en) | 2017-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106980624A (en) | A kind for the treatment of method and apparatus of text data | |
CN104915340B (en) | Natural language question-answering method and device | |
JP6222821B2 (en) | Error correction model learning device and program | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN108287858A (en) | The semantic extracting method and device of natural language | |
CN108288468A (en) | Audio recognition method and device | |
CN105631468A (en) | RNN-based automatic picture description generation method | |
CN108711421A (en) | A kind of voice recognition acoustic model method for building up and device and electronic equipment | |
CN108304375A (en) | A kind of information identifying method and its equipment, storage medium, terminal | |
CN110619050B (en) | Intention recognition method and device | |
CN104572631B (en) | The training method and system of a kind of language model | |
CN109976702A (en) | A kind of audio recognition method, device and terminal | |
CN113095080B (en) | Theme-based semantic recognition method and device, electronic equipment and storage medium | |
CN109582954A (en) | Method and apparatus for output information | |
CN107943940A (en) | Data processing method, medium, system and electronic equipment | |
CN108345612A (en) | A kind of question processing method and device, a kind of device for issue handling | |
CN109710732A (en) | Information query method, device, storage medium and electronic equipment | |
CN112650842A (en) | Human-computer interaction based customer service robot intention recognition method and related equipment | |
CN114218488A (en) | Information recommendation method and device based on multi-modal feature fusion and processor | |
CN115099239B (en) | Resource identification method, device, equipment and storage medium | |
CN112883182A (en) | Question-answer matching method and device based on machine reading | |
CN113342948A (en) | Intelligent question and answer method and device | |
CN116091836A (en) | Multi-mode visual language understanding and positioning method, device, terminal and medium | |
CN113609264B (en) | Data query method and device for power system nodes | |
CN113297387B (en) | News detection method for image-text mismatching based on NKD-GNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20211116 Address after: Room 554, floor 5, building 3, No. 969, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province Patentee after: Taobao (China) Software Co., Ltd Address before: P.O. Box 847, 4th floor, Grand Cayman capital building, British Cayman Islands Patentee before: Alibaba Group Holdings Limited |