CN106980624A - A kind for the treatment of method and apparatus of text data - Google Patents

A kind for the treatment of method and apparatus of text data Download PDF

Info

Publication number
CN106980624A
CN106980624A CN201610031796.6A CN201610031796A CN106980624A CN 106980624 A CN106980624 A CN 106980624A CN 201610031796 A CN201610031796 A CN 201610031796A CN 106980624 A CN106980624 A CN 106980624A
Authority
CN
China
Prior art keywords
word
candidate
entity
instance
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610031796.6A
Other languages
Chinese (zh)
Other versions
CN106980624B (en
Inventor
江会星
孙健
初敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taobao China Software Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610031796.6A priority Critical patent/CN106980624B/en
Priority to US15/404,855 priority patent/US10176804B2/en
Priority to PCT/US2017/013388 priority patent/WO2017127296A1/en
Priority to EP17741788.8A priority patent/EP3405912A4/en
Publication of CN106980624A publication Critical patent/CN106980624A/en
Application granted granted Critical
Publication of CN106980624B publication Critical patent/CN106980624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application provides a kind for the treatment of method and apparatus of text data, and this method includes:Obtain the first text data;Judge whether first text data is suitable to analogy;If so, then extracting first instance word from first text data;Analogy is carried out to the first instance word, second instance word is obtained;Second text data is generated according to the second instance word.The embodiment of the present application direct construction term vector in largely without mark text, realize analogy answer, knowledge base need not be built, reduce the consuming of manpower and physics, cost is reduced, both definite relations are not replied directly, are replied using analogical pattern, coverage rate is improved, the reply success rate of analog problem is improved.

Description

A kind for the treatment of method and apparatus of text data
Technical field
The application is related to text-processing technical field, more particularly to a kind of processing method of text data and A kind of processing unit of text data.
Background technology
Just becoming increasingly with the development of science and technology, computer carries out intelligent sound or the demand of word response Extensively, many intelligent chat robots are occurred in that successively.
In voice or word response, analog problem be it is relatively common, such as " Xiao Ming and it is small it is red what is Relation ".
At present, intelligent chat robots are generally based on RDF (Resource Description Framework, resource description framework) similar or analogy relation between two entities is derived, so that Answer analog problem.
Relation between two entities is asked, it is necessary to build perfect RDF knowledge in advance based on RDF knowledge bases Storehouse.
The structure of RDF knowledge bases, is generally required by excavating relationship templates, cleaning encyclopaedia class data, closing System extracts three step iteration and carried out, and expends substantial amounts of man power and material, and cost is high, and still, coverage rate is not high, So that the reply success rate of analog problem is low.
For example, " Liu Dehua and Cheng Long have been base friends " in some Eight Diagrams grabbed news, is described, The information such as Liu Dehua, Cheng Long, relation base friend are then recorded in RDF knowledge bases.
If the problem of receiving " what relation Liu Dehua and Cheng Long are " that user sends, in RDF It is base friend that relation is found in knowledge base, then answers " base friend ".
If formerly not grabbing Eight Diagrams news, it can not reply, what relation may be answered " is” Get around problem.
In addition, the reply based on RDF is catechetical, in chat system, it possibly can not draw and answer Case, sometimes, lacks the ability to express of anthropomorphic humour.
The content of the invention
In view of the above problems, it is proposed that the embodiment of the present application overcomes above mentioned problem or extremely to provide one kind A kind of processing method of the text data partially solved the above problems and a kind of corresponding text data Processing unit.
In order to solve the above problems, the embodiment of the present application discloses a kind of processing method of text data, bag Include:
Obtain the first text data;
Judge whether first text data is suitable to analogy;If so, then from first text data Extract first instance word;
Analogy is carried out to the first instance word, second instance word is obtained;
Second text data is generated according to the second instance word.
Preferably, it is described to judge that the step of whether first text data is suitable to analogy includes:
Word segmentation processing is carried out to first text data, multiple first text participles are obtained;
Multiple first text participles of first text data and default analog problem template are carried out Matching;
When the match is successful, determine that first text data is suitable to analogy.
Preferably, described to carry out analogy to the first instance word, the step of obtaining second instance word includes:
When the first instance word is one, search similar to the first instance word one or more First candidate's entity word;
Entity word type and the first instance are screened from one or more of first candidate entity words The one or more second candidate entity words of word identical;
One or more second instance words are selected from one or more of second candidate entity words.
Preferably, it is described to search the one or more first candidate entity words similar to the first instance word The step of include:
Inquire about the first term vector and one or more first candidate entity words of the first instance word One or more second term vectors;
One or more the are calculated based on first term vector and one or more of second term vectors One similarity;
Extract the one or more first candidate entity words of the first similarity highest, as with it is described first real The similar one or more first candidate entity words of pronouns, general term for nouns, numerals and measure words.
Preferably, described to carry out analogy to the first instance word, the step of obtaining second instance word includes:
When the first instance word includes the first fructification word and the second fructification word, search and described the The similar one or more 3rd candidate's entity words of one fructification word;
Entity word type is screened from one or more of 3rd candidate's entity words and the described first son is real The one or more 4th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
Based on the first fructification word, the second fructification word and one or more of 4th candidates Entity word calculates one or more 5th candidate's entity words;
Entity word type is screened from one or more of 5th candidate's entity words and the described second son is real The one or more 6th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
From one or more of 4th candidate's entity words and one or more of 6th candidate's entity words Choose second instance word.
Preferably, it is described to search the one or more threeth candidate entities similar to the first fructification word The step of word, includes:
Inquire about the 3rd term vector and one or more 3rd candidate's entity words of the first fructification word One or more 4th term vectors;
One or more the are calculated based on the 3rd term vector and one or more of 4th term vectors Two similarities;
Extract the one or more 3rd candidate's entity words of the second similarity highest, as with described first son The similar one or more 3rd candidate's entity words of entity word.
Preferably, it is described based on the first fructification word, the second fructification word and it is one or The step of multiple 4th candidate's entity words calculate one or more 5th candidate's entity words includes:
Inquire about the 3rd term vector of the first fructification word, one or more of 4th candidate's entity words One or more 4th term vectors, the 5th term vector of the second fructification word;
On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th word to Amount, obtains the 6th term vector;
When the 7th term vector of some entity word is nearest with the 6th term vector, the entity word is confirmed For the 5th candidate's entity word.
Preferably, it is described from one or more of 4th candidate's entity words and the one or more of 6th The step of candidate's entity word chooses second instance word includes:
The 3rd term vector and the 4th word of the 4th candidate's entity word based on the first fructification word Vector calculates the first distance;
The 6th term vector based on the 7th term vector and the 6th candidate's entity word calculate second away from From;
4th candidate's entity word and described the are calculated using first distance and the second distance The scoring of six candidate's entity words;
Choose the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word is used as second instance word.
Preferably, the step of second text data of generation according to the second instance word includes:
Search the analogy for belonging to same relation type with the analog problem template and answer template;
The second instance word is embedded in into the analogy to answer in template, the second text data is obtained.
Preferably, in addition to:
When receiving the first speech data of client transmission, first speech data is converted to the One text data;
Second text data is converted into second speech data;
The second speech data is returned into the client.
The embodiment of the present application also discloses a kind of processing unit of text data, including:
First text data acquisition module, for obtaining the first text data;
Analogy is intended to judge module, for judging whether first text data is suitable to analogy;If so, Then call entity word extraction module;
Entity word extraction module, for extracting first instance word from first text data;
Entity word analogy module, for carrying out analogy to the first instance word, obtains second instance word;
Second text data generation module, for generating the second text data according to the second instance word.
Preferably, the analogy, which is intended to judge module, includes:
Participle submodule, for carrying out word segmentation processing to first text data, obtains multiple first texts This participle;
Analog problem template matches submodule, for multiple first texts of first text data to be divided Word is matched with default analog problem template;
Analogy is intended to determination sub-module, for when the match is successful, determining that first text data is suitable to Analogy.
Preferably, the entity word analogy module includes:
First candidate's entity word search submodule, for the first instance word be one when, search with The similar one or more first candidate entity words of the first instance word;
Second candidate's entity word screens submodule, for from one or more of first candidate entity words Screen entity word type and the one or more second candidate entity words of the first instance word identical;
Second instance selected ci poem selects submodule, for being selected from one or more of second candidate entity words One or more second instance words.
Preferably, the first candidate entity word is searched submodule and included:
Primary vector query unit, for inquire about the first instance word the first term vector and one or One or more second term vectors of multiple first candidate entity words;
First similarity calculated, for based on first term vector and one or more of second Term vector calculates one or more first similarities;
First candidate's entity word extraction unit, for extracting the first similarity highest one or more first Candidate's entity word, is used as the one or more first candidate entity words similar to the first instance word.
Preferably, the entity word analogy module includes:
3rd candidate's entity word searches submodule, for including the first fructification word in the first instance word During with the second fructification word, one or more threeth candidates similar to the first fructification word are searched real Pronouns, general term for nouns, numerals and measure words;
4th candidate's entity word screens submodule, for from one or more of 3rd candidate's entity words Screen entity word type and the one or more 4th candidate's entity words of the first fructification word identical;
5th candidate's entity word calculating sub module, for based on the first fructification word, second son Entity word and one or more of 4th candidate's entity words calculate one or more 5th candidate's entity words;
6th candidate's entity word screens submodule, for from one or more of 5th candidate's entity words Screen entity word type and the one or more 6th candidate's entity words of the second fructification word identical;
Second instance selected ci poem takes submodule, for from one or more of 4th candidate's entity words and described One or more 6th candidate's entity words choose second instance word.
Preferably, the 3rd candidate's entity word is searched submodule and included:
Second term vector query unit, the 3rd term vector and one for inquiring about the first fructification word One or more 4th term vectors of individual or multiple 3rd candidate's entity words;
Second similarity calculated, for based on the 3rd term vector and one or more of four Term vector calculates one or more second similarities;
3rd candidate's entity word extraction unit, for extracting the second similarity highest one or more three Candidate's entity word, is used as the one or more threeth candidate entity words similar to the first fructification word.
Preferably, the 5th candidate entity word calculating sub module includes:
3rd vectorial query unit, for inquire about the first fructification word the 3rd term vector, described one One or more 4th term vectors of individual or multiple 4th candidate's entity words, the of the second fructification word Five term vectors;
Vector calculation unit, on the basis of the 3rd term vector, subtract the 5th term vector, Plus the 4th term vector, the 6th term vector is obtained;
5th candidate's entity word determining unit, for the 7th term vector and the described 6th in some entity word When term vector is nearest, it is the 5th candidate's entity word to confirm the entity word.
Preferably, the second instance selected ci poem takes submodule to include:
First metrics calculation unit, for the 3rd term vector based on the first fructification word and described the 4th term vector of four candidate's entity words calculates the first distance;
Sixth term vector meter of the second distance based on the 7th term vector Yu the 6th candidate's entity word Calculate second distance;
Score calculation unit, is waited for calculating the described 4th using first distance and the second distance Select the scoring of entity word and the 6th candidate's entity word;
Unit is chosen, for choosing the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word conduct Second instance word.
Preferably, the second text data generation module includes:
Analogy answers template and searches submodule, belongs to the same relation with the analog problem template for searching Template is answered in the analogy of type;
Template insertion submodule is answered in analogy, and mould is answered for the second instance word to be embedded in into the analogy In plate, the second text data is obtained.
Preferably, in addition to:
Text conversion module, for when receiving the first speech data of client transmission, by described the One speech data is converted to the first text data;
Voice conversion module, for second text data to be converted into second speech data;
Voice returns to module, for the second speech data to be returned into the client.
The embodiment of the present application includes advantages below:
The embodiment of the present application is when confirming that the first text data has analogy intention, to the first text data First instance word carries out analogy, obtains second instance word, and then generates the second text data, in a large amount of nothings Direct construction term vector in text is marked, analogy answer is realized, without building knowledge base, reduces people The consuming of power and physics, reduces cost, both definite relations is not replied directly, using analogical pattern Reply, improve coverage rate, improve the reply success rate of analog problem.
Brief description of the drawings
Fig. 1 is a kind of step flow chart of the processing method embodiment of text data of the application;
Fig. 2A and Fig. 2 B are a kind of exemplary plots of analog problem template of the embodiment of the present application;
Fig. 3 is a kind of structure chart of CBOW models of the embodiment of the present application;
Fig. 4 is a kind of structured flowchart of the processing unit embodiment of text data of the application.
Embodiment
To enable above-mentioned purpose, the feature and advantage of the application more obvious understandable, below in conjunction with the accompanying drawings The application is described in further detail with embodiment.
Reference picture 1, shows a kind of step flow of the processing method embodiment of text data of the application Figure, specifically may include steps of:
Step 101, the first text data is obtained;
It should be noted that the embodiment of the present application can apply artificial in chat robots, voice assistant etc. In intelligent use.
The artificial intelligence application can be deployed in terminal local, for example, mobile phone, tablet personal computer, intelligence are worn Equipment (such as bracelet, wrist-watch, glasses) is worn, can also be disposed beyond the clouds or in server, for example, Distributed system, the embodiment of the present application is not any limitation as to this.
If deployment is beyond the clouds, the first text data that can be sent with direct reception client end.
Or,
When receiving the first speech data of client transmission, voice can be carried out to the first speech data Recognize (Automatic Speech Recognition, ASR), the first speech data is converted into the first text Notebook data.
In the specific implementation, carrying out the speech recognition system of speech recognition generally by following basic module Constituted:
1st, signal transacting and characteristic extracting module;The main task of the module is extracted from speech data Feature, for acoustic model processing.Meanwhile, it typically also includes some signal processing technologies, to the greatest extent may be used The influence that the factors such as ambient noise, channel, speaker are caused to feature can be reduced.
2nd, acoustic model;Use more and be modeled based on single order HMM speech recognition system.
3rd, pronunciation dictionary;Pronunciation dictionary includes the speech recognition system treatable word finder of institute and its pronunciation. The actual mapping there is provided acoustic model and language model of pronunciation dictionary.
4th, language model;The language model language targeted to speech recognition system is modeled.It is theoretical On, including regular language, the various language models including context-free grammar can serve as language mould Type, but the N-gram and its variant that are also based on statistics that various systems are generally used at present.
5th, decoder;Decoder is one of core of speech recognition system, and its task is the letter to input Number, according to acoustics, language model and dictionary, searching can export the word string of the signal with maximum probability.
Step 102, judge whether first text data is suitable to analogy;If so, then performing step 103;
So-called analogy, i.e., be compared two different (two classes) objects, according to two (two classes) Object is similar on a series of attributes, and known one of object also has other attributes, by This, which releases another object, also has the conclusion of other similar attributes.
In embodiments of the present invention, the first text data can be problem, such as " whom the good friend of desk lamp is ", " what relation Liu Dehua and Cheng Long are ", can be answered with analogy.
In one embodiment of the application, step 102 can include following sub-step:
Sub-step S11, word segmentation processing is carried out to first text data, obtains multiple first texts point Word;
In the embodiment of the present application, word segmentation processing can be carried out in following one or more modes:
1st, the participle based on string matching:Refer to the Chinese character string being analysed to according to certain strategy Matched with the entry in a preset machine dictionary, if finding some character string in dictionary, Then the match is successful (identifying a word).
2nd, the participle of feature based scanning or mark cutting:Refer to the preferential knowledge in character string to be analyzed Not and be syncopated as some carry obvious characteristic words, can be by former character string using these words as breakpoint It is divided into less string and enters mechanical Chinese word segmentation again, so as to reduces the error rate of matching;Or by participle and Part-of-speech tagging combines, and help is provided participle decision-making using abundant grammatical category information, and Word segmentation result is tested in turn, adjusted again in annotation process, so as to improve the accurate of cutting Rate.
3rd, the participle based on understanding:Refer to, by allowing the understanding of anthropomorphic distich of computer mould, reach Recognize the effect of word.Its basic thought is exactly that syntax, semantic analysis are carried out while participle, profit Ambiguity is handled with syntactic information and semantic information.It generally includes three parts:Participle System, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem The syntax and semantic information about word, sentence etc. can be obtained to judge segmentation ambiguity, i.e., It simulates understanding process of the people to sentence.
4th, the segmenting method based on statistics:Refer to, due to word co-occurrence adjacent with word in Chinese information Frequency or probability can preferably reflect into the confidence level of word, it is possible to adjacent co-occurrence in language material The frequency of each combinatorics on words counted, calculate their information that appears alternatively, and calculate two Chinese character X, Y adjacent co-occurrence probabilities.The information that appears alternatively can embody the close of marriage relation between Chinese character Degree.When tightness degree is higher than some threshold value, just it is believed that this word group may constitute one Word.
Certainly, above-mentioned word segmentation processing mode is intended only as example, when implementing the embodiment of the present application, Other word segmentation processing modes can be set according to actual conditions, and the embodiment of the present application is not limited this System.In addition, in addition to above-mentioned word segmentation processing mode, those skilled in the art can also be according to reality Need to use other word segmentation processing modes, the embodiment of the present application is not also any limitation as this.
Sub-step S12, multiple first text participles and the default analogy of first text data are asked Topic template is matched;
Sub-step S13, when the match is successful, determines that first text data is suitable to analogy.
Using the embodiment of the present application, one or more relationship types (i.e. analogical pattern frame) can be directed to There is provided template is answered in the analog problem template of pairing and analogy.
In analog problem template, include the basic structure of (text) the problem of suitable for analogy.
In template is answered in analogy, with the basic structure answered problem, and entity word is remained Position.
Analog problem template and analogy answer template with customized structure persistent storage in the text, When matching, it is loaded into internal memory.
In the specific implementation, CFG analyzer (Context-free grammar can be utilized Parser, CFG) carry out analog problem template matching.
If the production rule of a formal grammar G=(N, Σ, P, S) all takes following form:V->W, Then it is referred to as context-free, wherein, V ∈ N, w ∈ (N ∪ Σ) *.
The reason for CFG is named as " context-free " is exactly because character V always can be with Freely replaced by word string w, without considering the context that character V occurs.
One formal language is context-free, if it is (the bar generated by context-free grammar Mesh context-free language).
If the first text participle and default analog problem template matches after participle, it is considered that the One text data is suitable to analogy.
Using still life relation as the example of relationship type, in analog problem template as shown in Figure 2 A, Arg1 presentation-entity words, have problematic basic structure " ", " good ", " friend/base friend ", "Yes", " who ".
For " whom the good friend of desk lamp is ", can be obtained after participle " desk lamp ", " ", " good friend Friend ", "Yes", " who ", with the analog problem template matches shown in Fig. 2A, it is believed that with analogy meaning Figure.
Using Eight Diagrams relation as the example of relationship type, in analog problem template as shown in Figure 2 B, arg1 With arg2 presentation-entity words, have problematic basic structure " and ", "Yes", " what ", " relation ".
For " what relation Liu Dehua and Cheng Long are ", can be obtained after participle " Liu Dehua ", " and ", " Cheng Long ", "Yes", " what ", " relation ", and analog problem template matches shown in Fig. 2 B can be with Think to be intended to analogy.
Step 103, first instance word is extracted from first text data;
Entity word, can correspond to a specific individual.
It should be noted that first instance word, second instance word, the first fructification word, the second fructification Word, first candidate's entity word, second candidate's entity word, the 3rd candidate's entity word, the 4th candidate's entity word, 5th candidate's entity word, the 6th candidate's entity word are its essence for different processing states It is entity word.
In star's classification, entity word can be Liu Dehua, Zhang Baizhi, woods green grass or young crops rosy clouds etc..
In addition, entity word can also include the individual of some wide in range representative classifications, such as people, film is bright Star, singer etc..
For example, for " whom the good friend of desk lamp is ", entity word is " desk lamp ".
In another example, for " what relation Liu Dehua and Cheng Long are ", entity word be " Liu Dehua ", " Cheng Long ".
Step 104, analogy is carried out to the first instance word, obtains second instance word;
In the embodiment of the present application, by some attributes of entity word, so as to derive similar its of attribute His entity word, such as derives similar second instance word from first instance word.
In the specific implementation, can capture in advance data training word2vec (word to vector) model, Analogy is carried out to the first instance word by word2vec models, second instance word is obtained.
Wherein, word2vec models are a works that the word in training data is converted into vector form Tool, can be converted to word the term vector of 200 dimensions, the word (including entity word) can be stored in In hash (Hash) table.
By conversion, the processing to content of text can be reduced to the vector operation in vector space, counted The similarity in vector space is calculated, to represent the similarity on text semantic.
The data of training can capture webpage by reptile spider, carry out after data cleansing, done Net title and body matter.
In actual applications, data can include two parts:
1st, network data;
Substantially stablize data, we used and accumulated (all encyclopaedia data and 1 year or so other Have details page web data) data, textual data;
2nd, news data;
The window of a nearly half a year is maintained, it is daily to update, can include all news of title and text Data.
This partial data is primarily to handle " relation " of dynamic change in the world, such as between men Friend, conjugal relation etc., therefore, need to react what is grown with each passing hour during training word2vec models News corpus.
Using word2vec CBOW (Continuous Bag-of-Word Model) model, such as scheme Shown in 3, CBOW models are by input layer (input), mapping layer (projection) and output layer (output) Constitute, current word w (t) vector representation is predicted using (n=4) individual word before w (t) and rear (n=4) individual word, should Mode enable to the distance of semantic identical or pattern identical word vector representation closer to.
In one embodiment of the application, step 104 can include following sub-step:
Sub-step S21, when the first instance word is one, is searched similar to the first instance word One or more first candidate entity words;
In the specific implementation, for the situation of problem only one of which entity word, first instance word can be inquired about The first term vector and one or more first candidate entity words one or more second term vectors;
One or more first similarities are calculated based on the first term vector and one or more second term vectors;
Extract the one or more first candidate entity words of the first similarity highest, as with first instance word Similar one or more first candidate entity words.
Specifically, more than word2vec can be calculated according to the vector after conversion by distance instruments Chordal distance (Cosine distance), to represent the similarity of vectorial (word).
For example, input " france ", distance instruments can be calculated and be shown most close with " france " distance Word, example is as follows:
Word Cosine distance
spain 0.678515
belgium 0.665923
netherlands 0.652428
italy 0.633130
switzerland 0.622323
luxembourg 0.610033
portugal 0.577154
russia 0.571507
germany 0.563291
catalonia 0.534176
Sub-step S22, screens entity word type and institute from one or more of first candidate entity words State the one or more second candidate entity words of first instance word identical;
In the embodiment of the present application, it is the answer for problem progress analogy, entity word in general considerations The type of type and entity word in answer is consistent.
For example, for " desk lamp ", entity word type identical entity word have " wall patch ", " LED ", " cabinet for TV " etc..
Sub-step S23, one or more second are selected from one or more of second candidate entity words Entity word.
In the specific implementation, can from based on entity word type screen after entity word in selection one or Multiple second instance words are answered.
In another embodiment of the application, step 104 can include following sub-step:
Sub-step S31, when the first instance word includes the first fructification word and the second fructification word, Search the one or more threeth candidate entity words similar to the first fructification word;
There is the situation of multiple first instance words for problem, such as two, for ease of being carried out to first instance word Expression, in the embodiment of the present application, can according to entity word order, with the first fructification word, second Fructification word etc. is replaced first instance word and expressed.
For example, for " what relation Liu Dehua and Cheng Long are ", the first fructification word is " Liu De China ", the second fructification word is " Cheng Long ".
In the specific implementation, in word2vec models, can inquire about the 3rd word of the first fructification word to One or more 4th term vectors of amount and one or more 3rd candidate's entity words;
Based on the 3rd term vector and one or more 4th term vectors, pass through the modes such as cosine similarity Calculate one or more second similarities;
Extract the one or more 3rd candidate's entity words of the second similarity highest, as with the first fructification The similar one or more 3rd candidate's entity words of word.
Conversely, the 3rd relatively low candidate's entity word of the second similarity is screened.
For example, for " what relation Liu Dehua and Cheng Long are ", can calculate and the first fructification Word " Liu Dehua " similar N (N is positive integer) individual 3rd candidate's entity word, e.g., " yellow solar corona ", " Miao Qiaowei ", " Wang Lihong ", " losing lonely ", " ice rain ", then carried from this N number of 3rd candidate's entity word Most like one or more 3rd candidate's entity words are taken, e.g., " Miao Qiaowei ", " yellow solar corona ", " Wang Li It is grand ", " ice rain ", and screen out " lose lonely ".
Sub-step S32, screens entity word type and institute from one or more of 3rd candidate's entity words State the first one or more 4th candidate's entity words of fructification word identical;
In the embodiment of the present application, it is the answer for problem progress analogy, entity word in general considerations The type of type and entity word in answer is consistent.
For ease of representing the state screened based on entity word type, screened from the 3rd candidate's entity word Entity word can be referred to as the 4th candidate's entity word.
For example, for " Liu Dehua ", entity word type is star, therefore, it can from " Miao Qiaowei ", " ice rain " that entity word type is song is screened out in " yellow solar corona ", " Wang Lihong ", " ice rain ", is protected Entity word type is stayed to be similarly " Miao Qiaowei ", " yellow solar corona ", " Wang Lihong " of star.
Sub-step S33, based on the first fructification word, the second fructification word and it is one or Multiple 4th candidate's entity words calculate one or more 5th candidate's entity words;In the specific implementation, can be with D=A-B+C mode computational entity word, wherein, A is that the first fructification word, B are the second fructification Word, C are the 4th candidate's entity word, and D is the 5th candidate's entity word.
Specifically, the 3rd term vector of the first fructification word, one or more 4th candidates can be inquired about One or more 4th term vectors, the 5th term vector of the second fructification word of entity word.
On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th term vector, obtain the 6th Term vector.
When the 7th term vector of some entity word is nearest with the 6th term vector, confirm that the entity word is 5th candidate's entity word.
For example, it is " Cheng Long " that if the first fructification word, which is " Liu Dehua ", the second fructification word, the 4th waits It is " Miao Qiaowei ", " yellow solar corona ", " Wang Lihong " to select entity word.
In one case, it can be subtracted " Cheng Long " on the basis of the 3rd term vector of " Liu Dehua " The 5th term vector, the 4th term vector plus " Miao Qiaowei ", the 6th term vector is obtained, if " nothing 7th vector of line " recently, then can confirm that " wireless " is the 5th candidate's entity word with six term vector.
In another case, can subtract on the basis of the 3rd term vector of " Liu Dehua " " into The 5th term vector, the 4th term vector plus " yellow solar corona " of dragon ", obtain the 6th term vector, If with six term vector recently, can confirm that " Liang Chaowei " is the 5th to the 7th vector of " Liang Chaowei " Candidate's entity word.
In another case, can subtract on the basis of the 3rd term vector of " Liu Dehua " " into The 5th term vector, the 4th term vector plus " Wang Lihong " of dragon ", obtain the 6th term vector, If with six term vector recently, can confirm that " Zhou Jielun " is the 5th to the 7th vector of " Zhou Jielun " Candidate's entity word.
Sub-step S34, screens entity word type and institute from one or more of 5th candidate's entity words State the second one or more 6th candidate's entity words of fructification word identical;
In the embodiment of the present application, it is the answer for problem progress analogy, entity word in general considerations The type of type and entity word in answer is consistent.
For example, for " Cheng Long ", entity word type is star, be therefore, it can from " wireless ", " beam Towards big ", " Wang Lihong ", screen out " wireless " that entity word type is company, reservation in " Zhou Jielun " Entity word type is similarly " Liang Chaowei ", " Zhou Jielun " of star.
It should be noted that because the 4th candidate's entity word and the 5th candidate's entity word are to be mutually related, Therefore, after the 5th candidate's entity word is screened, the 4th corresponding candidate's entity word can also be screened Out.
For example, because " wireless " is screened, therefore, " Miao Qiaowei " associated by " wireless " It is screened, i.e., remaining " yellow solar corona ", " Wang Lihong ".
Sub-step S35, from one or more of 4th candidate's entity words and the one or more of 6th Candidate's entity word chooses second instance word.
In the embodiment of the present application, second instance word can be chosen by equation below:
Wherein, A, B be first instance word, C, D be second instance word, score (C, D) be C and D scoring, ciFor i-th of the 4th candidate's entity words, djFor j-th of the 6th candidate's entity words, λ is normal Number.
Specifically, the of the 3rd term vector that can be based on the first fructification word and the 4th candidate's entity word Four term vectors calculate the first distance;
Second distance is calculated based on the 7th term vector and the 6th term vector of the 6th candidate's entity word, wherein, 6th term vector is on the basis of the 3rd term vector, to subtract the 5th term vector, obtained plus the 4th term vector The term vector obtained;
4th candidate's entity word is calculated using the first distance and the second distance and the 6th candidate is real The scoring of pronouns, general term for nouns, numerals and measure words;
The scoring candidate's entity word of highest the 4th and the 6th candidate's entity word are chosen as second instance word, i.e., For ease of expressing second instance word, in the embodiment of the present application, can according to entity word order, Second instance word is replaced with the 4th candidate's entity word, the 6th candidate's entity word etc. to be expressed.
For example, according to above-mentioned formula, substituting into " Liu Dehua ", " Cheng Long ", " yellow solar corona ", " Liang Chaowei " The scoring calculated is 0.85, substitutes into " Liu Dehua ", " Cheng Long ", " Wang Lihong ", " Zhou Jielun " calculating The scoring arrived is 0.93, due to 0.93 > 0.85, then it is that can determine " Wang Lihong ", " Zhou Jielun " Two entity words.
Step 105, the second text data is generated according to the second instance word.
In the embodiment of the present application, the analogy for belonging to same relation type with analog problem template is searched to answer Template.
By the second instance word embedded category than answering in template, the second text data is obtained.
It should be noted that because analogy answer template is more, it is therefore possible to use similar key-set<value>Mode store, wherein, key is relationship type, i.e. analogical pattern frame, such as Eight Diagrams relation, still life relation etc., set<value>It is one group of answer template.
When key is hit, from corresponding set<value>Middle one answer template of selection, selection Strategy can be random, can be provided according to probability, be also not necessarily limited to provide according to entity type certainly Different answer templates.
For example, for analog problem template as shown in Figure 2 A, template can be answered using following analogy:
1st, A good friend should be B.
2nd, I thinks that A good friend is B.
3rd, A good friend is that class of B.
4th, A and B happy be able to should become friends.
Wherein, A is that first instance word, B are second instance word.
For " whom the good friend of desk lamp is ", the 3rd template is applied mechanically, answer can be " the good friend of desk lamp Friend is wall patch, LED, cabinet for TV that class ".
In another example, for the analog problem template shown in Fig. 2 B, template can be answered using following analogy:
1st, their two relations are how complicated, just and C is similar with D relation.
2nd, just as C and D, what you understood.
3rd, their relation in fact, be with D relation with C just as it is the same.
4th, this is mentioned, I feels the relation like C and D.
If they the 5, are compared to C and D, whether very appropriate you feel
6th, relation of the A and B relation like C and D.
7th, A and B is similar to C and D.
8th, A and B are just as C and D.
9th, A and B relation feel just look like C and D relation.
10th, A and B relation allows me to contemplate C and D relation.
Wherein, A, B are first instance word, and C, D are second instance word.
For " what relation Liu Dehua and Cheng Long are ", the 6th template is applied mechanically, answer can be " Liu De The relation of China and Cheng Long are like grand and Zhou Jielun the relations of Wang Li ".
If what is formerly received is the first text data that client is sent, can be directly by the second textual data Shown according to client is returned.
If what is formerly received is the first speech data that client is sent, the second text data can be turned Second speech data is changed to, second speech data is returned into the client plays out, or, by Two text datas return to client displaying, or, enter while second speech data is returned into the client Row is played and the second text data is returned into client displaying.
The embodiment of the present application is when confirming that the first text data has analogy intention, to the first text data First instance word carries out analogy, obtains second instance word, and then generates the second text data, in a large amount of nothings Direct construction term vector in text is marked, analogy answer is realized, without building knowledge base, reduces people The consuming of power and physics, reduces cost, both definite relations is not replied directly, using analogical pattern Reply, improve coverage rate, improve the reply success rate of analog problem.
It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as to one it is The combination of actions of row, but those skilled in the art should know that the embodiment of the present application is not by described Sequence of movement limitation because according to the embodiment of the present application, some steps can using other orders or Person is carried out simultaneously.Secondly, those skilled in the art should also know, embodiment described in this description Belong to necessary to preferred embodiment, involved action not necessarily the embodiment of the present application.
Reference picture 4, shows a kind of structured flowchart of the processing unit embodiment of text data of the application, Following module can specifically be included:
First text data acquisition module 401, for obtaining the first text data;
Analogy is intended to judge module 402, for judging whether first text data is suitable to analogy;If It is then to call entity word extraction module 403;
Entity word extraction module 403, for extracting first instance word from first text data;
Entity word analogy module 404, for carrying out analogy to the first instance word, obtains second instance Word;
Second text data generation module 405, for generating the second textual data according to the second instance word According to.
In a kind of embodiment of the application, the analogy, which is intended to judge module 402, can include following son Module:
Participle submodule, for carrying out word segmentation processing to first text data, obtains multiple first texts This participle;
Analog problem template matches submodule, for multiple first texts of first text data to be divided Word is matched with default analog problem template;
Analogy is intended to determination sub-module, for when the match is successful, determining that first text data is suitable to Analogy.
In a kind of embodiment of the application, the entity word analogy module 403 can include following submodule Block:
First candidate's entity word search submodule, for the first instance word be one when, search with The similar one or more first candidate entity words of the first instance word;
Second candidate's entity word screens submodule, for from one or more of first candidate entity words Screen entity word type and the one or more second candidate entity words of the first instance word identical;
Second instance selected ci poem selects submodule, for being selected from one or more of second candidate entity words One or more second instance words.
In a kind of embodiment of the application, the first candidate entity word, which searches submodule, can be included such as Lower unit:
Primary vector query unit, for inquire about the first instance word the first term vector and one or One or more second term vectors of multiple first candidate entity words;
First similarity calculated, for based on first term vector and one or more of second Term vector calculates one or more first similarities;
First candidate's entity word extraction unit, for extracting the first similarity highest one or more first Candidate's entity word, is used as the one or more first candidate entity words similar to the first instance word.
In a kind of embodiment of the application, the entity word analogy module 403 can include following submodule Block:
3rd candidate's entity word searches submodule, for including the first fructification word in the first instance word During with the second fructification word, one or more threeth candidates similar to the first fructification word are searched real Pronouns, general term for nouns, numerals and measure words;
4th candidate's entity word screens submodule, for from one or more of 3rd candidate's entity words Screen entity word type and the one or more 4th candidate's entity words of the first fructification word identical;
5th candidate's entity word calculating sub module, for based on the first fructification word, second son Entity word and one or more of 4th candidate's entity words calculate one or more 5th candidate's entity words;
6th candidate's entity word screens submodule, for from one or more of 5th candidate's entity words Screen entity word type and the one or more 6th candidate's entity words of the second fructification word identical;
Second instance selected ci poem takes submodule, for from one or more of 4th candidate's entity words and described One or more 6th candidate's entity words choose second instance word.
In a kind of embodiment of the application, the 3rd candidate's entity word, which searches submodule, can be included such as Lower unit:
Second term vector query unit, the 3rd term vector and one for inquiring about the first fructification word One or more 4th term vectors of individual or multiple 3rd candidate's entity words;
Second similarity calculated, for based on the 3rd term vector and one or more of four Term vector calculates one or more second similarities;
3rd candidate's entity word extraction unit, for extracting the second similarity highest one or more three Candidate's entity word, is used as the one or more threeth candidate entity words similar to the first fructification word.
In a kind of embodiment of the application, the 5th candidate's entity word calculating sub module can be included such as Lower unit:
3rd vectorial query unit, for inquire about the first fructification word the 3rd term vector, described one One or more 4th term vectors of individual or multiple 4th candidate's entity words, the of the second fructification word Five term vectors;
Vector calculation unit, on the basis of the 3rd term vector, subtract the 5th term vector, Plus the 4th term vector, the 6th term vector is obtained;
5th candidate's entity word determining unit, for the 7th term vector and the described 6th in some entity word When term vector is nearest, it is the 5th candidate's entity word to confirm the entity word.
In a kind of embodiment of the application, the second instance selected ci poem takes submodule to include such as placing an order Member:
First metrics calculation unit, for the 3rd term vector based on the first fructification word and described the 4th term vector of four candidate's entity words calculates the first distance;
Sixth term vector meter of the second distance based on the 7th term vector Yu the 6th candidate's entity word Calculate second distance;
Score calculation unit, is waited for calculating the described 4th using first distance and the second distance Select the scoring of entity word and the 6th candidate's entity word;
Unit is chosen, for choosing the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word conduct Second instance word.
In a kind of embodiment of the application, the second text data generation module 404 can be included such as Lower submodule:
Analogy answers template and searches submodule, belongs to the same relation with the analog problem template for searching Template is answered in the analogy of type;
Template insertion submodule is answered in analogy, and mould is answered for the second instance word to be embedded in into the analogy In plate, the second text data is obtained.
In a kind of embodiment of the application, the device can also include following module:
Text conversion module, for when receiving the first speech data of client transmission, by described the One speech data is converted to the first text data;
Voice conversion module, for second text data to be converted into second speech data;
Voice returns to module, for the second speech data to be returned into the client.
For device embodiment, because it is substantially similar to embodiment of the method, so the comparison of description Simply, the relevent part can refer to the partial explaination of embodiments of method.
Each embodiment in this specification is described by the way of progressive, and each embodiment is stressed Be all between difference with other embodiment, each embodiment identical similar part mutually referring to .
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present application can be provided as method, dress Put or computer program product.Therefore, the embodiment of the present application can using complete hardware embodiment, completely The form of embodiment in terms of software implementation or combination software and hardware.Moreover, the embodiment of the present application Can use can be situated between in one or more computers for wherein including computer usable program code with storage The computer journey that matter is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of sequence product.
In a typical configuration, the computer equipment includes one or more processors (CPU), input/output interface, network interface and internal memory.Internal memory potentially includes computer-readable medium In volatile memory, the shape such as random access memory (RAM) and/or Nonvolatile memory Formula, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.Computer-readable medium includes permanent and non-permanent, removable and non-removable media It can realize that information is stored by any method or technique.Information can be computer-readable instruction, Data structure, the module of program or other data.The example of the storage medium of computer includes, but Phase transition internal memory (PRAM), static RAM (SRAM), dynamic random is not limited to deposit Access to memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other in Deposit technology, read-only optical disc read-only storage (CD-ROM), digital versatile disc (DVD) or other Optical storage, magnetic cassette tape, tape magnetic rigid disk storage other magnetic storage apparatus or it is any its His non-transmission medium, the information that can be accessed by a computing device available for storage.According to herein Define, computer-readable medium does not include the computer readable media (transitory media) of non-standing, Such as the data-signal and carrier wave of modulation.
The embodiment of the present application is with reference to according to the method for the embodiment of the present application, terminal device (system) and meter The flow chart and/or block diagram of calculation machine program product is described.It should be understood that can be by computer program instructions Each flow and/or square frame and flow chart and/or square frame in implementation process figure and/or block diagram The combination of flow and/or square frame in figure.Can provide these computer program instructions to all-purpose computer, The processor of special-purpose computer, Embedded Processor or other programmable data processing terminal equipments is to produce One machine so that pass through the computing devices of computer or other programmable data processing terminal equipments Instruction produce be used to realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The device for the function of being specified in multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable datas to handle In the computer-readable memory that terminal device works in a specific way so that be stored in this computer-readable Instruction in memory, which is produced, includes the manufacture of command device, and command device realization is in flow chart one The function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing terminals are set It is standby upper so that series of operation steps is performed on computer or other programmable terminal equipments in terms of producing The processing that calculation machine is realized, so that the instruction performed on computer or other programmable terminal equipments provides use In realization in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames The step of function of specifying.
Although having been described for the preferred embodiment of the embodiment of the present application, those skilled in the art are once Basic creative concept is known, then other change and modification can be made to these embodiments.So, Appended claims are intended to be construed to include preferred embodiment and fall into the institute of the embodiment of the present application scope Have altered and change.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms It is used merely to make a distinction an entity or operation with another entity or operation, and not necessarily requires Or imply between these entities or operation there is any this actual relation or order.Moreover, art Language " comprising ", "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, so that Process, method, article or terminal device including a series of key elements not only include those key elements, and Also include other key elements for being not expressly set out, or also include for this process, method, article or The intrinsic key element of person's terminal device.In the absence of more restrictions, by sentence " including one It is individual ... " limit key element, it is not excluded that at the process including the key element, method, article or end Also there is other identical element in end equipment.
Processing method above to a kind of text data provided herein and a kind of place of text data Device is managed, is described in detail, used herein principle and embodiment party of the specific case to the application Formula is set forth, and the explanation of above example is only intended to help and understands the present processes and its core Thought;Simultaneously for those of ordinary skill in the art, according to the thought of the application, in specific implementation It will change in mode and application, in summary, this specification content should not be construed as pair The limitation of the application.

Claims (15)

1. a kind of processing method of text data, it is characterised in that including:
Obtain the first text data;
Judge whether first text data is suitable to analogy;If so, then from first text data Extract first instance word;
Analogy is carried out to the first instance word, second instance word is obtained;
Second text data is generated according to the second instance word.
2. according to the method described in claim 1, it is characterised in that described to judge first text The step of whether data are suitable to analogy includes:
Word segmentation processing is carried out to first text data, multiple first text participles are obtained;
Multiple first text participles of first text data and default analog problem template are carried out Matching;
When the match is successful, determine that first text data is suitable to analogy.
3. method according to claim 1 or 2, it is characterised in that described real to described first Pronouns, general term for nouns, numerals and measure words carries out analogy, and the step of obtaining second instance word includes:
When the first instance word is one, search similar to the first instance word one or more First candidate's entity word;
Entity word type and the first instance are screened from one or more of first candidate entity words The one or more second candidate entity words of word identical;
One or more second instance words are selected from one or more of second candidate entity words.
4. method according to claim 3, it is characterised in that the lookup is real with described first The step of pronouns, general term for nouns, numerals and measure words similar one or more first candidate entity words, includes:
Inquire about the first term vector and one or more first candidate entity words of the first instance word One or more second term vectors;
One or more the are calculated based on first term vector and one or more of second term vectors One similarity;
Extract the one or more first candidate entity words of the first similarity highest, as with it is described first real The similar one or more first candidate entity words of pronouns, general term for nouns, numerals and measure words.
5. the method according to claim 1 or 2 or 4, it is characterised in that described to described One entity word carries out analogy, and the step of obtaining second instance word includes:
When the first instance word includes the first fructification word and the second fructification word, search and described the The similar one or more 3rd candidate's entity words of one fructification word;
Entity word type is screened from one or more of 3rd candidate's entity words and the described first son is real The one or more 4th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
Based on the first fructification word, the second fructification word and one or more of 4th candidates Entity word calculates one or more 5th candidate's entity words;
Entity word type is screened from one or more of 5th candidate's entity words and the described second son is real The one or more 6th candidate's entity words of pronouns, general term for nouns, numerals and measure words identical;
From one or more of 4th candidate's entity words and one or more of 6th candidate's entity words Choose second instance word.
6. method according to claim 5, it is characterised in that the lookup and the described first son The step of entity word similar one or more 3rd candidate's entity words, includes:
Inquire about the 3rd term vector and one or more 3rd candidate's entity words of the first fructification word One or more 4th term vectors;
One or more the are calculated based on the 3rd term vector and one or more of 4th term vectors Two similarities;
Extract the one or more 3rd candidate's entity words of the second similarity highest, as with described first son The similar one or more 3rd candidate's entity words of entity word.
7. method according to claim 5, it is characterised in that described real based on the described first son Pronouns, general term for nouns, numerals and measure words, the second fructification word and one or more of 4th candidate's entity words calculate one or more The step of 5th candidate's entity word, includes:
Inquire about the 3rd term vector of the first fructification word, one or more of 4th candidate's entity words One or more 4th term vectors, the 5th term vector of the second fructification word;
On the basis of the 3rd term vector, subtract the 5th term vector, plus the 4th word to Amount, obtains the 6th term vector;
When the 7th term vector of some entity word is nearest with the 6th term vector, the entity word is confirmed For the 5th candidate's entity word.
8. method according to claim 7, it is characterised in that described from one or more of The step of 4th candidate's entity word and one or more of 6th candidate's entity words choose second instance word Including:
The 3rd term vector and the 4th word of the 4th candidate's entity word based on the first fructification word Vector calculates the first distance;
The 6th term vector based on the 7th term vector and the 6th candidate's entity word calculate second away from From;
4th candidate's entity word and described the are calculated using first distance and the second distance The scoring of six candidate's entity words;
Choose the scoring candidate's entity word of highest the 4th and the 6th candidate's entity word is used as second instance word.
9. method according to claim 2, it is characterised in that described according to the second instance The step of word generates the second text data includes:
Search the analogy for belonging to same relation type with the analog problem template and answer template;
The second instance word is embedded in into the analogy to answer in template, the second text data is obtained.
10. the method according to claim 1 or 2 or 4 or 6 or 7 or 8 or 9, its feature exists In, in addition to:
When receiving the first speech data of client transmission, first speech data is converted to the One text data;
Second text data is converted into second speech data;
The second speech data is returned into the client.
11. a kind of processing unit of text data, it is characterised in that including:
First text data acquisition module, for obtaining the first text data;
Analogy is intended to judge module, for judging whether first text data is suitable to analogy;If so, Then call entity word extraction module;
Entity word extraction module, for extracting first instance word from first text data;
Entity word analogy module, for carrying out analogy to the first instance word, obtains second instance word;
Second text data generation module, for generating the second text data according to the second instance word.
12. device according to claim 11, it is characterised in that the analogy is intended to judge mould Block includes:
Participle submodule, for carrying out word segmentation processing to first text data, obtains multiple first texts This participle;
Analog problem template matches submodule, for multiple first texts of first text data to be divided Word is matched with default analog problem template;
Analogy is intended to determination sub-module, for when the match is successful, determining that first text data is suitable to Analogy.
13. the device according to claim 11 or 12, it is characterised in that the entity word analogy Module includes:
First candidate's entity word search submodule, for the first instance word be one when, search with The similar one or more first candidate entity words of the first instance word;
Second candidate's entity word screens submodule, for from one or more of first candidate entity words Screen entity word type and the one or more second candidate entity words of the first instance word identical;
Second instance selected ci poem selects submodule, for being selected from one or more of second candidate entity words One or more second instance words.
14. the device according to claim 11 or 12, it is characterised in that the entity word analogy Module includes:
3rd candidate's entity word searches submodule, for including the first fructification word in the first instance word During with the second fructification word, one or more threeth candidates similar to the first fructification word are searched real Pronouns, general term for nouns, numerals and measure words;
4th candidate's entity word screens submodule, for from one or more of 3rd candidate's entity words Screen entity word type and the one or more 4th candidate's entity words of the first fructification word identical;
5th candidate's entity word calculating sub module, for based on the first fructification word, second son Entity word and one or more of 4th candidate's entity words calculate one or more 5th candidate's entity words;
6th candidate's entity word screens submodule, for from one or more of 5th candidate's entity words Screen entity word type and the one or more 6th candidate's entity words of the second fructification word identical;
Second instance selected ci poem takes submodule, for from one or more of 4th candidate's entity words and described One or more 6th candidate's entity words choose second instance word.
15. device according to claim 12, it is characterised in that the second text data life Include into module:
Analogy answers template and searches submodule, belongs to the same relation with the analog problem template for searching Template is answered in the analogy of type;
Template insertion submodule is answered in analogy, and mould is answered for the second instance word to be embedded in into the analogy In plate, the second text data is obtained.
CN201610031796.6A 2016-01-18 2016-01-18 Text data processing method and device Active CN106980624B (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201610031796.6A CN106980624B (en) 2016-01-18 2016-01-18 Text data processing method and device
US15/404,855 US10176804B2 (en) 2016-01-18 2017-01-12 Analyzing textual data
PCT/US2017/013388 WO2017127296A1 (en) 2016-01-18 2017-01-13 Analyzing textual data
EP17741788.8A EP3405912A4 (en) 2016-01-18 2017-01-13 Analyzing textual data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610031796.6A CN106980624B (en) 2016-01-18 2016-01-18 Text data processing method and device

Publications (2)

Publication Number Publication Date
CN106980624A true CN106980624A (en) 2017-07-25
CN106980624B CN106980624B (en) 2021-03-26

Family

ID=59314671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610031796.6A Active CN106980624B (en) 2016-01-18 2016-01-18 Text data processing method and device

Country Status (4)

Country Link
US (1) US10176804B2 (en)
EP (1) EP3405912A4 (en)
CN (1) CN106980624B (en)
WO (1) WO2017127296A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190102A (en) * 2018-09-12 2019-01-11 张连祥 The system and method that project of inviting outside investment negotiation scheme automatically generates
CN110750627A (en) * 2018-07-19 2020-02-04 上海谦问万答吧云计算科技有限公司 Material retrieval method and device, electronic equipment and storage medium
CN112861533A (en) * 2019-11-26 2021-05-28 阿里巴巴集团控股有限公司 Entity word recognition method and device

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10878338B2 (en) * 2016-10-06 2020-12-29 International Business Machines Corporation Machine learning of analogic patterns
US10331718B2 (en) 2016-10-06 2019-06-25 International Business Machines Corporation Analogy outcome determination
WO2018081020A1 (en) 2016-10-24 2018-05-03 Carlabs Inc. Computerized domain expert
US10325024B2 (en) * 2016-11-30 2019-06-18 International Business Machines Corporation Contextual analogy response
US10325025B2 (en) * 2016-11-30 2019-06-18 International Business Machines Corporation Contextual analogy representation
US10901992B2 (en) * 2017-06-12 2021-01-26 KMS Lighthouse Ltd. System and method for efficiently handling queries
US11182706B2 (en) * 2017-11-13 2021-11-23 International Business Machines Corporation Providing suitable strategies to resolve work items to participants of collaboration system
CN109801090A (en) * 2017-11-16 2019-05-24 国家新闻出版广电总局广播科学研究院 The cross-selling method and server of networking products data
US11586655B2 (en) 2017-12-19 2023-02-21 Visa International Service Association Hyper-graph learner for natural language comprehension
US10572586B2 (en) * 2018-02-27 2020-02-25 International Business Machines Corporation Technique for automatically splitting words
CN108563468B (en) * 2018-03-30 2021-09-21 深圳市冠旭电子股份有限公司 Bluetooth sound box data processing method and device and Bluetooth sound box
CN110263318B (en) * 2018-04-23 2022-10-28 腾讯科技(深圳)有限公司 Entity name processing method and device, computer readable medium and electronic equipment
CN108829777A (en) * 2018-05-30 2018-11-16 出门问问信息科技有限公司 A kind of the problem of chat robots, replies method and device
CN108959256B (en) * 2018-06-29 2023-04-07 北京百度网讯科技有限公司 Short text generation method and device, storage medium and terminal equipment
JP6988715B2 (en) * 2018-06-29 2022-01-05 日本電信電話株式会社 Answer text selection device, method, and program
WO2020031242A1 (en) * 2018-08-06 2020-02-13 富士通株式会社 Assessment program, assessment method, and information processing device
CN109460503B (en) * 2018-09-14 2022-01-14 阿里巴巴(中国)有限公司 Answer input method, answer input device, storage medium and electronic equipment
JP7159780B2 (en) * 2018-10-17 2022-10-25 富士通株式会社 Correction Content Identification Program and Report Correction Content Identification Device
CN109635277B (en) * 2018-11-13 2023-05-26 北京合享智慧科技有限公司 Method and related device for acquiring entity information
CN109783624A (en) * 2018-12-27 2019-05-21 联想(北京)有限公司 Answer generation method, device and the intelligent conversational system in knowledge based library
CN109902286B (en) * 2019-01-09 2023-12-12 千城数智(北京)网络科技有限公司 Entity identification method and device and electronic equipment
CN110263167B (en) * 2019-06-20 2022-07-29 北京百度网讯科技有限公司 Medical entity classification model generation method, device, equipment and readable storage medium
US11741305B2 (en) 2019-10-07 2023-08-29 The Toronto-Dominion Bank Systems and methods for automatically assessing fault in relation to motor vehicle collisions
CN111222317B (en) * 2019-10-16 2022-04-29 平安科技(深圳)有限公司 Sequence labeling method, system and computer equipment
CN112700203B (en) * 2019-10-23 2022-11-01 北京易真学思教育科技有限公司 Intelligent marking method and device
CN111738596B (en) * 2020-06-22 2024-03-22 中国银行股份有限公司 Work order dispatching method and device
CN112115212B (en) * 2020-09-29 2023-10-03 中国工商银行股份有限公司 Parameter identification method and device and electronic equipment
US11941000B2 (en) 2021-04-16 2024-03-26 International Business Machines Corporation Cognitive generation of tailored analogies
CN113223532B (en) * 2021-04-30 2024-03-05 平安科技(深圳)有限公司 Quality inspection method and device for customer service call, computer equipment and storage medium
CN113254620B (en) * 2021-06-21 2022-08-30 中国平安人寿保险股份有限公司 Response method, device and equipment based on graph neural network and storage medium
CN113707131B (en) * 2021-08-30 2024-04-16 中国科学技术大学 Speech recognition method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6101490A (en) * 1991-07-19 2000-08-08 Hatton; Charles Malcolm Computer system program for creating new ideas and solving problems
CN1647069A (en) * 2002-04-11 2005-07-27 株式会社PtoPA Conversation control system and conversation control method
CN1752966A (en) * 2004-09-24 2006-03-29 北京亿维讯科技有限公司 Method of solving problem using wikipedia and user inquiry treatment technology
CN1794233A (en) * 2005-12-28 2006-06-28 刘文印 Network user interactive asking answering method and its system
US20070209069A1 (en) * 2006-03-03 2007-09-06 Motorola, Inc. Push-to-ask protocol layer provisioning and usage method
US20090282114A1 (en) * 2008-05-08 2009-11-12 Junlan Feng System and method for generating suggested responses to an email
CN103902652A (en) * 2014-02-27 2014-07-02 深圳市智搜信息技术有限公司 Automatic question-answering system
US20140358890A1 (en) * 2013-06-04 2014-12-04 Sap Ag Question answering framework

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5519608A (en) 1993-06-24 1996-05-21 Xerox Corporation Method for extracting from a text corpus answers to questions stated in natural language by using linguistic analysis and hypothesis generation
US6778970B2 (en) * 1998-05-28 2004-08-17 Lawrence Au Topological methods to organize semantic network data flows for conversational applications
US7072883B2 (en) * 2001-12-21 2006-07-04 Ut-Battelle Llc System for gathering and summarizing internet information
US20060224569A1 (en) 2005-03-31 2006-10-05 Desanto John A Natural language based search engine and methods of use therefor
FR2906049A1 (en) * 2006-09-19 2008-03-21 Alcatel Sa COMPUTER-IMPLEMENTED METHOD OF DEVELOPING ONTOLOGY FROM NATURAL LANGUAGE TEXT
US7415409B2 (en) * 2006-12-01 2008-08-19 Coveo Solutions Inc. Method to train the language model of a speech recognition system to convert and index voicemails on a search engine
JP4355772B2 (en) * 2007-02-19 2009-11-04 パナソニック株式会社 Force conversion device, speech conversion device, speech synthesis device, speech conversion method, speech synthesis method, and program
JP5119700B2 (en) * 2007-03-20 2013-01-16 富士通株式会社 Prosody modification device, prosody modification method, and prosody modification program
US8737975B2 (en) * 2009-12-11 2014-05-27 At&T Mobility Ii Llc Audio-based text messaging
US20140006012A1 (en) 2012-07-02 2014-01-02 Microsoft Corporation Learning-Based Processing of Natural Language Questions
US9798799B2 (en) 2012-11-15 2017-10-24 Sri International Vehicle personal assistant that interprets spoken natural language input based upon vehicle context
US9251474B2 (en) 2013-03-13 2016-02-02 International Business Machines Corporation Reward based ranker array for question answer system
US9189742B2 (en) * 2013-11-20 2015-11-17 Justin London Adaptive virtual intelligent agent
US9483582B2 (en) * 2014-09-12 2016-11-01 International Business Machines Corporation Identification and verification of factual assertions in natural language

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6101490A (en) * 1991-07-19 2000-08-08 Hatton; Charles Malcolm Computer system program for creating new ideas and solving problems
CN1647069A (en) * 2002-04-11 2005-07-27 株式会社PtoPA Conversation control system and conversation control method
CN1752966A (en) * 2004-09-24 2006-03-29 北京亿维讯科技有限公司 Method of solving problem using wikipedia and user inquiry treatment technology
CN1794233A (en) * 2005-12-28 2006-06-28 刘文印 Network user interactive asking answering method and its system
US20070209069A1 (en) * 2006-03-03 2007-09-06 Motorola, Inc. Push-to-ask protocol layer provisioning and usage method
US20090282114A1 (en) * 2008-05-08 2009-11-12 Junlan Feng System and method for generating suggested responses to an email
US20140358890A1 (en) * 2013-06-04 2014-12-04 Sap Ag Question answering framework
CN103902652A (en) * 2014-02-27 2014-07-02 深圳市智搜信息技术有限公司 Automatic question-answering system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750627A (en) * 2018-07-19 2020-02-04 上海谦问万答吧云计算科技有限公司 Material retrieval method and device, electronic equipment and storage medium
CN109190102A (en) * 2018-09-12 2019-01-11 张连祥 The system and method that project of inviting outside investment negotiation scheme automatically generates
CN112861533A (en) * 2019-11-26 2021-05-28 阿里巴巴集团控股有限公司 Entity word recognition method and device

Also Published As

Publication number Publication date
EP3405912A1 (en) 2018-11-28
WO2017127296A1 (en) 2017-07-27
US10176804B2 (en) 2019-01-08
EP3405912A4 (en) 2019-06-26
CN106980624B (en) 2021-03-26
US20170206897A1 (en) 2017-07-20

Similar Documents

Publication Publication Date Title
CN106980624A (en) A kind for the treatment of method and apparatus of text data
CN104915340B (en) Natural language question-answering method and device
JP6222821B2 (en) Error correction model learning device and program
CN107818164A (en) A kind of intelligent answer method and its system
CN108287858A (en) The semantic extracting method and device of natural language
CN108288468A (en) Audio recognition method and device
CN105631468A (en) RNN-based automatic picture description generation method
CN108711421A (en) A kind of voice recognition acoustic model method for building up and device and electronic equipment
CN108304375A (en) A kind of information identifying method and its equipment, storage medium, terminal
CN110619050B (en) Intention recognition method and device
CN104572631B (en) The training method and system of a kind of language model
CN109976702A (en) A kind of audio recognition method, device and terminal
CN113095080B (en) Theme-based semantic recognition method and device, electronic equipment and storage medium
CN109582954A (en) Method and apparatus for output information
CN107943940A (en) Data processing method, medium, system and electronic equipment
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN109710732A (en) Information query method, device, storage medium and electronic equipment
CN112650842A (en) Human-computer interaction based customer service robot intention recognition method and related equipment
CN114218488A (en) Information recommendation method and device based on multi-modal feature fusion and processor
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN112883182A (en) Question-answer matching method and device based on machine reading
CN113342948A (en) Intelligent question and answer method and device
CN116091836A (en) Multi-mode visual language understanding and positioning method, device, terminal and medium
CN113609264B (en) Data query method and device for power system nodes
CN113297387B (en) News detection method for image-text mismatching based on NKD-GNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211116

Address after: Room 554, floor 5, building 3, No. 969, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee after: Taobao (China) Software Co., Ltd

Address before: P.O. Box 847, 4th floor, Grand Cayman capital building, British Cayman Islands

Patentee before: Alibaba Group Holdings Limited