CN107273503A - Method and apparatus for generating the parallel text of same language - Google Patents

Method and apparatus for generating the parallel text of same language Download PDF

Info

Publication number
CN107273503A
CN107273503A CN201710464118.3A CN201710464118A CN107273503A CN 107273503 A CN107273503 A CN 107273503A CN 201710464118 A CN201710464118 A CN 201710464118A CN 107273503 A CN107273503 A CN 107273503A
Authority
CN
China
Prior art keywords
word sequence
sequence
term vector
vector
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710464118.3A
Other languages
Chinese (zh)
Other versions
CN107273503B (en
Inventor
李朋凯
何径舟
付志宏
信贤卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710464118.3A priority Critical patent/CN107273503B/en
Publication of CN107273503A publication Critical patent/CN107273503A/en
Priority to US15/900,166 priority patent/US10650102B2/en
Application granted granted Critical
Publication of CN107273503B publication Critical patent/CN107273503B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses the method and apparatus for generating the parallel text of same language.One embodiment of this method includes:The term vector table of acquisition source cutting word sequence and training in advance;According to term vector table, it is determined that source term vector sequence corresponding with source segmenting word sequence;Source term vector sequence is imported to the first circulation neural network model of training in advance, the intermediate vector of the semantic default dimension for characterizing source cutting word sequence is generated;Intermediate vector is imported to the second circulation neural network model of training in advance, target word sequence vector corresponding with intermediate vector is generated;According to term vector table, it is determined that target cutting word sequence corresponding with target word sequence vector, and target cutting word sequence is defined as the parallel text of corresponding with source segmenting word sequence same language.The embodiment reduces the algorithm complexity of generation parallel text with language, reduces required memory space.

Description

Method and apparatus for generating the parallel text of same language
Technical field
The application is related to field of computer technology, and in particular to Internet technical field, more particularly, to generates same language The method and apparatus for saying parallel text.
Background technology
Artificial intelligence (Artificial Intelligence, AI) is research, developed for simulating, extending and extending people Intelligent theory, method, a new technological sciences of technology and application system.Artificial intelligence is one of computer science Branch, it attempts to understand the essence of intelligence, and produces a kind of new intelligence that can be made a response in the similar mode of human intelligence Energy machine, the research in the field includes robot, language identification, image recognition, natural language processing and expert system etc..Manually Natural language processing in smart field is computer science and an important directions in artificial intelligence field.It is studied The various theoretical and methods for carrying out efficient communication between people and computer with natural language can be realized.It is a text generation and this The similar parallel text of same language of the identical semanteme of text language is the important component in natural language processing.It is parallel with language The application scenario of text is a lot, as an example, at present, search engine is examined in the query statement (query) inputted to user Suo Shi, due to the randomness of user input query sentence, if the query statement inputted using user is retrieved, often effect It is bad, in order to obtain more preferable retrieval effectiveness, generally can all parallel text be generated with language to query statement, then use institute The parallel text of same language of generation is retrieved.
However, being typically in advance using statistics alignment algorithm at present when generating the parallel text of same language of a text Or regular alignment algorithm, dictionary is replaced based on Parallel Corpus generation;Then, according to priori and replacement dictionary, generation The parallel text of same language after replacement.The method of existing generation parallel text with language, alignment algorithm is complicated, it is necessary to artificial dry Pre- more, the replacement dictionary accuracy rate generated is low, and needs to store replacement dictionary, and the required storage for generally replacing dictionary is empty Between size all in several GB so that the problem of memory space needed for existing is big.
The content of the invention
The purpose of the application is to propose a kind of improved method and apparatus for being used to generate the parallel text with language, to solve The technical problem that certainly background section above is mentioned.
In a first aspect, the embodiment of the present application provides a kind of method for being used to generate the parallel text with language, this method bag Include:The term vector table of acquisition source cutting word sequence and training in advance, wherein, above-mentioned term vector table be used to characterizing word and term vector it Between corresponding relation;According to above-mentioned term vector table, it is determined that source term vector sequence corresponding with above-mentioned source segmenting word sequence;Will be above-mentioned Source term vector sequence imports the first circulation neural network model of training in advance, generates for characterizing above-mentioned source cutting word sequence The intermediate vector of semantic default dimension, wherein, above-mentioned first circulation neural network model be used to characterizing term vector sequence with State the corresponding relation between the vector of default dimension;Above-mentioned intermediate vector is imported to the second circulation neutral net mould of training in advance Type, generates target word sequence vector corresponding with above-mentioned intermediate vector, wherein, above-mentioned second circulation neural network model is used for table Levy the corresponding relation between the vector of above-mentioned default dimension and term vector sequence;According to above-mentioned term vector table, it is determined that with above-mentioned mesh The corresponding target cutting word sequence of term vector sequence is marked, and above-mentioned target cutting word sequence is defined as and above-mentioned source cutting word order The parallel text of the corresponding same language of row.
In certain embodiments, before above-mentioned acquisition source cutting word sequence and the term vector table of training in advance, the above method Also include:The inquiry request that user's using terminal is sent is received, above-mentioned inquiry request includes query statement;To above-mentioned query statement Pre-processed, obtain the cutting word sequence of above-mentioned query statement, above-mentioned pretreatment includes word segmentation processing and removes additional character; Resulting cutting word sequence is defined as source cutting word sequence.
In certain embodiments, it is above-mentioned above-mentioned target cutting word sequence is defined as it is corresponding with above-mentioned source segmenting word sequence With language after parallel text, the above method also includes:Scanned for according to the parallel text of above-mentioned same language, obtain search knot Really;Mentioned above searching results are sent to above-mentioned terminal.
In certain embodiments, before above-mentioned acquisition source cutting word sequence and the term vector table of training in advance, the above method Also include training step, above-mentioned training step includes:Obtain at least one pair of with language parallel cutting word sequence, wherein, each pair is same The parallel cutting word sequence of language includes identical language and semantic identical the first cutting word sequence and the second cutting word sequence;Obtain Default term vector table, default first circulation neural network model and default second circulation neural network model;For upper State at least one pair of each pair with language in parallel cutting word sequence with language parallel cutting word sequence, according to above-mentioned default word to Scale, determines the corresponding first segmenting word sequence vector of the first segmenting word sequence of this pair parallel cutting word sequence with language;Will Above-mentioned first segmenting word sequence vector imports above-mentioned default first circulation neural network model, obtains and above-mentioned first segmenting word The vector of the corresponding above-mentioned default dimension of sequence vector;Resulting vector is imported into above-mentioned default second circulation neutral net Model, is obtained and resulting vectorial corresponding second segmenting word sequence vector;According to above-mentioned default term vector table, it is determined that with The corresponding word sequence of above-mentioned second segmenting word sequence vector;According to resulting word sequence and this pair with language parallel cutting word order Different information between second cutting word sequence of row, to above-mentioned default term vector table, above-mentioned default first circulation nerve Network model and above-mentioned default second circulation neural network model are adjusted;By above-mentioned default term vector table, above-mentioned pre- If first circulation neural network model and above-mentioned default second circulation neural network model be identified as training what is obtained Term vector table, first circulation neural network model and second circulation neural network model.
In certain embodiments, above-mentioned first circulation neural network model and above-mentioned second circulation neural network model are Time Recognition with Recurrent Neural Network model.
In certain embodiments, it is above-mentioned according to above-mentioned term vector table, it is determined that source word corresponding with above-mentioned source segmenting word sequence Sequence vector, including:To each segmenting word in above-mentioned source cutting word sequence, inquiry and the segmenting word in above-mentioned term vector table The term vector of matching, and the term vector found is defined as in above-mentioned source term vector sequence with the segmenting word in above-mentioned source cutting The corresponding source term vector in position identical position in word sequence.
In certain embodiments, it is above-mentioned according to above-mentioned term vector table, it is determined that mesh corresponding with above-mentioned target word sequence vector Cutting word sequence is marked, including:For each target term vector in above-mentioned target word sequence vector, selected from above-mentioned term vector table Take with the word corresponding to the similarity highest term vector of the target term vector, selected word is defined as above-mentioned target cutting The corresponding target segmenting word in position identical position in word sequence with the target term vector in above-mentioned target word sequence vector.
Second aspect, the embodiment of the present application provides a kind of device for being used to generate the parallel text with language, the device bag Include:Acquiring unit, is configured to acquisition source cutting word sequence and the term vector table of training in advance, wherein, above-mentioned term vector table is used Corresponding relation between sign word and term vector;First determining unit, is configured to according to above-mentioned term vector table, it is determined that with it is upper State the corresponding source term vector sequence of source segmenting word sequence;First generation unit, is configured to import above-mentioned source term vector sequence The first circulation neural network model of training in advance, generates semantic default dimension for characterizing above-mentioned source cutting word sequence Intermediate vector, wherein, above-mentioned first circulation neural network model is used to characterize term vector sequence and the vector of above-mentioned default dimension Between corresponding relation;Second generation unit, is configured to import above-mentioned intermediate vector the second circulation nerve of training in advance Network model, generates target word sequence vector corresponding with above-mentioned intermediate vector, wherein, above-mentioned second circulation neural network model For characterizing the corresponding relation between the vector of above-mentioned default dimension and term vector sequence;Second determining unit, is configured to root According to above-mentioned term vector table, it is determined that target cutting word sequence corresponding with above-mentioned target word sequence vector, and by above-mentioned target cutting Word sequence is defined as the parallel text of corresponding with above-mentioned source segmenting word sequence same language.
In certain embodiments, said apparatus also includes:Receiving unit, is configured to receive what user's using terminal was sent Inquiry request, above-mentioned inquiry request includes query statement;Pretreatment unit, is configured to locate above-mentioned query statement in advance Reason, obtains the cutting word sequence of above-mentioned query statement, and above-mentioned pretreatment includes word segmentation processing and removes additional character;3rd determines Unit, is configured to resulting cutting word sequence being defined as source cutting word sequence.
In certain embodiments, said apparatus also includes:Search unit, is configured to according to the parallel text of above-mentioned same language Scan for, obtain search result;Transmitting element, is configured to send mentioned above searching results to above-mentioned terminal.
In certain embodiments, said apparatus also includes training unit, and above-mentioned training unit includes:First acquisition module, Be configured to obtain at least one pair of with language parallel cutting word sequence, wherein, each pair with language parallel cutting word sequence include language Say identical and semantic identical the first cutting word sequence and the second cutting word sequence;Second acquisition module, is configured to obtain pre- If term vector table, default first circulation neural network model and default second circulation neural network model;Adjusting module, Be configured to at least one pair of above-mentioned each pair with language in parallel cutting word sequence with language parallel cutting word sequence, according to Above-mentioned default term vector table, determines corresponding first cutting of the first segmenting word sequence of this pair parallel cutting word sequence with language Term vector sequence;Above-mentioned first segmenting word sequence vector is imported into above-mentioned default first circulation neural network model, obtain with The vector of the corresponding above-mentioned default dimension of above-mentioned first segmenting word sequence vector;Resulting vector is imported above-mentioned default the Two Recognition with Recurrent Neural Network models, are obtained and resulting vectorial corresponding second segmenting word sequence vector;According to above-mentioned default Term vector table, it is determined that word sequence corresponding with above-mentioned second segmenting word sequence vector;It is same with this pair according to resulting word sequence Different information between second cutting word sequence of the parallel cutting word sequence of language, to above-mentioned default term vector table, above-mentioned pre- If first circulation neural network model and above-mentioned default second circulation neural network model be adjusted;Determining module, matches somebody with somebody Putting is used for above-mentioned default term vector table, above-mentioned default first circulation neural network model and above-mentioned default second circulation Neural network model is identified as training obtained term vector table, first circulation neural network model and second circulation nerve net Network model.
In certain embodiments, above-mentioned first circulation neural network model and above-mentioned second circulation neural network model are Time Recognition with Recurrent Neural Network model.
In certain embodiments, above-mentioned first determining unit is further configured to:To in above-mentioned source cutting word sequence Each segmenting word, inquires about the term vector matched with the segmenting word, and the term vector found is determined in above-mentioned term vector table For the corresponding source word in position identical position in above-mentioned source term vector sequence with the segmenting word in above-mentioned source cutting word sequence Vector.
In certain embodiments, above-mentioned second determining unit is further configured to:For above-mentioned target word sequence vector In each target term vector, chosen from above-mentioned term vector table similarity highest term vector with the target term vector it is right The word answered, selected word is defined as in above-mentioned target cutting word sequence with the target term vector in above-mentioned target term vector sequence The corresponding target segmenting word in position identical position in row.
The third aspect, the embodiment of the present application provides a kind of electronic equipment, and the electronic equipment includes:One or more processing Device;Storage device, for storing one or more programs, when said one or multiple programs are by said one or multiple processors During execution so that said one or multiple processors realize the method as described in any implementation in first aspect.
Fourth aspect, the embodiment of the present application provides a kind of computer-readable recording medium, is stored thereon with computer journey Sequence, it is characterised in that the side as described in any implementation in first aspect is realized when the computer program is executed by processor Method.
The method and apparatus for being used to generate the parallel text with language that the embodiment of the present application is provided, by according to term vector Table, it is determined that with corresponding source term vector sequence, then by source term vector sequence import training in advance first circulation neutral net mould Type, generates the intermediate vector of the semantic default dimension for characterizing source cutting word sequence, then imports intermediate vector advance The second circulation neural network model of training, generates target word sequence vector corresponding with intermediate vector, then further according to above-mentioned Term vector table, it is determined that target cutting word sequence corresponding with target word sequence vector, finally and by target cutting word sequence is determined For the parallel text of same language corresponding with source segmenting word sequence.So as to which generating process does not need manual intervention, reduces generation same The algorithm complexity of the parallel text of language, and the larger replacement dictionary of space-consuming need not be stored (usual size is thousands of Mbytes), it is only necessary to store term vector table, the parameter of first circulation neural network model and second circulation neural network model Parameter (about space-consuming is tens to three altogether), so as to reduce required memory space.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is that the application can apply to exemplary system architecture figure therein;
Fig. 2 is the flow chart for being used to generate one embodiment of the method for parallel text with language according to the application;
Fig. 3 is the schematic diagram for being used to generate one application scenarios of the method for parallel text with language according to the application;
Fig. 4 is the flow chart for being used to generate another embodiment of the method for parallel text with language according to the application;
Fig. 5 is the structural representation for being used to generate one embodiment of the device of parallel text with language according to the application Figure;
Fig. 6 is adapted for the structural representation of the computer system of the electronic equipment for realizing the embodiment of the present application.
Embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that, in order to Be easy to description, illustrate only in accompanying drawing to about the related part of invention.
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the application in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 show can using the application be used for generate with language the method for parallel text or for generating same language The exemplary system architecture 100 of the embodiment of the device of parallel text.
As shown in figure 1, system architecture 100 can include terminal device 101,102,103, network 104 and server 105. Medium of the network 104 to provide communication link between terminal device 101,102,103 and server 105.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be interacted with using terminal equipment 101,102,103 by network 104 with server 105, to receive or send out Send message etc..Various client applications, such as web browser applications, purchase can be installed on terminal device 101,102,103 Species application, searching class application, JICQ, mailbox client, social platform software etc..
Terminal device 101,102,103 can be the various electronic equipments with display screen, include but is not limited to intelligent hand Machine, tablet personal computer, pocket computer on knee and desktop computer etc..
Server 105 can be to provide the server of various services, for example, shown from terminal device 101,102,103 Searching class website provides the backstage search server supported.Backstage search server can be to data such as the searching requests that receives Progress such as analyzes at the processing, and result (such as web page interlinkage data) is fed back into terminal device.
It should be noted that the method for being used to generate the parallel text with language that the embodiment of the present application is provided is general by taking Business device 105 is performed, and correspondingly, the device for generating the parallel text of same language is generally positioned in server 105.In some feelings Under condition, the embodiment of the present application provided be used for generate with language the method for parallel text can also not need terminal device 101, 102nd, 103, and can individually be performed by server 105, at this moment, server 105 both can be the service with server capability Device or the general electronic equipment without server capability but with calculation function.
It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realizing need Will, can have any number of terminal device, network and server.
With continued reference to Fig. 2, it illustrates the reality for being used to generate the method for parallel text with language according to the application Apply the flow 200 of example.This is used for the method for generating the parallel text with language, comprises the following steps:
Step 201, source cutting word sequence and the term vector table of training in advance are obtained.
In the present embodiment, for generating electronic equipment (such as Fig. 1 of the method operation of the parallel text of same language thereon Shown server) locally or remotely it can be cut from other electronic equipments of above-mentioned electronic equipment network connection acquisition source The term vector table of segmentation sequence and training in advance.
In the present embodiment, segmenting word refers to word or the phrase without additional character or punctuation mark.Cutting word order The sequence that row are made up of at least one segmenting word arranged in order.Cutting word sequence in source can be stored in advance in above-mentioned electricity The cutting word sequence of the local parallel text of same language to be generated of sub- equipment.Cutting word sequence in source can also be specified by user The cutting word sequence of the parallel text of same language to be generated.Source cutting word sequence can also be above-mentioned electronic equipment from above-mentioned electronics The parallel text of same language to be generated that other electronic equipments (for example, terminal device shown in Fig. 1) of device network connection are received This cutting word sequence.The parallel text of one text refers to the semantic similar text to the text.The same language of one text Say that parallel text refers to identical with text language and adopted similar text.For example, " ordering through train " is " through train ticket booking " The parallel text with language, " rice is either with or without protein " is the parallel text of same language of " rice is either with or without protein ".
In the present embodiment, term vector table is used to word or phrase being mapped to real number vector, and the real number vector mapped is just It is term vector.By using term vector table, it is possible to achieve the feature in natural language from the high-dimensional space of vocabulary table size It is reduced to a relatively low dimensional space.Weigh term vector table principle be:Between the term vector of two words of semantic similarity Similarity should be higher, it is whereas a lower.As an example, term vector can use Distributed Representation A kind of real number vector that (or Distributional Representation) is represented.Here term vector table can be advance Train.For example, it is " Beijing " that one in term vector table record, which can be word, corresponding term vector for " -0.1654, 0.8764,0.5364, -0.6354,0.1645 ", it can be Arbitrary Dimensions, this Shen that term vector, which has in 5 dimensions, practical application, here Please this is not specifically limited.
It should be noted that how to train term vector table to be widely studied at present and application prior art, herein no longer Repeat.
As an example, the word included by statement library and each sentence including a large amount of sentences can be obtained first;So Afterwards, for each word in word storehouse, the sentence for including the word in statement library is obtained, and then in these sentences, is obtained The context words adjacent with the word, based on the maximum principle of the degree of association sum of word and context words is made, calculate every The term vector of individual word.
As an example, each sentence of each word to be analyzed included in statement library belonging in statement library can also be obtained Default type, obtain the corresponding type set of each word to be analyzed;The term vector of each word to be analyzed is set to Variable is trained, according to the corresponding type set of each word to be analyzed and term vector, the degree of association between each word to be analyzed is set up Summation computation model, be used as training pattern;According to above-mentioned training pattern, based on the principle for the summation maximum for making the degree of association, Training variable is trained, the term vector of each word to be analyzed is obtained.
Step 202, according to term vector table, it is determined that source term vector sequence corresponding with source segmenting word sequence.
In the present embodiment, according to the term vector table obtained in step 201, above-mentioned electronic equipment (such as clothes shown in Fig. 1 Business device) source term vector sequence corresponding with the source segmenting word sequence acquired in step 201 can be determined.Here, source term vector sequence Row are the term vector sequences for generating the parallel text of same language of source cutting word sequence.Source term vector sequence is by arranging in order At least one source term vector composition of row.Each source term vector in the term vector sequence of source with it is each in source cutting word sequence Source segmenting word is corresponded, and each source term vector in the term vector sequence of source be according in source cutting word sequence with the source word to Measure corresponding source segmenting word and inquire about what is obtained in term vector table.
In some optional implementations of the present embodiment, step 202 can be carried out as follows:To in source cutting word sequence Each segmenting word, the term vector that is matched with the segmenting word is inquired about in term vector table, and the term vector found is defined as The corresponding source term vector in position identical position in the term vector sequence of source with the segmenting word in source cutting word sequence.
Step 203, source term vector sequence is imported to the first circulation neural network model of training in advance, generated for characterizing The intermediate vector of the semantic default dimension of source cutting word sequence.
In the present embodiment, can be by for generating that the method for the parallel text of same language runs on electronic equipment thereon Source term vector sequence imports the first circulation neural network model of training in advance, generates the semanteme for characterizing source cutting word sequence Default dimension intermediate vector.Wherein, first circulation neural network model is used to characterize term vector sequence and default dimension Corresponding relation between vector.
In practice, Recognition with Recurrent Neural Network (Recurrent Neural Networks, RNNs) model is different from traditional FNNs (Feed-forward Neural Networks, feed-forward neutral net), RNNs introduces directed circulation, can locate Manage those input between forward-backward correlation the problem of.It is again to defeated from input layer to hidden layer in traditional neural network model Go out layer, connect entirely between layers, the node between every layer is connectionless.But this common neutral net for Handle the problem of sequence is related but helpless.And in Recognition with Recurrent Neural Network model, the output of a sequence currently with above Output it is also relevant.The specific form of expression is that Recognition with Recurrent Neural Network network can be remembered to information above and be applied to work as In the calculating of preceding output, i.e., the node between hidden layer is no longer connectionless but has connection, and hidden layer input not only Output including input layer also includes the output of last moment hidden layer.In theory, Recognition with Recurrent Neural Network can be to any length Sequence data handled.But in practice, often assume that current state is only several with above to reduce complexity Individual state is related.
As an example, first circulation neural network model can be with substantial amounts of term vector sequence and corresponding default dimension Vector as training data, using arbitrary nonlinear activation function (for example, Sigmoid functions, Softplus functions, double Polarity S igmoid functions etc.) as the neuron activation functions of default first circulation neural network model, to the word inputted Sequence vector is calculated, and using the vector of default dimension corresponding with the term vector sequence of the input as output, training is initial First circulation neural network model obtained from.
Step 204, intermediate vector is imported to the second circulation neural network model of training in advance, generation and intermediate vector pair The target word sequence vector answered.
In the present embodiment, the intermediate vector generated in step 203 can be imported training in advance by above-mentioned electronic equipment Second circulation neural network model, generate corresponding with intermediate vector target word sequence vector.Wherein, second circulation nerve net Network model is used to characterize the corresponding relation between the vector of default dimension and term vector sequence.Here, target word sequence vector is It is made up of at least one the target word vector arranged in order, the number of target term vector can be with target word sequence vector The number of source term vector is identical in the term vector sequence of source, can also be different from the number of source term vector in the term vector sequence of source, i.e. The number of target term vector is not changeless in target word sequence vector.
As an example, second circulation neural network model can be with the corresponding word of vector sum of substantial amounts of default dimension to Sequence is measured as training data, using arbitrary nonlinear activation function (for example, Sigmoid functions, Softplus functions, double Polarity S igmoid functions etc.) as the neuron activation functions of default second circulation neural network model, it is pre- to what is inputted If the vector of dimension is calculated, the vectorial corresponding term vector sequence with the default dimension inputted is regard as output, training Obtained from initial second circulation neural network model.
Step 205, according to term vector table, it is determined that target cutting word sequence corresponding with target word sequence vector, and by mesh Mark cutting word sequence is defined as the parallel text of corresponding with source segmenting word sequence same language.
In the present embodiment, above-mentioned electronic equipment can be according to the term vector table obtained in step 201, it is determined that and step The corresponding target cutting word sequence of target word sequence vector generated in 204, and target cutting word sequence is defined as and source The parallel text of the corresponding same language of segmenting word sequence.Here, target cutting word sequence is at least one mesh by arranging in order Mark segmenting word composition.Each target segmenting word in target cutting word sequence and each target word in target word sequence vector Vector is corresponded, and each target segmenting word in target cutting word sequence be according in target word sequence vector with the target The corresponding target term vector of segmenting word inquires about what is obtained in term vector table.
In some optional implementations of the present embodiment, step 205 can be carried out as follows:
For each target term vector in target word sequence vector, chosen from term vector table and the target term vector Word corresponding to similarity highest term vector, selected word is defined as in target cutting word sequence and the target term vector The corresponding target segmenting word in position identical position in target word sequence vector.
As an example, cosine similarity between two term vectors can be calculated as similar between two term vectors Degree.
As an example, the Euclidean distance between two term vectors can also be calculated, Euclidean distance is nearer, then two term vectors Between similarity it is higher, otherwise similarity is lower.
Because source segmenting word can be mapped to source term vector by term vector table acquired in step 201, in step 205 During by target word DUAL PROBLEMS OF VECTOR MAPPING to target segmenting word, acquired term vector table in use or step 201, be with by source Identical term vector table when segmenting word is mapped to source word segmentation vector, therefore, according to term vector table acquired in step 201, Target word sequence vector is mapped to target cutting word sequence, resulting target cutting word sequence and the language of source cutting word sequence Speech is identical and semantic similar, i.e. resulting target cutting word sequence is the parallel text of same language corresponding with source segmenting word sequence This.
In some optional implementations of the present embodiment, term vector table, first circulation neural network model and second Recognition with Recurrent Neural Network model can be obtained by the training of following training step:
First, obtain at least one pair of with language parallel cutting word sequence.
Here, each pair with language parallel cutting word sequence include identical language and semantic identical the first cutting word sequence and Second cutting word sequence.As an example, acquired each pair is with language, parallel cutting word sequence can be artificial by technical staff The language of mark is identical and semantic identical the first cutting word sequence and the second cutting word sequence.
Then, default term vector table, default first circulation neural network model and default second circulation god are obtained Through network model.
Then, at least one pair of each pair with language in parallel cutting word sequence with language parallel cutting word sequence, root According to default term vector table, corresponding first segmenting word of the first segmenting word sequence of this pair parallel cutting word sequence with language is determined Sequence vector;First segmenting word sequence vector is imported into default first circulation neural network model, obtained and the first segmenting word The vector of the corresponding default dimension of sequence vector;Resulting vector is imported into default second circulation neural network model, obtained To with resulting vectorial corresponding second segmenting word sequence vector;According to default term vector table, it is determined that with the second segmenting word The corresponding word sequence of sequence vector;According to second segmenting word of the resulting word sequence with this pair parallel cutting word sequence with language Different information between sequence, is followed to default term vector table, default first circulation neural network model and default second Ring neural network model is adjusted.As an example, adjustment term vector table can be adjustment term vector table in word it is corresponding with word The value respectively tieed up in term vector, adjustment first circulation neural network model can be the defeated of adjustment first circulation neural network model Enter matrix, hide layer matrix and output matrix, adjustment second circulation neural network model can be adjustment second circulation nerve net The input matrix of network model, hiding layer matrix and output matrix.
Finally, by default term vector table, default first circulation neural network model and default second circulation nerve Network model is identified as training obtained term vector table, first circulation neural network model and second circulation neutral net mould Type.Here, default term vector table, default first circulation neural network model and default second circulation neural network model In parameters adjusted and optimized in the training process, more preferable effect can be reached when in use.
In some optional implementations of the present embodiment, first circulation neural network model and second circulation nerve net Network model can be time Recognition with Recurrent Neural Network model, and such as LSTM (the Long Short-Term Memory) times circulate god Through network model.
With continued reference to Fig. 3, Fig. 3 is the application scenarios for being used to generate the method for parallel text with language according to the present embodiment A schematic diagram.In Fig. 3 application scenarios, first electronic equipment obtain source cutting word sequence 301 " ordering through train " and Term vector table 302, then determines the source term vector sequence 303 of source cutting word sequence 301 by term vector table 302, then by source word Sequence vector 303 import training in advance first circulation neural network model 304, generate intermediate vector 305, then by centre to Amount 305 imports the second circulation neural network models 306 of training in advance, generation target word sequence vector 307, finally by word to Target word sequence vector 307 is determined target cutting word sequence 308 " through train ticket booking " by scale 302, so as to generate source cutting The parallel text " through train ticket booking " of the corresponding same language of word sequence 301 " ordering through train ".
The method that above-described embodiment of the application is provided by according to term vector table, it is determined that with corresponding source term vector sequence Row, then the first circulation neural network model of source term vector sequence importing training in advance is generated for characterizing source cutting word order The intermediate vector of the semantic default dimension of row, then imports intermediate vector the second circulation neutral net mould of training in advance Type, generates corresponding with intermediate vector target word sequence vector, then further according to above-mentioned term vector table, it is determined that with target term vector The corresponding target cutting word sequence of sequence, finally and is defined as same language corresponding with source segmenting word sequence by target cutting word sequence Say parallel text.So as to reduce the algorithm complexity of generation parallel text with language, and reduce required memory space.
With further reference to Fig. 4, it illustrates the stream of another embodiment of the method for generating the parallel text of same language Journey 400.This is used for the flow 400 for generating the method for parallel text with language, comprises the following steps:
Step 401, the inquiry request that user's using terminal is sent is received.
In the present embodiment, for generating electronic equipment (such as Fig. 1 of the method operation of the parallel text of same language thereon Shown server) inquiry that user's using terminal send can be received by wired connection mode or radio connection please Ask.Here, inquiry request can include query statement.As an example, user can be to be installed in using terminal browser access Searching class website, and input inquiry sentence, and after to above-mentioned searching class website provide support above-mentioned electronic equipment send wrap The inquiry request of above-mentioned query statement is included, so that above-mentioned electronic equipment can receive above-mentioned inquiry request.
Step 402, query statement is pre-processed, obtains the cutting word sequence of query statement.
In the present embodiment, above-mentioned electronic equipment can be pre-processed to query statement, obtain the cutting of query statement Word sequence.Here, pretreatment can include word segmentation processing and remove additional character.
Word segmentation processing is exactly the process that continuous word sequence is reassembled into word sequence according to certain specification.Need Bright is that word segmentation processing is widely studied and application the prior art of those skilled in the art, be will not be repeated here.As an example, Word segmentation processing can use the segmenting method based on string matching, the segmenting method based on understanding and the participle side based on statistics Method.
Herein, additional character refers to relative to tradition or conventional outer symbol, and frequency of use is less and is difficult to directly input Symbol, such as:Mathematic sign, unit symbol, tab etc..Additional character is removed just to refer to treat the text for removing additional character This, wherein included additional character is removed, and retains the process of wherein no special symbol.
Above-mentioned pretreatment is have passed through, the cutting word sequence of query statement can be obtained.Here, cutting word sequence be by by Constituted according at least one tactic segmenting word.
Step 403, resulting cutting word sequence is defined as source cutting word sequence.
In the present embodiment, the cutting word sequence obtained by step 402 can be defined as source and cuts by above-mentioned electronic equipment Segmentation sequence, so that subsequent step is used.
Step 404, source cutting word sequence and the term vector table of training in advance are obtained.
Step 405, according to term vector table, it is determined that source term vector sequence corresponding with source segmenting word sequence.
Step 406, source term vector sequence is imported to the first circulation neural network model of training in advance, generated for characterizing The intermediate vector of the semantic default dimension of source cutting word sequence.
Step 407, intermediate vector is imported to the second circulation neural network model of training in advance, generation and intermediate vector pair The target word sequence vector answered.
Step 408, according to term vector table, it is determined that target cutting word sequence corresponding with target word sequence vector, and by mesh Mark cutting word sequence is defined as the parallel text of corresponding with source segmenting word sequence same language.
In the present embodiment, the concrete operations of step 404, step 405, step 406, step 407 and step 408 and Fig. 2 Step 201 in shown embodiment, step 202, step 203, step 204 and step 205 concrete operations it is essentially identical, herein Repeat no more.
Step 409, scanned for according to the parallel text of same language, obtain search result.
In the present embodiment, above-mentioned electronic equipment can determine corresponding with source segmenting word sequence same in a step 408 After the parallel text of language, scanned for according to the parallel text of same language, obtain search result.As an example, search result can be with Include the web page interlinkage of the related webpage of text parallel to same language.Because user compares when terminal input inquiry sentence Arbitrarily, if the content inputted according to user is scanned for, recall rate is relatively low.And step 404 is used to the operation of step 408, After the parallel text of the corresponding same language of segmenting word sequence for generating query statement, the parallel text of same language generated and inquiry The semanteme of sentence is approximate, but is particularly suited for search, so as to improve the recall rate of search.
Step 410, search result is sent to terminal.
In the present embodiment, above-mentioned electronic equipment can will search for obtained search result and send to step in step 409 The terminal of inquiry request is received in 401.
Figure 4, it is seen that compared with the corresponding embodiments of Fig. 2, being used in the present embodiment generates parallel with language The flow 400 of the method for text, which has had more, to be received inquiry request from terminal and the query statement in inquiry request is pre-processed And the step of scanned for according to the parallel text of identified same language and search result is returned into terminal.Thus, this implementation The scheme of example description can improve the recall rate of search engine search.
With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, it is used to generate together this application provides one kind One embodiment of the device of the parallel text of language, the device embodiment is corresponding with the embodiment of the method shown in Fig. 2, the device Specifically it can apply in various electronic equipments.
As shown in figure 5, with language, the device 500 of parallel text includes for the generation that is used for of the present embodiment:Acquiring unit 501, First determining unit 502, the first generation unit 503, the second generation unit 504 and the second determining unit 505.Wherein, obtain single Member 501, is configured to acquisition source cutting word sequence and the term vector table of training in advance, wherein, above-mentioned term vector table is used to characterize Corresponding relation between word and term vector;First determining unit 502, is configured to according to above-mentioned term vector table, it is determined that with it is above-mentioned The corresponding source term vector sequence of source segmenting word sequence;First generation unit 503, is configured to import above-mentioned source term vector sequence The first circulation neural network model of training in advance, generates semantic default dimension for characterizing above-mentioned source cutting word sequence Intermediate vector, wherein, above-mentioned first circulation neural network model is used to characterize term vector sequence and the vector of above-mentioned default dimension Between corresponding relation;Second generation unit 504, is configured to import above-mentioned intermediate vector the second circulation god of training in advance Through network model, target word sequence vector corresponding with above-mentioned intermediate vector is generated, wherein, above-mentioned second circulation neutral net mould Type is used to characterize the corresponding relation between the vector of above-mentioned default dimension and term vector sequence;Second determining unit 505, configuration is used According to above-mentioned term vector table, it is determined that target cutting word sequence corresponding with above-mentioned target word sequence vector, and by above-mentioned target Cutting word sequence is defined as the parallel text of corresponding with above-mentioned source segmenting word sequence same language.
In the present embodiment, the acquiring unit 501, first of the device 500 for generating the parallel text of same language determines single First 502, first generation unit 503, the specific processing of the second generation unit 504 and the second determining unit 505 and its brought Technique effect can refer to the phase of step 201 in Fig. 2 correspondence embodiments, step 202, step 203, step 204 and step 205 respectively Speak on somebody's behalf bright, will not be repeated here.
In some optional implementations of the present embodiment, the above-mentioned device 500 for being used to generate the parallel text with language It can also include:Receiving unit 506, is configured to receive the inquiry request that user's using terminal is sent, above-mentioned inquiry request bag Include query statement;Pretreatment unit 507, is configured to pre-process above-mentioned query statement, obtains above-mentioned query statement Cutting word sequence, above-mentioned pretreatment includes word segmentation processing and removes additional character;3rd determining unit 508, is configured to institute Obtained cutting word sequence is defined as source cutting word sequence.Receiving unit 506, the determining unit 508 of pretreatment unit 507 and the 3rd Specific processing and its technique effect that is brought can be respectively with reference to step 401, step 402 and step in Fig. 4 correspondence embodiments 403 related description, will not be repeated here.
In some optional implementations of the present embodiment, the above-mentioned device 500 for being used to generate the parallel text with language It can also include:Search unit 509, is configured to be scanned for according to the parallel text of above-mentioned same language, obtains search result;Hair Unit 510 is sent, is configured to send mentioned above searching results to above-mentioned terminal.Search unit 509 and transmitting element 510 it is specific Processing and its technique effect brought can refer to the related description of step 409 and step 410 in Fig. 4 correspondence embodiments respectively, It will not be repeated here.
In some optional implementations of the present embodiment, the above-mentioned device 500 for being used to generate the parallel text with language Training unit 511 can also be included, above-mentioned training unit 511 can include:First acquisition module 5111, is configured to obtain extremely Few a pair with language parallel cutting word sequence, wherein, with language, parallel cutting word sequence includes that language is identical and semantic phase to each pair Same the first cutting word sequence and the second cutting word sequence;Second acquisition module 5112, is configured to obtain default term vector Table, default first circulation neural network model and default second circulation neural network model;Adjusting module 5113, configuration is used In at least one pair of above-mentioned each pair with language in parallel cutting word sequence with language parallel cutting word sequence, according to above-mentioned pre- If term vector table, determine the corresponding first cutting term vector of the first segmenting word sequence of this pair parallel cutting word sequence with language Sequence;Above-mentioned first segmenting word sequence vector is imported into above-mentioned default first circulation neural network model, obtained and above-mentioned the The vector of the corresponding above-mentioned default dimension of one segmenting word sequence vector;Resulting vector is imported into above-mentioned default second circulation Neural network model, is obtained and resulting vectorial corresponding second segmenting word sequence vector;According to above-mentioned default term vector Table, it is determined that word sequence corresponding with above-mentioned second segmenting word sequence vector;It is flat with language according to resulting word sequence and this pair Different information between second cutting word sequence of row cutting word sequence, to above-mentioned default term vector table, above-mentioned default One Recognition with Recurrent Neural Network model and above-mentioned default second circulation neural network model are adjusted;Determining module 5114, configuration For above-mentioned default term vector table, above-mentioned default first circulation neural network model and above-mentioned default second circulation is refreshing It is identified as training obtained term vector table, first circulation neural network model and second circulation neutral net through network model Model.The specific processing of training unit 511 and its technique effect brought can be respectively with reference to the correlations in Fig. 2 correspondence embodiments Illustrate, will not be repeated here.
In some optional implementations of the present embodiment, above-mentioned first circulation neural network model and above-mentioned second is followed Ring neural network model can be time Recognition with Recurrent Neural Network model.
In some optional implementations of the present embodiment, above-mentioned first determining unit 502 can further configure use In:To each segmenting word in above-mentioned source cutting word sequence, inquired about in above-mentioned term vector table the word that is matched with the segmenting word to Amount, and the term vector found is defined as in above-mentioned source term vector sequence with the segmenting word in above-mentioned source cutting word sequence The corresponding source term vector in position identical position.The specific processing of first determining unit 502 and its technique effect brought can Respectively with reference to the related description of step 202 in Fig. 2 correspondence embodiments, it will not be repeated here.
In some optional implementations of the present embodiment, above-mentioned second determining unit 505 can further configure use In:For each target term vector in above-mentioned target word sequence vector, from above-mentioned term vector table choose with the target word to Word corresponding to the similarity highest term vector of amount, selected word is defined as in above-mentioned target cutting word sequence and the mesh Mark position identical position corresponding target segmenting word of the term vector in above-mentioned target word sequence vector.First determining unit 505 specific processing and its technique effect brought can refer to the related description of step 205 in Fig. 2 correspondence embodiments respectively, It will not be repeated here.
Below with reference to Fig. 6, it illustrates suitable for for the computer system 600 for the electronic equipment for realizing the embodiment of the present application Structural representation.Electronic equipment shown in Fig. 6 is only an example, to the function of the embodiment of the present application and should not use model Shroud carrys out any limitation.
As shown in fig. 6, computer system 600 includes CPU (CPU, Central Processing Unit) 601, its can according to the program being stored in read-only storage (ROM, Read Only Memory) 602 or from storage part 606 programs being loaded into random access storage device (RAM, Random Access Memory) 603 and perform it is various appropriate Action and processing.In RAM 603, the system that is also stored with 600 operates required various programs and data.CPU 601、ROM 602 and RAM 603 is connected with each other by bus 604.Input/output (I/O, Input/Output) interface 605 is also connected to Bus 604.
I/O interfaces 605 are connected to lower component:Storage part 606 including hard disk etc.;And including such as LAN (locals Net, Local Area Network) card, modem etc. NIC communications portion 607.Communications portion 607 is passed through Communication process is performed by the network of such as internet.Driver 608 is also according to needing to be connected to I/O interfaces 605.Detachable media 609, such as disk, CD, magneto-optic disk, semiconductor memory etc., as needed be arranged on driver 608 on, in order to from The computer program read thereon is mounted into storage part 606 as needed.
Especially, in accordance with an embodiment of the present disclosure, the process described above with reference to flow chart may be implemented as computer Software program.For example, embodiment of the disclosure includes a kind of computer program product, it includes being carried on computer-readable medium On computer program, the computer program include be used for execution flow chart shown in method program code.In such reality Apply in example, the computer program can be downloaded and installed by communications portion 607 from network, and/or from detachable media 609 are mounted.When the computer program is performed by CPU (CPU) 601, perform what is limited in the present processes Above-mentioned functions.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer-readable recording medium either the two any combination.Computer-readable recording medium for example can be --- but Be not limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination. The more specifically example of computer-readable recording medium can include but is not limited to:Electrical connection with one or more wires, Portable computer diskette, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type may be programmed read-only deposit Reservoir (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory Part or above-mentioned any appropriate combination.In this application, computer-readable recording medium can any be included or store The tangible medium of program, the program can be commanded execution system, device or device and use or in connection.And In the application, computer-readable signal media can include believing in a base band or as the data of carrier wave part propagation Number, wherein carrying computer-readable program code.The data-signal of this propagation can take various forms, including but not It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer Any computer-readable medium beyond readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use In by the use of instruction execution system, device or device or program in connection.Included on computer-readable medium Program code any appropriate medium can be used to transmit, include but is not limited to:Wirelessly, electric wire, optical cable, RF etc., Huo Zheshang Any appropriate combination stated.
Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of the various embodiments of the application, method and computer journey Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation The part of one module of table, program segment or code, the part of the module, program segment or code is used comprising one or more In the executable instruction for realizing defined logic function.It should also be noted that in some realizations as replacement, being marked in square frame The function of note can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actually It can perform substantially in parallel, they can also be performed in the opposite order sometimes, this is depending on involved function.Also to note Meaning, the combination of each square frame in block diagram and/or flow chart and the square frame in block diagram and/or flow chart can be with holding The special hardware based system of function or operation as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit can also be set within a processor, for example, can be described as:A kind of processor bag Include acquiring unit, the first determining unit, the first generation unit, the second generation unit and the second determining unit.Wherein, these units Title do not constitute restriction to the unit in itself under certain conditions, for example, the first determining unit is also described as " unit for determining source term vector sequence ".
As on the other hand, present invention also provides a kind of computer-readable medium, the computer-readable medium can be Included in device described in above-described embodiment;Can also be individualism, and without be incorporated the device in.Above-mentioned calculating Machine computer-readable recording medium carries one or more program, when said one or multiple programs are performed by the device so that should Device:The term vector table of acquisition source cutting word sequence and training in advance, wherein, above-mentioned term vector table is used to characterize word and term vector Between corresponding relation;According to above-mentioned term vector table, it is determined that source term vector sequence corresponding with above-mentioned source segmenting word sequence;Will be upper The first circulation neural network model that source term vector sequence imports training in advance is stated, is generated for characterizing above-mentioned source cutting word sequence Semantic default dimension intermediate vector, wherein, above-mentioned first circulation neural network model be used for characterize term vector sequence with Corresponding relation between the vector of above-mentioned default dimension;Above-mentioned intermediate vector is imported to the second circulation neutral net of training in advance Model, generates target word sequence vector corresponding with above-mentioned intermediate vector, wherein, above-mentioned second circulation neural network model is used for Characterize the corresponding relation between the vector of above-mentioned default dimension and term vector sequence;According to above-mentioned term vector table, it is determined that with it is above-mentioned The corresponding target cutting word sequence of target word sequence vector, and above-mentioned target cutting word sequence is defined as and above-mentioned source segmenting word The parallel text of the corresponding same language of sequence.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to the technology of the particular combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, is carried out by above-mentioned technical characteristic or its equivalent feature Other technical schemes formed by any combination.Such as features described above has similar work(with (but not limited to) disclosed herein The technical characteristic of energy carries out technical scheme formed by replacement mutually.

Claims (16)

1. a kind of method for being used to generate the parallel text with language, it is characterised in that methods described includes:
The term vector table of acquisition source cutting word sequence and training in advance, wherein, the term vector table is used to characterize word and term vector Between corresponding relation;
According to the term vector table, it is determined that source term vector sequence corresponding with the source segmenting word sequence;
The source term vector sequence is imported to the first circulation neural network model of training in advance, generates and is cut for characterizing the source The intermediate vector of the semantic default dimension of segmentation sequence, wherein, the first circulation neural network model be used for characterize word to Measure the corresponding relation between sequence and the vector of the default dimension;
The intermediate vector is imported to the second circulation neural network model of training in advance, generated corresponding with the intermediate vector Target word sequence vector, wherein, the second circulation neural network model be used for characterize the default dimension vector with word to Measure the corresponding relation between sequence;
According to the term vector table, it is determined that target cutting word sequence corresponding with the target word sequence vector, and by the mesh Mark cutting word sequence is defined as the parallel text of corresponding with the source segmenting word sequence same language.
2. according to the method described in claim 1, it is characterised in that the acquisition source cutting word sequence and the word of training in advance to Before scale, methods described also includes:
The inquiry request that user's using terminal is sent is received, the inquiry request includes query statement;
The query statement is pre-processed, the cutting word sequence of the query statement is obtained, the pretreatment includes participle Processing and removal additional character;
Resulting cutting word sequence is defined as source cutting word sequence.
3. method according to claim 2, it is characterised in that it is described by the target cutting word sequence be defined as with it is described After the parallel text of the corresponding same language of source segmenting word sequence, methods described also includes:
Scanned for according to the parallel text of the same language, obtain search result;
The search result is sent to the terminal.
4. according to any described method in claim 1-3, it is characterised in that the acquisition source cutting word sequence and in advance instruction Before experienced term vector table, methods described also includes training step, and the training step includes:
Obtain at least one pair of with language parallel cutting word sequence, wherein, each pair with language parallel cutting word sequence include language phase Same and semantic identical the first cutting word sequence and the second cutting word sequence;
Obtain default term vector table, default first circulation neural network model and default second circulation neutral net mould Type;
For at least one pair of described each pair with language in parallel cutting word sequence with language parallel cutting word sequence, according to described Default term vector table, determine corresponding first segmenting word of the first segmenting word sequence of this pair parallel cutting word sequence with language to Measure sequence;The first segmenting word sequence vector is imported into the default first circulation neural network model, obtain with it is described The vector of the corresponding default dimension of first segmenting word sequence vector;Resulting vector is imported into described default second to follow Ring neural network model, is obtained and resulting vectorial corresponding second segmenting word sequence vector;According to the default word to Scale, it is determined that word sequence corresponding with the second segmenting word sequence vector;According to resulting word sequence and this pair of same language Different information between second cutting word sequence of parallel cutting word sequence, to the default term vector table, described default First circulation neural network model and the default second circulation neural network model are adjusted;
By the default term vector table, the default first circulation neural network model and the default second circulation god It is identified as training obtained term vector table, first circulation neural network model and second circulation neutral net through network model Model.
5. method according to claim 4, it is characterised in that the first circulation neural network model and described second is followed Ring neural network model is time Recognition with Recurrent Neural Network model.
6. method according to claim 5, it is characterised in that described according to the term vector table, it is determined that being cut with the source The corresponding source term vector sequence of segmentation sequence, including:
To each segmenting word in the source cutting word sequence, inquired about in the term vector table word that is matched with the segmenting word to Amount, and the term vector found is defined as in the source term vector sequence with the segmenting word in the source cutting word sequence The corresponding source term vector in position identical position.
7. method according to claim 6, it is characterised in that described according to the term vector table, it is determined that with the target The corresponding target cutting word sequence of term vector sequence, including:
For each target term vector in the target word sequence vector, from the term vector table choose with the target word to Word corresponding to the similarity highest term vector of amount, selected word is defined as in the target cutting word sequence and the mesh Mark position identical position corresponding target segmenting word of the term vector in the target word sequence vector.
8. a kind of device for being used to generate the parallel text with language, it is characterised in that described device includes:
Acquiring unit, is configured to acquisition source cutting word sequence and the term vector table of training in advance, wherein, the term vector table is used Corresponding relation between sign word and term vector;
First determining unit, is configured to according to the term vector table, it is determined that source word corresponding with the source segmenting word sequence to Measure sequence;
First generation unit, is configured to import the source term vector sequence first circulation neutral net mould of training in advance Type, generates the intermediate vector of the semantic default dimension for characterizing the source cutting word sequence, wherein, the first circulation god It is used to characterize the corresponding relation between term vector sequence and the vector of the default dimension through network model;
Second generation unit, is configured to import the intermediate vector second circulation neural network model of training in advance, raw Into target word sequence vector corresponding with the intermediate vector, wherein, the second circulation neural network model is used to characterize institute State the corresponding relation between the vector of default dimension and term vector sequence;
Second determining unit, is configured to according to the term vector table, it is determined that target corresponding with the target word sequence vector Cutting word sequence, and the target cutting word sequence is defined as the parallel text of corresponding with the source segmenting word sequence same language This.
9. device according to claim 8, it is characterised in that described device also includes:
Receiving unit, is configured to receive the inquiry request that user's using terminal is sent, the inquiry request includes query statement;
Pretreatment unit, is configured to pre-process the query statement, obtains the cutting word sequence of the query statement, The pretreatment includes word segmentation processing and removes additional character;
3rd determining unit, is configured to resulting cutting word sequence being defined as source cutting word sequence.
10. device according to claim 9, it is characterised in that described device also includes:
Search unit, is configured to be scanned for according to the parallel text of the same language, obtains search result;
Transmitting element, is configured to send the search result to the terminal.
11. according to any described device in claim 8-10, it is characterised in that described device also includes training unit, institute Stating training unit includes:
First acquisition module, be configured to obtain at least one pair of with language parallel cutting word sequence, wherein, each pair is parallel with language Cutting word sequence includes identical language and semantic identical the first cutting word sequence and the second cutting word sequence;
Second acquisition module, is configured to obtain default term vector table, default first circulation neural network model and presets Second circulation neural network model;
Adjusting module, is configured to cut at least one pair of described each pair with language in parallel cutting word sequence is parallel with language Segmentation sequence, according to the default term vector table, determines the first cutting word sequence of this pair parallel cutting word sequence with language Corresponding first segmenting word sequence vector;The first segmenting word sequence vector is imported into the default first circulation nerve net Network model, obtains the vector of the default dimension corresponding with the first segmenting word sequence vector;Resulting vector is led Enter the default second circulation neural network model, obtain and resulting vectorial corresponding second segmenting word sequence vector; According to the default term vector table, it is determined that word sequence corresponding with the second segmenting word sequence vector;According to resulting Different information between word sequence and the second cutting word sequence of this pair parallel cutting word sequence with language, to the default word Vector table, the default first circulation neural network model and the default second circulation neural network model are adjusted It is whole;
Determining module, is configured to the default term vector table, the default first circulation neural network model and institute Default second circulation neural network model is stated to be identified as training obtained term vector table, first circulation neural network model With second circulation neural network model.
12. device according to claim 11, it is characterised in that the first circulation neural network model and described second Recognition with Recurrent Neural Network model is time Recognition with Recurrent Neural Network model.
13. device according to claim 12, it is characterised in that first determining unit is further configured to:
To each segmenting word in the source cutting word sequence, inquired about in the term vector table word that is matched with the segmenting word to Amount, and the term vector found is defined as in the source term vector sequence with the segmenting word in the source cutting word sequence The corresponding source term vector in position identical position.
14. device according to claim 13, it is characterised in that second determining unit is further configured to:
For each target term vector in the target word sequence vector, from the term vector table choose with the target word to Word corresponding to the similarity highest term vector of amount, selected word is defined as in the target cutting word sequence and the mesh Mark position identical position corresponding target segmenting word of the term vector in the target word sequence vector.
15. a kind of electronic equipment, including:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are by one or more of computing devices so that one or more of processors Realize the method as described in any in claim 1-7.
16. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The method as described in any in claim 1-7 is realized during execution.
CN201710464118.3A 2017-06-19 2017-06-19 Method and device for generating parallel text in same language Active CN107273503B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710464118.3A CN107273503B (en) 2017-06-19 2017-06-19 Method and device for generating parallel text in same language
US15/900,166 US10650102B2 (en) 2017-06-19 2018-02-20 Method and apparatus for generating parallel text in same language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710464118.3A CN107273503B (en) 2017-06-19 2017-06-19 Method and device for generating parallel text in same language

Publications (2)

Publication Number Publication Date
CN107273503A true CN107273503A (en) 2017-10-20
CN107273503B CN107273503B (en) 2020-07-10

Family

ID=60068971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710464118.3A Active CN107273503B (en) 2017-06-19 2017-06-19 Method and device for generating parallel text in same language

Country Status (2)

Country Link
US (1) US10650102B2 (en)
CN (1) CN107273503B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170676A (en) * 2017-12-27 2018-06-15 百度在线网络技术(北京)有限公司 Method, system and the terminal of story creation
CN108268442A (en) * 2017-12-19 2018-07-10 芋头科技(杭州)有限公司 A kind of sentence Intention Anticipation method and system
CN108763277A (en) * 2018-04-10 2018-11-06 平安科技(深圳)有限公司 A kind of data analysing method, computer readable storage medium and terminal device
CN108959467A (en) * 2018-06-20 2018-12-07 华东师范大学 A kind of calculation method of question sentence and the Answer Sentence degree of correlation based on intensified learning
WO2019080648A1 (en) * 2017-10-26 2019-05-02 华为技术有限公司 Retelling sentence generation method and apparatus
CN109858004A (en) * 2019-02-12 2019-06-07 四川无声信息技术有限公司 Text Improvement, device and electronic equipment
WO2019149076A1 (en) * 2018-02-05 2019-08-08 阿里巴巴集团控股有限公司 Word vector generation method, apparatus and device
CN110472251A (en) * 2018-05-10 2019-11-19 腾讯科技(深圳)有限公司 Method, the method for statement translation, equipment and the storage medium of translation model training
CN111291563A (en) * 2020-01-20 2020-06-16 腾讯科技(深圳)有限公司 Word vector alignment method and training method of word vector alignment model
CN111353039A (en) * 2018-12-05 2020-06-30 北京京东尚科信息技术有限公司 File class detection method and device
CN111950272A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Text similarity generation method and device and electronic equipment
CN112883295A (en) * 2019-11-29 2021-06-01 北京搜狗科技发展有限公司 Data processing method, device and medium
CN113449515A (en) * 2021-01-27 2021-09-28 心医国际数字医疗系统(大连)有限公司 Medical text prediction method and device and electronic equipment

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133202A (en) * 2017-06-01 2017-09-05 北京百度网讯科技有限公司 Text method of calibration and device based on artificial intelligence
CN109614492B (en) * 2018-12-29 2024-06-18 平安科技(深圳)有限公司 Text data enhancement method, device, equipment and storage medium based on artificial intelligence
CN110321537B (en) * 2019-06-11 2023-04-07 创新先进技术有限公司 Method and device for generating file
CN111797622B (en) * 2019-06-20 2024-04-09 北京沃东天骏信息技术有限公司 Method and device for generating attribute information
CN110442874B (en) * 2019-08-09 2023-06-13 南京邮电大学 Chinese word sense prediction method based on word vector
CN110866395B (en) * 2019-10-30 2023-05-05 语联网(武汉)信息技术有限公司 Word vector generation method and device based on translator editing behaviors
CN110866404B (en) * 2019-10-30 2023-05-05 语联网(武汉)信息技术有限公司 Word vector generation method and device based on LSTM neural network
CN113627135B (en) * 2020-05-08 2023-09-29 百度在线网络技术(北京)有限公司 Recruitment post description text generation method, device, equipment and medium
CN111753551B (en) * 2020-06-29 2022-06-14 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model
CN113836950B (en) * 2021-09-22 2024-04-02 广州华多网络科技有限公司 Commodity title text translation method and device, equipment and medium thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1672149A (en) * 2002-05-31 2005-09-21 埃里·阿博 Word association method and apparatus
CN1720524A (en) * 2002-10-29 2006-01-11 埃里·阿博 Knowledge system method and apparatus
US20110202512A1 (en) * 2010-02-14 2011-08-18 Georges Pierre Pantanelli Method to obtain a better understanding and/or translation of texts by using semantic analysis and/or artificial intelligence and/or connotations and/or rating
CN104598611A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and system for sequencing search entries
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN106407381A (en) * 2016-09-13 2017-02-15 北京百度网讯科技有限公司 Method and device for pushing information based on artificial intelligence

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10360901B2 (en) * 2013-12-06 2019-07-23 Nuance Communications, Inc. Learning front-end speech recognition parameters within neural network training
CN105701120B (en) * 2014-11-28 2019-05-03 华为技术有限公司 The method and apparatus for determining semantic matching degree
KR102167719B1 (en) * 2014-12-08 2020-10-19 삼성전자주식회사 Method and apparatus for training language model, method and apparatus for recognizing speech
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106844368B (en) * 2015-12-03 2020-06-16 华为技术有限公司 Method for man-machine conversation, neural network system and user equipment
US10453074B2 (en) * 2016-07-08 2019-10-22 Asapp, Inc. Automatically suggesting resources for responding to a request

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1672149A (en) * 2002-05-31 2005-09-21 埃里·阿博 Word association method and apparatus
CN1720524A (en) * 2002-10-29 2006-01-11 埃里·阿博 Knowledge system method and apparatus
US20110202512A1 (en) * 2010-02-14 2011-08-18 Georges Pierre Pantanelli Method to obtain a better understanding and/or translation of texts by using semantic analysis and/or artificial intelligence and/or connotations and/or rating
CN104598611A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and system for sequencing search entries
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN106407381A (en) * 2016-09-13 2017-02-15 北京百度网讯科技有限公司 Method and device for pushing information based on artificial intelligence

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710915A (en) * 2017-10-26 2019-05-03 华为技术有限公司 Repeat sentence generation method and device
US11586814B2 (en) 2017-10-26 2023-02-21 Huawei Technologies Co., Ltd. Paraphrase sentence generation method and apparatus
CN109710915B (en) * 2017-10-26 2021-02-23 华为技术有限公司 Method and device for generating repeated statement
WO2019080648A1 (en) * 2017-10-26 2019-05-02 华为技术有限公司 Retelling sentence generation method and apparatus
CN108268442A (en) * 2017-12-19 2018-07-10 芋头科技(杭州)有限公司 A kind of sentence Intention Anticipation method and system
CN108170676B (en) * 2017-12-27 2019-05-10 百度在线网络技术(北京)有限公司 Method, system and the terminal of story creation
CN108170676A (en) * 2017-12-27 2018-06-15 百度在线网络技术(北京)有限公司 Method, system and the terminal of story creation
WO2019149076A1 (en) * 2018-02-05 2019-08-08 阿里巴巴集团控股有限公司 Word vector generation method, apparatus and device
US10824819B2 (en) 2018-02-05 2020-11-03 Alibaba Group Holding Limited Generating word vectors by recurrent neural networks based on n-ary characters
CN108763277B (en) * 2018-04-10 2023-04-18 平安科技(深圳)有限公司 Data analysis method, computer readable storage medium and terminal device
CN108763277A (en) * 2018-04-10 2018-11-06 平安科技(深圳)有限公司 A kind of data analysing method, computer readable storage medium and terminal device
CN110472251B (en) * 2018-05-10 2023-05-30 腾讯科技(深圳)有限公司 Translation model training method, sentence translation equipment and storage medium
CN110472251A (en) * 2018-05-10 2019-11-19 腾讯科技(深圳)有限公司 Method, the method for statement translation, equipment and the storage medium of translation model training
CN108959467A (en) * 2018-06-20 2018-12-07 华东师范大学 A kind of calculation method of question sentence and the Answer Sentence degree of correlation based on intensified learning
CN108959467B (en) * 2018-06-20 2021-10-15 华东师范大学 Method for calculating correlation degree of question sentences and answer sentences based on reinforcement learning
CN111353039A (en) * 2018-12-05 2020-06-30 北京京东尚科信息技术有限公司 File class detection method and device
CN111353039B (en) * 2018-12-05 2024-05-17 北京京东尚科信息技术有限公司 File category detection method and device
CN109858004A (en) * 2019-02-12 2019-06-07 四川无声信息技术有限公司 Text Improvement, device and electronic equipment
CN112883295A (en) * 2019-11-29 2021-06-01 北京搜狗科技发展有限公司 Data processing method, device and medium
CN112883295B (en) * 2019-11-29 2024-02-23 北京搜狗科技发展有限公司 Data processing method, device and medium
CN111291563A (en) * 2020-01-20 2020-06-16 腾讯科技(深圳)有限公司 Word vector alignment method and training method of word vector alignment model
CN111291563B (en) * 2020-01-20 2023-09-01 腾讯科技(深圳)有限公司 Word vector alignment method and word vector alignment model training method
CN111950272A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Text similarity generation method and device and electronic equipment
CN111950272B (en) * 2020-06-23 2023-06-27 北京百度网讯科技有限公司 Text similarity generation method and device and electronic equipment
CN113449515A (en) * 2021-01-27 2021-09-28 心医国际数字医疗系统(大连)有限公司 Medical text prediction method and device and electronic equipment

Also Published As

Publication number Publication date
CN107273503B (en) 2020-07-10
US20180365231A1 (en) 2018-12-20
US10650102B2 (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN107273503A (en) Method and apparatus for generating the parallel text of same language
CN107729319B (en) Method and apparatus for outputting information
US11151177B2 (en) Search method and apparatus based on artificial intelligence
CN107168952A (en) Information generating method and device based on artificial intelligence
CN107491534B (en) Information processing method and device
US11501182B2 (en) Method and apparatus for generating model
CN107783960A (en) Method, apparatus and equipment for Extracting Information
CN107066449A (en) Information-pushing method and device
CN107679039A (en) The method and apparatus being intended to for determining sentence
CN105677931B (en) Information search method and device
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
CN107832305A (en) Method and apparatus for generating information
CN108038469A (en) Method and apparatus for detecting human body
CN107680579A (en) Text regularization model training method and device, text regularization method and device
CN109933662A (en) Model training method, information generating method, device, electronic equipment and computer-readable medium
CN107908789A (en) Method and apparatus for generating information
CN110555714A (en) method and apparatus for outputting information
CN108804450A (en) The method and apparatus of information push
CN109766418B (en) Method and apparatus for outputting information
CN107832468A (en) Demand recognition methods and device
CN109190124A (en) Method and apparatus for participle
CN107958247A (en) Method and apparatus for facial image identification
CN107506434A (en) Method and apparatus based on artificial intelligence classification phonetic entry text
CN109740167A (en) Method and apparatus for generating information
CN107742128A (en) Method and apparatus for output information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant