CN105786782A

CN105786782A - Word vector training method and device

Info

Publication number: CN105786782A
Application number: CN201610179115.0A
Authority: CN
Inventors: 邢宁; 刘明荣; 许静芳; 常晓夫; 王晓伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Information Service Co Ltd
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2016-07-20
Anticipated expiration: 2036-03-25
Also published as: CN105786782B

Abstract

The invention provides a word vector training method and device. The method comprises the following steps: an internet webpage is captured, and training language materials are acquired and stored in a corpus; each training language material in the corpus is subjected to word segmentation, and an orderly word set corresponding to each training language material is obtained; a word list is established according to pre-collected user query logs; the training language materials stored in the corpus are distributed to nodes of a distributed word vector learning model; the distributed word vector learning model is configured to perform periodic word vector training on each word in the word list, and the word vector corresponding to the word in the word list is obtained. According to the word vector training method and device, the word vectors obtained through training can be well applied to search business, and fast iterative high-quality word vector training can be realized.

Description

The training method of a kind of term vector and device

Technical field

The present invention relates to Internet technical field, particularly relate to training method and the device of a kind of term vector.

Background technology

In internet, applications, it is important that a problem be how to convert natural language to computer it will be appreciated that data representation form.And solve the most important step of this problem and find a kind of method exactly by natural language symbol data.That conventional is degree of depth study (DeepLearning at present, DL) method, what adopt in DL is " Distributedrepresentation " distributing method for expressing, and each vocabulary is shown as a kind of low-dimensional real number vector, and this vector is exactly the term vector that word is corresponding.Term vector is exactly thus be born, it is to be understood that term vector is exactly for expressing the vector of word in natural language, with suitable in internet, applications.Such as, term vector is usable in a lot of natural language learning and processes in the work that (NaturalLearningProcessing, NLP) is relevant, such as cluster, semantic analysis etc..

At present, people's Use Word 2vec instrument utilizes DL method by unit model, obtains the vector in the vector space that in vocabulary, word is corresponding by the language material collected is trained.What this term vector training method adopted is single cpu mode, and its training speed is relatively low, the especially very difficult business scenario very huge suitable in data volume.Additionally this term vector training method is the training method of universality, and it does not consider the particularity of specific transactions scene, thus under specific business scenario its training effect bad.

Summary of the invention

In order to solve above-mentioned technical problem, the invention provides the training method of a kind of term vector and device so that the term vector that training obtains can be perfectly suitable in searching service, and is capable of the training of the high-quality term vector of iteratively faster.

The embodiment of the invention discloses following technical scheme:

First aspect present invention provides the training method of a kind of term vector, and described method includes:

Capture internet web page, obtain corpus, be saved in corpus；

Each corpus in corpus is made word segmentation processing, obtains the orderly set of words that each corpus is corresponding；

Vocabulary is built according to the user's inquiry log collected in advance；

The each corpus preserved in described corpus is distributed to each node in distributed term vector learning model；

Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding；

Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.

Optionally, described each corpus in corpus is made word segmentation processing, including:

Utilize participle instrument and the participle dictionary pre-build, each corpus is made word segmentation processing；Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.

Optionally, user's inquiry log that described basis is collected in advance builds vocabulary, including:

Extract the word comprised in the user's inquiry log collected in advance, and add up the word frequency of each word；

Obtain high frequency words, build and generate vocabulary.

Optionally, after described acquisition high frequency words, described method also includes:

After utilizing name entity dictionary that described high frequency words is merged process, then perform the described step building and generating vocabulary.

Optionally, described each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after in each node vocabulary of obtaining of training, the training term vector of each word synchronizes, trigger next cycle training, including:

In the described vocabulary that step 1, each node configured in described distributed term vector learning model are corresponding, the term vector of each word is for initializing term vector；

The word mated with described vocabulary that step 2, each node include for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this cycle；

Step 3, judge whether meet preset decision condition, if it is, enter step 5；If it does not, enter step 4；

Step 4: according to the training term vector of institute's predicate that this cycle training of each node obtains, parallel synchronous updates the initialization term vector in this cycle of institute's predicate, as the initialization term vector in next cycle of institute's predicate, enters step 2；

Step 5, according to described training term vector, obtain the term vector of equivalent in described vocabulary.

Optionally, the word mated with described vocabulary that the orderly set of words that described each corpus to being assigned to is corresponding includes is trained, including:

To each corpus being assigned to, travel through all words in the orderly set of words that described corpus is corresponding, each word is mated with described vocabulary respectively, if current word matches identical word in described vocabulary, this word is trained, obtains the term vector that this word is corresponding.

Optionally, described according to the term vector of each word in each node current period described vocabulary of obtaining of training, parallel synchronous updates the initialization term vector of the current period of each word, and the initialization term vector as next cycle of word each in described vocabulary includes:

Following formula is adopted to realize synchronized update:

w^{'} = w - η (Σ_{1}^{N} Δ w);

Wherein, w ' refers to the initial word vector in certain next cycle of word in the described vocabulary that a certain node is corresponding；W refers to the initialization term vector of this word current period in this node correspondence vocabulary；η refers to predetermined coefficient；Δ w asks poor by the term vector that this word current period training in this node correspondence vocabulary is obtained with the initialization term vector of this word current period and obtains；N is the nodes of described learning model.

Optionally, described method also includes:

Configure each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out term vector training, the training term vector of the word not mated described in the training of each node is obtained synchronizes, trigger next cycle training, make each node cyclic training obtain described in the term vector of word that do not mate, and term vector corresponding for the described word not mated is saved in described vocabulary；Wherein, the described word not mated belongs to pre-set categories.

Second aspect present invention provides the training devices of a kind of term vector, and described device includes:

Unit set up in corpus, is used for capturing internet web page, obtains corpus, is saved in corpus；

Participle unit, for each corpus in corpus is made word segmentation processing, obtains the orderly set of words that each corpus is corresponding；

Vocabulary construction unit, for building vocabulary according to the user's inquiry log collected in advance；

Language material Dispatching Unit, for being distributed to each node in distributed term vector learning model by each corpus preserved in described corpus；

First dispensing unit, carries out periodic term vector training for configuring described distributed term vector learning model to each word in described vocabulary, obtains the term vector that in described vocabulary, each word is corresponding；Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.

Third aspect present invention provides the training devices of a kind of term vector, described device includes memorizer, and one or more than one program, one of them or more than one program are stored in memorizer, and are configured to be performed one or more than one program package containing the instruction for carrying out following operation by one or more than one processor:

Capture internet web page, obtain corpus, be saved in corpus；

Vocabulary is built according to the user's inquiry log collected in advance；

Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:

Obtain high frequency words, build and generate vocabulary.

After utilizing name entity dictionary that described high frequency words is merged process, then perform the described instruction building and generating vocabulary.

In the described vocabulary that instruction 1, each node configured in described distributed term vector learning model are corresponding, the term vector of each word is for initializing term vector；

The word mated with described vocabulary that instruction 2, each node include for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this cycle；

Instruction 3, judge whether to meet and preset decision condition, if it is, entry instruction 5；If it does not, entry instruction 4；

Instruction 4: according to the training term vector of institute's predicate that this cycle training of each node obtains, parallel synchronous updates the initialization term vector in this cycle of institute's predicate, as the initialization term vector in next cycle of institute's predicate, entry instruction 2；

Instruction 5, according to described training term vector, obtain the term vector of equivalent in described vocabulary.

Following formula is adopted to realize synchronized update:

w^{'} = w - η (Σ_{1}^{N} Δ w);

Configure each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out cyclic training, the training term vector of the word not mated described in obtaining；Wherein, the described word not mated belongs to pre-set categories

The training term vector of word not mated described in the training of each node is obtained synchronizes, the term vector that the word that do not mate described in obtaining is corresponding, and is saved in described vocabulary.

Compared with prior art, technical scheme provided by the invention possesses following beneficial effect:

Technique scheme provided by the invention, first, captures internet web page, obtains corpus, be saved in corpus；The mode of setting up of this corpus make use of the advantage of the high real-time of internet web page resource, high representativeness, aboundresources broad covered area well, it is possible to language material that get magnanimity level, that coverage rate is wider.

Then, each corpus in corpus is made word segmentation processing respectively, obtain the orderly set of words that each corpus is corresponding；And, build vocabulary according to the user's inquiry log collected in advance；The present invention has abandoned traditional mode building vocabulary based on corpus, it is proposed that build the mode of vocabulary according to user's inquiry log；Owing to user's inquiry log can characterize user's actual search demand, therefore, build, with the query word comprised in user's inquiry log, the vocabulary generated and just can be adapted well to searching service.

Finally, each corpus preserved in described corpus is distributed to each node in distributed term vector learning model by the present invention；Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding.The present invention is to solve that large-scale corpus trains problem slowly, the unit multithreading training method of abandoning tradition, and adopt distributed term vector learning model, trained by multi-node parallel, to improve training speed such that it is able to quickly iteration goes out high-quality term vector.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow chart of the training method of a kind of term vector that the embodiment of the present invention provides；

Fig. 2 is the distributed term vector training schematic diagram that the embodiment of the present invention provides；

Fig. 3 is the structure chart of the training devices of a kind of term vector that the embodiment of the present invention provides；

Fig. 4 is the hardware structure diagram of the training devices of a kind of term vector that the embodiment of the present invention provides；

Fig. 5 is the structural representation of the server that the embodiment of the present invention provides.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is explicitly described, it is clear that, described embodiment is a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.

The invention provides the training method of a kind of term vector and device, the present invention analyzes the factor of the special applications background of term vector, propose the technical thought building specific vocabulary according to user's inquiry log, make to train the term vector obtained can be perfectly suitable in searching service, and, the unit multithreading training method of abandoning tradition of the present invention and propose the distributed term vector learning method of employing, it is possible to realize the training of the high-quality term vector of iteratively faster.

Refer to the flow chart that Fig. 1, Fig. 1 are the training methodes of a kind of term vector that the embodiment of the present invention provides, as it is shown in figure 1, the method includes: step 101-step 105:

Step 101: capture internet web page, obtains corpus, is saved in corpus；

Concrete, capture internet web page, using each web page contents of grabbing as each corpus, be saved in corpus.

Language material, refer to language actually used in the linguistic data that truly occurred；Language material is commonly stored in corpus, and corpus is the data base carrying language material with electronic computer for carrier；Real corpus is it is generally required to through processing (analyze and process), can become useful resource.

At present, China's corpus has four kinds of types, is the general corpus of Modern Chinese, the Peoples Daily tagged corpus, the Modern Chinese corpus for language teaching and research, the Modern Chinese corpus etc. towards speech signal analysis respectively.Accordingly, it is desirable to during language material, will directly obtain language material from these well-established corpus.

But the content of these corpus is comparatively fixing, update slower；And due to the opening of the Internet and novelty, the linguistic data every day making that this field produces is all in growth at double, therefore, if still only obtaining language material from these corpus existing, the language material negligible amounts then acquired, and coverage rate is narrower, these language materials can not characterize the actually used situation of language in internet arena.

Based on this, so that the language material acquired can be applicable to internet arena better, it is especially applicable to search field, in search engine, in the embodiment of the present invention, obtains corpus by capturing the mode of internet web page.

More specifically, the present invention also provide for following can implementation:

Utilize search engine collecting internet news class webpage, Web Community's webpage and/or blog web page etc., using the web page contents that grabs as corpus.

Owing to internet news class webpage, Web Community's webpage, blog web page are all had the webpage of reliability certification, the information of this class webpage carrying is all the information that credibility is higher, therefore directly corpus is obtained from this class webpage, it is possible to increase the quality of corpus.

Certainly, the realization of the present invention can but be not limited to news category webpage, Web Community's class and blog web page, it is also possible to be science popularization webpage, paper website webpage etc. there is the webpage of reliability certification class.Furthermore, in order to expand corpus further, corpus also can also be obtained from aforementioned well-established corpus.The mode of this acquisition language material that the embodiment of the present invention provides, make use of the advantage of the high real-time of internet web page resource, high representativeness, aboundresources broad covered area, it is possible to corpus that get magnanimity level, that coverage rate is wider well.

Step 102: each corpus in corpus is obtained, as word segmentation processing, the orderly set of words that each corpus is corresponding respectively；Wherein, described orderly word set is combined into the set that a sequential word is constituted.

In the embodiment of the present invention, capture, from internet web page, the corpus obtained and be generally sentence or article.Owing to the training of term vector is with word for training data.Therefore, when acquiring corpus, in addition it is also necessary to corpus is obtained as word segmentation processing the set of sequential word corresponding to each language material.Concrete, assume that corpus is an article, article is made up of at least one sentence, each sentence that this corpus is included is sequentially carried out word segmentation processing, each sentence is divided into the set of the word of a string ordered arrangement, and the word after division is arranged in order further according to putting in order of sentence each in original text chapter.Such as, certain corpus is for " I likes Beijing.Beijing is the political economy cultural center of China ", the orderly set of words obtained after this corpus is carried out word segmentation processing can be " I/like/Beijing/Beijing/be/China// politics/economy/culture/" center ".

And word segmentation processing depends primarily on what participle dictionary realized, the quality of participle dictionary directly determines the quality of word segmentation processing.Participle dictionary is also referred to as dictionary for word segmentation, and for the ease of describing, the present invention utilizes participle dictionary to describe.At present, the participle dictionary generally adopted is the dictionary by setting up based on " xinhua dictionary " or other similar published book.But, fast-developing Chinese internet all can have new word, new things to produce every day, and these participle dictionaries are not able to include the new word produced in the Internet in time, so, if directly utilizing these dictionaries language material to obtaining from the Internet to carry out word segmentation processing, its participle effect is just not so good.

Based on this, the present invention constructs the participle dictionary being particularly well-suited to Internet scene, and mainly according to user, the word inquired about in (query) daily record and input method dictionary builds generation participle dictionary.As long as it is understood that user use the Internet, generally all can produce user's inquiry log, therefore, almost every day, per time each second even per minute, network all can produce user's inquiry log, in this user's inquiry log record have query word.Simultaneously as input method itself sets up the input method dictionary having correspondence, in this input method dictionary, record has conventional word；It addition, input method self also can pass through to collect the new word produced in user's input process updates input method dictionary termly.It can be seen that user's inquiry log and input method dictionary are all follow the real network operation behavior of user closely and update, therefore, the present invention proposes to utilize user's inquiry log and input method dictionary to set up participle dictionary, this participle dictionary can be synchronize with internet development, can reaction network practical language situation, it is possible to be well adapted in internet, applications scene.

Specifically, step 102 can be accomplished by:

Utilize participle instrument and the participle dictionary pre-build, obtain the orderly set of words corresponding with each corpus for each corpus as word segmentation processing；Wherein, described orderly word set is combined into the set that a sequential word is constituted；Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.

In the embodiment of the present invention, after word segmentation processing, a corpus correspond to an orderly set of words.So-called set of words in order refers to the word order according to the text message recorded in corpus, the set of the word having permanent order relation obtained after text message is made word segmentation processing.Such as: corpus is one section of blog articles, the putting in order of each word in the paragraph order of the text message then recorded according to this blog articles, the statement sequence of each paragraph text and each sentence, carries out word segmentation processing to text information successively and obtains the set of sequential word.Giving an example, the text message recorded in corpus is " I loves Tian An-men, Beijing ", and the orderly set of words obtaining this corpus corresponding as word segmentation processing according to the word order of text message is exactly: (I/love/Beijing/Tian An-men).

Step 103: build vocabulary according to the user's inquiry log collected in advance.

Traditional method building vocabulary is to select some words from corpus, utilizes the word selected to build and generates vocabulary, but these vocabularys are universalities, it does not have be representative, it is impossible to is adapted to pointedly in search scene.The present invention considers the application scenarios of term vector, so that term vector can be adapted in search scene (such as search engine) pointedly, the present invention proposes the scheme building vocabulary according to user's inquiry log.Compared with the vocabulary general with prior art, the present invention extracts the peculiar vocabulary that can cover most search needs according to user's inquiry log, and the term vector trained based on this vocabulary can more conform to the demand of search scene.

Furthermore, it is contemplated that Internet era Chinese word quantity be magnanimity level, any training is impossible to cover all of word, and from training time cost consideration, covers all of word also without necessity.Therefore, in order to build appropriately sized vocabulary, and making vocabulary cover most query demand as far as possible, the present invention proposes following building mode: extracts the word comprised in the user's inquiry log collected in advance, and adds up the word frequency of each word；Obtain high frequency words, and build generation vocabulary.

Wherein, described acquisition high frequency words specifically may include that to screen out word frequency less than the word of predetermined threshold value and obtains remaining word, is high frequency words.

By this building mode, based on user's inquiry log and predetermined threshold value, it is possible to construct the vocabulary that size is suitable, the word selected by word frequency size can cover most query demand.So, while the present invention is to ensure that training quality, suitably reduce the data volume of training.

On the basis of above-mentioned building mode, present invention additionally contemplates that in the process generating vocabulary, it is likely that a substantive noun can be divided into multiple word and add in vocabulary.Such as, place name " Mudanjiang " is " Paeonia suffruticosa ", " river " by participle；Again such as, mechanism's name " liking strange skill " is the situation such as " love ", " strange skill " by participle.In order to solve the problem that substantive noun is made mistakes by participle；About the building mode of vocabulary, the invention allows for preferred scheme.This preferred version is specifically after above-mentioned acquisition high frequency words, and generates before vocabulary described structure, it is also possible to after comprising the steps: to utilize name entity dictionary that described high frequency words is merged process, then perform the described step building and generating vocabulary.

Wherein, name entity dictionary comprises various substantive noun, if name, place name, mechanism's name etc. are for describing the word of entity name.

Utilize name entity dictionary that word is carried out entity merging.Such as, for " love " " strange skill " that participle obtains, binding entity word " likes strange skill ", and both of the aforesaid participle is merged so that the word of vocabulary more meets true, more accurate.

Obtain the training data required for model training through step 101-103 process, then perform step 104.

Step 104: each corpus preserved in described corpus is distributed to each node in distributed term vector learning model.

In the training method of existing routine, it is usually use unit model, but in the training method of the present invention, the data volume for realizing the corpus of term vector training is very huge, and conventional unit model can not meet the demand of training.Therefore, the present invention proposes distributed term vector learning model, utilizes distributed computing technology to improve training speed, meets the iteratively faster of model training.

Step 104 distributes training data principally for the node in distributed term vector learning model, so that all of training datas of node shared.Concrete, each corpus included by corpus is distributed to each node in described distributed term vector learning model；Wherein it is possible to the corpus in corpus is distributed equally so that in learning model, each node needs the training data undertaken to be impartial；Corpus in corpus can also be carried out random assortment so that in learning model, each node needs the training data undertaken to be do not wait.Referring to the distributed term vector training schematic diagram that the embodiment of the present invention shown in Fig. 2 provides；This distributed term vector learning model includes N number of node, and wherein, each node may each be the equipment referring to independently carry out model training, for instance computer.

When realizing, step 104 can be that each corpus included by corpus is randomly assigned to each node in distributed term vector learning model as a independent training data using orderly set of words corresponding for each corpus in corpus.For example, it is assumed that distributed term vector learning model includes 3 nodes, current corpus includes about 30,000 corpus, after word segmentation processing, obtains the orderly set of words that each corpus is corresponding；Then randomly all or part of corpus in corpus is distributed to this 3 node processing, is actually and orderly set of words corresponding for each corpus is inputted randomly to each node.So, orderly set of words corresponding for the corpus being assigned to just can be learnt by each node as training data.Certainly, when distribution, it is possible to mean allocation training data, it is also possible to be each node distribution training data adaptively according to practical situation.

Certainly, step 104 can also be to realize the assigning process of training data according to default allocation rule；The allocation rule such as preset is the clooating sequence according to corpus, and all language materials are distributed to all nodes adaptively；Make the training data size that all nodes are assigned to basically identical.Certainly, concrete allocation rule can set according to the actual requirements, it is not limited to above-mentioned example.

Step 104 main purpose is each node being suitably allocated to by the corpus that the corpus got includes in distributed term vector learning model, multiple nodal parallel in distributed term vector learning model are worked, has jointly trained all of corpus.

Step 105: configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding；

Concrete, when distributed term vector learning model is started working, first having to perform initial configuration operation, the word that the vocabulary that each node is corresponding is included arranges initialization term vector.At training initial period, what word that the vocabulary that each node is corresponding includes was corresponding initializes term vector is all identical.Then, the word that vocabulary is included by each node respectively based on this initialization term vector proceeds by training, the term vector after being trained；Then, term vector training after corresponding to each word in each node correspondence vocabulary synchronizes, and carries out the training in next cycle, until training obtains the term vector that each word in described vocabulary is corresponding.

Wherein, synchronizing process is: to each word in described vocabulary, calculate term vector corresponding to this word obtained after each node is trained and the difference initialized between term vector respectively, and according to difference corresponding to calculated all nodes, obtain vectorial adjusted value corresponding to this word (can be generally the meansigma methods of all node correspondence differences)；To each word in described vocabulary, the initialization vector value that this word is corresponding is adjusted by corresponding vectorial adjusted value to be utilized respectively in described vocabulary each word, and using the initial value as this next cycle training of word of the vector value after adjusting；Being, within follow-up cycle of training, the term vector after the adjustment that in the vocabulary that all above cycle training of each node obtains after terminating, each word is corresponding is trained as initial value.

When realizing, step 105 can realize as follows, and which includes step 1051-step 1055；

Step 1051, configuring the term vector of each word in the described vocabulary that each node in described distributed term vector learning model is corresponding is initialization term vector.

Initialize the term vector that each word of vocabulary corresponding to each node in described distributed term vector learning model is corresponding, configuring the term vector of each word in the described vocabulary that each node is corresponding, for initializing term vector, makes each word in the vocabulary that all nodes are corresponding all start training according to unified initial word vector.

It should be noted that vocabulary corresponding to each node in described distributed term vector learning model is identical, the vocabulary being in step 103 to generate.

When realizing, step 1051 can have following two initialization mode:

One is, random initializtion term vector on any one node, then will initialize term vector parallel synchronous extremely each node.Another kind is, the term vector parallel synchronous of each node in distributed term vector learning model is initialized term vector is null vector.

When realizing, it is possible to adopt MPI (MessagePassingInterface, messaging interface) interface to realize parallel synchronous and process.MPI interface is comparatively general multiple programming interface, using the teaching of the invention it is possible to provide efficient, expansible, unified Parallel Programming Environment.Certainly, the present invention can also adopt other interfaces to realize each internodal parallel synchronous process in distributed term vector learning model.

Concrete, utilize MPI interface, by ten thousand Broadcoms, initialization term vector is synchronized to each node.Parallel synchronous is utilized to process through step 1051 so that all nodes are configured with identical initialization term vector.

Step 1052, the word mated with described vocabulary that each node includes for the orderly set of words that each corpus being assigned to is corresponding, the initialization term vector of institute's predicate is trained, obtains the training term vector of current period institute predicate.

Step 1053, according to the training term vector of institute's predicate that the training of each node current period obtains, parallel synchronous updates the initialization term vector of the current period of institute's predicate, as the initialization term vector in next cycle of institute's predicate.

Periodically update the initialization term vector of node in this way, thus the training of property performance period.When each cycle training terminates, it may be judged whether satisfied presetting judges fixed condition, to determine whether deconditioning.

Step 1054, judge whether meet preset decision condition, if it is, enter step 1055；If it does not, enter step 1052.

Step 1055, the initialization term vector in next cycle according to institute predicate, obtain the term vector of equivalent in described vocabulary.

When realizing, orderly set of words corresponding for the corpus that receives as training data, to each corpus, is traveled through all words that described orderly set of words includes by each node respectively, and the word included only for vocabulary is trained；Each cycle training terminates, the initial word vector that in synchronized update each node correspondence vocabulary, each word is corresponding, makes all nodes start the training process of next cycle.

First, all words that the described orderly set of words of described traversal includes, the word included only for vocabulary is trained being specially, the each word having in sequence word that traversal corpus is corresponding, each word is mated with described vocabulary respectively, if current word matches identical word in described vocabulary, then this word is trained, obtains the term vector that this word is corresponding；If current word is not matched to identical word in described vocabulary, then abandon this word, the next word of this word is mated；Coupling is completed until each word in the orderly set of words that described corpus is corresponding.

Secondly, described parallel synchronous is, each node passes through SGD (StochasticGradientDescent, stochastic gradient descent) the algorithm accumulative gradient updating amount Δ w, this accumulative gradient updating amount Δ w that calculate be the difference of term vector that in a certain node correspondence vocabulary, some word obtains in current period training and the initial word vector of current period.Calculate, according to the accumulative gradient updating gauge that this word in each node is corresponding, the initialization term vector that this word of next cycle is corresponding, then, parallel synchronous update this word in each node corresponding initialize term vector as this word initialization term vector in next cycle.

When realizing, it is possible to calculate the initial word vector w ' in next cycle according to equation below 1；

Formula 1

Wherein, the w ' in formula refers to the initial word vector in certain next cycle of word in vocabulary；W refers to the initialization term vector of this word current period；η refers to that predetermined coefficient, Δ w refer to this word accumulative gradient updating amount at current period, and the term vector that wherein Δ w can pass through that the training of this word current period is obtained is asked poor with the initialization term vector of this word current period and obtained, and N is the nodes of described learning model.

Wherein, the size of η determines node training renewal rate, and the general value of η is the numerical value less than 1；Such as: could be arranged to the numerical value such as 1/N, 1/2N；Preferably, it is possible to arrange η=1/N, then the initialization term vector in next cycle is equal to the difference between the meansigma methods of the accumulative gradient updating amount initializing term vector and N number of node of current period.

Node periodically updates initialization term vector, and initializing after every time updating starts the training to one cycle of word each in vocabulary based on term vector, so, when periodically training meets default training condition until all nodes.

Wherein, default decision condition can be that training iterations reaches to preset iterations；Default decision condition can also be exceed accumulative gradient updating amount corresponding to the word of threshold number in described vocabulary to be both less than presetting updated value.Certainly, preset decision condition it can also be provided that other guide, arrange in the present invention default decision condition in order that whether the training result weighing all nodes all reaches unanimity, if training objectives can be reached.

When all node training results meet and preset decision condition, now, all nodes in distributed term vector learning model terminate training；Then, according to the training term vector that last cycle training of all nodes obtains, calculate the initialization term vector obtaining next cycle according to above-mentioned formula 1, using this initialization term vector calculated as term vector corresponding to the word of described vocabulary.

In the present invention, the term vector that initializes of each node training is all that parallel synchronous updates, but the SGD process in each node is all asynchronous refresh, this update mode is referred to as half asynchronous refresh, this half asynchronous distributed term vector learning model that the present invention proposes can while ensureing Algorithm Convergence, reduce the network service time loss frequently synchronizing to bring, with the training of acceleration model.

Outside word carries out learning training in for above-mentioned vocabulary, the present invention takes full advantage of language material more, also proposed the training to particular category words such as numeric class, English category, name classes, obtain the term vector of correspondence with training, term vector corresponding to the word of these classifications can Optimizing Search business more.

Concrete, the invention provides a kind of optional scheme, the program is increase following steps on the basis of method shown in above-mentioned Fig. 1:

When realizing, word in the set of sequential word corresponding to the node traverses language material in distributed term vector learning model, it has been found that word is not belonging to vocabulary, but when belonging to a certain pre-set categories, this word is trained.So, it becomes possible to making full use of the word in language material, the term vector that the valuable word of searching service is corresponding is excavated in training.

Needing exist for illustrating, each node described in above-mentioned example adopts SGD algorithm, but the present invention is when realizing, and each node can adopt above-mentioned SGD algorithm, it would however also be possible to employ other algorithms such as support vector machine, logistic regression, neutral net are trained.

Below by example, the process that realizes of said method is carried out instantiating explanation.

Such as: user's inquiry log is " world's comedy top ten list "；

Include according to the vocabulary that this user's inquiry log builds: " world ", " comedy ", " top ten list "；

According to said method provided by the invention, large-scale corpus is input in distributed term vector learning model, is trained the result obtained, the term vector (the real number vectors of 5 dimensions) that namely in vocabulary, word is corresponding.

" world " (0.004003,0.004419 ,-0.003830 ,-0.003278,0.001367)

" comedy " (-0.043665,0.018578,0.138403,0.004431 ,-0.139117)

" top ten list " (-0.337518,0.224568,0.018613,0.222294 ,-0.057880).

When hands-on, can arranging different vector dimension according to different demands, only for dimension for 5 in above-mentioned example, but the enforcement of the present invention is not limited to this dimension.

By above-described embodiment it can be seen that the training method of term vector provided by the invention, first, capture internet web page, obtain corpus, be saved in corpus；The mode of setting up of this corpus make use of the advantage of the high real-time of internet web page resource, high representativeness, aboundresources broad covered area well, it is possible to language material that get magnanimity level, that coverage rate is wider.

Corresponding with said method, present invention also offers the device of correspondence.It it is the structure chart of the training devices of a kind of term vector that the embodiment of the present invention provides referring specifically to Fig. 3, Fig. 3.As it is shown on figure 3, this device may include that unit 201, participle unit 202, vocabulary construction unit 203, language material Dispatching Unit 204 and the first dispensing unit 205 set up in corpus；The annexation of each unit and concrete function laid down a definition explanation below in conjunction with the operation principle of this device.

Unit 201 set up in corpus, is used for capturing internet web page, obtains corpus, is saved in corpus；

Participle unit 202, for each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding；

Vocabulary construction unit 203, for building vocabulary according to the user's inquiry log collected in advance；

Language material Dispatching Unit 204, for being distributed to each node in distributed term vector learning model by each corpus preserved in described corpus；

First dispensing unit 205, carries out periodic term vector training for configuring described distributed term vector learning model to each word in described vocabulary, obtains the term vector that in described vocabulary, each word is corresponding；Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.

When realizing, described participle unit 202 may include that word segmentation processing subelement.

Described word segmentation processing subelement, for utilizing participle instrument and the participle dictionary pre-build, makes word segmentation processing respectively to each corpus；Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.

When realizing, described vocabulary construction unit 203 may include that the first extraction subelement and first builds subelement.

First extracts subelement, for extracting the word comprised in the user's inquiry log collected in advance, and adds up the word frequency of each word；

First builds subelement, is used for obtaining high frequency words, builds and generates vocabulary.

When realizing, described vocabulary construction unit 203 can also include: second extracts subelement, merging subelement and the second structure subelement.

Second extracts subelement, for extracting the word comprised in the user's inquiry log collected in advance, and adds up the word frequency of each word；

Merge subelement, be used for utilizing name entity dictionary that high frequency words is merged process；

Second builds subelement, builds for the high frequency words after utilizing merging treatment and generates vocabulary.

When realizing, described first dispensing unit 205, including: configuration subelement, training subelement, judgment sub-unit, synchronized update subelement and term vector computation subunit.

Configuration subelement, in the described vocabulary corresponding for configuring each node in described distributed term vector learning model, the term vector of each word is for initializing term vector；

Training subelement, for configuring the word mated with described vocabulary that each node includes for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this cycle；

Judgment sub-unit, is used for judging whether to meet presetting decision condition, if it is, perform term vector computation subunit；If it does not, perform synchronized update subelement；

Synchronized update subelement, the training term vector of the institute's predicate for obtaining according to this cycle training of each node, parallel synchronous updates the initialization term vector in this cycle of institute's predicate, as the initialization term vector in next cycle of institute's predicate, enters described training subelement；

Term vector computation subunit, for according to described training term vector, obtaining the term vector of equivalent in described vocabulary.

Optionally, described training subelement includes: traversal coupling subelement and term vector training subelement.

Traversal coupling subelement, for each corpus being assigned to, traveling through all words in the orderly set of words that described corpus is corresponding, mated with described vocabulary respectively by each word；

Term vector training subelement, for when the matching result of described traversal coupling subelement is for being, being trained this word, obtain the term vector that this word is corresponding.

Optionally, synchronized update subelement, it is used for adopting following formula to realize synchronized update:

w^{'} = w - η (Σ_{1}^{N} Δ w);

Optionally, described device can also include:

Second dispensing unit, for configuring each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out term vector training, the training term vector of the word not mated described in the training of each node is obtained synchronizes, trigger next cycle training, make each node cyclic training obtain described in the term vector of word that do not mate, and term vector corresponding for the described word not mated is saved in described vocabulary；Wherein, the described word not mated belongs to pre-set categories.

About the device in above-described embodiment, the concrete mode that wherein modules performs to operate has been described in detail in about the embodiment of the method, and explanation will be not set forth in detail herein.

It addition, present invention also offers the training devices of another kind of term vector, below in conjunction with Fig. 4, this device is explained.

Fig. 4 is the hardware structure diagram of the training devices of a kind of term vector that the embodiment of the present invention provides, and the device 300 shown in Fig. 4 can be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet device, armarium, body-building equipment, personal digital assistant etc..

With reference to Fig. 4, device 300 can include following one or more assembly: processes assembly 302, memorizer 304, power supply module 306, multimedia groupware 308, audio-frequency assembly 310, the interface 312 of input/output (I/O), sensor cluster 314, and communications component 316.

Process assembly 302 and generally control the integrated operation of device 300, such as with display, call, data communication, the operation that camera operation and record operation are associated.Process assembly 302 and can include one or more processor 320 to perform instruction, to complete all or part of step of above-mentioned method.Additionally, process assembly 302 can include one or more module, it is simple to what process between assembly 302 and other assemblies is mutual.Such as, process assembly 302 and can include multi-media module, with facilitate multimedia groupware 308 and process between assembly 302 mutual.

Memorizer 304 is configured to store various types of data to support the operation at equipment 300.The example of these data includes the instruction of any application program for operating on the device 300 or method, contact data, telephone book data, message, picture, video etc..Memorizer 304 can be realized by any kind of volatibility or non-volatile memory device or their combination, such as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory EPROM (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, disk or CD.

The various assemblies that power supply module 306 is device 300 provide electric power.Power supply module 306 can include power-supply management system, one or more power supplys, and other generate, manage and distribute, with for device 300, the assembly that electric power is associated.

Multimedia groupware 308 includes the screen providing an output interface between described device 300 and user.In certain embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen may be implemented as touch screen, to receive the input signal from user.Touch panel includes one or more touch sensor to sense the gesture on touch, slip and touch panel.Described touch sensor can not only sense the border of touch or sliding action, but also detects the persistent period relevant to described touch or slide and pressure.In certain embodiments, multimedia groupware 308 includes a front-facing camera and/or post-positioned pick-up head.When equipment 300 is in operator scheme, during such as screening-mode or video mode, front-facing camera and/or post-positioned pick-up head can receive the multi-medium data of outside.Each front-facing camera and post-positioned pick-up head can be a fixing optical lens system or have focal length and optical zoom ability.

Audio-frequency assembly 310 is configured to output and/or input audio signal.Such as, audio-frequency assembly 310 includes a mike (MIC), and when device 300 is in operator scheme, during such as call model, logging mode and speech recognition mode, mike is configured to receive external audio signal.The audio signal received can be further stored at memorizer 304 or send via communications component 316.In certain embodiments, audio-frequency assembly 310 also includes a speaker, is used for exporting audio signal.

I/O interface 312 provides interface for processing between assembly 302 and peripheral interface module, above-mentioned peripheral interface module can be keyboard, puts striking wheel, button etc..These buttons may include but be not limited to: home button, volume button, startup button and locking press button.

Sensor cluster 314 includes one or more sensor, for providing the state estimation of various aspects for device 300.Such as, what sensor cluster 314 can detect equipment 300 opens/closed mode, the relative localization of assembly, such as described assembly is display and the keypad of device 300, the position change of all right detecting device 300 of sensor cluster 314 or 300 1 assemblies of device, the presence or absence that user contacts with device 300, the variations in temperature of device 300 orientation or acceleration/deceleration and device 300.Sensor cluster 314 can include proximity transducer, is configured to when not having any physical contact object near detection.Sensor cluster 314 can also include optical sensor, such as CMOS or ccd image sensor, for using in imaging applications.In certain embodiments, this sensor cluster 314 can also include acceleration transducer, gyro sensor, Magnetic Sensor, pressure transducer or temperature sensor.

Communications component 316 is configured to facilitate between device 300 and other equipment the communication of wired or wireless mode.Device 300 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or their combination.In one exemplary embodiment, communications component 316 receives the broadcast singal or the broadcast related information that manage system from external broadcasting via broadcast channel.In one exemplary embodiment, described communications component 316 also includes near-field communication (NFC) module, to promote junction service.Such as, can based on RF identification (RFID) technology in NFC module, Infrared Data Association (IrDA) technology, ultra broadband (UWB) technology, bluetooth (BT) technology and other technologies realize.

In the exemplary embodiment, device 300 can be realized by one or more application specific integrated circuits (ASIC), digital signal processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components, is used for performing said method.

In the exemplary embodiment, additionally providing a kind of non-transitory computer-readable recording medium including instruction, for instance include the memorizer 304 of instruction, above-mentioned instruction can have been performed said method by the processor 320 of device 300.Such as, described non-transitory computer-readable recording medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..

A kind of non-transitory computer-readable recording medium, when the instruction in described storage medium is performed by the processor of electronic equipment so that electronic equipment is able to carry out the training method of a kind of term vector, described processor performs the instruction of following operation:

Capture internet web page, obtain corpus, be saved in corpus；

Each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding；

Vocabulary is built according to the user's inquiry log collected in advance；

Fig. 5 is the structural representation of server in the embodiment of the present invention.This server 1900 can produce relatively larger difference because of configuration or performance difference, one or more central processing units (centralprocessingunits can be included, CPU) 1922 (such as, one or more processors) and memorizer 1932, the storage medium 1930 (such as one or more mass memory units) of one or more storage application programs 1942 or data 1944.Wherein, memorizer 1932 and storage medium 1930 can be of short duration storage or persistently store.The program being stored in storage medium 1930 can include one or more modules (diagram does not mark), and each module can include a series of command operatings in server.Further, central processing unit 1922 could be arranged to communicate with storage medium 1930, performs a series of command operatings in storage medium 1930 on server 1900.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941, such as WindowsServerTM, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM etc..

As seen through the above description of the embodiments, those skilled in the art is it can be understood that can add the mode of general hardware platform by software to all or part of step in above-described embodiment method and realize.Based on such understanding, the part that prior art is contributed by technical scheme substantially in other words can embody with the form of software product, this computer software product can be stored in storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a computer equipment (can be personal computer, server, or the network communication equipment such as such as WMG) perform the method described in some part of each embodiment of the present invention or embodiment.

It should be noted that each embodiment in this specification all adopts the mode gone forward one by one to describe, between each embodiment identical similar part mutually referring to, what each embodiment stressed is the difference with other embodiments.Especially for equipment and system embodiment, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part illustrates referring to the part of embodiment of the method.Equipment described above and system embodiment are merely schematic, the unit wherein illustrated as separation assembly can be or may not be physically separate, the assembly shown as unit can be or may not be physical location, namely may be located at a place, or can also be distributed on multiple NE.Some or all of module therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.

Those skilled in the art, after considering description and putting into practice invention disclosed herein, will readily occur to other embodiment of the present invention.It is contemplated that contain any modification of the present invention, purposes or adaptations, these modification, purposes or adaptations are followed the general principle of the present invention and include the undocumented known general knowledge in the art of the disclosure or conventional techniques means.Description and embodiments is considered only as exemplary, and the true scope of the present invention and spirit are pointed out by claim below.

It should be appreciated that the invention is not limited in precision architecture described above and illustrated in the accompanying drawings, and various amendment and change can carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims

1. the training method of a term vector, it is characterised in that described method includes:

Capture internet web page, obtain corpus, be saved in corpus；

Vocabulary is built according to the user's inquiry log collected in advance；

2. method according to claim 1, it is characterised in that described each corpus in corpus is made word segmentation processing respectively, including:

Utilize participle instrument and the participle dictionary pre-build, each corpus is made word segmentation processing respectively；Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.

3. method according to claim 1, it is characterised in that user's inquiry log that described basis is collected in advance builds vocabulary, including:

Obtain high frequency words, build and generate vocabulary.

4. method according to claim 3, it is characterised in that after described acquisition high frequency words, described method also includes:

5. method according to claim 1, it is characterised in that each word in described vocabulary is carried out periodic term vector training by the described distributed term vector learning model of described configuration, obtains the term vector that in described vocabulary, each word is corresponding, including:

The word mated with described vocabulary that step 2, each node include for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of current period institute predicate；

Step 3: according to the training term vector of institute's predicate that the training of each node current period obtains, parallel synchronous updates the initialization term vector of the current period of institute's predicate, as the initialization term vector in next cycle of institute's predicate；

Step 4, judge whether meet preset decision condition, if it is, enter step 5；If it does not, enter step 2；

Step 5, the initialization term vector in next cycle according to institute predicate, obtain the term vector of equivalent in described vocabulary.

6. method according to claim 5, it is characterised in that the word mated with described vocabulary that the orderly set of words that described each corpus to being assigned to is corresponding includes is trained, including:

7. method according to claim 5, it is characterized in that, the training term vector of the described institute's predicate obtained according to the training of each node current period, parallel synchronous updates the initialization term vector of the current period of institute's predicate, as the initialization term vector in next cycle of institute's predicate, including:

Following formula is adopted to realize synchronized update:

w^{'} = w - η (Σ_{1}^{N} Δ w);

8. method according to claim 1, it is characterised in that described method also includes:

9. the training devices of a term vector, it is characterised in that described device includes:

Participle unit, for each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding；

10. the training devices of a term vector, it is characterized in that, include memorizer, and one or more than one program, one of them or more than one program are stored in memorizer, and are configured to be performed one or more than one program package containing the instruction for carrying out following operation by one or more than one processor:

Capture internet web page, obtain corpus, be saved in corpus；

Vocabulary is built according to the user's inquiry log collected in advance；