CN105786782A - Word vector training method and device - Google Patents

Word vector training method and device Download PDF

Info

Publication number
CN105786782A
CN105786782A CN201610179115.0A CN201610179115A CN105786782A CN 105786782 A CN105786782 A CN 105786782A CN 201610179115 A CN201610179115 A CN 201610179115A CN 105786782 A CN105786782 A CN 105786782A
Authority
CN
China
Prior art keywords
word
term vector
corpus
training
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610179115.0A
Other languages
Chinese (zh)
Other versions
CN105786782B (en
Inventor
邢宁
刘明荣
许静芳
常晓夫
王晓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Information Service Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201610179115.0A priority Critical patent/CN105786782B/en
Publication of CN105786782A publication Critical patent/CN105786782A/en
Application granted granted Critical
Publication of CN105786782B publication Critical patent/CN105786782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a word vector training method and device. The method comprises the following steps: an internet webpage is captured, and training language materials are acquired and stored in a corpus; each training language material in the corpus is subjected to word segmentation, and an orderly word set corresponding to each training language material is obtained; a word list is established according to pre-collected user query logs; the training language materials stored in the corpus are distributed to nodes of a distributed word vector learning model; the distributed word vector learning model is configured to perform periodic word vector training on each word in the word list, and the word vector corresponding to the word in the word list is obtained. According to the word vector training method and device, the word vectors obtained through training can be well applied to search business, and fast iterative high-quality word vector training can be realized.

Description

The training method of a kind of term vector and device
Technical field
The present invention relates to Internet technical field, particularly relate to training method and the device of a kind of term vector.
Background technology
In internet, applications, it is important that a problem be how to convert natural language to computer it will be appreciated that data representation form.And solve the most important step of this problem and find a kind of method exactly by natural language symbol data.That conventional is degree of depth study (DeepLearning at present, DL) method, what adopt in DL is " Distributedrepresentation " distributing method for expressing, and each vocabulary is shown as a kind of low-dimensional real number vector, and this vector is exactly the term vector that word is corresponding.Term vector is exactly thus be born, it is to be understood that term vector is exactly for expressing the vector of word in natural language, with suitable in internet, applications.Such as, term vector is usable in a lot of natural language learning and processes in the work that (NaturalLearningProcessing, NLP) is relevant, such as cluster, semantic analysis etc..
At present, people's Use Word 2vec instrument utilizes DL method by unit model, obtains the vector in the vector space that in vocabulary, word is corresponding by the language material collected is trained.What this term vector training method adopted is single cpu mode, and its training speed is relatively low, the especially very difficult business scenario very huge suitable in data volume.Additionally this term vector training method is the training method of universality, and it does not consider the particularity of specific transactions scene, thus under specific business scenario its training effect bad.
Summary of the invention
In order to solve above-mentioned technical problem, the invention provides the training method of a kind of term vector and device so that the term vector that training obtains can be perfectly suitable in searching service, and is capable of the training of the high-quality term vector of iteratively faster.
The embodiment of the invention discloses following technical scheme:
First aspect present invention provides the training method of a kind of term vector, and described method includes:
Capture internet web page, obtain corpus, be saved in corpus;
Each corpus in corpus is made word segmentation processing, obtains the orderly set of words that each corpus is corresponding;
Vocabulary is built according to the user's inquiry log collected in advance;
The each corpus preserved in described corpus is distributed to each node in distributed term vector learning model;
Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding;
Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
Optionally, described each corpus in corpus is made word segmentation processing, including:
Utilize participle instrument and the participle dictionary pre-build, each corpus is made word segmentation processing;Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.
Optionally, user's inquiry log that described basis is collected in advance builds vocabulary, including:
Extract the word comprised in the user's inquiry log collected in advance, and add up the word frequency of each word;
Obtain high frequency words, build and generate vocabulary.
Optionally, after described acquisition high frequency words, described method also includes:
After utilizing name entity dictionary that described high frequency words is merged process, then perform the described step building and generating vocabulary.
Optionally, described each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after in each node vocabulary of obtaining of training, the training term vector of each word synchronizes, trigger next cycle training, including:
In the described vocabulary that step 1, each node configured in described distributed term vector learning model are corresponding, the term vector of each word is for initializing term vector;
The word mated with described vocabulary that step 2, each node include for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this cycle;
Step 3, judge whether meet preset decision condition, if it is, enter step 5;If it does not, enter step 4;
Step 4: according to the training term vector of institute's predicate that this cycle training of each node obtains, parallel synchronous updates the initialization term vector in this cycle of institute's predicate, as the initialization term vector in next cycle of institute's predicate, enters step 2;
Step 5, according to described training term vector, obtain the term vector of equivalent in described vocabulary.
Optionally, the word mated with described vocabulary that the orderly set of words that described each corpus to being assigned to is corresponding includes is trained, including:
To each corpus being assigned to, travel through all words in the orderly set of words that described corpus is corresponding, each word is mated with described vocabulary respectively, if current word matches identical word in described vocabulary, this word is trained, obtains the term vector that this word is corresponding.
Optionally, described according to the term vector of each word in each node current period described vocabulary of obtaining of training, parallel synchronous updates the initialization term vector of the current period of each word, and the initialization term vector as next cycle of word each in described vocabulary includes:
Following formula is adopted to realize synchronized update:
w ′ = w - η ( Σ 1 N Δ w ) ;
Wherein, w ' refers to the initial word vector in certain next cycle of word in the described vocabulary that a certain node is corresponding;W refers to the initialization term vector of this word current period in this node correspondence vocabulary;η refers to predetermined coefficient;Δ w asks poor by the term vector that this word current period training in this node correspondence vocabulary is obtained with the initialization term vector of this word current period and obtains;N is the nodes of described learning model.
Optionally, described method also includes:
Configure each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out term vector training, the training term vector of the word not mated described in the training of each node is obtained synchronizes, trigger next cycle training, make each node cyclic training obtain described in the term vector of word that do not mate, and term vector corresponding for the described word not mated is saved in described vocabulary;Wherein, the described word not mated belongs to pre-set categories.
Second aspect present invention provides the training devices of a kind of term vector, and described device includes:
Unit set up in corpus, is used for capturing internet web page, obtains corpus, is saved in corpus;
Participle unit, for each corpus in corpus is made word segmentation processing, obtains the orderly set of words that each corpus is corresponding;
Vocabulary construction unit, for building vocabulary according to the user's inquiry log collected in advance;
Language material Dispatching Unit, for being distributed to each node in distributed term vector learning model by each corpus preserved in described corpus;
First dispensing unit, carries out periodic term vector training for configuring described distributed term vector learning model to each word in described vocabulary, obtains the term vector that in described vocabulary, each word is corresponding;Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
Third aspect present invention provides the training devices of a kind of term vector, described device includes memorizer, and one or more than one program, one of them or more than one program are stored in memorizer, and are configured to be performed one or more than one program package containing the instruction for carrying out following operation by one or more than one processor:
Capture internet web page, obtain corpus, be saved in corpus;
Each corpus in corpus is made word segmentation processing, obtains the orderly set of words that each corpus is corresponding;
Vocabulary is built according to the user's inquiry log collected in advance;
The each corpus preserved in described corpus is distributed to each node in distributed term vector learning model;
Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding;
Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:
Utilize participle instrument and the participle dictionary pre-build, each corpus is made word segmentation processing;Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.
Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:
Extract the word comprised in the user's inquiry log collected in advance, and add up the word frequency of each word;
Obtain high frequency words, build and generate vocabulary.
Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:
After utilizing name entity dictionary that described high frequency words is merged process, then perform the described instruction building and generating vocabulary.
Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:
In the described vocabulary that instruction 1, each node configured in described distributed term vector learning model are corresponding, the term vector of each word is for initializing term vector;
The word mated with described vocabulary that instruction 2, each node include for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this cycle;
Instruction 3, judge whether to meet and preset decision condition, if it is, entry instruction 5;If it does not, entry instruction 4;
Instruction 4: according to the training term vector of institute's predicate that this cycle training of each node obtains, parallel synchronous updates the initialization term vector in this cycle of institute's predicate, as the initialization term vector in next cycle of institute's predicate, entry instruction 2;
Instruction 5, according to described training term vector, obtain the term vector of equivalent in described vocabulary.
Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:
To each corpus being assigned to, travel through all words in the orderly set of words that described corpus is corresponding, each word is mated with described vocabulary respectively, if current word matches identical word in described vocabulary, this word is trained, obtains the term vector that this word is corresponding.
Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:
Following formula is adopted to realize synchronized update:
w ′ = w - η ( Σ 1 N Δ w ) ;
Wherein, w ' refers to the initial word vector in certain next cycle of word in the described vocabulary that a certain node is corresponding;W refers to the initialization term vector of this word current period in this node correspondence vocabulary;η refers to predetermined coefficient;Δ w asks poor by the term vector that this word current period training in this node correspondence vocabulary is obtained with the initialization term vector of this word current period and obtains;N is the nodes of described learning model.
Optionally, described processor is additionally operable to perform one or more than one program package containing the instruction for carrying out following operation:
Configure each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out cyclic training, the training term vector of the word not mated described in obtaining;Wherein, the described word not mated belongs to pre-set categories
The training term vector of word not mated described in the training of each node is obtained synchronizes, the term vector that the word that do not mate described in obtaining is corresponding, and is saved in described vocabulary.
Compared with prior art, technical scheme provided by the invention possesses following beneficial effect:
Technique scheme provided by the invention, first, captures internet web page, obtains corpus, be saved in corpus;The mode of setting up of this corpus make use of the advantage of the high real-time of internet web page resource, high representativeness, aboundresources broad covered area well, it is possible to language material that get magnanimity level, that coverage rate is wider.
Then, each corpus in corpus is made word segmentation processing respectively, obtain the orderly set of words that each corpus is corresponding;And, build vocabulary according to the user's inquiry log collected in advance;The present invention has abandoned traditional mode building vocabulary based on corpus, it is proposed that build the mode of vocabulary according to user's inquiry log;Owing to user's inquiry log can characterize user's actual search demand, therefore, build, with the query word comprised in user's inquiry log, the vocabulary generated and just can be adapted well to searching service.
Finally, each corpus preserved in described corpus is distributed to each node in distributed term vector learning model by the present invention;Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding.The present invention is to solve that large-scale corpus trains problem slowly, the unit multithreading training method of abandoning tradition, and adopt distributed term vector learning model, trained by multi-node parallel, to improve training speed such that it is able to quickly iteration goes out high-quality term vector.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of the training method of a kind of term vector that the embodiment of the present invention provides;
Fig. 2 is the distributed term vector training schematic diagram that the embodiment of the present invention provides;
Fig. 3 is the structure chart of the training devices of a kind of term vector that the embodiment of the present invention provides;
Fig. 4 is the hardware structure diagram of the training devices of a kind of term vector that the embodiment of the present invention provides;
Fig. 5 is the structural representation of the server that the embodiment of the present invention provides.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is explicitly described, it is clear that, described embodiment is a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.
The invention provides the training method of a kind of term vector and device, the present invention analyzes the factor of the special applications background of term vector, propose the technical thought building specific vocabulary according to user's inquiry log, make to train the term vector obtained can be perfectly suitable in searching service, and, the unit multithreading training method of abandoning tradition of the present invention and propose the distributed term vector learning method of employing, it is possible to realize the training of the high-quality term vector of iteratively faster.
Refer to the flow chart that Fig. 1, Fig. 1 are the training methodes of a kind of term vector that the embodiment of the present invention provides, as it is shown in figure 1, the method includes: step 101-step 105:
Step 101: capture internet web page, obtains corpus, is saved in corpus;
Concrete, capture internet web page, using each web page contents of grabbing as each corpus, be saved in corpus.
Language material, refer to language actually used in the linguistic data that truly occurred;Language material is commonly stored in corpus, and corpus is the data base carrying language material with electronic computer for carrier;Real corpus is it is generally required to through processing (analyze and process), can become useful resource.
At present, China's corpus has four kinds of types, is the general corpus of Modern Chinese, the Peoples Daily tagged corpus, the Modern Chinese corpus for language teaching and research, the Modern Chinese corpus etc. towards speech signal analysis respectively.Accordingly, it is desirable to during language material, will directly obtain language material from these well-established corpus.
But the content of these corpus is comparatively fixing, update slower;And due to the opening of the Internet and novelty, the linguistic data every day making that this field produces is all in growth at double, therefore, if still only obtaining language material from these corpus existing, the language material negligible amounts then acquired, and coverage rate is narrower, these language materials can not characterize the actually used situation of language in internet arena.
Based on this, so that the language material acquired can be applicable to internet arena better, it is especially applicable to search field, in search engine, in the embodiment of the present invention, obtains corpus by capturing the mode of internet web page.
More specifically, the present invention also provide for following can implementation:
Utilize search engine collecting internet news class webpage, Web Community's webpage and/or blog web page etc., using the web page contents that grabs as corpus.
Owing to internet news class webpage, Web Community's webpage, blog web page are all had the webpage of reliability certification, the information of this class webpage carrying is all the information that credibility is higher, therefore directly corpus is obtained from this class webpage, it is possible to increase the quality of corpus.
Certainly, the realization of the present invention can but be not limited to news category webpage, Web Community's class and blog web page, it is also possible to be science popularization webpage, paper website webpage etc. there is the webpage of reliability certification class.Furthermore, in order to expand corpus further, corpus also can also be obtained from aforementioned well-established corpus.The mode of this acquisition language material that the embodiment of the present invention provides, make use of the advantage of the high real-time of internet web page resource, high representativeness, aboundresources broad covered area, it is possible to corpus that get magnanimity level, that coverage rate is wider well.
Step 102: each corpus in corpus is obtained, as word segmentation processing, the orderly set of words that each corpus is corresponding respectively;Wherein, described orderly word set is combined into the set that a sequential word is constituted.
In the embodiment of the present invention, capture, from internet web page, the corpus obtained and be generally sentence or article.Owing to the training of term vector is with word for training data.Therefore, when acquiring corpus, in addition it is also necessary to corpus is obtained as word segmentation processing the set of sequential word corresponding to each language material.Concrete, assume that corpus is an article, article is made up of at least one sentence, each sentence that this corpus is included is sequentially carried out word segmentation processing, each sentence is divided into the set of the word of a string ordered arrangement, and the word after division is arranged in order further according to putting in order of sentence each in original text chapter.Such as, certain corpus is for " I likes Beijing.Beijing is the political economy cultural center of China ", the orderly set of words obtained after this corpus is carried out word segmentation processing can be " I/like/Beijing/Beijing/be/China// politics/economy/culture/" center ".
And word segmentation processing depends primarily on what participle dictionary realized, the quality of participle dictionary directly determines the quality of word segmentation processing.Participle dictionary is also referred to as dictionary for word segmentation, and for the ease of describing, the present invention utilizes participle dictionary to describe.At present, the participle dictionary generally adopted is the dictionary by setting up based on " xinhua dictionary " or other similar published book.But, fast-developing Chinese internet all can have new word, new things to produce every day, and these participle dictionaries are not able to include the new word produced in the Internet in time, so, if directly utilizing these dictionaries language material to obtaining from the Internet to carry out word segmentation processing, its participle effect is just not so good.
Based on this, the present invention constructs the participle dictionary being particularly well-suited to Internet scene, and mainly according to user, the word inquired about in (query) daily record and input method dictionary builds generation participle dictionary.As long as it is understood that user use the Internet, generally all can produce user's inquiry log, therefore, almost every day, per time each second even per minute, network all can produce user's inquiry log, in this user's inquiry log record have query word.Simultaneously as input method itself sets up the input method dictionary having correspondence, in this input method dictionary, record has conventional word;It addition, input method self also can pass through to collect the new word produced in user's input process updates input method dictionary termly.It can be seen that user's inquiry log and input method dictionary are all follow the real network operation behavior of user closely and update, therefore, the present invention proposes to utilize user's inquiry log and input method dictionary to set up participle dictionary, this participle dictionary can be synchronize with internet development, can reaction network practical language situation, it is possible to be well adapted in internet, applications scene.
Specifically, step 102 can be accomplished by:
Utilize participle instrument and the participle dictionary pre-build, obtain the orderly set of words corresponding with each corpus for each corpus as word segmentation processing;Wherein, described orderly word set is combined into the set that a sequential word is constituted;Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.
In the embodiment of the present invention, after word segmentation processing, a corpus correspond to an orderly set of words.So-called set of words in order refers to the word order according to the text message recorded in corpus, the set of the word having permanent order relation obtained after text message is made word segmentation processing.Such as: corpus is one section of blog articles, the putting in order of each word in the paragraph order of the text message then recorded according to this blog articles, the statement sequence of each paragraph text and each sentence, carries out word segmentation processing to text information successively and obtains the set of sequential word.Giving an example, the text message recorded in corpus is " I loves Tian An-men, Beijing ", and the orderly set of words obtaining this corpus corresponding as word segmentation processing according to the word order of text message is exactly: (I/love/Beijing/Tian An-men).
Step 103: build vocabulary according to the user's inquiry log collected in advance.
Traditional method building vocabulary is to select some words from corpus, utilizes the word selected to build and generates vocabulary, but these vocabularys are universalities, it does not have be representative, it is impossible to is adapted to pointedly in search scene.The present invention considers the application scenarios of term vector, so that term vector can be adapted in search scene (such as search engine) pointedly, the present invention proposes the scheme building vocabulary according to user's inquiry log.Compared with the vocabulary general with prior art, the present invention extracts the peculiar vocabulary that can cover most search needs according to user's inquiry log, and the term vector trained based on this vocabulary can more conform to the demand of search scene.
Furthermore, it is contemplated that Internet era Chinese word quantity be magnanimity level, any training is impossible to cover all of word, and from training time cost consideration, covers all of word also without necessity.Therefore, in order to build appropriately sized vocabulary, and making vocabulary cover most query demand as far as possible, the present invention proposes following building mode: extracts the word comprised in the user's inquiry log collected in advance, and adds up the word frequency of each word;Obtain high frequency words, and build generation vocabulary.
Wherein, described acquisition high frequency words specifically may include that to screen out word frequency less than the word of predetermined threshold value and obtains remaining word, is high frequency words.
By this building mode, based on user's inquiry log and predetermined threshold value, it is possible to construct the vocabulary that size is suitable, the word selected by word frequency size can cover most query demand.So, while the present invention is to ensure that training quality, suitably reduce the data volume of training.
On the basis of above-mentioned building mode, present invention additionally contemplates that in the process generating vocabulary, it is likely that a substantive noun can be divided into multiple word and add in vocabulary.Such as, place name " Mudanjiang " is " Paeonia suffruticosa ", " river " by participle;Again such as, mechanism's name " liking strange skill " is the situation such as " love ", " strange skill " by participle.In order to solve the problem that substantive noun is made mistakes by participle;About the building mode of vocabulary, the invention allows for preferred scheme.This preferred version is specifically after above-mentioned acquisition high frequency words, and generates before vocabulary described structure, it is also possible to after comprising the steps: to utilize name entity dictionary that described high frequency words is merged process, then perform the described step building and generating vocabulary.
Wherein, name entity dictionary comprises various substantive noun, if name, place name, mechanism's name etc. are for describing the word of entity name.
Utilize name entity dictionary that word is carried out entity merging.Such as, for " love " " strange skill " that participle obtains, binding entity word " likes strange skill ", and both of the aforesaid participle is merged so that the word of vocabulary more meets true, more accurate.
Obtain the training data required for model training through step 101-103 process, then perform step 104.
Step 104: each corpus preserved in described corpus is distributed to each node in distributed term vector learning model.
In the training method of existing routine, it is usually use unit model, but in the training method of the present invention, the data volume for realizing the corpus of term vector training is very huge, and conventional unit model can not meet the demand of training.Therefore, the present invention proposes distributed term vector learning model, utilizes distributed computing technology to improve training speed, meets the iteratively faster of model training.
Step 104 distributes training data principally for the node in distributed term vector learning model, so that all of training datas of node shared.Concrete, each corpus included by corpus is distributed to each node in described distributed term vector learning model;Wherein it is possible to the corpus in corpus is distributed equally so that in learning model, each node needs the training data undertaken to be impartial;Corpus in corpus can also be carried out random assortment so that in learning model, each node needs the training data undertaken to be do not wait.Referring to the distributed term vector training schematic diagram that the embodiment of the present invention shown in Fig. 2 provides;This distributed term vector learning model includes N number of node, and wherein, each node may each be the equipment referring to independently carry out model training, for instance computer.
When realizing, step 104 can be that each corpus included by corpus is randomly assigned to each node in distributed term vector learning model as a independent training data using orderly set of words corresponding for each corpus in corpus.For example, it is assumed that distributed term vector learning model includes 3 nodes, current corpus includes about 30,000 corpus, after word segmentation processing, obtains the orderly set of words that each corpus is corresponding;Then randomly all or part of corpus in corpus is distributed to this 3 node processing, is actually and orderly set of words corresponding for each corpus is inputted randomly to each node.So, orderly set of words corresponding for the corpus being assigned to just can be learnt by each node as training data.Certainly, when distribution, it is possible to mean allocation training data, it is also possible to be each node distribution training data adaptively according to practical situation.
Certainly, step 104 can also be to realize the assigning process of training data according to default allocation rule;The allocation rule such as preset is the clooating sequence according to corpus, and all language materials are distributed to all nodes adaptively;Make the training data size that all nodes are assigned to basically identical.Certainly, concrete allocation rule can set according to the actual requirements, it is not limited to above-mentioned example.
Step 104 main purpose is each node being suitably allocated to by the corpus that the corpus got includes in distributed term vector learning model, multiple nodal parallel in distributed term vector learning model are worked, has jointly trained all of corpus.
Step 105: configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding;
Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
Concrete, when distributed term vector learning model is started working, first having to perform initial configuration operation, the word that the vocabulary that each node is corresponding is included arranges initialization term vector.At training initial period, what word that the vocabulary that each node is corresponding includes was corresponding initializes term vector is all identical.Then, the word that vocabulary is included by each node respectively based on this initialization term vector proceeds by training, the term vector after being trained;Then, term vector training after corresponding to each word in each node correspondence vocabulary synchronizes, and carries out the training in next cycle, until training obtains the term vector that each word in described vocabulary is corresponding.
Wherein, synchronizing process is: to each word in described vocabulary, calculate term vector corresponding to this word obtained after each node is trained and the difference initialized between term vector respectively, and according to difference corresponding to calculated all nodes, obtain vectorial adjusted value corresponding to this word (can be generally the meansigma methods of all node correspondence differences);To each word in described vocabulary, the initialization vector value that this word is corresponding is adjusted by corresponding vectorial adjusted value to be utilized respectively in described vocabulary each word, and using the initial value as this next cycle training of word of the vector value after adjusting;Being, within follow-up cycle of training, the term vector after the adjustment that in the vocabulary that all above cycle training of each node obtains after terminating, each word is corresponding is trained as initial value.
When realizing, step 105 can realize as follows, and which includes step 1051-step 1055;
Step 1051, configuring the term vector of each word in the described vocabulary that each node in described distributed term vector learning model is corresponding is initialization term vector.
Initialize the term vector that each word of vocabulary corresponding to each node in described distributed term vector learning model is corresponding, configuring the term vector of each word in the described vocabulary that each node is corresponding, for initializing term vector, makes each word in the vocabulary that all nodes are corresponding all start training according to unified initial word vector.
It should be noted that vocabulary corresponding to each node in described distributed term vector learning model is identical, the vocabulary being in step 103 to generate.
When realizing, step 1051 can have following two initialization mode:
One is, random initializtion term vector on any one node, then will initialize term vector parallel synchronous extremely each node.Another kind is, the term vector parallel synchronous of each node in distributed term vector learning model is initialized term vector is null vector.
When realizing, it is possible to adopt MPI (MessagePassingInterface, messaging interface) interface to realize parallel synchronous and process.MPI interface is comparatively general multiple programming interface, using the teaching of the invention it is possible to provide efficient, expansible, unified Parallel Programming Environment.Certainly, the present invention can also adopt other interfaces to realize each internodal parallel synchronous process in distributed term vector learning model.
Concrete, utilize MPI interface, by ten thousand Broadcoms, initialization term vector is synchronized to each node.Parallel synchronous is utilized to process through step 1051 so that all nodes are configured with identical initialization term vector.
Step 1052, the word mated with described vocabulary that each node includes for the orderly set of words that each corpus being assigned to is corresponding, the initialization term vector of institute's predicate is trained, obtains the training term vector of current period institute predicate.
Step 1053, according to the training term vector of institute's predicate that the training of each node current period obtains, parallel synchronous updates the initialization term vector of the current period of institute's predicate, as the initialization term vector in next cycle of institute's predicate.
Periodically update the initialization term vector of node in this way, thus the training of property performance period.When each cycle training terminates, it may be judged whether satisfied presetting judges fixed condition, to determine whether deconditioning.
Step 1054, judge whether meet preset decision condition, if it is, enter step 1055;If it does not, enter step 1052.
Step 1055, the initialization term vector in next cycle according to institute predicate, obtain the term vector of equivalent in described vocabulary.
When realizing, orderly set of words corresponding for the corpus that receives as training data, to each corpus, is traveled through all words that described orderly set of words includes by each node respectively, and the word included only for vocabulary is trained;Each cycle training terminates, the initial word vector that in synchronized update each node correspondence vocabulary, each word is corresponding, makes all nodes start the training process of next cycle.
First, all words that the described orderly set of words of described traversal includes, the word included only for vocabulary is trained being specially, the each word having in sequence word that traversal corpus is corresponding, each word is mated with described vocabulary respectively, if current word matches identical word in described vocabulary, then this word is trained, obtains the term vector that this word is corresponding;If current word is not matched to identical word in described vocabulary, then abandon this word, the next word of this word is mated;Coupling is completed until each word in the orderly set of words that described corpus is corresponding.
Secondly, described parallel synchronous is, each node passes through SGD (StochasticGradientDescent, stochastic gradient descent) the algorithm accumulative gradient updating amount Δ w, this accumulative gradient updating amount Δ w that calculate be the difference of term vector that in a certain node correspondence vocabulary, some word obtains in current period training and the initial word vector of current period.Calculate, according to the accumulative gradient updating gauge that this word in each node is corresponding, the initialization term vector that this word of next cycle is corresponding, then, parallel synchronous update this word in each node corresponding initialize term vector as this word initialization term vector in next cycle.
When realizing, it is possible to calculate the initial word vector w ' in next cycle according to equation below 1;
Formula 1
Wherein, the w ' in formula refers to the initial word vector in certain next cycle of word in vocabulary;W refers to the initialization term vector of this word current period;η refers to that predetermined coefficient, Δ w refer to this word accumulative gradient updating amount at current period, and the term vector that wherein Δ w can pass through that the training of this word current period is obtained is asked poor with the initialization term vector of this word current period and obtained, and N is the nodes of described learning model.
Wherein, the size of η determines node training renewal rate, and the general value of η is the numerical value less than 1;Such as: could be arranged to the numerical value such as 1/N, 1/2N;Preferably, it is possible to arrange η=1/N, then the initialization term vector in next cycle is equal to the difference between the meansigma methods of the accumulative gradient updating amount initializing term vector and N number of node of current period.
Node periodically updates initialization term vector, and initializing after every time updating starts the training to one cycle of word each in vocabulary based on term vector, so, when periodically training meets default training condition until all nodes.
Wherein, default decision condition can be that training iterations reaches to preset iterations;Default decision condition can also be exceed accumulative gradient updating amount corresponding to the word of threshold number in described vocabulary to be both less than presetting updated value.Certainly, preset decision condition it can also be provided that other guide, arrange in the present invention default decision condition in order that whether the training result weighing all nodes all reaches unanimity, if training objectives can be reached.
When all node training results meet and preset decision condition, now, all nodes in distributed term vector learning model terminate training;Then, according to the training term vector that last cycle training of all nodes obtains, calculate the initialization term vector obtaining next cycle according to above-mentioned formula 1, using this initialization term vector calculated as term vector corresponding to the word of described vocabulary.
In the present invention, the term vector that initializes of each node training is all that parallel synchronous updates, but the SGD process in each node is all asynchronous refresh, this update mode is referred to as half asynchronous refresh, this half asynchronous distributed term vector learning model that the present invention proposes can while ensureing Algorithm Convergence, reduce the network service time loss frequently synchronizing to bring, with the training of acceleration model.
Outside word carries out learning training in for above-mentioned vocabulary, the present invention takes full advantage of language material more, also proposed the training to particular category words such as numeric class, English category, name classes, obtain the term vector of correspondence with training, term vector corresponding to the word of these classifications can Optimizing Search business more.
Concrete, the invention provides a kind of optional scheme, the program is increase following steps on the basis of method shown in above-mentioned Fig. 1:
Configure each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out term vector training, the training term vector of the word not mated described in the training of each node is obtained synchronizes, trigger next cycle training, make each node cyclic training obtain described in the term vector of word that do not mate, and term vector corresponding for the described word not mated is saved in described vocabulary;Wherein, the described word not mated belongs to pre-set categories.
When realizing, word in the set of sequential word corresponding to the node traverses language material in distributed term vector learning model, it has been found that word is not belonging to vocabulary, but when belonging to a certain pre-set categories, this word is trained.So, it becomes possible to making full use of the word in language material, the term vector that the valuable word of searching service is corresponding is excavated in training.
Needing exist for illustrating, each node described in above-mentioned example adopts SGD algorithm, but the present invention is when realizing, and each node can adopt above-mentioned SGD algorithm, it would however also be possible to employ other algorithms such as support vector machine, logistic regression, neutral net are trained.
Below by example, the process that realizes of said method is carried out instantiating explanation.
Such as: user's inquiry log is " world's comedy top ten list ";
Include according to the vocabulary that this user's inquiry log builds: " world ", " comedy ", " top ten list ";
According to said method provided by the invention, large-scale corpus is input in distributed term vector learning model, is trained the result obtained, the term vector (the real number vectors of 5 dimensions) that namely in vocabulary, word is corresponding.
" world " (0.004003,0.004419 ,-0.003830 ,-0.003278,0.001367)
" comedy " (-0.043665,0.018578,0.138403,0.004431 ,-0.139117)
" top ten list " (-0.337518,0.224568,0.018613,0.222294 ,-0.057880).
When hands-on, can arranging different vector dimension according to different demands, only for dimension for 5 in above-mentioned example, but the enforcement of the present invention is not limited to this dimension.
By above-described embodiment it can be seen that the training method of term vector provided by the invention, first, capture internet web page, obtain corpus, be saved in corpus;The mode of setting up of this corpus make use of the advantage of the high real-time of internet web page resource, high representativeness, aboundresources broad covered area well, it is possible to language material that get magnanimity level, that coverage rate is wider.
Then, each corpus in corpus is made word segmentation processing respectively, obtain the orderly set of words that each corpus is corresponding;And, build vocabulary according to the user's inquiry log collected in advance;The present invention has abandoned traditional mode building vocabulary based on corpus, it is proposed that build the mode of vocabulary according to user's inquiry log;Owing to user's inquiry log can characterize user's actual search demand, therefore, build, with the query word comprised in user's inquiry log, the vocabulary generated and just can be adapted well to searching service.
Finally, each corpus preserved in described corpus is distributed to each node in distributed term vector learning model by the present invention;Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding.The present invention is to solve that large-scale corpus trains problem slowly, the unit multithreading training method of abandoning tradition, and adopt distributed term vector learning model, trained by multi-node parallel, to improve training speed such that it is able to quickly iteration goes out high-quality term vector.
Corresponding with said method, present invention also offers the device of correspondence.It it is the structure chart of the training devices of a kind of term vector that the embodiment of the present invention provides referring specifically to Fig. 3, Fig. 3.As it is shown on figure 3, this device may include that unit 201, participle unit 202, vocabulary construction unit 203, language material Dispatching Unit 204 and the first dispensing unit 205 set up in corpus;The annexation of each unit and concrete function laid down a definition explanation below in conjunction with the operation principle of this device.
Unit 201 set up in corpus, is used for capturing internet web page, obtains corpus, is saved in corpus;
Participle unit 202, for each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding;
Vocabulary construction unit 203, for building vocabulary according to the user's inquiry log collected in advance;
Language material Dispatching Unit 204, for being distributed to each node in distributed term vector learning model by each corpus preserved in described corpus;
First dispensing unit 205, carries out periodic term vector training for configuring described distributed term vector learning model to each word in described vocabulary, obtains the term vector that in described vocabulary, each word is corresponding;Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
When realizing, described participle unit 202 may include that word segmentation processing subelement.
Described word segmentation processing subelement, for utilizing participle instrument and the participle dictionary pre-build, makes word segmentation processing respectively to each corpus;Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.
When realizing, described vocabulary construction unit 203 may include that the first extraction subelement and first builds subelement.
First extracts subelement, for extracting the word comprised in the user's inquiry log collected in advance, and adds up the word frequency of each word;
First builds subelement, is used for obtaining high frequency words, builds and generates vocabulary.
When realizing, described vocabulary construction unit 203 can also include: second extracts subelement, merging subelement and the second structure subelement.
Second extracts subelement, for extracting the word comprised in the user's inquiry log collected in advance, and adds up the word frequency of each word;
Merge subelement, be used for utilizing name entity dictionary that high frequency words is merged process;
Second builds subelement, builds for the high frequency words after utilizing merging treatment and generates vocabulary.
When realizing, described first dispensing unit 205, including: configuration subelement, training subelement, judgment sub-unit, synchronized update subelement and term vector computation subunit.
Configuration subelement, in the described vocabulary corresponding for configuring each node in described distributed term vector learning model, the term vector of each word is for initializing term vector;
Training subelement, for configuring the word mated with described vocabulary that each node includes for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this cycle;
Judgment sub-unit, is used for judging whether to meet presetting decision condition, if it is, perform term vector computation subunit;If it does not, perform synchronized update subelement;
Synchronized update subelement, the training term vector of the institute's predicate for obtaining according to this cycle training of each node, parallel synchronous updates the initialization term vector in this cycle of institute's predicate, as the initialization term vector in next cycle of institute's predicate, enters described training subelement;
Term vector computation subunit, for according to described training term vector, obtaining the term vector of equivalent in described vocabulary.
Optionally, described training subelement includes: traversal coupling subelement and term vector training subelement.
Traversal coupling subelement, for each corpus being assigned to, traveling through all words in the orderly set of words that described corpus is corresponding, mated with described vocabulary respectively by each word;
Term vector training subelement, for when the matching result of described traversal coupling subelement is for being, being trained this word, obtain the term vector that this word is corresponding.
Optionally, synchronized update subelement, it is used for adopting following formula to realize synchronized update:
w ′ = w - η ( Σ 1 N Δ w ) ;
Wherein, w ' refers to the initial word vector in certain next cycle of word in the described vocabulary that a certain node is corresponding;W refers to the initialization term vector of this word current period in this node correspondence vocabulary;η refers to predetermined coefficient;Δ w asks poor by the term vector that this word current period training in this node correspondence vocabulary is obtained with the initialization term vector of this word current period and obtains;N is the nodes of described learning model.
Optionally, described device can also include:
Second dispensing unit, for configuring each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out term vector training, the training term vector of the word not mated described in the training of each node is obtained synchronizes, trigger next cycle training, make each node cyclic training obtain described in the term vector of word that do not mate, and term vector corresponding for the described word not mated is saved in described vocabulary;Wherein, the described word not mated belongs to pre-set categories.
About the device in above-described embodiment, the concrete mode that wherein modules performs to operate has been described in detail in about the embodiment of the method, and explanation will be not set forth in detail herein.
It addition, present invention also offers the training devices of another kind of term vector, below in conjunction with Fig. 4, this device is explained.
Fig. 4 is the hardware structure diagram of the training devices of a kind of term vector that the embodiment of the present invention provides, and the device 300 shown in Fig. 4 can be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet device, armarium, body-building equipment, personal digital assistant etc..
With reference to Fig. 4, device 300 can include following one or more assembly: processes assembly 302, memorizer 304, power supply module 306, multimedia groupware 308, audio-frequency assembly 310, the interface 312 of input/output (I/O), sensor cluster 314, and communications component 316.
Process assembly 302 and generally control the integrated operation of device 300, such as with display, call, data communication, the operation that camera operation and record operation are associated.Process assembly 302 and can include one or more processor 320 to perform instruction, to complete all or part of step of above-mentioned method.Additionally, process assembly 302 can include one or more module, it is simple to what process between assembly 302 and other assemblies is mutual.Such as, process assembly 302 and can include multi-media module, with facilitate multimedia groupware 308 and process between assembly 302 mutual.
Memorizer 304 is configured to store various types of data to support the operation at equipment 300.The example of these data includes the instruction of any application program for operating on the device 300 or method, contact data, telephone book data, message, picture, video etc..Memorizer 304 can be realized by any kind of volatibility or non-volatile memory device or their combination, such as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory EPROM (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, disk or CD.
The various assemblies that power supply module 306 is device 300 provide electric power.Power supply module 306 can include power-supply management system, one or more power supplys, and other generate, manage and distribute, with for device 300, the assembly that electric power is associated.
Multimedia groupware 308 includes the screen providing an output interface between described device 300 and user.In certain embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen may be implemented as touch screen, to receive the input signal from user.Touch panel includes one or more touch sensor to sense the gesture on touch, slip and touch panel.Described touch sensor can not only sense the border of touch or sliding action, but also detects the persistent period relevant to described touch or slide and pressure.In certain embodiments, multimedia groupware 308 includes a front-facing camera and/or post-positioned pick-up head.When equipment 300 is in operator scheme, during such as screening-mode or video mode, front-facing camera and/or post-positioned pick-up head can receive the multi-medium data of outside.Each front-facing camera and post-positioned pick-up head can be a fixing optical lens system or have focal length and optical zoom ability.
Audio-frequency assembly 310 is configured to output and/or input audio signal.Such as, audio-frequency assembly 310 includes a mike (MIC), and when device 300 is in operator scheme, during such as call model, logging mode and speech recognition mode, mike is configured to receive external audio signal.The audio signal received can be further stored at memorizer 304 or send via communications component 316.In certain embodiments, audio-frequency assembly 310 also includes a speaker, is used for exporting audio signal.
I/O interface 312 provides interface for processing between assembly 302 and peripheral interface module, above-mentioned peripheral interface module can be keyboard, puts striking wheel, button etc..These buttons may include but be not limited to: home button, volume button, startup button and locking press button.
Sensor cluster 314 includes one or more sensor, for providing the state estimation of various aspects for device 300.Such as, what sensor cluster 314 can detect equipment 300 opens/closed mode, the relative localization of assembly, such as described assembly is display and the keypad of device 300, the position change of all right detecting device 300 of sensor cluster 314 or 300 1 assemblies of device, the presence or absence that user contacts with device 300, the variations in temperature of device 300 orientation or acceleration/deceleration and device 300.Sensor cluster 314 can include proximity transducer, is configured to when not having any physical contact object near detection.Sensor cluster 314 can also include optical sensor, such as CMOS or ccd image sensor, for using in imaging applications.In certain embodiments, this sensor cluster 314 can also include acceleration transducer, gyro sensor, Magnetic Sensor, pressure transducer or temperature sensor.
Communications component 316 is configured to facilitate between device 300 and other equipment the communication of wired or wireless mode.Device 300 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or their combination.In one exemplary embodiment, communications component 316 receives the broadcast singal or the broadcast related information that manage system from external broadcasting via broadcast channel.In one exemplary embodiment, described communications component 316 also includes near-field communication (NFC) module, to promote junction service.Such as, can based on RF identification (RFID) technology in NFC module, Infrared Data Association (IrDA) technology, ultra broadband (UWB) technology, bluetooth (BT) technology and other technologies realize.
In the exemplary embodiment, device 300 can be realized by one or more application specific integrated circuits (ASIC), digital signal processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components, is used for performing said method.
In the exemplary embodiment, additionally providing a kind of non-transitory computer-readable recording medium including instruction, for instance include the memorizer 304 of instruction, above-mentioned instruction can have been performed said method by the processor 320 of device 300.Such as, described non-transitory computer-readable recording medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..
A kind of non-transitory computer-readable recording medium, when the instruction in described storage medium is performed by the processor of electronic equipment so that electronic equipment is able to carry out the training method of a kind of term vector, described processor performs the instruction of following operation:
Capture internet web page, obtain corpus, be saved in corpus;
Each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding;
Vocabulary is built according to the user's inquiry log collected in advance;
The each corpus preserved in described corpus is distributed to each node in distributed term vector learning model;
Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding;
Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
Fig. 5 is the structural representation of server in the embodiment of the present invention.This server 1900 can produce relatively larger difference because of configuration or performance difference, one or more central processing units (centralprocessingunits can be included, CPU) 1922 (such as, one or more processors) and memorizer 1932, the storage medium 1930 (such as one or more mass memory units) of one or more storage application programs 1942 or data 1944.Wherein, memorizer 1932 and storage medium 1930 can be of short duration storage or persistently store.The program being stored in storage medium 1930 can include one or more modules (diagram does not mark), and each module can include a series of command operatings in server.Further, central processing unit 1922 could be arranged to communicate with storage medium 1930, performs a series of command operatings in storage medium 1930 on server 1900.
Server 1900 can also include one or more power supplys 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941, such as WindowsServerTM, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM etc..
As seen through the above description of the embodiments, those skilled in the art is it can be understood that can add the mode of general hardware platform by software to all or part of step in above-described embodiment method and realize.Based on such understanding, the part that prior art is contributed by technical scheme substantially in other words can embody with the form of software product, this computer software product can be stored in storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a computer equipment (can be personal computer, server, or the network communication equipment such as such as WMG) perform the method described in some part of each embodiment of the present invention or embodiment.
It should be noted that each embodiment in this specification all adopts the mode gone forward one by one to describe, between each embodiment identical similar part mutually referring to, what each embodiment stressed is the difference with other embodiments.Especially for equipment and system embodiment, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part illustrates referring to the part of embodiment of the method.Equipment described above and system embodiment are merely schematic, the unit wherein illustrated as separation assembly can be or may not be physically separate, the assembly shown as unit can be or may not be physical location, namely may be located at a place, or can also be distributed on multiple NE.Some or all of module therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Those skilled in the art, after considering description and putting into practice invention disclosed herein, will readily occur to other embodiment of the present invention.It is contemplated that contain any modification of the present invention, purposes or adaptations, these modification, purposes or adaptations are followed the general principle of the present invention and include the undocumented known general knowledge in the art of the disclosure or conventional techniques means.Description and embodiments is considered only as exemplary, and the true scope of the present invention and spirit are pointed out by claim below.
It should be appreciated that the invention is not limited in precision architecture described above and illustrated in the accompanying drawings, and various amendment and change can carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims (10)

1. the training method of a term vector, it is characterised in that described method includes:
Capture internet web page, obtain corpus, be saved in corpus;
Each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding;
Vocabulary is built according to the user's inquiry log collected in advance;
The each corpus preserved in described corpus is distributed to each node in distributed term vector learning model;
Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding;
Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
2. method according to claim 1, it is characterised in that described each corpus in corpus is made word segmentation processing respectively, including:
Utilize participle instrument and the participle dictionary pre-build, each corpus is made word segmentation processing respectively;Described participle dictionary is built-up according to the user's inquiry log collected in advance and input method dictionary.
3. method according to claim 1, it is characterised in that user's inquiry log that described basis is collected in advance builds vocabulary, including:
Extract the word comprised in the user's inquiry log collected in advance, and add up the word frequency of each word;
Obtain high frequency words, build and generate vocabulary.
4. method according to claim 3, it is characterised in that after described acquisition high frequency words, described method also includes:
After utilizing name entity dictionary that described high frequency words is merged process, then perform the described step building and generating vocabulary.
5. method according to claim 1, it is characterised in that each word in described vocabulary is carried out periodic term vector training by the described distributed term vector learning model of described configuration, obtains the term vector that in described vocabulary, each word is corresponding, including:
In the described vocabulary that step 1, each node configured in described distributed term vector learning model are corresponding, the term vector of each word is for initializing term vector;
The word mated with described vocabulary that step 2, each node include for the orderly set of words that each corpus being assigned to is corresponding, is trained the initialization term vector of institute's predicate, obtains the training term vector of current period institute predicate;
Step 3: according to the training term vector of institute's predicate that the training of each node current period obtains, parallel synchronous updates the initialization term vector of the current period of institute's predicate, as the initialization term vector in next cycle of institute's predicate;
Step 4, judge whether meet preset decision condition, if it is, enter step 5;If it does not, enter step 2;
Step 5, the initialization term vector in next cycle according to institute predicate, obtain the term vector of equivalent in described vocabulary.
6. method according to claim 5, it is characterised in that the word mated with described vocabulary that the orderly set of words that described each corpus to being assigned to is corresponding includes is trained, including:
To each corpus being assigned to, travel through all words in the orderly set of words that described corpus is corresponding, each word is mated with described vocabulary respectively, if current word matches identical word in described vocabulary, this word is trained, obtains the term vector that this word is corresponding.
7. method according to claim 5, it is characterized in that, the training term vector of the described institute's predicate obtained according to the training of each node current period, parallel synchronous updates the initialization term vector of the current period of institute's predicate, as the initialization term vector in next cycle of institute's predicate, including:
Following formula is adopted to realize synchronized update:
w ′ = w - η ( Σ 1 N Δ w ) ;
Wherein, w ' refers to the initial word vector in certain next cycle of word in the described vocabulary that a certain node is corresponding;W refers to the initialization term vector of this word current period in this node correspondence vocabulary;η refers to predetermined coefficient;Δ w asks poor by the term vector that this word current period training in this node correspondence vocabulary is obtained with the initialization term vector of this word current period and obtains;N is the nodes of described learning model.
8. method according to claim 1, it is characterised in that described method also includes:
Configure each node in described distributed term vector learning model, make each node respectively according to the corpus being assigned to, the word not mated with described vocabulary that the orderly set of words that each corpus is corresponding is included carries out term vector training, the training term vector of the word not mated described in the training of each node is obtained synchronizes, trigger next cycle training, make each node cyclic training obtain described in the term vector of word that do not mate, and term vector corresponding for the described word not mated is saved in described vocabulary;Wherein, the described word not mated belongs to pre-set categories.
9. the training devices of a term vector, it is characterised in that described device includes:
Unit set up in corpus, is used for capturing internet web page, obtains corpus, is saved in corpus;
Participle unit, for each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding;
Vocabulary construction unit, for building vocabulary according to the user's inquiry log collected in advance;
Language material Dispatching Unit, for being distributed to each node in distributed term vector learning model by each corpus preserved in described corpus;
First dispensing unit, carries out periodic term vector training for configuring described distributed term vector learning model to each word in described vocabulary, obtains the term vector that in described vocabulary, each word is corresponding;Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
10. the training devices of a term vector, it is characterized in that, include memorizer, and one or more than one program, one of them or more than one program are stored in memorizer, and are configured to be performed one or more than one program package containing the instruction for carrying out following operation by one or more than one processor:
Capture internet web page, obtain corpus, be saved in corpus;
Each corpus in corpus is made word segmentation processing respectively, obtains the orderly set of words that each corpus is corresponding;
Vocabulary is built according to the user's inquiry log collected in advance;
The each corpus preserved in described corpus is distributed to each node in distributed term vector learning model;
Configure described distributed term vector learning model and each word in described vocabulary is carried out periodic term vector training, obtain the term vector that in described vocabulary, each word is corresponding;
Wherein, the training of described term vector includes: each node is respectively according to the corpus being assigned to, the word mated with described vocabulary that the orderly set of words that each corpus is corresponding is included is trained, and to after the training term vector of each word synchronizes in each node vocabulary of obtaining of training, trigger next cycle training.
CN201610179115.0A 2016-03-25 2016-03-25 A kind of training method and device of term vector Active CN105786782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610179115.0A CN105786782B (en) 2016-03-25 2016-03-25 A kind of training method and device of term vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610179115.0A CN105786782B (en) 2016-03-25 2016-03-25 A kind of training method and device of term vector

Publications (2)

Publication Number Publication Date
CN105786782A true CN105786782A (en) 2016-07-20
CN105786782B CN105786782B (en) 2018-10-19

Family

ID=56390898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610179115.0A Active CN105786782B (en) 2016-03-25 2016-03-25 A kind of training method and device of term vector

Country Status (1)

Country Link
CN (1) CN105786782B (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
CN106874643A (en) * 2016-12-27 2017-06-20 中国科学院自动化研究所 Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector
CN107015969A (en) * 2017-05-19 2017-08-04 四川长虹电器股份有限公司 Can self-renewing semantic understanding System and method for
CN107239443A (en) * 2017-05-09 2017-10-10 清华大学 The training method and server of a kind of term vector learning model
CN107451295A (en) * 2017-08-17 2017-12-08 四川长虹电器股份有限公司 A kind of method that deep learning training data is obtained based on grammer networks
CN107577658A (en) * 2017-07-18 2018-01-12 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN107577659A (en) * 2017-07-18 2018-01-12 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN107766565A (en) * 2017-11-06 2018-03-06 广州杰赛科技股份有限公司 Conversational character differentiating method and system
CN108024005A (en) * 2016-11-04 2018-05-11 北京搜狗科技发展有限公司 Information processing method, device, intelligent terminal, server and system
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN108170663A (en) * 2017-11-14 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
CN108170667A (en) * 2017-11-30 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment
CN108231146A (en) * 2017-12-01 2018-06-29 华南师范大学 A kind of medical records model building method, system and device based on deep learning
CN108345580A (en) * 2017-01-22 2018-07-31 阿里巴巴集团控股有限公司 A kind of term vector processing method and processing device
CN108509422A (en) * 2018-04-04 2018-09-07 广州荔支网络技术有限公司 A kind of Increment Learning Algorithm of term vector, device and electronic equipment
CN108520018A (en) * 2018-03-22 2018-09-11 大连理工大学 A kind of literary works creation age determination method based on term vector
CN108628813A (en) * 2017-03-17 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing
CN109388689A (en) * 2017-08-08 2019-02-26 中国电信股份有限公司 Word stock generating method and device
CN109543175A (en) * 2018-10-11 2019-03-29 北京诺道认知医学科技有限公司 A kind of method and device for searching synonym
CN109587019A (en) * 2018-12-12 2019-04-05 珠海格力电器股份有限公司 A kind of sound control method of household appliance, device, storage medium and system
WO2019080615A1 (en) * 2017-10-23 2019-05-02 阿里巴巴集团控股有限公司 Cluster-based word vector processing method, device, and apparatus
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
CN109933778A (en) * 2017-12-18 2019-06-25 北京京东尚科信息技术有限公司 Segmenting method, device and computer readable storage medium
CN110019830A (en) * 2017-09-20 2019-07-16 腾讯科技(深圳)有限公司 Corpus processing, term vector acquisition methods and device, storage medium and equipment
CN110023930A (en) * 2016-11-29 2019-07-16 微软技术许可有限责任公司 It is predicted using neural network and the language data of on-line study
CN110191005A (en) * 2019-06-25 2019-08-30 北京九章云极科技有限公司 A kind of alarm log processing method and system
CN110197188A (en) * 2018-02-26 2019-09-03 北京京东尚科信息技术有限公司 Method, system, equipment and the storage medium of business scenario prediction, classification
CN110266675A (en) * 2019-06-12 2019-09-20 成都积微物联集团股份有限公司 A kind of xss attack automated detection method based on deep learning
CN110633352A (en) * 2018-06-01 2019-12-31 北京嘀嘀无限科技发展有限公司 Semantic retrieval method and device
CN112256517A (en) * 2020-08-28 2021-01-22 苏州浪潮智能科技有限公司 Log analysis method and device of virtualization platform based on LSTM-DSSM
CN113961664A (en) * 2020-07-15 2022-01-21 上海乐言信息科技有限公司 Deep learning-based numerical word processing method, system, terminal and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103218444A (en) * 2013-04-22 2013-07-24 中央民族大学 Method of Tibetan language webpage text classification based on semanteme
CN104462051A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Word segmentation method and device
US20150088875A1 (en) * 2009-04-15 2015-03-26 Lexisnexis, A Division Of Reed Elsevier Inc. System and Method For Ranking Search Results Within Citation Intensive Document Collections
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
US20160070748A1 (en) * 2014-09-04 2016-03-10 Crimson Hexagon, Inc. Method and apparatus for improved searching of digital content

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150088875A1 (en) * 2009-04-15 2015-03-26 Lexisnexis, A Division Of Reed Elsevier Inc. System and Method For Ranking Search Results Within Citation Intensive Document Collections
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103218444A (en) * 2013-04-22 2013-07-24 中央民族大学 Method of Tibetan language webpage text classification based on semanteme
CN104462051A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Word segmentation method and device
US20160070748A1 (en) * 2014-09-04 2016-03-10 Crimson Hexagon, Inc. Method and apparatus for improved searching of digital content
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108024005B (en) * 2016-11-04 2020-08-21 北京搜狗科技发展有限公司 Information processing method and device, intelligent terminal, server and system
CN108024005A (en) * 2016-11-04 2018-05-11 北京搜狗科技发展有限公司 Information processing method, device, intelligent terminal, server and system
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
CN106776534B (en) * 2016-11-11 2020-02-11 北京工商大学 Incremental learning method of word vector model
CN110023930B (en) * 2016-11-29 2023-06-23 微软技术许可有限责任公司 Language data prediction using neural networks and online learning
CN110023930A (en) * 2016-11-29 2019-07-16 微软技术许可有限责任公司 It is predicted using neural network and the language data of on-line study
CN106874643A (en) * 2016-12-27 2017-06-20 中国科学院自动化研究所 Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector
CN106874643B (en) * 2016-12-27 2020-02-28 中国科学院自动化研究所 Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors
CN108345580B (en) * 2017-01-22 2020-05-15 创新先进技术有限公司 Word vector processing method and device
US10878199B2 (en) 2017-01-22 2020-12-29 Advanced New Technologies Co., Ltd. Word vector processing for foreign languages
CN108345580A (en) * 2017-01-22 2018-07-31 阿里巴巴集团控股有限公司 A kind of term vector processing method and processing device
CN108628813A (en) * 2017-03-17 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing
CN108628813B (en) * 2017-03-17 2022-09-23 北京搜狗科技发展有限公司 Processing method and device for processing
CN107239443A (en) * 2017-05-09 2017-10-10 清华大学 The training method and server of a kind of term vector learning model
CN107015969A (en) * 2017-05-19 2017-08-04 四川长虹电器股份有限公司 Can self-renewing semantic understanding System and method for
CN107577659A (en) * 2017-07-18 2018-01-12 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN107577658A (en) * 2017-07-18 2018-01-12 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN109388689A (en) * 2017-08-08 2019-02-26 中国电信股份有限公司 Word stock generating method and device
CN107451295B (en) * 2017-08-17 2020-06-30 四川长虹电器股份有限公司 Method for obtaining deep learning training data based on grammar network
CN107451295A (en) * 2017-08-17 2017-12-08 四川长虹电器股份有限公司 A kind of method that deep learning training data is obtained based on grammer networks
CN110019830B (en) * 2017-09-20 2022-09-23 腾讯科技(深圳)有限公司 Corpus processing method, corpus processing device, word vector obtaining method, word vector obtaining device, storage medium and equipment
CN110019830A (en) * 2017-09-20 2019-07-16 腾讯科技(深圳)有限公司 Corpus processing, term vector acquisition methods and device, storage medium and equipment
US10769383B2 (en) 2017-10-23 2020-09-08 Alibaba Group Holding Limited Cluster-based word vector processing method, device, and apparatus
WO2019080615A1 (en) * 2017-10-23 2019-05-02 阿里巴巴集团控股有限公司 Cluster-based word vector processing method, device, and apparatus
TWI721310B (en) * 2017-10-23 2021-03-11 開曼群島商創新先進技術有限公司 Cluster-based word vector processing method, device and equipment
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
CN109726386B (en) * 2017-10-30 2023-05-09 中国移动通信有限公司研究院 Word vector model generation method, device and computer readable storage medium
CN107766565A (en) * 2017-11-06 2018-03-06 广州杰赛科技股份有限公司 Conversational character differentiating method and system
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
WO2019095836A1 (en) * 2017-11-14 2019-05-23 阿里巴巴集团控股有限公司 Method, device, and apparatus for word vector processing based on clusters
US10846483B2 (en) 2017-11-14 2020-11-24 Advanced New Technologies Co., Ltd. Method, device, and apparatus for word vector processing based on clusters
CN108170663A (en) * 2017-11-14 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
WO2019105134A1 (en) * 2017-11-30 2019-06-06 阿里巴巴集团控股有限公司 Word vector processing method, apparatus and device
CN108170667B (en) * 2017-11-30 2020-06-23 阿里巴巴集团控股有限公司 Word vector processing method, device and equipment
TWI701588B (en) * 2017-11-30 2020-08-11 香港商阿里巴巴集團服務有限公司 Word vector processing method, device and equipment
CN108170667A (en) * 2017-11-30 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment
CN108231146A (en) * 2017-12-01 2018-06-29 华南师范大学 A kind of medical records model building method, system and device based on deep learning
CN109933778B (en) * 2017-12-18 2024-03-05 北京京东尚科信息技术有限公司 Word segmentation method, word segmentation device and computer readable storage medium
CN109933778A (en) * 2017-12-18 2019-06-25 北京京东尚科信息技术有限公司 Segmenting method, device and computer readable storage medium
CN110197188A (en) * 2018-02-26 2019-09-03 北京京东尚科信息技术有限公司 Method, system, equipment and the storage medium of business scenario prediction, classification
CN108520018A (en) * 2018-03-22 2018-09-11 大连理工大学 A kind of literary works creation age determination method based on term vector
CN108509422B (en) * 2018-04-04 2020-01-24 广州荔支网络技术有限公司 Incremental learning method and device for word vectors and electronic equipment
CN108509422A (en) * 2018-04-04 2018-09-07 广州荔支网络技术有限公司 A kind of Increment Learning Algorithm of term vector, device and electronic equipment
CN110633352A (en) * 2018-06-01 2019-12-31 北京嘀嘀无限科技发展有限公司 Semantic retrieval method and device
CN109543175A (en) * 2018-10-11 2019-03-29 北京诺道认知医学科技有限公司 A kind of method and device for searching synonym
CN109587019A (en) * 2018-12-12 2019-04-05 珠海格力电器股份有限公司 A kind of sound control method of household appliance, device, storage medium and system
CN110266675A (en) * 2019-06-12 2019-09-20 成都积微物联集团股份有限公司 A kind of xss attack automated detection method based on deep learning
CN110191005A (en) * 2019-06-25 2019-08-30 北京九章云极科技有限公司 A kind of alarm log processing method and system
CN113961664A (en) * 2020-07-15 2022-01-21 上海乐言信息科技有限公司 Deep learning-based numerical word processing method, system, terminal and medium
CN112256517B (en) * 2020-08-28 2022-07-08 苏州浪潮智能科技有限公司 Log analysis method and device of virtualization platform based on LSTM-DSSM
CN112256517A (en) * 2020-08-28 2021-01-22 苏州浪潮智能科技有限公司 Log analysis method and device of virtualization platform based on LSTM-DSSM

Also Published As

Publication number Publication date
CN105786782B (en) 2018-10-19

Similar Documents

Publication Publication Date Title
CN105786782A (en) Word vector training method and device
CN104090890B (en) Keyword similarity acquisition methods, device and server
CN110175223A (en) A kind of method and device that problem of implementation generates
KR20160124182A (en) Method and apparatus for grouping contacts
CN107729815A (en) Image processing method, device, mobile terminal and computer-readable recording medium
CN111914113A (en) Image retrieval method and related device
WO2018204076A1 (en) Personalized user-categorized recommendations
CN108304376B (en) Text vector determination method and device, storage medium and electronic device
CN111738010B (en) Method and device for generating semantic matching model
CN115114395B (en) Content retrieval and model training method and device, electronic equipment and storage medium
US20170212886A1 (en) Configurable Generic Language Understanding Models
CN110275962B (en) Method and apparatus for outputting information
CN107291772A (en) One kind search access method, device and electronic equipment
CN111428522B (en) Translation corpus generation method, device, computer equipment and storage medium
WO2023197872A1 (en) Book searching method and apparatus, and device and storage medium
WO2019101099A1 (en) Video program identification method and device, terminal, system, and storage medium
CN110032616A (en) A kind of acquisition method and device of document reading conditions
CN105631404A (en) Method and device for clustering pictures
CN112995757B (en) Video clipping method and device
CN112862021B (en) Content labeling method and related device
CN109871128B (en) Question type identification method and device
CN107707759A (en) Terminal control method, device and system, storage medium
CN111723273A (en) Smart cloud retrieval system and method
CN110968246A (en) Intelligent Chinese handwriting input recognition method and device
CN104376030B (en) The method and apparatus of browser bookmark intelligent packet

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20170821

Address after: 100084. Room 9, floor 02, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing

Applicant after: Beijing Sogou Information Service Co., Ltd.

Address before: 100084 Beijing, Zhongguancun East Road, building 1, No. 9, Sohu cyber building, room 9, room, room 01

Applicant before: Sogo Science-Technology Development Co., Ltd., Beijing

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant