CN105786782B - A kind of training method and device of term vector - Google Patents

A kind of training method and device of term vector Download PDF

Info

Publication number
CN105786782B
CN105786782B CN201610179115.0A CN201610179115A CN105786782B CN 105786782 B CN105786782 B CN 105786782B CN 201610179115 A CN201610179115 A CN 201610179115A CN 105786782 B CN105786782 B CN 105786782B
Authority
CN
China
Prior art keywords
word
term vector
training
vocabulary
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610179115.0A
Other languages
Chinese (zh)
Other versions
CN105786782A (en
Inventor
邢宁
刘明荣
许静芳
常晓夫
王晓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Information Service Co Ltd
Original Assignee
Beijing Sogou Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Information Service Co Ltd filed Critical Beijing Sogou Information Service Co Ltd
Priority to CN201610179115.0A priority Critical patent/CN105786782B/en
Publication of CN105786782A publication Critical patent/CN105786782A/en
Application granted granted Critical
Publication of CN105786782B publication Critical patent/CN105786782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of training method of term vector and devices, wherein the method includes:Internet web page is captured, training corpus is obtained, is stored in corpus;Word segmentation processing is made respectively to each training corpus in corpus, obtains the corresponding orderly set of words of each training corpus;Vocabulary is built according to the user's inquiry log collected in advance;Each training corpus preserved in the corpus is distributed to each node in distributed term vector learning model;It configures the distributed term vector learning model and periodic term vector training is carried out to each word in the vocabulary, obtain the corresponding term vector of each word in the vocabulary.The present invention provides a kind of training method of term vector and devices so that the term vector that training obtains can be perfectly suitable in searching service, and can realize the training of the high quality term vector of iteratively faster.

Description

A kind of training method and device of term vector
Technical field
The present invention relates to Internet technical fields, more particularly to the training method and device of a kind of term vector.
Background technology
In the Internet, applications, it is important that a problem be how natural language is converted into computer it will be appreciated that Data representation form.And it is exactly to find a kind of method by natural language symbol data to solve the problems, such as a most important step.Mesh Preceding the most commonly used is deep learning (Deep Learning, DL) methods, using " Distributed in DL Each vocabulary is shown as a kind of low-dimensional real vector by representation " distributing representation methods, which is exactly that word corresponds to Term vector.Term vector is exactly thus to be born, it is to be understood that term vector be exactly for express word in natural language to Amount, to be suitable for the Internet, applications.For example, term vector, which can be used in many natural language learnings, handles (Natural Learning Processing, NLP) in relevant work, such as cluster, semantic analysis.
Currently, people using Word2vec tools by single machine model utilize DL methods, by the language material to being collected into Row training is to obtain the vector in the corresponding vector space of word in vocabulary.This term vector training method is using single machine mould Formula, training rate is relatively low, is especially difficult suitable for the very large business scenario of data volume.In addition this term vector instruction Practice the training method that method is universality, there is no the particularity for considering specific transactions scene, so in specific business field Its training effect and bad under scape.
Invention content
In order to solve the above technical problem, the present invention provides a kind of training method of term vector and devices so that training Obtained term vector can be perfectly suitable in searching service, and can realize the instruction of the high quality term vector of iteratively faster Practice.
The embodiment of the invention discloses following technical solutions:
First aspect present invention provides a kind of training method of term vector, the method includes:
Internet web page is captured, training corpus is obtained, is stored in corpus;
Word segmentation processing is made to each training corpus in corpus, obtains the corresponding orderly set of words of each training corpus;
Vocabulary is built according to the user's inquiry log collected in advance;
Each training corpus preserved in the corpus is distributed to each node in distributed term vector learning model;
It configures the distributed term vector learning model and periodic term vector training is carried out to each word in the vocabulary, Obtain the corresponding term vector of each word in the vocabulary;
Wherein, the term vector, which is trained, includes:Each node is respectively according to the training corpus being assigned to, to each training corpus In the vocabulary for being trained with the matched word of the vocabulary, and being trained to each node that corresponding orderly set of words includes After the training term vector of each word synchronizes, next cycle training is triggered.
Optionally, each training corpus in corpus makees word segmentation processing, including:
Using participle tool and the participle dictionary pre-established, word segmentation processing is made to each training corpus;The participle word Library is built-up according to the user's inquiry log and input method dictionary collected in advance.
Optionally, user's inquiry log that the basis is collected in advance builds vocabulary, including:
The word for including in the user's inquiry log collected in advance is extracted, and counts the word frequency of each word;
High frequency words are obtained, structure generates vocabulary.
Optionally, after the acquisition high frequency words, the method further includes:
After merging processing to the high frequency words using name entity dictionary, then execute the step that the structure generates vocabulary Suddenly.
Optionally, for each node respectively according to the training corpus being assigned to, corresponding to each training corpus have sequence word The training word of each word in the vocabulary for being trained with the matched word of the vocabulary, and being trained to each node that set includes After vector synchronizes, next cycle training is triggered, including:
The word of each word in the corresponding vocabulary of each node in step 1, the configuration distributed term vector learning model Vector is initialization term vector;
Step 2, each node be directed to that the corresponding orderly set of words of each training corpus that is assigned to includes with institute's predicate The matched word of table is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this period;
Step 3 judges whether to meet default decision condition, if so, entering step 5;If not, entering step 4;
Step 4:According to the training term vector for institute's predicate that each this cycle training of node obtains, parallel synchronous updates institute's predicate This period initialization term vector, the initialization term vector in next period as institute's predicate enters step 2;
Step 5, according to the trained term vector, obtain the term vector of equivalent in the vocabulary.
Optionally, the corresponding orderly set of words of the described pair of each training corpus being assigned to is including with the vocabulary The word matched is trained, including:
To each training corpus being assigned to, all words in the corresponding orderly set of words of the training corpus are traversed, it will Each word is matched with the vocabulary respectively, if current word is matched to identical word in the vocabulary, is carried out to the word Training, obtains the corresponding term vector of the word.
Optionally, in the vocabulary trained according to each node current period each word term vector, it is parallel same Step updates the initialization term vector of the current period of each word, the initialization term vector packet as each word lower period in the vocabulary It includes:
Synchronized update is realized using following formula:
Wherein, w ' refers to the initial term vector in certain word lower period in the corresponding vocabulary of a certain node;W refers to the section The initialization term vector of the word current period in the corresponding vocabulary of point;η refers to predetermined coefficient;Δ w to the node by corresponding to vocabulary In the initialization term vector of term vector and the word current period trained of the word current period ask difference to obtain;N is Practise the number of nodes of model.
Optionally, the method further includes:
Configure each node in the distributed term vector learning model so that each node is respectively according to the training being assigned to Language material, what orderly set of words corresponding to each training corpus included carries out term vector instruction with the not matched word of the vocabulary Practice, to each node train described in the training term vector of not matched word synchronize, trigger next cycle training so that Each node cyclic training obtains the term vector of the not matched word, and will the corresponding term vector guarantor of the not matched word It is stored in the vocabulary;Wherein, the not matched word belongs to pre-set categories.
Second aspect of the present invention provides a kind of training device of term vector, and described device includes:
Corpus establishes unit, for capturing internet web page, obtains training corpus, is stored in corpus;
Participle unit obtains each training corpus and corresponds to for making word segmentation processing to each training corpus in corpus Orderly set of words;
Vocabulary construction unit, for building vocabulary according to the user's inquiry log collected in advance;
Language material Dispatching Unit learns for each training corpus preserved in the corpus to be distributed to distributed term vector Each node in model;
First dispensing unit carries out week for configuring the distributed term vector learning model to each word in the vocabulary The term vector of phase property is trained, and the corresponding term vector of each word in the vocabulary is obtained;Wherein, the term vector, which is trained, includes:Each section For point respectively according to the training corpus that is assigned to, orderly set of words corresponding to each training corpus is including with the vocabulary After the word matched is trained, and the training term vector of each word synchronizes in the vocabulary trained to each node, triggering is next Cycle training.
Third aspect present invention provides a kind of training device of term vector, and described device includes memory, Yi Jiyi A either more than one program one of them or more than one program is stored in memory, and is configured to by one Or it includes the instruction for being operated below that more than one processor, which executes the one or more programs,:
Internet web page is captured, training corpus is obtained, is stored in corpus;
Word segmentation processing is made to each training corpus in corpus, obtains the corresponding orderly set of words of each training corpus;
Vocabulary is built according to the user's inquiry log collected in advance;
Each training corpus preserved in the corpus is distributed to each node in distributed term vector learning model;
It configures the distributed term vector learning model and periodic term vector training is carried out to each word in the vocabulary, Obtain the corresponding term vector of each word in the vocabulary;
Wherein, the term vector, which is trained, includes:Each node is respectively according to the training corpus being assigned to, to each training corpus In the vocabulary for being trained with the matched word of the vocabulary, and being trained to each node that corresponding orderly set of words includes After the training term vector of each word synchronizes, next cycle training is triggered.
Optionally, the processor is additionally operable to execute the one or more programs to include for carrying out following grasp The instruction of work:
Using participle tool and the participle dictionary pre-established, word segmentation processing is made to each training corpus;The participle word Library is built-up according to the user's inquiry log and input method dictionary collected in advance.
Optionally, the processor is additionally operable to execute the one or more programs to include for carrying out following grasp The instruction of work:
The word for including in the user's inquiry log collected in advance is extracted, and counts the word frequency of each word;
High frequency words are obtained, structure generates vocabulary.
Optionally, the processor is additionally operable to execute the one or more programs to include for carrying out following grasp The instruction of work:
After merging processing to the high frequency words using name entity dictionary, then execute the finger that the structure generates vocabulary It enables.
Optionally, the processor is additionally operable to execute the one or more programs to include for carrying out following grasp The instruction of work:
The word of each word in the corresponding vocabulary of each node in instruction 1, the configuration distributed term vector learning model Vector is initialization term vector;
Instruction 2, each node be directed to that the corresponding orderly set of words of each training corpus that is assigned to includes with institute's predicate The matched word of table is trained the initialization term vector of institute's predicate, obtains the training term vector of word described in this period;
Instruction 3 judges whether to meet default decision condition, if so, entry instruction 5;If not, entry instruction 4;
Instruction 4:According to the training term vector for institute's predicate that each this cycle training of node obtains, parallel synchronous updates institute's predicate This period initialization term vector, the initialization term vector in next period as institute's predicate, entry instruction 2;
It instructs 5, according to the trained term vector, obtains the term vector of equivalent in the vocabulary.
Optionally, the processor is additionally operable to execute the one or more programs to include for carrying out following grasp The instruction of work:
To each training corpus being assigned to, all words in the corresponding orderly set of words of the training corpus are traversed, it will Each word is matched with the vocabulary respectively, if current word is matched to identical word in the vocabulary, is carried out to the word Training, obtains the corresponding term vector of the word.
Optionally, the processor is additionally operable to execute the one or more programs to include for carrying out following grasp The instruction of work:
Synchronized update is realized using following formula:
Wherein, w ' refers to the initial term vector in certain word lower period in the corresponding vocabulary of a certain node;W refers to the section The initialization term vector of the word current period in the corresponding vocabulary of point;η refers to predetermined coefficient;Δ w to the node by corresponding to vocabulary In the initialization term vector of term vector and the word current period trained of the word current period ask difference to obtain;N is Practise the number of nodes of model.
Optionally, the processor is additionally operable to execute the one or more programs to include for carrying out following grasp The instruction of work:
Configure each node in the distributed term vector learning model so that each node is respectively according to the training being assigned to Language material, what orderly set of words corresponding to each training corpus included carries out periodical instruction with the not matched word of the vocabulary Practice, obtains the training term vector of the not matched word;Wherein, the not matched word belongs to pre-set categories
To each node train described in the training term vector of not matched word synchronize, obtain described not matched The corresponding term vector of word, and be saved into the vocabulary.
Compared with prior art, technical solution provided by the invention has following advantageous effect:
Above-mentioned technical proposal provided by the invention captures internet web page first, obtains training corpus, is stored in language material In library;The mode of establishing of this corpus is utilized that the high real-time of internet web page resource, high representative, resource is rich well The advantage of rich broad covered area, language material that magnanimity grade can be got, that covering surface is wider.
Then, word segmentation processing is made respectively to each training corpus in corpus, obtains that each training corpus is corresponding to be had Sequence word set;And vocabulary is built according to the user's inquiry log collected in advance;The present invention has abandoned traditional based on training language The mode of material structure vocabulary, it is proposed that the mode of vocabulary is built according to user's inquiry log;Since user's inquiry log being capable of table Take over family actual search demand for use, it therefore, can be fine with the vocabulary that the query word structure for including in user's inquiry log generates Ground is adapted to searching service.
Finally, each training corpus preserved in the corpus is distributed in distributed term vector learning model by the present invention Each node;It configures the distributed term vector learning model and periodic term vector instruction is carried out to each word in the vocabulary Practice, obtains the corresponding term vector of each word in the vocabulary.The present invention trains slow to solve the problems, such as large-scale corpus, abandons Traditional single machine multithreading training method, and distributed term vector learning model is used, it is trained by multi-node parallel, to improve Training rate, the term vector of high quality is gone out so as to quick iteration.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without having to pay creative labor, may be used also for those of ordinary skill in the art With obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of flow chart of the training method of term vector provided in an embodiment of the present invention;
Fig. 2 is distributed term vector training schematic diagram provided in an embodiment of the present invention;
Fig. 3 is a kind of structure chart of the training device of term vector provided in an embodiment of the present invention;
Fig. 4 is a kind of hardware structure diagram of the training device of term vector provided in an embodiment of the present invention;
Fig. 5 is the structural schematic diagram of server provided in an embodiment of the present invention.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical solution in the embodiment of the present invention are explicitly described, it is clear that described embodiment is a part of the invention Embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not making wound The every other embodiment obtained under the premise of the property made labour, shall fall within the protection scope of the present invention.
The present invention provides a kind of training method of term vector and device, the present invention analyzes the special applications back of the body of term vector The factor of scape, it is proposed that the technical thought of specific vocabulary is built according to user's inquiry log so that the term vector that training obtains Can be perfectly suitable in searching service, also, the single machine multithreading training method of abandoning tradition of the present invention and propose and adopt With distributed term vector learning method, the training of the high quality term vector of iteratively faster can be realized.
Referring to Fig. 1, Fig. 1 is a kind of flow chart of the training method of term vector provided in an embodiment of the present invention, such as Fig. 1 institutes Show, this method includes:Step 101- steps 105:
Step 101:Internet web page is captured, training corpus is obtained, is stored in corpus;
Specifically, crawl internet web page is stored in language using each web page contents grabbed as each training corpus Expect in library.
Language material refers to the linguistic data really occurred in the actual use of language;Language material is commonly stored in corpus In, corpus is the database that language material is carried using electronic computer as carrier;Real corpus generally require by processing (analysis and Processing), useful resource can be become.
Currently, Chinese corpus there are four types of type, be respectively the general corpus of Modern Chinese,《People's Daily》Mark language material Library, the Modern Chinese corpus etc. for the Modern Chinese corpus of language teaching and research, towards speech signal analysis.Cause This, when people need language material, will directly acquire language material from these well-established corpus.
But the content of these corpus is relatively fixed, and update is slower;And due to the opening and novelty of internet so that The linguistic data that the field generates is daily all in growth at double, therefore, if still only obtained from these existing corpus Take language material, then the language material negligible amounts acquired, and covering surface is relatively narrow, and these language materials can not characterize in internet arena The actual use situation of language.
Based on this, in order to enable the language material acquired can preferably be suitable for internet arena, it is especially applied to In the embodiment of the present invention, training corpus is obtained by way of capturing internet web page in search engine for search field.
More specifically, the present invention also provides following achievable modes:
Using search engine collecting internet news class webpage, Web Community's webpage and/or blog web page etc., will grab Web page contents as training corpus.
Since internet news class webpage, Web Community's webpage, blog web page are all the webpages for having had reliability certification, The information of this kind of webpage carrying is all the higher information of confidence level, therefore directly obtains training corpus from this kind of webpage, The quality of training corpus can be improved.
Certainly, realization of the invention can with but be not limited to news category webpage, Web Community's class and blog web page, may be used also To be the webpage with reliability certification class such as webpage of science popularization webpage, paper website.Furthermore in order to further expand training Language material also can also obtain training corpus from aforementioned well-established corpus.This acquisition provided in an embodiment of the present invention The excellent of representative, the resourceful broad covered area of high real-time, height of internet web page resource is utilized in the mode of language material well Gesture, training corpus that magnanimity grade can be got, that covering surface is wider.
Step 102:It is corresponding that each training corpus is obtained as word segmentation processing respectively to each training corpus in corpus Orderly set of words;Wherein, the orderly word set is combined into the set that a sequential word is constituted.
In the embodiment of the present invention, the training corpus captured from internet web page is generally sentence or article.Due to word The training of vector is using word as training data.Therefore, when acquiring training corpus, it is also necessary to make at participle to training corpus Reason obtains the set of the corresponding sequential word of each language material.Specifically, assuming that training corpus is an article, article is by least What one sentence was constituted, word segmentation processing is carried out successively to each sentence that the training corpus includes, each sentence is divided into The set of the word of a string of ordered arrangements, and the word after division is arranged in order further according to putting in order for each sentence in original text chapter. For example, certain training corpus is that " I likes Beijing.Beijing is the political economy cultural center in China ", which is divided After word processing obtained orderly set of words can be " I/like/Beijing/Beijing/be/China// politics/economic/culture/in The heart ".
And word segmentation processing depends primarily on what participle dictionary was realized, the quality for segmenting dictionary directly determines word segmentation processing Quality.Participle dictionary is also referred to as dictionary for word segmentation, and for ease of description, the present invention is described using participle dictionary.Currently, logical Frequently with participle dictionary be to pass through《Xinhua dictionary》Or the dictionary established based on other similar published books.But Fast-developing Chinese internet can all have new word, new things to generate daily, and these participle dictionaries are not able in time The new word generated in internet is included on ground, in this way, if directly utilize these dictionaries to the language material that is obtained from internet into Row word segmentation processing, participle effect is with regard to not so good.
Based on this, the present invention constructs the participle dictionary especially suitable for Internet scene, is mainly inquired according to user (query) word in daily record and input method dictionary generates participle dictionary to build.It is understood that as long as user uses interconnection Net generally can all generate user's inquiry log, therefore, almost daily, per when even each second per minute, can all generate in network User's inquiry log, record has query word in user's inquiry log.Simultaneously as input method foundation itself has corresponding input Method dictionary, record has common word in the input method dictionary;In addition, input method itself also can be by collecting in user's input process The new word generated is updated periodically input method dictionary.It can be seen that:User's inquiry log and input method dictionary are all to follow use closely The real network operation behavior at family and it is newer, therefore, the present invention proposition built using user's inquiry log and input method dictionary Vertical participle dictionary, which can be synchronous with internet development, can reaction network practical language situation, can be preferably It is adapted in the Internet, applications scene.
Specifically, step 102 can be accomplished by the following way:
Using participle tool and the participle dictionary that pre-establishes, for each training corpus as word segmentation processing obtain with each The corresponding orderly set of words of training corpus;Wherein, the orderly word set is combined into the set that a sequential word is constituted;The participle Dictionary is built-up according to the user's inquiry log and input method dictionary collected in advance.
In the embodiment of the present invention, after word segmentation processing, a training corpus corresponds to an orderly set of words.It is so-called to have Sequence word set refers to having to what text message was made to obtain after word segmentation processing according to the word order of the text message described in training corpus The set of the word of permanent order relationship.Such as:Training corpus is a blog articles, then the text recorded according to the blog articles Each word puts in order in the paragraph sequence of information, the statement sequence of each paragraph text and each sentence, successively to this Text message carries out word segmentation processing and obtains the set of sequential word.For example, the text message described in training corpus is " I loves Beijing Tian An-men " obtains the corresponding orderly set of words of the training corpus just according to the word order of text message as word segmentation processing It is:(I/love/Beijing/Tian An-men).
Step 103:Vocabulary is built according to the user's inquiry log collected in advance.
The method of traditional structure vocabulary is that some words are selected from training corpus, and word is generated using the word structure of selection Table, but these vocabularys are universalities, without representativeness, cannot be pointedly adapted in search scene.The present invention considers The application scenarios of term vector, in order to enable term vector can be pointedly adapted in search scene (such as search engine), this hair It is bright to propose the scheme that vocabulary is built according to user's inquiry log.Compared with general vocabulary in the prior art, the present invention according to User's inquiry log extracts the peculiar vocabulary that can cover most search needs, the term vector energy trained based on the vocabulary Enough it is more in line with the demand of search scene.
Furthermore, it is contemplated that Internet era Chinese word quantity is magnanimity grade, any training is impossible to cover all Word, and from training time cost consideration, also It is not necessary to cover all words.Therefore, in order to build appropriately sized vocabulary, and And vocabulary is made to cover most query demands as far as possible, the present invention proposes following building mode:What extraction was collected in advance The word for including in user's inquiry log, and count the word frequency of each word;High frequency words are obtained, and build generation vocabulary.
Wherein, the acquisition high frequency words can specifically include:The word that word frequency is less than to predetermined threshold value screens out to obtain residue Word, as high frequency words.
By the building mode, it is based on user's inquiry log and predetermined threshold value, size vocabulary appropriate, root can be constructed Most query demands can be covered according to the selected word of word frequency size.In this way, the present invention is to ensure that training quality Meanwhile suitably reducing trained data volume.
On the basis of above-mentioned building mode, present invention additionally contemplates that during generating vocabulary, it is likely that can be by one A substantive noun is divided into multiple words and is added in vocabulary.For example, place name " Mudanjiang " is segmented as " tree peony ", " river ";For another example Mechanism name " iqiyi.com " segmented for " love ", " strange skill " situations such as.In order to solve the problems, such as that substantive noun is malfunctioned by participle;About The building mode of vocabulary, the invention also provides preferred schemes.The preferred embodiment be specifically above-mentioned acquisition high frequency words it Afterwards, and before the structure generates vocabulary, can also include the following steps:Using name entity dictionary to the high frequency words After merging processing, then execute the step of structure generates vocabulary.
Wherein, it includes various substantive nouns to name in entity dictionary, such as name, place name, mechanism name are used to describe entity The word of title.
Entity merging is carried out to word using name entity dictionary.For example, for " love " " strange skill " that participle obtains, in conjunction with reality Pronouns, general term for nouns, numerals and measure words " iqiyi.com " merges both of the aforesaid participle so that the word of vocabulary more meets true, more accurate.
The required training data of model training is obtained by step 101-103 processing, then executes step 104.
Step 104:Each training corpus preserved in the corpus is distributed in distributed term vector learning model Each node.
In existing conventional training method, single machine model is usually used, but in the training method of the present invention, is used for Realize that the data volume of the training corpus of term vector training is very huge, conventional single machine model cannot meet trained need It asks.Therefore, the present invention proposes distributed term vector learning model, improves training speed using distributed computing technology, meets mould The iteratively faster of type training.
Step 104 is mainly to distribute training data for the node in distributed term vector learning model, so that all sections All training datas of point shared.Specifically, each training corpus that corpus includes is distributed to the distributed word Each node in vectorial learning model;Wherein it is possible to which the training corpus in corpus is distributed equally so that learning model In each node to need the training data undertaken be impartial;Training corpus in corpus can also be randomly assigned, be made Obtaining the training data that each node needs undertake in learning model is.Referring to Fig. 2 shows it is provided in an embodiment of the present invention Distributed term vector trains schematic diagram;The distribution term vector learning model includes N number of node, wherein each node can be with It is to refer to the independent equipment for carrying out model training, such as computer.
When realizing, step 104 can be using the corresponding orderly set of words of each training corpus in corpus as one The independent training data of part, each training corpus that corpus includes is randomly assigned in distributed term vector learning model Each node.For example, it is assumed that distributed term vector learning model includes 3 nodes, current corpus includes 30,000 or so Training corpus obtains the corresponding orderly set of words of each training corpus after word segmentation processing;It then randomly will be in corpus All or part of training corpus distributes to this 3 node processings, actually by the corresponding orderly set of words of each training corpus Randomly it is input to each node.In this way, each node can using the corresponding orderly set of words of the training corpus being assigned to as Training data is learnt.Certainly, can be with mean allocation training data in distribution, it can also be according to actual conditions adaptability Ground is that each node distributes training data.
Certainly, step 104 can also be that the assigning process of training data is realized according to preset allocation rule;Such as it is pre- If allocation rule be clooating sequence according to training corpus, all language materials are adaptively distributed into all nodes;So that institute There is the training data size that node is assigned to almost the same.Certainly, specific allocation rule can be set according to actual demand, and It is not limited to above-mentioned example.
Step 104 main purpose is that the training corpus that the corpus that will be got includes is suitably allocated to distributed word Each node in vectorial learning model so that multiple nodal parallels work in distributed term vector learning model, common training Complete all training corpus.
Step 105:It configures the distributed term vector learning model and periodic word is carried out to each word in the vocabulary Vector training obtains the corresponding term vector of each word in the vocabulary;
Wherein, the term vector, which is trained, includes:Each node is respectively according to the training corpus being assigned to, to each training corpus In the vocabulary for being trained with the matched word of the vocabulary, and being trained to each node that corresponding orderly set of words includes After the training term vector of each word synchronizes, next cycle training is triggered.
Specifically, when distributed term vector learning model is started to work, first have to execute initial configuration operation, to each The word setting initialization term vector that the corresponding vocabulary of node includes.It is wrapped in training initial period, the corresponding vocabulary of each node The corresponding initialization term vector of word included is all identical.Then, each node is respectively based on the initialization term vector to word The word that table includes proceeds by training, the term vector after being trained;Then, to correspond to each word in vocabulary to each node corresponding Term vector after training synchronizes, and carries out the training in next period, until training obtains each word in the vocabulary and corresponds to Term vector.
Wherein, synchronizing process is:To each word in the vocabulary, the word pair obtained after each node training is calculated separately Difference between the term vector answered and initialization term vector, and according to the corresponding difference of all nodes being calculated, obtaining should The corresponding vectorial adjusted value (generally can be the average value that all nodes correspond to difference) of word;To each word in the vocabulary, It is utilized respectively the corresponding vectorial adjusted value of each word in the vocabulary to be adjusted the corresponding initialization vector value of the word, and will adjust Initial value of the vector value as the lower cycle training of the word after whole;As, within subsequent cycle of training, each node more than Term vector in the vocabulary obtained after one cycle training after the corresponding adjustment of each word is trained as initial value.
When realizing, step 105 can be realized as follows, and which includes step 1051- steps 1055;
Step 1051, each word in the corresponding vocabulary of each node in the distributed term vector learning model is configured Term vector is initialization term vector.
The corresponding term vector of each word of the corresponding vocabulary of each node in the distributed term vector learning model is initialized, The term vector for configuring each word in the corresponding vocabulary of each node is initialization term vector, is made each in the corresponding vocabulary of all nodes Word starts to train all in accordance with unified initial term vector.
It should be noted that the corresponding vocabulary of each node in the distribution term vector learning model is identical, It is the vocabulary generated in step 103.
When realizing, step 1051 can have following two initialization modes:
One is then the random initializtion term vector on any one node will initialize term vector parallel synchronous To each node.Another kind is to initialize the term vector parallel synchronous of each node in distributed term vector learning model Term vector is null vector.
When realizing, the realization of MPI (Message Passing Interface, messaging interface) interface may be used Parallel synchronous processing.MPI interfaces are more general multiple programming interfaces, are capable of providing efficient, expansible, unified parallel volume Journey environment.Certainly, the present invention can also use other interfaces to realize parallel between each node in distributed term vector learning model Synchronization process.
Specifically, being synchronized to each node by term vector is initialized by ten thousand Broadcoms using MPI interfaces.By step 1051 are handled using parallel synchronous so that all nodes are configured with identical initialization term vector.
Step 1052, each node be directed to that the corresponding orderly set of words of each training corpus that is assigned to includes with it is described The matched word of vocabulary is trained the initialization term vector of institute's predicate, obtains the training term vector of current period institute predicate.
Step 1053, the training term vector for the institute's predicate trained according to each node current period, parallel synchronous update The initialization term vector of the current period of institute's predicate, the initialization term vector in next period as institute's predicate.
The initialization term vector of node is periodically updated by this method, to which property performance period is trained.Each cycle training At the end of, judge whether to meet it is default judge fixed condition, to determine whether deconditioning.
Step 1054 judges whether to meet default decision condition, if so, entering step 1055;If not, entering step 1052。
Step 1055, according to institute predicate next period initialization term vector, obtain the word of equivalent in the vocabulary Vector.
When realizing, each node is using the corresponding orderly set of words of the training corpus received as training data, to every One training corpus, traverses all words that the orderly set of words includes respectively, and the word for including only for vocabulary is trained; Each cycle training terminates, and each node of synchronized update corresponds to the corresponding initial term vector of each word in vocabulary, and all nodes is made to start The training process of next cycle.
First, all words that the traversal orderly set of words includes, the word for including only for vocabulary are instructed Practice specifically, the corresponding each word having in sequence word of traversal training corpus, each word is matched with the vocabulary respectively, such as Fruit current word is matched to identical word in the vocabulary, then is trained to the word, obtains the corresponding term vector of the word;If Current word is not matched to identical word in the vocabulary, then abandons the word, is matched to next word of the word;Until Until each word in orderly set of words corresponding to the training corpus completes matching.
Secondly, the parallel synchronous is, each node by SGD (Stochastic Gradient Descent, at random Gradient declines) the accumulative gradient updating amount Δ w that calculates of algorithm, which is a certain node equivalent The difference of the initial term vector of some word is trained in current period in table term vector and current period.According to each section The corresponding accumulative gradient updating amount of the word calculates the corresponding initialization term vector of next period word, then, parallel synchronous in point Update the word in each node it is corresponding initialization term vector as the word next period initialization term vector.
When realizing, the initial term vector w ' in next period can be calculated according to following formula 1;
Formula 1
Wherein, the w ' in formula refers to the initial term vector in certain word lower period in vocabulary;W refers to the word current period Initialize term vector;η refers to predetermined coefficient, and Δ w refers to accumulative gradient updating amount of the word in current period, and wherein Δ w can be with Difference is asked to obtain by the initialization term vector of the term vector and the word current period trained to the word current period, N is institute State the number of nodes of learning model.
Wherein, it is the numerical value less than 1 that the size of η, which determines that node trains renewal rate, the general values of η,;Such as:It can set It is set to 1/N, the numerical value such as 1/2N;Preferably, η=1/N can be set, then the initialization term vector in next period is equal to current period Initialization term vector and N number of node accumulative gradient updating amount average value between difference.
Node periodically updates initialization term vector, is started to word based on each updated initialization term vector The training of each word a cycle in table, in this way, until periodically training when all nodes meet and preset training condition.
Wherein, default decision condition can be that trained iterations reach default iterations;Default decision condition also may be used To be the both less than default updated value of the corresponding accumulative gradient updating amount of word in the vocabulary more than threshold number.Certainly, it presets Decision condition may be set to be other content, and it is to weigh all sections that the purpose for presetting decision condition is arranged in the present invention Whether the training result of point all reaches unanimity, if can reach training goal.
When all node training results, which meet, presets decision condition, at this point, the institute in distributed term vector learning model There is node to terminate training;Then, the training term vector obtained according to all nodes the last one cycle training, according to above-mentioned formula 1 is calculated the initialization term vector in next period, using the initialization term vector being calculated as the word pair of the vocabulary The term vector answered.
In the present invention, the initialization term vector of each node training is all that parallel synchronous is newer, but in each node SGD processes be all asynchronous refresh, this update mode is referred to as half asynchronous refresh, proposed by the present invention this half is asynchronous Distributed term vector learning model can while ensureing Algorithm Convergence, reduce frequent synchronous belt come network communication when Between consume, with the training of acceleration model.
Word carries out except learning training in for above-mentioned vocabulary, and the present invention more takes full advantage of language material, it is also proposed that Training to particular categories words such as numeric class, English category, name classes obtains corresponding term vector, the word of these classifications with training Corresponding term vector being capable of more Optimizing Search business.
Specifically, the present invention provides a kind of optional scheme, the program increases on the basis of being the method shown in above-mentioned Fig. 1 Add following steps:
Configure each node in the distributed term vector learning model so that each node is respectively according to the training being assigned to Language material, what orderly set of words corresponding to each training corpus included carries out term vector instruction with the not matched word of the vocabulary Practice, to each node train described in the training term vector of not matched word synchronize, trigger next cycle training so that Each node cyclic training obtains the term vector of the not matched word, and will the corresponding term vector guarantor of the not matched word It is stored in the vocabulary;Wherein, the not matched word belongs to pre-set categories.
When realizing, when the set of the corresponding sequential word of node traverses language material in distributed term vector learning model Middle word finds that word is not belonging to vocabulary, but when belonging to a certain pre-set categories, is trained to the word.In this way, it is possible to fully sharp With the word in language material, training is excavated to the corresponding term vector of the valuable word of searching service.
What needs to be explained here is that each node described in above-mentioned example uses SGD algorithms, but the present invention is realizing When, above-mentioned SGD algorithms may be used in each node, can also use other calculations such as support vector machines, logistic regression, neural network Method is trained.
Instantiating explanation is carried out to the realization process of the above method below by example.
Such as:User's inquiry log is " world's comedy top ten list ";
According to user's inquiry log build vocabulary include:" world ", " comedy ", " top ten list ";
According to the above method provided by the invention, large-scale corpus is input in distributed term vector learning model, into It is that row training obtains as a result, the corresponding term vector of word i.e. in vocabulary (real vectors of 5 dimensions).
" world " (0.004003,0.004419, -0.003830, -0.003278,0.001367)
" comedy " (- 0.043665,0.018578,0.138403,0.004431, -0.139117)
" top ten list " (- 0.337518,0.224568,0.018613,0.222294, -0.057880).
In hands-on, different vector dimensions can be set according to different needs, only with dimension in above-mentioned example For 5, but implementation it's not limited to that the dimension of the present invention.
It can be seen from above-described embodiment that the training method of term vector provided by the invention captures internet net first Page obtains training corpus, is stored in corpus;Internet web page resource is utilized in the mode of establishing of this corpus well High real-time, high representative, resourceful broad covered area advantage, language that magnanimity grade can be got, that covering surface is wider Material.
Then, word segmentation processing is made respectively to each training corpus in corpus, obtains that each training corpus is corresponding to be had Sequence word set;And vocabulary is built according to the user's inquiry log collected in advance;The present invention has abandoned traditional based on training language The mode of material structure vocabulary, it is proposed that the mode of vocabulary is built according to user's inquiry log;Since user's inquiry log being capable of table Take over family actual search demand for use, it therefore, can be fine with the vocabulary that the query word structure for including in user's inquiry log generates Ground is adapted to searching service.
Finally, each training corpus preserved in the corpus is distributed in distributed term vector learning model by the present invention Each node;It configures the distributed term vector learning model and periodic term vector instruction is carried out to each word in the vocabulary Practice, obtains the corresponding term vector of each word in the vocabulary.The present invention trains slow to solve the problems, such as large-scale corpus, abandons Traditional single machine multithreading training method, and distributed term vector learning model is used, it is trained by multi-node parallel, to improve Training rate, the term vector of high quality is gone out so as to quick iteration.
Corresponding with the above method, the present invention also provides corresponding devices.It is of the invention real referring specifically to Fig. 3, Fig. 3 A kind of structure chart of the training device of term vector of example offer is provided.As shown in figure 3, the device may include:Corpus is established single Member 201, participle unit 202, vocabulary construction unit 203, language material Dispatching Unit 204 and the first dispensing unit 205;With reference to this The operation principle of device lays down a definition explanation to the connection relation and concrete function of each unit.
Corpus establishes unit 201, for capturing internet web page, obtains training corpus, is stored in corpus;
Participle unit 202 obtains each trained language for making word segmentation processing respectively to each training corpus in corpus Expect corresponding orderly set of words;
Vocabulary construction unit 203, for building vocabulary according to the user's inquiry log collected in advance;
Language material Dispatching Unit 204, for each training corpus preserved in the corpus to be distributed to distributed term vector Each node in learning model;
First dispensing unit 205, for configure the distributed term vector learning model to each word in the vocabulary into Row periodic term vector training obtains the corresponding term vector of each word in the vocabulary;Wherein, the term vector, which is trained, includes: Each node respectively according to the training corpus that is assigned to, orderly set of words corresponding to each training corpus include with institute's predicate After the training term vector of each word synchronizes in the vocabulary that the matched word of table is trained, and is trained to each node, triggering Next cycle training.
When realizing, the participle unit 202 may include:Word segmentation processing subelement.
The word segmentation processing subelement, for using participle tool and the participle dictionary that pre-establishes, to each trained language Material makees word segmentation processing respectively;The participle dictionary is built-up according to the user's inquiry log and input method dictionary collected in advance 's.
When realizing, the vocabulary construction unit 203 may include:First extraction subelement and the first structure subelement.
First extraction subelement, for extracting the word for including in the user's inquiry log collected in advance, and counts each word Word frequency;
First structure subelement, for obtaining high frequency words, structure generates vocabulary.
When realizing, the vocabulary construction unit 203 can also include:Second extraction subelement merges subelement and the Two structure subelements.
Second extraction subelement, for extracting the word for including in the user's inquiry log collected in advance, and counts each word Word frequency;
Merge subelement, for merging processing to high frequency words using name entity dictionary;
Second structure subelement, for generating vocabulary using the high frequency words structure after merging treatment.
When realizing, first dispensing unit 205, including:Configure subelement, training subelement, judgment sub-unit, same Step update subelement and term vector computation subunit.
Subelement is configured, for configuring in the corresponding vocabulary of each node in the distributed term vector learning model The term vector of each word is initialization term vector;
Training subelement is directed in the corresponding orderly set of words of each training corpus that is assigned to for configuring each node and wraps Include with the matched word of the vocabulary, the initialization term vector of institute's predicate is trained, the training of word described in this period is obtained Term vector;
Judgment sub-unit meets default decision condition for judging whether, if so, executing term vector computation subunit; If not, executing synchronized update subelement;
Synchronized update subelement, the training term vector of institute's predicate for being obtained according to each this cycle training of node, parallel The initialization term vector in this period of synchronized update institute predicate, the initialization term vector in next period as institute's predicate enter The trained subelement;
Term vector computation subunit, for according to the trained term vector, obtaining the term vector of equivalent in the vocabulary.
Optionally, the trained subelement includes:It traverses coupling subelement and term vector trains subelement.
Coupling subelement is traversed, for each training corpus to being assigned to, it is corresponding orderly to traverse the training corpus All words in set of words match each word with the vocabulary respectively;
Term vector trains subelement, for when the matching result of the traversal coupling subelement is to be, being carried out to the word Training, obtains the corresponding term vector of the word.
Optionally, synchronized update subelement, for realizing synchronized update using following formula:
Wherein, w ' refers to the initial term vector in certain word lower period in the corresponding vocabulary of a certain node;W refers to the section The initialization term vector of the word current period in the corresponding vocabulary of point;η refers to predetermined coefficient;Δ w to the node by corresponding to vocabulary In the initialization term vector of term vector and the word current period trained of the word current period ask difference to obtain;N is Practise the number of nodes of model.
Optionally, described device can also include:
Second dispensing unit, for configuring each node in the distributed term vector learning model so that each node point Not according to the training corpus being assigned to, what orderly set of words corresponding to each training corpus included does not match with the vocabulary Word carry out term vector training, to each node train described in the training term vector of not matched word synchronize, trigger Next cycle training so that each node cyclic training obtains the term vector of the not matched word, and is not matched described The corresponding term vector of word be saved into the vocabulary;Wherein, the not matched word belongs to pre-set categories.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, explanation will be not set forth in detail herein.
In addition, the present invention also provides the training device of another term vector, the device is explained with reference to Fig. 4 Explanation.
Fig. 4 is a kind of hardware structure diagram of the training device of term vector provided in an embodiment of the present invention, the device shown in Fig. 4 300 can be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet device, and medical treatment is set It is standby, body-building equipment, personal digital assistant etc..
With reference to Fig. 4, device 300 may include following one or more components:Processing component 302, memory 304, power supply Component 306, multimedia component 308, audio component 310, the interface 312 of input/output (I/O), sensor module 314, and Communication component 316.
The integrated operation of 302 usual control device 300 of processing component, such as with display, call, data communication, phase Machine operates and record operates associated operation.Processing component 302 may include that one or more processors 320 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 302 may include one or more modules, just Interaction between processing component 302 and other assemblies.For example, processing component 302 may include multi-media module, it is more to facilitate Interaction between media component 308 and processing component 302.
Memory 304 is configured as storing various types of data to support the operation in equipment 300.These data are shown Example includes instruction for any application program or method that operate on the device 300, contact data, and telephone book data disappears Breath, picture, video etc..Memory 304 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.
Power supply module 306 provides electric power for the various assemblies of device 300.Power supply module 306 may include power management system System, one or more power supplys and other generated with for device 300, management and the associated component of distribution electric power.
Multimedia component 308 is included in the screen of one output interface of offer between described device 300 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 308 includes a front camera and/or rear camera.When equipment 300 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 310 is configured as output and/or input audio signal.For example, audio component 310 includes a Mike Wind (MIC), when device 300 is in operation mode, when such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The received audio signal can be further stored in memory 304 or via communication set Part 316 is sent.In some embodiments, audio component 310 further includes a loud speaker, is used for exports audio signal.
I/O interfaces 312 provide interface between processing component 302 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to:Home button, volume button, start button and lock Determine button.
Sensor module 314 includes one or more sensors, and the state for providing various aspects for device 300 is commented Estimate.For example, sensor module 314 can detect the state that opens/closes of equipment 300, and the relative positioning of component, for example, it is described Component is the display and keypad of device 300, and sensor module 314 can be with 300 1 components of detection device 300 or device Position change, the existence or non-existence that user contacts with device 300,300 orientation of device or acceleration/deceleration and device 300 Temperature change.Sensor module 314 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 314 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 316 is configured to facilitate the communication of wired or wireless way between device 300 and other equipment.Device 300 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary implementation In example, communication component 316 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 316 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 300 can be believed by one or more application application-specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, it includes the non-transitorycomputer readable storage medium instructed, example to additionally provide a kind of Such as include the memory 304 of instruction, above-metioned instruction can be executed by the processor 320 of device 300 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of electronic equipment When device executes so that electronic equipment is able to carry out a kind of training method of term vector, and the processor executes the following finger operated It enables:
Internet web page is captured, training corpus is obtained, is stored in corpus;
Word segmentation processing is made respectively to each training corpus in corpus, obtains the corresponding orderly word set of each training corpus It closes;
Vocabulary is built according to the user's inquiry log collected in advance;
Each training corpus preserved in the corpus is distributed to each node in distributed term vector learning model;
It configures the distributed term vector learning model and periodic term vector training is carried out to each word in the vocabulary, Obtain the corresponding term vector of each word in the vocabulary;
Wherein, the term vector, which is trained, includes:Each node is respectively according to the training corpus being assigned to, to each training corpus In the vocabulary for being trained with the matched word of the vocabulary, and being trained to each node that corresponding orderly set of words includes After the training term vector of each word synchronizes, next cycle training is triggered.
Fig. 5 is the structural schematic diagram of server in the embodiment of the present invention.The server 1900 can be different because of configuration or performance And generate bigger difference, may include one or more central processing units (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage application programs 1942 or data 1944 storage medium 1930 (such as one or more mass memory units).Wherein, memory 1932 Can be of short duration storage or persistent storage with storage medium 1930.The program for being stored in storage medium 1930 may include one or More than one module (diagram does not mark), each module may include to the series of instructions operation in server.Further Ground, central processing unit 1922 could be provided as communicating with storage medium 1930, and storage medium 1930 is executed on server 1900 In series of instructions operation.
Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation All or part of step in example method can add the mode of general hardware platform to realize by software.Based on this understanding, Substantially the part that contributes to existing technology can embody technical scheme of the present invention in the form of software products in other words Out, which can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions With so that a computer equipment (can be that the network communications such as personal computer, server, or Media Gateway are set It is standby) execute method described in certain parts of each embodiment of the present invention or embodiment.
It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment it Between just to refer each other for identical similar part, each embodiment focuses on the differences from other embodiments. For equipment and system embodiment, since it is substantially similar to the method embodiment, so describe fairly simple, The relevent part can refer to the partial explaination of embodiments of method.Equipment and system embodiment described above is only schematic , wherein may or may not be physically separated as the unit that separation assembly illustrates, shown as unit Component may or may not be physical unit, you can be located at a place, or may be distributed over multiple networks On unit.Some or all of module therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art are without creative efforts, you can to understand and implement.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the present invention Its embodiment.The present invention is directed to cover the present invention any variations, uses, or adaptations, these modifications, purposes or Person's adaptive change follows the general principle of the present invention and includes the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.
It should be understood that the invention is not limited in the precision architectures for being described above and being shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

Claims (17)

1. a kind of training method of term vector, which is characterized in that the method includes:
Internet web page is captured, training corpus is obtained, is stored in corpus;
Word segmentation processing is made respectively to each training corpus in corpus, obtains the corresponding orderly set of words of each training corpus;
Vocabulary is built according to the user's inquiry log collected in advance;
Each training corpus preserved in the corpus is distributed to each node in distributed term vector learning model;
It configures the distributed term vector learning model and periodic term vector training is carried out to each word in the vocabulary, obtain The corresponding term vector of each word in the vocabulary;
Wherein, the term vector, which is trained, includes:Each node according to the training corpus being assigned to, corresponds to each training corpus respectively Orderly set of words include the vocabulary for being trained with the matched word of the vocabulary, and each node being trained in each word Training term vector synchronize after, trigger next cycle training.
2. according to the method described in claim 1, it is characterized in that, each training corpus in corpus divides respectively Word processing, including:
Using participle tool and the participle dictionary pre-established, word segmentation processing is made respectively to each training corpus;The participle word Library is built-up according to the user's inquiry log and input method dictionary collected in advance.
3. according to the method described in claim 1, it is characterized in that, user's inquiry log structure word that the basis is collected in advance Table, including:
The word for including in the user's inquiry log collected in advance is extracted, and counts the word frequency of each word;
High frequency words are obtained, vocabulary is built.
4. according to the method described in claim 3, it is characterized in that, after the acquisition high frequency words, the method further includes:
After processing being merged using name entity dictionary to the high frequency words, then the step of executing the structure vocabulary.
5. according to the method described in claim 1, it is characterized in that, the configuration distributed term vector learning model is to institute Each word in predicate table carries out periodic term vector training, obtains the corresponding term vector of each word in the vocabulary, including:
The term vector of each word in the corresponding vocabulary of each node in step 1, the configuration distributed term vector learning model To initialize term vector;
It is including with the vocabulary that step 2, each node are directed to the corresponding orderly set of words of each training corpus that is assigned to The word matched is trained the initialization term vector of institute's predicate, obtains the training term vector of current period institute predicate;
Step 3:According to the training term vector for institute's predicate that each node current period is trained, parallel synchronous updates institute's predicate The initialization term vector of current period, the initialization term vector in next period as institute's predicate;
Step 4 judges whether to meet default decision condition, if so, entering step 5;If not, entering step 2;
Step 5, according to institute predicate next period initialization term vector, obtain the term vector of equivalent in the vocabulary.
6. according to the method described in claim 5, it is characterized in that, the described pair of each training corpus being assigned to is corresponding orderly What set of words included is trained with the matched word of the vocabulary, including:
To each training corpus being assigned to, all words in the corresponding orderly set of words of the training corpus are traversed, it will be each Word is matched with the vocabulary respectively, if current word is matched to identical word in the vocabulary, is trained to the word, Obtain the corresponding term vector of the word.
7. according to the method described in claim 5, it is characterized in that, it is described train according to each node current period described in The training term vector of word, parallel synchronous update the initialization term vector of the current period of institute's predicate, next week as institute's predicate The initialization term vector of phase, including:
Synchronized update is realized using following formula:
Wherein, w ' refers to the initial term vector in certain word lower period in the corresponding vocabulary of a certain node;W refers to the node pair Answer the initialization term vector of the word current period in vocabulary;η refers to predetermined coefficient;Δ w should in vocabulary by being corresponded to the node The initialization term vector of the term vector that word current period is trained and the word current period asks difference to obtain;N is the study mould The number of nodes of type.
8. according to the method described in claim 1, it is characterized in that, the method further includes:
Configure each node in the distributed term vector learning model so that each node is respectively according to the training language being assigned to Material, what orderly set of words corresponding to each training corpus included carries out term vector training with the not matched word of the vocabulary, To each node train described in the training term vector of not matched word synchronize, trigger next cycle training so that each A node cyclic training obtains the term vector of the not matched word, and will the corresponding term vector preservation of the not matched word Enter in the vocabulary;Wherein, the not matched word belongs to pre-set categories.
9. a kind of training device of term vector, which is characterized in that described device includes:
Corpus establishes unit, for capturing internet web page, obtains training corpus, is stored in corpus;
Participle unit obtains each training corpus and corresponds to for making word segmentation processing respectively to each training corpus in corpus Orderly set of words;
Vocabulary construction unit, for building vocabulary according to the user's inquiry log collected in advance;
Language material Dispatching Unit, for each training corpus preserved in the corpus to be distributed to distributed term vector learning model In each node;
First dispensing unit carries out periodically each word in the vocabulary for configuring the distributed term vector learning model Term vector training, obtain the corresponding term vector of each word in the vocabulary;Wherein, the term vector, which is trained, includes:Each node point Not according to the training corpus being assigned to, orderly set of words corresponding to each training corpus includes matched with the vocabulary After word is trained, and the training term vector of each word synchronizes in the vocabulary trained to each node, next period is triggered Training.
10. device according to claim 9, which is characterized in that the participle unit, including:
The word segmentation processing subelement, for using participle tool and the participle dictionary pre-established, dividing each training corpus Word segmentation processing is not made;The participle dictionary is built-up according to the user's inquiry log and input method dictionary collected in advance.
11. device according to claim 9, which is characterized in that the vocabulary construction unit, including:
First extraction subelement, for extracting the word for including in the user's inquiry log collected in advance, and counts the word of each word Frequently;
First structure subelement builds vocabulary for obtaining high frequency words.
12. device according to claim 9, which is characterized in that the vocabulary construction unit, including:
Second extraction subelement, for extracting the word for including in the user's inquiry log collected in advance, and counts the word of each word Frequently;
Merge subelement, for merging processing to high frequency words using name entity dictionary;
Second structure subelement, for building vocabulary using the high frequency words after merging treatment.
13. device according to claim 9, which is characterized in that first dispensing unit, including:
Subelement is configured, for configuring each word in the corresponding vocabulary of each node in the distributed term vector learning model Term vector be initialization term vector;
Subelement is trained, includes for the corresponding orderly set of words of each training corpus being assigned to for configuring each node With the matched word of the vocabulary, the initialization term vector of institute's predicate is trained, obtain the training word of word described in this period to Amount;
Judgment sub-unit meets default decision condition for judging whether, if so, executing term vector computation subunit;If It is no, execute synchronized update subelement;
Synchronized update subelement, the training term vector of institute's predicate for being obtained according to each this cycle training of node, parallel synchronous Update the initialization term vector in this period of institute predicate, the initialization term vector in next period as institute's predicate, into described Training subelement;
Term vector computation subunit, for according to the trained term vector, obtaining the term vector of equivalent in the vocabulary.
14. device according to claim 13, which is characterized in that the trained subelement, including:
Traversal coupling subelement traverses the corresponding orderly word set of the training corpus for each training corpus to being assigned to All words in conjunction match each word with the vocabulary respectively;
Term vector trains subelement, for when the matching result of the traversal coupling subelement is to be, being trained to the word, Obtain the corresponding term vector of the word.
15. device according to claim 13, which is characterized in that the synchronized update subelement, for using following public affairs Formula realizes synchronized update:
Wherein, w ' refers to the initial term vector in certain word lower period in the corresponding vocabulary of a certain node;W refers to the node pair Answer the initialization term vector of the word current period in vocabulary;η refers to predetermined coefficient;Δ w should in vocabulary by being corresponded to the node The initialization term vector of the term vector that word current period is trained and the word current period asks difference to obtain;N is the study mould The number of nodes of type.
16. device according to claim 9, which is characterized in that described device further includes:
Second dispensing unit, for configuring each node in the distributed term vector learning model so that each node distinguishes root According to the training corpus being assigned to, orderly set of words corresponding to each training corpus include with the not matched word of the vocabulary Carry out term vector training, to each node train described in the training term vector of not matched word synchronize, triggering is next Cycle training so that each node cyclic training obtains the term vector of the not matched word, and the not matched word by described in Corresponding term vector is saved into the vocabulary;Wherein, the not matched word belongs to pre-set categories.
17. a kind of training device of term vector, which is characterized in that include memory and one or more than one journey Sequence, either more than one program is stored in memory and is configured to by one or more than one processor for one of them It includes the instruction for being operated below to execute the one or more programs:
Internet web page is captured, training corpus is obtained, is stored in corpus;
Word segmentation processing is made respectively to each training corpus in corpus, obtains the corresponding orderly set of words of each training corpus;
Vocabulary is built according to the user's inquiry log collected in advance;
Each training corpus preserved in the corpus is distributed to each node in distributed term vector learning model;
It configures the distributed term vector learning model and periodic term vector training is carried out to each word in the vocabulary, obtain The corresponding term vector of each word in the vocabulary;
Wherein, the term vector, which is trained, includes:Each node according to the training corpus being assigned to, corresponds to each training corpus respectively Orderly set of words include the vocabulary for being trained with the matched word of the vocabulary, and each node being trained in each word Training term vector synchronize after, trigger next cycle training.
CN201610179115.0A 2016-03-25 2016-03-25 A kind of training method and device of term vector Active CN105786782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610179115.0A CN105786782B (en) 2016-03-25 2016-03-25 A kind of training method and device of term vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610179115.0A CN105786782B (en) 2016-03-25 2016-03-25 A kind of training method and device of term vector

Publications (2)

Publication Number Publication Date
CN105786782A CN105786782A (en) 2016-07-20
CN105786782B true CN105786782B (en) 2018-10-19

Family

ID=56390898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610179115.0A Active CN105786782B (en) 2016-03-25 2016-03-25 A kind of training method and device of term vector

Country Status (1)

Country Link
CN (1) CN105786782B (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108024005B (en) * 2016-11-04 2020-08-21 北京搜狗科技发展有限公司 Information processing method and device, intelligent terminal, server and system
CN106776534B (en) * 2016-11-11 2020-02-11 北京工商大学 Incremental learning method of word vector model
GB201620232D0 (en) * 2016-11-29 2017-01-11 Microsoft Technology Licensing Llc Data input system with online learning
CN106874643B (en) * 2016-12-27 2020-02-28 中国科学院自动化研究所 Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors
CN108345580B (en) 2017-01-22 2020-05-15 创新先进技术有限公司 Word vector processing method and device
CN108628813B (en) * 2017-03-17 2022-09-23 北京搜狗科技发展有限公司 Processing method and device for processing
CN107239443A (en) * 2017-05-09 2017-10-10 清华大学 The training method and server of a kind of term vector learning model
CN107015969A (en) * 2017-05-19 2017-08-04 四川长虹电器股份有限公司 Can self-renewing semantic understanding System and method for
CN107577658B (en) * 2017-07-18 2021-01-29 创新先进技术有限公司 Word vector processing method and device and electronic equipment
CN107577659A (en) * 2017-07-18 2018-01-12 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN109388689A (en) * 2017-08-08 2019-02-26 中国电信股份有限公司 Word stock generating method and device
CN107451295B (en) * 2017-08-17 2020-06-30 四川长虹电器股份有限公司 Method for obtaining deep learning training data based on grammar network
CN110019830B (en) * 2017-09-20 2022-09-23 腾讯科技(深圳)有限公司 Corpus processing method, corpus processing device, word vector obtaining method, word vector obtaining device, storage medium and equipment
CN107957989B9 (en) * 2017-10-23 2021-01-12 创新先进技术有限公司 Cluster-based word vector processing method, device and equipment
CN109726386B (en) * 2017-10-30 2023-05-09 中国移动通信有限公司研究院 Word vector model generation method, device and computer readable storage medium
CN107766565A (en) * 2017-11-06 2018-03-06 广州杰赛科技股份有限公司 Conversational character differentiating method and system
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN108170663A (en) * 2017-11-14 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
CN108170667B (en) * 2017-11-30 2020-06-23 阿里巴巴集团控股有限公司 Word vector processing method, device and equipment
CN108231146B (en) * 2017-12-01 2021-07-27 华南师范大学 Deep learning-based medical record model construction method, system and device
CN109933778B (en) * 2017-12-18 2024-03-05 北京京东尚科信息技术有限公司 Word segmentation method, word segmentation device and computer readable storage medium
CN110197188A (en) * 2018-02-26 2019-09-03 北京京东尚科信息技术有限公司 Method, system, equipment and the storage medium of business scenario prediction, classification
CN108520018B (en) * 2018-03-22 2021-09-24 大连理工大学 Literary work creation age judgment method based on word vectors
CN108509422B (en) * 2018-04-04 2020-01-24 广州荔支网络技术有限公司 Incremental learning method and device for word vectors and electronic equipment
CN110633352A (en) * 2018-06-01 2019-12-31 北京嘀嘀无限科技发展有限公司 Semantic retrieval method and device
CN109543175B (en) * 2018-10-11 2020-06-02 北京诺道认知医学科技有限公司 Method and device for searching synonyms
CN109587019A (en) * 2018-12-12 2019-04-05 珠海格力电器股份有限公司 A kind of sound control method of household appliance, device, storage medium and system
CN110266675B (en) * 2019-06-12 2022-11-04 成都积微物联集团股份有限公司 Automatic detection method for xss attack based on deep learning
CN110191005B (en) * 2019-06-25 2020-02-21 北京九章云极科技有限公司 Alarm log processing method and system
CN113961664A (en) * 2020-07-15 2022-01-21 上海乐言信息科技有限公司 Deep learning-based numerical word processing method, system, terminal and medium
CN112256517B (en) * 2020-08-28 2022-07-08 苏州浪潮智能科技有限公司 Log analysis method and device of virtualization platform based on LSTM-DSSM
CN112862662A (en) * 2021-03-12 2021-05-28 云知声智能科技股份有限公司 Method and equipment for distributed training of transform-xl language model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103218444A (en) * 2013-04-22 2013-07-24 中央民族大学 Method of Tibetan language webpage text classification based on semanteme
CN104462051A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Word segmentation method and device
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8150831B2 (en) * 2009-04-15 2012-04-03 Lexisnexis System and method for ranking search results within citation intensive document collections
US20160070748A1 (en) * 2014-09-04 2016-03-10 Crimson Hexagon, Inc. Method and apparatus for improved searching of digital content

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103218444A (en) * 2013-04-22 2013-07-24 中央民族大学 Method of Tibetan language webpage text classification based on semanteme
CN104462051A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Word segmentation method and device
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes

Also Published As

Publication number Publication date
CN105786782A (en) 2016-07-20

Similar Documents

Publication Publication Date Title
CN105786782B (en) A kind of training method and device of term vector
US20230015606A1 (en) Named entity recognition method and apparatus, device, and storage medium
CN110309304A (en) A kind of file classification method, device, equipment and storage medium
CN107851092A (en) Personal entity models
CN109522419A (en) Session information complementing method and device
CN107291690A (en) Punctuate adding method and device, the device added for punctuate
US20210049354A1 (en) Human object recognition method, device, electronic apparatus and storage medium
US10860801B2 (en) System and method for dynamic trend clustering
CN105531701A (en) Personalized trending image search suggestion
CN107305438A (en) The sort method and device of candidate item, the device sorted for candidate item
CN107918496A (en) It is a kind of to input error correction method and device, a kind of device for being used to input error correction
CN107291704A (en) Treating method and apparatus, the device for processing
CN115114395A (en) Content retrieval and model training method and device, electronic equipment and storage medium
CN109471919A (en) Empty anaphora resolution method and device
CN107346182A (en) A kind of method for building user thesaurus and the device for building user thesaurus
CN107564526A (en) Processing method, device and machine readable media
CN112488003A (en) Face detection method, model creation method, device, equipment and medium
CN110069624A (en) Text handling method and device
WO2019101099A1 (en) Video program identification method and device, terminal, system, and storage medium
CN108803890A (en) A kind of input method, input unit and the device for input
CN109302528A (en) A kind of photographic method, mobile terminal and computer readable storage medium
CN110019885A (en) A kind of expression data recommended method and device
CN107707759A (en) Terminal control method, device and system, storage medium
CN110286775A (en) A kind of dictionary management method and device
CN112328783A (en) Abstract determining method and related device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170821

Address after: 100084. Room 9, floor 02, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing

Applicant after: Beijing Sogou Information Service Co., Ltd.

Address before: 100084 Beijing, Zhongguancun East Road, building 1, No. 9, Sohu cyber building, room 9, room, room 01

Applicant before: Sogo Science-Technology Development Co., Ltd., Beijing

GR01 Patent grant
GR01 Patent grant