CN107577658A - Term vector processing method, device and electronic equipment - Google Patents

Term vector processing method, device and electronic equipment Download PDF

Info

Publication number
CN107577658A
CN107577658A CN201710583670.4A CN201710583670A CN107577658A CN 107577658 A CN107577658 A CN 107577658A CN 201710583670 A CN201710583670 A CN 201710583670A CN 107577658 A CN107577658 A CN 107577658A
Authority
CN
China
Prior art keywords
word
vector
metacharacters
character
cliction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710583670.4A
Other languages
Chinese (zh)
Other versions
CN107577658B (en
Inventor
曹绍升
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710583670.4A priority Critical patent/CN107577658B/en
Publication of CN107577658A publication Critical patent/CN107577658A/en
Application granted granted Critical
Publication of CN107577658B publication Critical patent/CN107577658B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This specification embodiment discloses term vector processing method, device and electronic equipment.Methods described includes:Multiple n metacharacters are marked off from word, based on the character vector of n metacharacters, the term vector of the word is trained, wherein, n metacharacters characterize the continuous n character of its corresponding word, and by taking day cliction as an example, its character is assumed name and/or japanese character.

Description

Term vector processing method, device and electronic equipment
Technical field
This specification is related to computer software technical field, more particularly to term vector processing method, device and electronics are set It is standby.
Background technology
The solution of natural language processing of today, mostly uses the framework based on neutral net, and in this framework Next important basic technology is exactly term vector.Term vector is the vector that word is mapped to a fixed dimension, the vector table The semantic information of the word is levied.
In the prior art, the algorithm for being commonly used in generation term vector is specific to English design.Such as Google The word vector algorithm of company, deep neural network algorithm of Microsoft etc..
Based on prior art, it is necessary to which a kind of term vector for Japanese generates scheme.
The content of the invention
This specification embodiment provides term vector processing method, device and electronic equipment, is asked to solve following technology Topic:A kind of term vector for Japanese is needed to generate scheme.
In order to solve the above technical problems, what this specification embodiment was realized in:
A kind of term vector processing method that this specification embodiment provides, including:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n character of its corresponding word;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
A kind of term vector processing unit that this specification embodiment provides, including:
Word-dividing mode, language material is segmented to obtain each word;
Determining module, determines each n metacharacters corresponding to each word, and the n metacharacters characterize the continuous n of its corresponding word Individual character;
Initialization module, establish and initialize the term vector of each word, and each n metacharacters corresponding to each word Character vector;
Training module, according to the term vector, the character vector, and the language material after participle, to institute's predicate to Amount and the character vector are trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
Another term vector processing method that this specification embodiment provides, including:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, it is described each Word is not included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, according to the vocabulary, n metacharacter mapping tables are established, the mapping table includes each word and n member words Mapping relations between symbol, the n metacharacters characterize the continuous n character of the word of its mapping;Jump procedure 3;
Step 3, according to the n metacharacters mapping table, establish and initialize the term vector of each word, and each word The character vector of each n metacharacters of mapping;Jump procedure 4;
Step 4, the language material after traversal participle, is performed using the word traversed as current word w and to current word w respectively Step 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, are traveled through in the window All words in addition to current word w, respectively using the word traversed as current word w current context word c and to when front upper and lower Cliction c performs step 6, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents the set of at least part n metacharacters that current word w maps in the n metacharacters mapping table, q Each n metacharacters in S (w) are represented, sim (w, c) represents current word w and current context word c similarity;Represent q word Symbol vector,W term vector is represented,C term vector is represented, ⊙ represents to be directed to two vectorial certain operations, the specific fortune Calculate as dot-product operation or included angle cosine computing or Euclidean distance computing;β1、β2For weight parameter;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)The negative sample word c ' that [x] refers to randomly select meets general In the case of rate distribution p (V), expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) that calculates calculates the loss function, according to the gradient, To q character vectorWith current context word c term vectorIt is updated.
The a kind of electronic equipment that this specification embodiment provides, including:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Individual computing device, so that at least one processor can:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n character of its corresponding word;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
Above-mentioned at least one technical scheme that this specification embodiment uses can reach following beneficial effect:It can pass through N metacharacters corresponding to word more subtly show the feature of the word, so be advantageous to improve generation day cliction term vector standard Exactness, practical function is preferable, therefore, can partly or entirely solve above-mentioned technical problem.
Brief description of the drawings
In order to illustrate more clearly of this specification embodiment or technical scheme of the prior art, below will to embodiment or The required accompanying drawing used is briefly described in description of the prior art, it should be apparent that, drawings in the following description are only Some embodiments described in this specification, for those of ordinary skill in the art, do not paying creative labor Under the premise of, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of overall architecture schematic diagram that the scheme of this specification is related under a kind of practical application scene;
Fig. 2 is a kind of schematic flow sheet for term vector processing method that this specification embodiment provides;
Fig. 3 is under the practical application scene that this specification embodiment provides, and one kind of the term vector processing method is specific The schematic flow sheet of embodiment;
Fig. 4 is the relevant treatment action signal of part language material used in flow in Fig. 3 that this specification embodiment provides Figure;
Fig. 5 is the schematic flow sheet for another term vector processing method that this specification embodiment provides;
Fig. 6 is a kind of structural representation for term vector processing unit corresponding to Fig. 2 that this specification embodiment provides.
Embodiment
This specification embodiment provides term vector processing method, device and electronic equipment.
In order that those skilled in the art more fully understand the technical scheme in this specification, below in conjunction with this explanation Accompanying drawing in book embodiment, the technical scheme in this specification embodiment is clearly and completely described, it is clear that described Embodiment be only some embodiments of the present application, rather than whole embodiment.Based on this specification embodiment, this area The every other embodiment that those of ordinary skill is obtained under the premise of creative work is not made, should all belong to the application The scope of protection.
Fig. 1 is a kind of overall architecture schematic diagram that the scheme of this specification is related under a kind of practical application scene.This is whole In body framework, four parts are related generally to:The character of n metacharacters, the term vector of word and n metacharacters corresponding to word, word in language material Vectorial, vectorial training server.N metacharacters are used for compared with the feature for subtly showing its corresponding word, are serviced by vector training Device is trained to the term vector of word and the character vector of n metacharacters, can obtain more accurately term vector.In practical application In, the relevant action of preceding three parts can be performed by corresponding software and/or hardware function.
The scheme of this specification is applied to the term vector of day cliction, is also applied for the term vector of the word of other some language, Such as Chinese, Korean, English etc..In the case of non-day cliction, the character is correspondingly the word of the composition non-day cliction Symbol, such as, Chinese character, Korea character, English alphabet etc..
For the ease of description, following embodiment is said mainly for the scene of day cliction to the scheme of this specification It is bright.
Fig. 2 is a kind of schematic flow sheet for term vector processing method that this specification embodiment provides.From program angle and Speech, the executive agent of the flow can be program with term vector systematic function and/or training function etc.;Slave unit angle and Speech, the executive agent of the flow can include but is not limited to the following at least one equipment that can carry described program:Individual calculus Machine, big-and-middle-sized computer, computer cluster, mobile phone, tablet personal computer, intelligent wearable device, vehicle device etc..
Flow in Fig. 2 may comprise steps of:
S202:Language material is segmented to obtain each word.
In this specification embodiment, each word can be specifically:At least occurred in language material in word once extremely Small part word.For the ease of subsequent treatment, each word can be stored in vocabulary, it is necessary to read word when using from vocabulary .
S204:Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n word of its corresponding word Symbol;Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
In order to make it easy to understand, assumed name and japanese character are illustrated:
Assumed name:The watch sound character of Japanese is referred to as assumed name (か な), there is hiragana (ひ ら Ga な) and katakana (カ タ カ Na) two kinds.Japanese character:Used Chinese character when being writing Japanese.For most of day cliction, existing assumed name literary style, also have Japanese character literary style.
For example, for day cliction " さ く ら ".Its corresponding 1 metacharacter is:“さ”、“く”、“ら”;Its corresponding 2 yuan of word Fu Wei:" さ く " (the 1st~2 character), " く ら " (the 2nd~3 character);Its corresponding 3 metacharacter is " さ く ら " the (the 1st~3 Individual character).
" さ く ら " are assumed name literary styles, and its corresponding japanese character literary style is day cliction " Sakurai ", and day cliction " Sakurai " is right for day cliction 1 metacharacter answered is:" Sakurai ".
In this specification embodiment, n value can be that dynamic is adjustable.For same word, it is determined that the word pair During each n metacharacters answered, n value can only take 1 (for example only determining each 3 metacharacter corresponding to the word), can also take more Individual (for example determining each 3 metacharacter and each 4 metacharacter etc. corresponding to the word).When n value is exactly the character sum that word includes When, n metacharacters are exactly the word.
In this specification embodiment, for the ease of computer disposal, n metacharacters can also be indicated with numeral.Than Such as, different characters can be represented that then n metacharacters can correspondingly be expressed as numeric string with a numeral respectively, wherein, should There are mapping relations between numeral and numeric string or character.
In step S204, it may be determined that at least part n metacharacters are used for subsequent processes corresponding to word.
S206:Establish and initialize the term vector of each word, and each n metacharacters corresponding to each word character to Amount.
In this specification embodiment, character vector refers to the vector for representing n metacharacters.Each n metacharacters can divide Do not represented with a character vector, can be represented respectively with a term vector just as each word.
,, may when initializing term vector and character vector in order to ensure the effect of scheme in this specification embodiment Have some restrictive conditions.Such as, it is impossible to each term vector and each character vector are initialized to identical vector;Again for example, Vector element value in some term vectors or character vector can not be all 0;Etc..
In this specification embodiment, it can be initialized by the way of random initializtion or according to specified probability distribution Mode, initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word, wherein, it is identical The character vector of n metacharacters is also identical.For example the specified probability distribution can be 0-1 distributions etc..
If in addition, having been based on other language materials before, term vector and character vector corresponding to some words were trained, then was being entered When training term vector corresponding to these words and character vector based on the language material in Fig. 2 to one step, can no longer it re-establish and first Term vector and character vector corresponding to these words of beginningization, but based on the language material in Fig. 2 and training result before, then instructed White silk.
S208:According to the term vector, the character vector, and the language material after participle, to the term vector and The character vector is trained.
In this specification embodiment, the training can be by neural fusion, and the neutral net can be with It is shallow-layer neutral net or deep-neural-network etc..This specification is not limited the concrete structure of the neutral net of use It is fixed.
By Fig. 2 method, the feature of the word can be more subtly showed by n metacharacters corresponding to word, it is particularly possible to The character constitutive characteristic of the word is showed, and then is advantageous to improve the degree of accuracy of the term vector of day cliction, practical function is preferable.
Method based on Fig. 2, this specification embodiment additionally provide some specific embodiments of this method, and extension Scheme, it is illustrated below.
It is described to determine each n metacharacters corresponding to each word for step S204 in this specification embodiment, specifically It can include:According to the result segmented to the language material, it is determined that the word occurred in the language material;
The mutually different word of the determination is directed to respectively, is performed:
Each n metacharacters corresponding to the word are determined, n metacharacters corresponding to the word characterize the continuous n character of the word, n mono- Individual positive integer or multiple different positive integers.
In this specification embodiment, for identical word, each n metacharacters corresponding to them are also identical, therefore, right Step in the preceding paragraph, performed respectively for the mutually different word determined, and for dittograph, can direct edge With existing result, without repeating, so as to save resource.
Further, it is contemplated that corresponding when being trained based on the language material if the number that some word occurs in language material is very little Training sample and frequency of training it is also less, adverse effect can be brought to the confidence level of training result, therefore, can be by this kind of word Screen out, wouldn't train.It can be subsequently trained in other language materials.
Based on such thinking, result that the basis segments to the language material, it is determined that occurred in the language material Word, it can specifically include:According to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is many In the word of setting number.It is specifically that how many times can determine according to actual conditions to set number.
In this specification embodiment, for step S208, specific training method can have it is a variety of, such as based on up and down The training method of cliction, training method based on specified near synonym or synonym etc., in order to make it easy to understand, being in a manner of former Example describes in detail.
The language material according to after the term vector, the character vector, and participle, to the term vector and institute State character vector to be trained, can specifically include:It is determined that the specified word in the language material after participle, and the specified word Cliction above and below one or more of described language material after participle;The character of each n metacharacters according to corresponding to the specified word Vector, and the term vector of the cliction up and down, determine the specified word and the similarity of the cliction up and down;According to the finger Word and the similarity of the cliction up and down are determined, to each n metacharacters corresponding to the term vector of cliction up and down and the specified word Character vector be updated.
This specification pair determines that the concrete mode of similarity does not limit.Such as can the included angle cosine based on vector Computing calculates similarity, quadratic sum computing calculating similarity that can be based on vector, etc..
The specified word can have the position difference multiple, specified word can be repeatedly and in language material, can be directed to respectively Each specified word performs the processing action in the preceding paragraph.Preferably, can respectively using the word included in the language material after participle as One specified word.
In this specification embodiment, the training in step S208 can cause:Specify word and the similarity of upper and lower cliction With respect to uprising, (herein, similarity can reflect the degree of association, and the degree of association of word and its context word is of a relatively high, and the meaning of a word Same or like each word respectively corresponding to up and down cliction be also often same or like), and specify word with it is non-up and down cliction Similarity relatively step-down, non-cliction up and down can be used as following negative sample words, then upper and lower cliction relatively can conduct Positive sample word.
As can be seen here, in the training process, it is thus necessary to determine that some negative sample words are as control.Can be in the language material after participle The middle one or more words of random selection can also strictly select non-cliction up and down as negative sample word as negative sample word.With Exemplified by former mode, it is described according to the specified word with it is described up and down cliction similarity, to it is described up and down cliction word to The character vector of each n metacharacters is updated corresponding to amount and the specified word, can specifically be included:Selected from each word One or more words, as negative sample word;Determine the similarity of the specified word and each negative sample word;According to specified damage Mistake function, the specified word and the similarity of the cliction up and down, and the specified word are similar to each negative sample word Degree, determine loss characterization value corresponding to the specified word;According to the loss characterization value, to the term vector of cliction up and down and The character vector of each n metacharacters is updated corresponding to the specified word.
Wherein, the loss characterization value is used to weigh the error degree between current vector value and training objective.It is described The parameter of loss function can be using above-mentioned several similarities as parameter, and specific loss function expression formula this specification is not Limit, behind can illustrated in greater detail.
In this specification embodiment, the amendment actually to the error degree is updated to term vector and character vector. When using the scheme of neural fusion this specification, this amendment can be based on backpropagation and gradient descent method is realized. In this case, the gradient is gradient corresponding to loss function.
It is then described according to the loss characterization value, each n members corresponding to the term vector and the specified word to the specified word The character vector of character is updated, and can specifically be included:According to the loss characterization value, determine corresponding to the loss function Gradient;According to the gradient, to the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word It is updated.
In this specification embodiment, can be to the training process of term vector and character vector based on participle after language material In at least part word iteration carry out, so as to so that term vector and character vector little by little restrain, until complete training.
So that whole words in the language material after based on participle are trained as an example.It is described according to institute's predicate for step S208 The language material after vectorial, described character vector, and participle, is trained to the term vector and the character vector, has Body can include:
The language material after participle is traveled through, the word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine the word with being somebody's turn to do The similarity of upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, each n members word corresponding to the term vector and the word to the upper and lower cliction The character vector of symbol is updated.
Specific how to be updated has illustrated above, repeats no more.
In this specification embodiment, the similarity of the word and the upper and lower cliction is determined, except each n according to corresponding to the word The character vector of metacharacter, the term vector of the upper and lower cliction, the term vector synthesis that can be combined with the word weigh the determination phase Like degree.Based on this thinking, the character vector of the n metacharacters each according to corresponding to the word, and the upper and lower cliction word to Amount, determines the similarity of the word and the upper and lower cliction, can specifically include:The word of the n metacharacters each according to corresponding to the word Symbol vector, the term vector of the word, and the term vector of the upper and lower cliction, determine the similarity of the word and the upper and lower cliction.
Further, for the ease of computer disposal, ergodic process above can be realized based on window.
For example, cliction above and below one or more of described described language material for determining the word after participle, can specifically wrap Include:In the language material after participle, by centered on the word, sliding the distance of specified quantity word to the left and/or to the right, Establish window;Word beyond the word in the window is defined as to the cliction up and down of the word.
It is of course also possible to using first word of the language material after participle as starting position, establish one and set length Window, in window comprising first word and afterwards continuous setting quantity word;After having handled each word in window, by window Slide backward to handle the next group word in the language material, until having traveled through the language material.
A kind of term vector processing method provided above this specification embodiment is illustrated.In order to make it easy to understand, Based on described above, this specification embodiment is additionally provided under practical application scene, a kind of tool of the term vector processing method The schematic flow sheet of body embodiment, as shown in Figure 3.
Flow in Fig. 3 mainly includes the following steps that:
Step 1, Chinese language material is segmented using participle instrument, scanning participle after Chinese language material, statistics it is all go out The word now crossed deletes the word that occurrence number is less than b times (that is, above-mentioned setting number) to establish vocabulary;Jump procedure 2;
Step 2, scan vocabulary one by one, extract n metacharacters corresponding to each word, establish n metacharacter tables, and word with The mapping table of corresponding n metacharacters;Jump procedure 3;
Step 3, the term vector that a dimension is d is established for each word in vocabulary, to each in n metacharacter tables N metacharacters all establish the character vector that a dimension is also d, institute's directed quantity that random initializtion is established;Jump procedure 4;
Step 4, from the Chinese language material for completing participle, slided one by one since first word, one word of selection is made every time For " current word w (that is, above-mentioned specified word) ", if all words of the traversed whole language materials of w, terminate;Otherwise jump procedure 5;
Step 5, centered on current word w, window is established to k word of two Slideslips, first out of window word is to most The latter word (in addition to current word w), one word of selection is as " upper and lower cliction c ", if all in the traversed windows of c every time Word, then jump procedure 4;Otherwise, jump procedure 6;
Step 6, for current word w, according to the word in step 2 and corresponding n metacharacters mapping table, w pairs of current word is found Each n metacharacters answered, current word w and upper and lower cliction c similarity is calculated according to formula (1):
Wherein, S represents the n metacharacter tables established in step 2 in formula, and S (w) represents in step 2 current word w in mapping table The set of corresponding at least part n metacharacters, q represent each n metacharacters in S (w), and sim (w, c) represents current word w with working as Front upper and lower cliction c similarity;Q character vector is represented,W term vector is represented,C term vector is represented, ⊙ represents pin The certain operations vectorial to two, the certain operations are dot-product operation or included angle cosine computing or Euclidean distance fortune Calculate;β1、β2For weight parameter, the value between 0~1 can be typically taken, such as, β1、β2It is nonnegative number, and β1With β2Sum is equal to 1;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and according to formula (2) (that is, above-mentioned loss function) Counting loss score l (w, c), loss score may act as above-mentioned loss characterization value:
Wherein, log is logarithmic function, and c ' is the negative sample word randomly selected, and Ec'∈p(V)What [x] referred to randomly select bears In the case that sample word c ' meets probability distribution p (V), expression formula x desired value, σ () is neutral net excitation function, in detail Referring to formula (3):
Wherein, if x is a real number, σ (x) and a real number;Gradient is calculated according to l (w, c) value, updates q's Character vectorWith the vector of upper and lower clictionJump procedure 5.
In above-mentioned steps 1~7, step 6 and step 7 are more crucial steps.In order to make it easy to understand, illustrated with reference to Fig. 4 It is bright.
Fig. 4 is the relevant treatment action signal of part language material used in flow in Fig. 3 that this specification embodiment provides Figure.
As shown in Figure 4, it is assumed that having sentence in language material, " い ", participle obtain in the sentence な つ や The body Three words " な つ や The body ", " ", " い ".
It is assumed that it be current word w now to select " な つ や The body ", it " " is upper and lower cliction c to select, and extracts current word w and reflects The set S (w) at least part n metacharacters penetrated, such as, 3 metacharacters of " な つ や The body " mapping include " な つ や ", " つ や The ", " や The body ", 4 metacharacters include " な つ や The ", " つ や The body ".According to formula (1), formula (2) and formula (3) Counting loss score l (w, c), and then calculate gradient, to update character vector corresponding to c term vector and w, these calculating process Represented with the grey square frame in Fig. 4.
Based on the embodiment in the thinking and Fig. 3 same with Fig. 2, this specification embodiment provide another word to Measure processing method.
Fig. 5 is the schematic flow sheet for another term vector processing method that this specification embodiment provides.
Flow in Fig. 5 may comprise steps of:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, it is described each Word is not included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, according to the vocabulary, n metacharacter mapping tables are established, the mapping table includes each word and n member words Mapping relations between symbol, the n metacharacters characterize the continuous n character of the word of its mapping;Jump procedure 3;
Step 3, according to the n metacharacters mapping table, establish and initialize the term vector of each word, and each word The character vector of each n metacharacters of mapping;Jump procedure 4;
Step 4, the language material after traversal participle, is performed using the word traversed as current word w and to current word w respectively Step 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, are traveled through in the window All words in addition to current word w, respectively using the word traversed as current word w current context word c and to when front upper and lower Cliction c performs step 6, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents the set of at least part n metacharacters that current word w maps in the n metacharacters mapping table, q Each n metacharacters in S (w) are represented, sim (w, c) represents current word w and current context word c similarity;Represent q word Symbol vector,W term vector is represented,C term vector is represented, ⊙ represents to be directed to two vectorial certain operations, the specific fortune Calculate as dot-product operation or included angle cosine computing or Euclidean distance computing;β1、β2For weight parameter;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)The negative sample word c ' that [x] refers to randomly select meets general Rate is distributedp(V) in the case of, expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) that calculates calculates the loss function, according to the gradient, To q character vectorWith current context word c term vectorIt is updated.
Each step can be performed by identical or different module in another term vector processing method, this specification To this and it is not specifically limited.
The term vector processing method provided above for this specification embodiment, based on same thinking, this specification is implemented Example additionally provides corresponding device, as shown in Figure 6.
Fig. 6 is a kind of structural representation for term vector processing unit corresponding to Fig. 2 that this specification embodiment provides, should Device can be located at the executive agent of flow in Fig. 2, including:
Word-dividing mode 601, language material is segmented to obtain each word;
Determining module 602, determine that each n metacharacters, the n metacharacters characterize the company of its corresponding word corresponding to each word Continue n character;
Initialization module 603, establish and initialize the term vector of each word, and each n members word corresponding to each word The character vector of symbol;
Training module 604, according to the language material after the term vector, the character vector, and participle, to institute's predicate Character vector is trained described in vector sum;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
Alternatively, the determining module 602 determines each n metacharacters corresponding to each word, specifically includes:
The determining module 602 is according to the result segmented to the language material, it is determined that the word occurred in the language material;
The word of the determination is directed to respectively, is performed:
Each n metacharacters corresponding to the word are determined, n metacharacters corresponding to the word characterize the continuous n character of the word, n mono- Individual positive integer or multiple different positive integers.
Alternatively, the determining module 602 is according to the result segmented to the language material, it is determined that occurring in the language material Word, specifically include:
The determining module 602 is according to the result segmented to the language material, it is determined that occurring and occurring in the language material Word of the number no less than setting number.
Alternatively, the initialization module 603 initializes the term vector of each word, and each n corresponding to each word The character vector of metacharacter, is specifically included:
The side that the initialization module 603 is initialized by the way of random initializtion or according to specified probability distribution Formula, the term vector of each word, and the character vector of each n metacharacters corresponding to each word are initialized, wherein, identical n members The character vector of character is also identical.
Alternatively, the training module 604 is according to institute's predicate after the term vector, the character vector, and participle Material, is trained to the term vector and the character vector, specifically includes:
The training module 604 determines the specified word in the language material after segmenting, and the specified word is after participle One or more of language material cliction up and down;
The character vector of each n metacharacters according to corresponding to the specified word, and the term vector of the cliction up and down, it is determined that The specified word and the similarity of the cliction up and down;
According to the specified word with it is described up and down cliction similarity, to it is described up and down cliction term vector and it is described specify The character vector of each n metacharacters is updated corresponding to word.
Alternatively, the training module 604 is according to the specified word and the similarity of the cliction up and down, above and below described The character vector of each n metacharacters is updated corresponding to the term vector of cliction and the specified word, is specifically included:
The training module 604 selects one or more words from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word with it is described up and down cliction similarity, and the specified word with The similarity of each negative sample word, determine loss characterization value corresponding to the specified word;
According to the loss characterization value, to each n metacharacters corresponding to the term vector of cliction up and down and the specified word Character vector be updated.
Alternatively, the training module 604 is according to the loss characterization value, to the term vector of cliction up and down and described Specify the character vector of each n metacharacters corresponding to word to be updated, specifically include:
The training module 604 determines gradient corresponding to the loss function according to the loss characterization value;
According to the gradient, to the character of each n metacharacters corresponding to the term vector of cliction up and down and the specified word Vector is updated.
Alternatively, the training module 604 selects one or more words from each word, as negative sample word, specifically Including:
The training module 604 randomly chooses one or more words from each word, as negative sample word.
Alternatively, the training module 604 is according to institute's predicate after the term vector, the character vector, and participle Material, is trained to the term vector and the character vector, specifically includes:
The training module 604 travels through to the language material after participle, respectively in the language material after participle Word performs:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine the word with being somebody's turn to do The similarity of upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, each n members word corresponding to the term vector and the word to the upper and lower cliction The character vector of symbol is updated.
Alternatively, the character vector of the training module 604 each n metacharacters according to corresponding to the word, and the context The term vector of word, the similarity of the word and the upper and lower cliction is determined, is specifically included:
The training module 604 character vector of each n metacharacters, term vector of the word according to corresponding to the word, and should The term vector of upper and lower cliction, determine the similarity of the word and the upper and lower cliction.
Alternatively, the training module 604 determines one or more of the language material of the word after participle context Word, specifically include:
In the language material of the training module 604 after participle, by centered on the word, sliding to the left and/or to the right The distance of dynamic specified quantity word, establishes window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
Based on same thinking, this specification embodiment additionally provides corresponding a kind of electronic equipment, including:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Individual computing device, so that at least one processor can:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n character of its corresponding word;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
Based on same thinking, this specification embodiment additionally provides a kind of corresponding non-volatile computer storage and is situated between Matter, is stored with computer executable instructions, and the computer executable instructions are arranged to:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n character of its corresponding word;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the action recorded in detail in the claims or step can be come according to different from the order in embodiment Perform and still can realize desired result.In addition, the process described in the accompanying drawings not necessarily require show it is specific suitable Sequence or consecutive order could realize desired result.In some embodiments, multitasking and parallel processing be also can With or be probably favourable.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.Especially for device, For electronic equipment, nonvolatile computer storage media embodiment, because it is substantially similar to embodiment of the method, so description It is fairly simple, the relevent part can refer to the partial explaination of embodiments of method.
Device that this specification embodiment provides, electronic equipment, nonvolatile computer storage media with method are corresponding , therefore, device, electronic equipment, nonvolatile computer storage media also there is the Advantageous similar with corresponding method to imitate Fruit, due to the advantageous effects of method being described in detail above, therefore, repeat no more here corresponding intrument, The advantageous effects of electronic equipment, nonvolatile computer storage media.
In the 1990s, the improvement for a technology can clearly distinguish be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And as the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow is programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, PLD (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, its logic function is determined by user to device programming.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, without asking chip maker to design and make Special IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " patrols Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but have many kinds, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also should This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, Can is readily available the hardware circuit for realizing the logical method flow.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing Device and storage can by the computer of the computer readable program code (such as software or firmware) of (micro-) computing device Read medium, gate, switch, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller include but is not limited to following microcontroller Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that except with Pure computer readable program code mode realized beyond controller, completely can be by the way that method and step is carried out into programming in logic to make Controller is obtained in the form of gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. to come in fact Existing identical function.Therefore this controller is considered a kind of hardware component, and various for realizing to including in it The device of function can also be considered as the structure in hardware component.Or even, can be by for realizing that the device of various functions regards For that not only can be the software module of implementation method but also can be the structure in hardware component.
System, device, module or the unit that above-described embodiment illustrates, it can specifically be realized by computer chip or entity, Or realized by the product with certain function.One kind typically realizes that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during specification.
It should be understood by those skilled in the art that, this specification embodiment can be provided as method, system or computer program Product.Therefore, this specification embodiment can use complete hardware embodiment, complete software embodiment or with reference to software and hardware The form of the embodiment of aspect.Moreover, this specification embodiment can be can use using computer is wherein included in one or more It is real in the computer-usable storage medium (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form for the computer program product applied.
This specification is with reference to the method, equipment (system) and computer program product according to this specification embodiment Flow chart and/or block diagram describe.It should be understood that can be by every in computer program instructions implementation process figure and/or block diagram One flow and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computers can be provided Processor of the programmed instruction to all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices To produce a machine so that produce use by the instruction of computer or the computing device of other programmable data processing devices In the dress for realizing the function of being specified in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames Put.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Other identical element also be present in the process of element, method, commodity or equipment.
It will be understood by those skilled in the art that this specification embodiment can be provided as method, system or computer program product. Therefore, this specification can use the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.Moreover, this specification can use the computer for wherein including computer usable program code in one or more can With the computer program product implemented in storage medium (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) Form.
This specification can be described in the general context of computer executable instructions, such as journey Sequence module.Usually, program module include performing particular task or realize the routine of particular abstract data type, program, object, Component, data structure etc..This specification can also be put into practice in a distributed computing environment, in these DCEs In, by performing task by communication network and connected remote processing devices.In a distributed computing environment, program module It can be located in the local and remote computer-readable storage medium including storage device.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system For applying example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
This specification embodiment is the foregoing is only, is not limited to the application.For those skilled in the art For, the application can have various modifications and variations.All any modifications made within spirit herein and principle, it is equal Replace, improve etc., it should be included within the scope of claims hereof.

Claims (24)

1. a kind of term vector processing method, including:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n character of its corresponding word;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the term vector, the character vector, and the language material after participle, to the term vector and the character to Amount is trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
2. the method as described in claim 1, each n metacharacters corresponding to determination each word, are specifically included:
According to the result segmented to the language material, it is determined that the word occurred in the language material;
The mutually different word of the determination is directed to respectively, is performed:
Determine each n metacharacters corresponding to the word, n metacharacters corresponding to the word characterize the continuous n character of the word, n be one just Integer or multiple different positive integers.
3. method as claimed in claim 2, the result that the basis segments to the language material, it is determined that occurring in the language material The word crossed, is specifically included:
According to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is no less than setting number Word.
4. the method as described in claim 1, the term vector of initialization each word, and each n corresponding to each word The character vector of metacharacter, is specifically included:
By the way of the random initializtion or in the way of specified probability distribution initializes, initialize the word of each word to Amount, and the character vector of each n metacharacters corresponding to each word, wherein, the character vector of identical n metacharacters is also identical.
5. the method as described in claim 1, described according to described after the term vector, the character vector, and participle Language material, the term vector and the character vector are trained, specifically included:
It is determined that the specified word in the language material after participle, and one in the language material of the specified word after participle or Multiple clictions up and down;
The character vector of each n metacharacters according to corresponding to the specified word, and the term vector of the cliction up and down, it is determined that described Specify word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector of cliction and the specified word pair up and down The character vector for each n metacharacters answered is updated.
6. method as claimed in claim 5, described according to the specified word and the similarity of the cliction up and down, on described The character vector of each n metacharacters is updated corresponding to the term vector of lower cliction and the specified word, is specifically included:
One or more words are selected from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, determines loss characterization value corresponding to the specified word;
According to the loss characterization value, to the word of each n metacharacters corresponding to the term vector of cliction up and down and the specified word Symbol vector is updated.
7. method as claimed in claim 6, described according to the loss characterization value, to the term vector of cliction and the institute up and down The character vector for stating each n metacharacters corresponding to specified word is updated, and is specifically included:
According to the loss characterization value, gradient corresponding to the loss function is determined;
According to the gradient, to the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word It is updated.
8. method as claimed in claim 6, described to select one or more words from each word, as negative sample word, tool Body includes:
One or more words are randomly choosed from each word, as negative sample word.
9. the method as described in claim 1, described according to described after the term vector, the character vector, and participle Language material, the term vector and the character vector are trained, specifically included:
The language material after participle is traveled through, the word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine that the word is upper and lower with this The similarity of cliction;
According to the word and the similarity of the upper and lower cliction, each n metacharacters corresponding to the term vector and the word to the upper and lower cliction Character vector is updated.
10. method as claimed in claim 9, the character vector of the n metacharacters each according to corresponding to the word, and this is upper and lower The term vector of cliction, the similarity of the word and the upper and lower cliction is determined, is specifically included:
The character vector of the n metacharacters each according to corresponding to the word, the term vector of the word, and the upper and lower cliction word to Amount, determine the similarity of the word and the upper and lower cliction.
11. one or more of method as claimed in claim 9, the language material for determining the word after participle is up and down Cliction, specifically include:
In the language material after participle, by centered on the word, slide to the left and/or to the right specified quantity word away from From establishing window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
12. a kind of term vector processing unit, including:
Word-dividing mode, language material is segmented to obtain each word;
Determining module, determines each n metacharacters corresponding to each word, and the n metacharacters characterize the continuous n word of its corresponding word Symbol;
Initialization module, establish and initialize the term vector of each word, and the character of each n metacharacters corresponding to each word Vector;
Training module, according to the term vector, the character vector, and the language material after participle, to the term vector and The character vector is trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
13. device as claimed in claim 12, the determining module determines each n metacharacters corresponding to each word, specific bag Include:
The determining module is according to the result segmented to the language material, it is determined that the word occurred in the language material;
The mutually different word of the determination is directed to respectively, is performed:
Determine each n metacharacters corresponding to the word, n metacharacters corresponding to the word characterize the continuous n character of the word, n be one just Integer or multiple different positive integers.
14. device as claimed in claim 13, the determining module is according to the result segmented to the language material, it is determined that described The word occurred in language material, is specifically included:
The determining module is according to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is many In the word of setting number.
15. device as claimed in claim 12, the initialization module initializes the term vector of each word, and described each The character vector of each n metacharacters, is specifically included corresponding to word:
The initialization module is by the way of random initializtion or in the way of specified probability distribution initializes, initialization The term vector of each word, and the character vector of each n metacharacters corresponding to each word, wherein, the character of identical n metacharacters Vector is also identical.
16. device as claimed in claim 12, the training module is according to the term vector, the character vector, Yi Jifen The language material after word, is trained to the term vector and the character vector, specifically includes:
The training module determines the specified word in the language material after segmenting, and institute predicate of the specified word after participle Cliction above and below one or more of material;
The character vector of each n metacharacters according to corresponding to the specified word, and the term vector of the cliction up and down, it is determined that described Specify word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector of cliction and the specified word pair up and down The character vector for each n metacharacters answered is updated.
17. device as claimed in claim 16, the training module is similar to the cliction up and down according to the specified word Degree, the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word is updated, specific bag Include:
The training module selects one or more words from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, determines loss characterization value corresponding to the specified word;
According to the loss characterization value, to the word of each n metacharacters corresponding to the term vector of cliction up and down and the specified word Symbol vector is updated.
18. device as claimed in claim 17, the training module is according to the loss characterization value, to the cliction up and down The character vector of each n metacharacters is updated corresponding to term vector and the specified word, is specifically included:
The training module determines gradient corresponding to the loss function according to the loss characterization value;
According to the gradient, to the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word It is updated.
19. device as claimed in claim 17, the training module selects one or more words from each word, as negative Sample word, is specifically included:
The training module randomly chooses one or more words from each word, as negative sample word.
20. device as claimed in claim 12, the training module is according to the term vector, the character vector, Yi Jifen The language material after word, is trained to the term vector and the character vector, specifically includes:
The training module is traveled through to the language material after participle, and the word in the language material after participle is performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine that the word is upper and lower with this The similarity of cliction;
According to the word and the similarity of the upper and lower cliction, each n metacharacters corresponding to the term vector and the word to the upper and lower cliction Character vector is updated.
21. device as claimed in claim 20, the character vector of the training module each n metacharacters according to corresponding to the word, And the term vector of the upper and lower cliction, the similarity of the word and the upper and lower cliction is determined, is specifically included:
The training module character vector of each n metacharacters, term vector of the word according to corresponding to the word, and the upper and lower cliction Term vector, determine the similarity of the word and the upper and lower cliction.
22. device as claimed in claim 20, the training module determines one in the language material of the word after participle Or multiple clictions up and down, specifically include:
In the language material of the training module after participle, by centered on the word, sliding to the left and/or to the right and specifying number The distance of amount word, establishes window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
23. a kind of term vector processing method, including:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, each word is not It is included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, n metacharacter mapping tables are established according to the vocabulary, the mapping table include each word and n metacharacters it Between mapping relations, the n metacharacters characterize its mapping word continuous n character;Jump procedure 3;
Step 3, according to the n metacharacters mapping table, establish and initialize the term vector of each word, and each word mapping Each n metacharacters character vector;Jump procedure 4;
Step 4, the language material after traversal participle, performs step using the word traversed as current word w and to current word w respectively 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, traveled through to remove in the window and work as All words beyond preceding word w, respectively using the word traversed as current word w current context word c and to current context word c Step 6 is performed, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents the set of at least part n metacharacters that current word w maps in the n metacharacters mapping table, and q represents S (w) each n metacharacters in, sim (w, c) represent current word w and current context word c similarity;Q character vector is represented,W term vector is represented,C term vector is represented, ⊙ represents to be directed to two vectorial certain operations, and the certain operations are point Product computing or included angle cosine computing or Euclidean distance computing;β1、β2For weight parameter;Jump procedure 7;
Step 7, randomly select λ word as negative sample word, according to following loss function calculate corresponding to loss characterization value l (w, c):
<mrow> <mi>l</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>c</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&amp;lambda;</mi> </munderover> <msub> <mi>E</mi> <mrow> <msup> <mi>c</mi> <mo>&amp;prime;</mo> </msup> <mo>&amp;Element;</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>V</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mo>-</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <msup> <mi>c</mi> <mo>,</mo> </msup> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> <mo>;</mo> </mrow>
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)The negative sample word c ' that [x] refers to randomly select meets probability point Clothp(V) in the case of, expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) calculated calculates the loss function, according to the gradient, to q's Character vectorWith current context word c term vectorIt is updated.
24. a kind of electronic equipment, including:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by least one place Manage device to perform, so that at least one processor can:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n character of its corresponding word;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the term vector, the character vector, and the language material after participle, to the term vector and the character to Amount is trained;
Wherein, institute's predicate is day cliction, and the character is assumed name and/or japanese character.
CN201710583670.4A 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment Active CN107577658B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710583670.4A CN107577658B (en) 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710583670.4A CN107577658B (en) 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN107577658A true CN107577658A (en) 2018-01-12
CN107577658B CN107577658B (en) 2021-01-29

Family

ID=61049856

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710583670.4A Active CN107577658B (en) 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN107577658B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119507A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 Term vector generation method, device and equipment
CN110929508A (en) * 2018-09-20 2020-03-27 阿里巴巴集团控股有限公司 Method, device and system for generating word vector

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MACHINELEARNING: "FastText分析与实践", 《简书HTTPS://WWW.JIANSHU.COM/P/9EA0D69DD55E》 *
PIOTR BOJANOWSKI ET AL.: "Enriching Word Vectors with Subword Information", 《ARXIV:1607.04606V1》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119507A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 Term vector generation method, device and equipment
CN110929508A (en) * 2018-09-20 2020-03-27 阿里巴巴集团控股有限公司 Method, device and system for generating word vector
CN110929508B (en) * 2018-09-20 2023-05-02 阿里巴巴集团控股有限公司 Word vector generation method, device and system

Also Published As

Publication number Publication date
CN107577658B (en) 2021-01-29

Similar Documents

Publication Publication Date Title
TWI685761B (en) Word vector processing method and device
CN108170667A (en) Term vector processing method, device and equipment
WO2019192261A1 (en) Payment mode recommendation method and device and equipment
CN108874765A (en) Term vector processing method and processing device
CN107957989A (en) Term vector processing method, device and equipment based on cluster
US11030411B2 (en) Methods, apparatuses, and devices for generating word vectors
CN110119507A (en) Term vector generation method, device and equipment
CN107423269A (en) Term vector processing method and processing device
CN107402945A (en) Word stock generating method and device, short text detection method and device
CN109325508A (en) The representation of knowledge, machine learning model training, prediction technique, device and electronic equipment
CN107562716A (en) Term vector processing method, device and electronic equipment
CN107247704A (en) Term vector processing method, device and electronic equipment
CN107577658A (en) Term vector processing method, device and electronic equipment
CN110502614A (en) Text hold-up interception method, device, system and equipment
CN107562715A (en) Term vector processing method, device and electronic equipment
CN107329964A (en) A kind of text handling method and device
CN110119381A (en) A kind of index updating method, device, equipment and medium
CN108170663A (en) Term vector processing method, device and equipment based on cluster
CN107590739A (en) A kind of method and device of information displaying
CN107844472A (en) Term vector processing method, device and electronic equipment
CN107577659A (en) Term vector processing method, device and electronic equipment
CN115861995B (en) Visual question-answering method and device, electronic equipment and storage medium
CN108681490A (en) For the vector processing method, device and equipment of RPC information
CN107679547A (en) A kind of data processing method for being directed to two disaggregated models, device and electronic equipment
CN110321433A (en) Determine the method and device of text categories

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1248352

Country of ref document: HK

TA01 Transfer of patent application right

Effective date of registration: 20191204

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, ky1-1205, Cayman Islands

Applicant after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant