CN107562716A - Term vector processing method, device and electronic equipment - Google Patents

Term vector processing method, device and electronic equipment Download PDF

Info

Publication number
CN107562716A
CN107562716A CN201710583773.0A CN201710583773A CN107562716A CN 107562716 A CN107562716 A CN 107562716A CN 201710583773 A CN201710583773 A CN 201710583773A CN 107562716 A CN107562716 A CN 107562716A
Authority
CN
China
Prior art keywords
word
metacharacters
vector
character
cliction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710583773.0A
Other languages
Chinese (zh)
Inventor
曹绍升
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710583773.0A priority Critical patent/CN107562716A/en
Publication of CN107562716A publication Critical patent/CN107562716A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

This specification embodiment discloses term vector processing method, device and electronic equipment.Methods described includes:Multiple n members letters are marked off from word, and further it is n metacharacters by n members letter maps, character vector based on n metacharacters, the term vector of the word is trained, wherein, n members letter characterizes the continuous n letter of its corresponding word, and institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.

Description

Term vector processing method, device and electronic equipment
Technical field
This specification is related to computer software technical field, more particularly to term vector processing method, device and electronics are set It is standby.
Background technology
The solution of natural language processing of today, mostly uses the framework based on neutral net, and in this framework Next important basic technology is exactly term vector.Term vector is the vector that word is mapped to a fixed dimension, the vector table The semantic information of the word is levied.
In the prior art, the algorithm for being commonly used in generation term vector is specific to English design.Such as Google The word vector algorithm of company, deep neural network algorithm of Microsoft etc..
Based on prior art, it is necessary to which a kind of term vector for Arabic, Malay, Indonesian generates scheme.
The content of the invention
This specification embodiment provides term vector processing method, device and electronic equipment, is asked to solve following technology Topic:A kind of term vector for Arabic, Malay, Indonesian is needed to generate scheme.
In order to solve the above technical problems, what this specification embodiment was realized in:
A kind of term vector processing method that this specification embodiment provides, including:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n letter by its corresponding word Map obtained character string;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
A kind of term vector processing unit that this specification embodiment provides, including:
Word-dividing mode, language material is segmented to obtain each word;
Determining module, determines each n metacharacters corresponding to each word, and the n metacharacters characterize the company by its corresponding word Continue the character string that n letter maps obtain;
Initialization module, establish and initialize the term vector of each word, and each n metacharacters corresponding to each word Character vector;
Training module, according to the term vector, the character vector, and the language material after participle, to institute's predicate to Amount and the character vector are trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
Another term vector processing method that this specification embodiment provides, including:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, it is described each Word is not included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, according to the vocabulary, n metacharacter mapping tables are established, the mapping table includes each word and n member words Mapping relations between symbol, the n metacharacters characterize the character string obtained by the continuous n letter maps of the word of its mapping;Jump Go to step 3;
Step 3, according to the n metacharacters mapping table, establish and initialize the term vector of each word, and each word The character vector of each n metacharacters of mapping;Jump procedure 4;
Step 4, the language material after traversal participle, is performed using the word traversed as current word w and to current word w respectively Step 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, are traveled through in the window All words in addition to current word w, respectively using the word traversed as current word w current context word c and to when front upper and lower Cliction c performs step 6, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents the set of at least part n metacharacters that current word w maps in the n metacharacters mapping table, q Each n metacharacters in S (w) are represented, sim (w, c) represents current word w and current context word c similarity;Represent q character Vector,W term vector is represented,C term vector is represented, ⊙ represents to be directed to two vectorial certain operations, the certain operations For dot-product operation or included angle cosine computing or Euclidean distance computing;β1、β2For weight parameter;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)The negative sample word c ' that [x] refers to randomly select meets general In the case of rate distribution p (V), expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) that calculates calculates the loss function, according to the gradient, To q character vectorWith current context word c term vectorIt is updated.
The a kind of electronic equipment that this specification embodiment provides, including:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Individual computing device, so that at least one processor can:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n letter by its corresponding word Map obtained character string;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
Above-mentioned at least one technical scheme that this specification embodiment uses can reach following beneficial effect:It can pass through N metacharacters corresponding to word more subtly show the feature of the word, and then are advantageous to improve the Arabic or Malay of generation Or the degree of accuracy of the term vector of the word of the language such as Indonesian, practical function is preferable, therefore, can partly or entirely solve State technical problem.
Brief description of the drawings
In order to illustrate more clearly of this specification embodiment or technical scheme of the prior art, below will to embodiment or The required accompanying drawing used is briefly described in description of the prior art, it should be apparent that, drawings in the following description are only Some embodiments described in this specification, for those of ordinary skill in the art, do not paying creative labor Under the premise of, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of overall architecture schematic diagram that the scheme of this specification is related under a kind of practical application scene;
Fig. 2 is a kind of schematic flow sheet for term vector processing method that this specification embodiment provides;
Fig. 3 is under the practical application scene that this specification embodiment provides, and one kind of the term vector processing method is specific The schematic flow sheet of embodiment;
Fig. 4 is the relevant treatment action signal of part language material used in flow in Fig. 3 that this specification embodiment provides Figure;
Fig. 5 is the schematic flow sheet for another term vector processing method that this specification embodiment provides;
Fig. 6 is a kind of structural representation for term vector processing unit corresponding to Fig. 2 that this specification embodiment provides.
Embodiment
This specification embodiment provides term vector processing method, device and electronic equipment.
In order that those skilled in the art more fully understand the technical scheme in this specification, below in conjunction with this explanation Accompanying drawing in book embodiment, the technical scheme in this specification embodiment is clearly and completely described, it is clear that described Embodiment be only some embodiments of the present application, rather than whole embodiment.Based on this specification embodiment, this area The every other embodiment that those of ordinary skill is obtained under the premise of creative work is not made, should all belong to the application The scope of protection.
Fig. 1 is a kind of overall architecture schematic diagram that the scheme of this specification is related under a kind of practical application scene.This is whole In body framework, five parts are related generally to:N members corresponding to word, word in language material are alphabetical, n metacharacters, the term vector of word corresponding to word Character vector, vectorial training server with n metacharacters.N member letters are used for compared with the feature for subtly showing its corresponding word, by Further mapping obtains n metacharacters to n members letter, by vectorial training server to the term vector of word and the character of n metacharacters Vector is trained, and can obtain more accurately term vector.In actual applications, preceding tetrameric relevant action can be by corresponding Software and/or hardware function perform.
The scheme of this specification is applied to the term vector of the Arabic either word of Malay or Indonesian, is also applied for Other form term vector of the letter for the word of the language of Non-American Standard Code for Information Interchange.
For the ease of description, following embodiment mainly for the Arabic either scene of Malay or Indonesian, The scheme of this specification is illustrated.
Fig. 2 is a kind of schematic flow sheet for term vector processing method that this specification embodiment provides.From program angle and Speech, the executive agent of the flow can be program with term vector systematic function and/or training function etc.;Slave unit angle and Speech, the executive agent of the flow can include but is not limited to the following at least one equipment that can carry described program:Individual calculus Machine, big-and-middle-sized computer, computer cluster, mobile phone, tablet personal computer, intelligent wearable device, vehicle device etc..
Flow in Fig. 2 may comprise steps of:
S202:Language material is segmented to obtain each word;Wherein, institute's predicate can be Arabic cliction, or be Malay word, Or the word for Indonesian.
In this specification embodiment, each word can be specifically:At least occurred in language material in word once extremely Small part word.For the ease of subsequent treatment, each word can be stored in vocabulary, it is necessary to read word when using from vocabulary .
S204:Each n metacharacters corresponding to each word are determined, the n metacharacters are characterized by continuous n of its corresponding word The character string that letter maps obtain.
In this specification embodiment, the continuous n letter of word can be referred to as:N members letter corresponding to the word.
In order to make it easy to understand, Arabic, Malay, Indonesian are illustrated respectively.
Arabic shares the letter of 28 voiced consonants, is at present 18 national official languages:Saudi Arabia, Door, the United Arab Emirates, Oman, Kuwait, Bahrain, Qatar, Iraq, Syria, Jordan, Lebanon, Palestine, Egypt, Soviet Union Pellet, Libya, Tunisia, Mauritania, Algeria and Morocco.
Arabic letter is more special, generally use unicode code storages, in order to reduce in subsequent processes Arabic letter, it is non-can to be mapped as the Latin alphabet or numeral etc. by memory consumption and raising processing speed The character of unicode codings, then subsequent treatment is carried out, wherein, the Latin alphabet and numeral belong to ASCII character.
This specification embodiment provides a kind of Arabic letter to the Latin alphabet or the mapping table of numeral, as showing Example, it is as shown in table 1 below.
Table 1
In table 1, former letter represents Arabic letter, and mapping character represents the Latin word mapped by Arabic letter Female or numeral.
Based on the mapping ruler of table 1, for example, for Arabic clictionIts 1st~3 letter (3 yuan of letters) ForIt can map to obtain a character string " mca " (3 metacharacter);2nd~4 letter (3 yuan of letters) beCan One character string " cag " (3 metacharacter) is obtained with mapping;Its 1st~4 letter (4 yuan of letters) isIt can map Obtain a character string " mcag " (4 metacharacter).
Malay and Indonesian it is alphabetical identical, some of letters and use unicode code storages, such as, it is first Sound letter " kh " is an alphabetical unit of minimum, using unicode code storages.Accordingly, there exist similar with Arabic to ask Topic, based on same thinking, this specification embodiment also provide a kind of Malay and Indonesian letter to the Latin alphabet or The mapping table of numeral, as an example, as shown in table 2 below.
Table 2
Former letter Mapping character Former letter Mapping character Former letter Mapping character Former letter Mapping character
a a b b k k p p
e e c c kh 2 q q
i i d d l l r r
o o f f m m s s
u u g g n n sy 5
ai 0 h h ny 3 t t
oi 1 j j ng 4 v v
w w x x y y z z
In table 2, former letter represents Malay and Indonesian letter, and mapping character is represented by Malay and Indonesian letter The Latin alphabet or numeral of mapping.
Based on the mapping ruler of table 2, for example, for Malay word " mengerikan ".Its 1st~3 (3 yuan of letter Letter) it is that " m e ng ", can map to obtain a character string " me4 " (3 metacharacter);2nd~4 letter (3 yuan of letters) be " e ng e ", it can map to obtain character string " e4e " (3 metacharacter), etc.;Its 1st~4 letter (4 yuan of letters) is " m e ng e ", it can map to obtain character string " me4e " (4 metacharacter), etc..
In this specification embodiment, n value can be that dynamic is adjustable.For same word, it is determined that the word pair During each n metacharacters answered, n value can only take 1 (for example only determining each 3 metacharacter corresponding to the word), can also take more Individual (for example determining each 3 metacharacter and each 4 metacharacter etc. corresponding to the word).
In step S204, it may be determined that at least part n metacharacters are used for subsequent processes corresponding to word.
S206:Establish and initialize the term vector of each word, and each n metacharacters corresponding to each word character to Amount.
In this specification embodiment, character vector refers to the vector for representing n metacharacters.Each n metacharacters can divide Do not represented with a character vector, can be represented respectively with a term vector just as each word.
,, may when initializing term vector and character vector in order to ensure the effect of scheme in this specification embodiment Have some restrictive conditions.Such as, it is impossible to each term vector and each character vector are initialized to identical vector;Again for example, Vector element value in some term vectors or character vector can not be all 0;Etc..
In this specification embodiment, it can be initialized by the way of random initializtion or according to specified probability distribution Mode, initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word, wherein, it is identical The character vector of n metacharacters is also identical.For example the specified probability distribution can be 0-1 distributions etc..
If in addition, having been based on other language materials before, term vector and character vector corresponding to some words were trained, then was being entered When training term vector corresponding to these words and character vector based on the language material in Fig. 2 to one step, can no longer it re-establish and first Term vector and character vector corresponding to these words of beginningization, but based on the language material in Fig. 2 and training result before, then instructed White silk.
S208:According to the term vector, the character vector, and the language material after participle, to the term vector and The character vector is trained.
In this specification embodiment, the training can be by neural fusion, and the neutral net can be with It is shallow-layer neutral net or deep-neural-network etc..This specification is not limited the concrete structure of the neutral net of use It is fixed.
By Fig. 2 method, the feature of the word can be more subtly showed by n metacharacters corresponding to word, it is particularly possible to The character constitutive characteristic of the word is showed, and then is advantageous to improve the Arabic generated the either language such as Malay or Indonesian Word term vector the degree of accuracy, practical function is preferable.
Method based on Fig. 2, this specification embodiment additionally provide some specific embodiments of this method, and extension Scheme, it is illustrated below.
It is described to determine each n metacharacters corresponding to each word, the n for step S204 in this specification embodiment Metacharacter characterizes the character string obtained by the continuous n letter maps of its corresponding word, can specifically include:
The alphabetic character mapping relations established are obtained, the alphabetic character mapping relations are:The affiliated language of institute's predicate it is each Mapping relations between letter and each character specified;And determine each n members letter corresponding to each word, the n members letter Characterize the continuous n letter of its corresponding word;According to the alphabetic character mapping relations, each n members letter is carried out respectively Mapping, obtains each n metacharacters corresponding to each word, and the n metacharacters characterize the continuous n letter maps by its corresponding word Obtained character string.
The character mapping relations can such as be pre-established and preserved by the mapping table in previous example.In example above In son, it is the relation mapped one by one between letter and character, in actual applications, other mapping modes can also be used, such as, Overall it can also carry out mapping between single character, etc. using multiple letters as one.
Further, the purpose mapped is:Reduce the memory consumption in subsequent processes and improve processing speed Degree.For example if at least part letter of language corresponding to institute's predicate uses other relative complex volumes such as unicode codings Code storage, then can preferably carry out the mapping, and in this case, the character can be that ASCII character etc. is relative Simple coding.
It is described to determine each n metacharacters corresponding to each word for step S204 in this specification embodiment, specifically It can include:According to the result segmented to the language material, it is determined that the word occurred in the language material;
The mutually different word of the determination is directed to respectively, is performed:
Each n metacharacters corresponding to the word are determined, n metacharacters corresponding to the word characterize the continuous n letter maps by the word Obtained character string, n are a positive integer or multiple different positive integers.
In this specification embodiment, for identical word, each n metacharacters corresponding to them are also identical, therefore, right Step in the preceding paragraph, performed respectively for the mutually different word determined, and for dittograph, can direct edge With existing result, without repeating, so as to save resource.
Further, it is contemplated that corresponding when being trained based on the language material if the number that some word occurs in language material is very little Training sample and frequency of training it is also less, adverse effect can be brought to the confidence level of training result, therefore, can be by this kind of word Screen out, wouldn't train.It can be subsequently trained in other language materials.
Based on such thinking, result that the basis segments to the language material, it is determined that occurred in the language material Word, it can specifically include:According to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is many In the word of setting number.It is specifically that how many times can determine according to actual conditions to set number.
In this specification embodiment, for step S208, specific training method can have it is a variety of, such as based on up and down The training method of cliction, training method based on specified near synonym or synonym etc., in order to make it easy to understand, being in a manner of former Example describes in detail.
The language material according to after the term vector, the character vector, and participle, to the term vector and institute State character vector to be trained, can specifically include:It is determined that the specified word in the language material after participle, and the specified word Cliction above and below one or more of described language material after participle;The character of each n metacharacters according to corresponding to the specified word Vector, and the term vector of the cliction up and down, determine the specified word and the similarity of the cliction up and down;According to the finger Word and the similarity of the cliction up and down are determined, to each n metacharacters corresponding to the term vector of cliction up and down and the specified word Character vector be updated.
This specification pair determines that the concrete mode of similarity does not limit.Such as can the included angle cosine based on vector Computing calculates similarity, quadratic sum computing calculating similarity that can be based on vector, etc..
The specified word can have the position difference multiple, specified word can be repeatedly and in language material, can be directed to respectively Each specified word performs the processing action in the preceding paragraph.Preferably, can respectively using the word included in the language material after participle as One specified word.
In this specification embodiment, the training in step S208 can cause:Specify word and the similarity of upper and lower cliction With respect to uprising, (herein, similarity can reflect the degree of association, and the degree of association of word and its context word is of a relatively high, and the meaning of a word Same or like each word respectively corresponding to up and down cliction be also often same or like), and specify word with it is non-up and down cliction Similarity relatively step-down, non-cliction up and down can be used as following negative sample words, then upper and lower cliction relatively can conduct Positive sample word.
As can be seen here, in the training process, it is thus necessary to determine that some negative sample words are as control.Can be in the language material after participle The middle one or more words of random selection can also strictly select non-cliction up and down as negative sample word as negative sample word.With Exemplified by former mode, it is described according to the specified word with it is described up and down cliction similarity, to it is described up and down cliction word to The character vector of each n metacharacters is updated corresponding to amount and the specified word, can specifically be included:Selected from each word One or more words, as negative sample word;Determine the similarity of the specified word and each negative sample word;According to specified damage Mistake function, the specified word and the similarity of the cliction up and down, and the specified word are similar to each negative sample word Degree, determine loss characterization value corresponding to the specified word;According to the loss characterization value, to the term vector of cliction up and down and The character vector of each n metacharacters is updated corresponding to the specified word.
Wherein, the loss characterization value is used to weigh the error degree between current vector value and training objective.It is described The parameter of loss function can be using above-mentioned several similarities as parameter, and specific loss function expression formula this specification is not Limit, behind can illustrated in greater detail.
In this specification embodiment, the amendment actually to the error degree is updated to term vector and character vector. When using the scheme of neural fusion this specification, this amendment can be based on backpropagation and gradient descent method is realized. In this case, the gradient is gradient corresponding to loss function.
It is then described according to the loss characterization value, each n members corresponding to the term vector and the specified word to the specified word The character vector of character is updated, and can specifically be included:According to the loss characterization value, determine corresponding to the loss function Gradient;According to the gradient, to the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word It is updated.
In this specification embodiment, can be to the training process of term vector and character vector based on participle after language material In at least part word iteration carry out, so as to so that term vector and character vector little by little restrain, until complete training.
So that whole words in the language material after based on participle are trained as an example.It is described according to institute's predicate for step S208 The language material after vectorial, described character vector, and participle, is trained to the term vector and the character vector, has Body can include:
The language material after participle is traveled through, the word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine the word with being somebody's turn to do The similarity of upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, each n members word corresponding to the term vector and the word to the upper and lower cliction The character vector of symbol is updated.
Specific how to be updated has illustrated above, repeats no more.
In this specification embodiment, the similarity of the word and the upper and lower cliction is determined, except each n according to corresponding to the word The character vector of metacharacter, the term vector of the upper and lower cliction, the term vector synthesis that can be combined with the word weigh the determination phase Like degree.Based on this thinking, the character vector of the n metacharacters each according to corresponding to the word, and the upper and lower cliction word to Amount, determines the similarity of the word and the upper and lower cliction, can specifically include:The word of the n metacharacters each according to corresponding to the word Symbol vector, the term vector of the word, and the term vector of the upper and lower cliction, determine the similarity of the word and the upper and lower cliction.
Further, for the ease of computer disposal, ergodic process above can be realized based on window.
For example, cliction above and below one or more of described described language material for determining the word after participle, can specifically wrap Include:In the language material after participle, by centered on the word, sliding the distance of specified quantity word to the left and/or to the right, Establish window;Word beyond the word in the window is defined as to the cliction up and down of the word.
It is of course also possible to using first word of the language material after participle as starting position, establish one and set length Window, in window comprising first word and afterwards continuous setting quantity word;After having handled each word in window, by window Slide backward to handle the next group word in the language material, until having traveled through the language material.
A kind of term vector processing method provided above this specification embodiment is illustrated.In order to make it easy to understand, Based on described above, this specification embodiment is additionally provided under practical application scene, a kind of tool of the term vector processing method The schematic flow sheet of body embodiment, as shown in Figure 3.
Flow in Fig. 3 mainly includes the following steps that:
Step 1, Chinese language material is segmented using participle instrument, scanning participle after Chinese language material, statistics it is all go out The word now crossed deletes the word that occurrence number is less than b times (that is, above-mentioned setting number) to establish vocabulary;Jump procedure 2;
Step 2, scan vocabulary one by one, extract n metacharacters corresponding to each word, establish n metacharacter tables, and word with The mapping table of corresponding n metacharacters;Jump procedure 3;
Step 3, the term vector that a dimension is d is established for each word in vocabulary, to each in n metacharacter tables N metacharacters all establish the character vector that a dimension is also d, institute's directed quantity that random initializtion is established;Jump procedure 4;
Step 4, from the Chinese language material for completing participle, slided one by one since first word, one word of selection is made every time For " current word w (that is, above-mentioned specified word) ", if all words of the traversed whole language materials of w, terminate;Otherwise jump procedure 5;
Step 5, centered on current word w, window is established to k word of two Slideslips, first out of window word is to most The latter word (in addition to current word w), one word of selection is as " upper and lower cliction c ", if all in the traversed windows of c every time Word, then jump procedure 4;Otherwise, jump procedure 6;
Step 6, for current word w, according to the word in step 2 and corresponding n metacharacters mapping table, w pairs of current word is found At least part n metacharacters answered, current word w and upper and lower cliction c similarity is calculated according to formula (1):
Wherein, S represents the n metacharacter tables established in step 2 in formula, and S (w) represents in step 2 current word w in mapping table The set of corresponding at least part n metacharacters, q represent each n metacharacters in S (w), and sim (w, c) represents current word w with working as Front upper and lower cliction c similarity;Q character vector is represented,W term vector is represented,C term vector is represented, ⊙ represents pin The certain operations vectorial to two, the certain operations are dot-product operation or included angle cosine computing or Euclidean distance fortune Calculate;β1、β2For weight parameter, the value between 0~1 can be typically taken, such as, β1、β2It is nonnegative number, and β1With β2Sum is equal to 1;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and according to formula (2) (that is, above-mentioned loss function) Counting loss score l (w, c), loss score may act as above-mentioned loss characterization value:
Wherein, log is logarithmic function, and c ' is the negative sample word randomly selected, and Ec'∈p(V)What [x] referred to randomly select bears In the case that sample word c ' meets probability distribution p (V), expression formula x desired value, σ () is neutral net excitation function, in detail Referring to formula (3):
Wherein, if x is a real number, σ (x) and a real number;Gradient is calculated according to l (w, c) value, updates q's Character vectorWith the vector of upper and lower clictionJump procedure 5.
In above-mentioned steps 1~7, step 6 and step 7 are more crucial steps.In order to make it easy to understand, illustrated with reference to Fig. 4 It is bright.
Fig. 4 is the relevant treatment action signal of part language material used in flow in Fig. 3 that this specification embodiment provides Figure.
As shown in Figure 4, it is assumed that have sentence in language materialParticiple obtains three words in the sentence
It is assumed that now selectFor current word w, selectFor upper and lower cliction c, extraction current word w mappings are extremely The set S (w) of small part n metacharacters, such as, based on above-mentioned table 1,Mapping 2 metacharacters include " mc ", " ca ", " ag ", 3 metacharacters include " mca ", " cag ".According to formula (1), formula (2) and formula (3) counting loss score l (w, c), enter And gradient is calculated, and to update character vector corresponding to c term vector and w, grey square frame table of these calculating process in Fig. 4 Show.
Based on the embodiment in the thinking and Fig. 3 same with Fig. 2, this specification embodiment provide another word to Measure processing method.
Fig. 5 is the schematic flow sheet for another term vector processing method that this specification embodiment provides.
Flow in Fig. 5 may comprise steps of:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, it is described each Word is not included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, according to the vocabulary, n metacharacter mapping tables are established, the mapping table includes each word and n member words Mapping relations between symbol, the n metacharacters characterize the character string obtained by the continuous n letter maps of the word of its mapping;Jump Go to step 3;
Step 3, according to the n metacharacters mapping table, establish and initialize the term vector of each word, and each word The character vector of each n metacharacters of mapping;Jump procedure 4;
Step 4, the language material after traversal participle, is performed using the word traversed as current word w and to current word w respectively Step 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, are traveled through in the window All words in addition to current word w, respectively using the word traversed as current word w current context word c and to when front upper and lower Cliction c performs step 6, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents the set of at least part n metacharacters that current word w maps in the n metacharacters mapping table, q Each n metacharacters in S (w) are represented, sim (w, c) represents current word w and current context word c similarity;Represent q character Vector,W term vector is represented,C term vector is represented, ⊙ represents to be directed to two vectorial certain operations, the certain operations For dot-product operation or included angle cosine computing or Euclidean distance computing;β1、β2For weight parameter;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)The negative sample word c ' that [x] refers to randomly select meets general Rate is distributedp(V) in the case of, expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) that calculates calculates the loss function, according to the gradient, To q character vectorWith current context word c term vectorIt is updated.
Each step can be performed by identical or different module in another term vector processing method, this specification To this and it is not specifically limited.
The term vector processing method provided above for this specification embodiment, based on same thinking, this specification is implemented Example additionally provides corresponding device, as shown in Figure 6.
Fig. 6 is a kind of structural representation for term vector processing unit corresponding to Fig. 2 that this specification embodiment provides, should Device can be located at the executive agent of flow in Fig. 2, including:
Word-dividing mode 601, language material is segmented to obtain each word;
Determining module 602, determines each n metacharacters corresponding to each word, and the n metacharacters are characterized by its corresponding word The character string that continuous n letter maps obtain;
Initialization module 603, establish and initialize the term vector of each word, and each n members word corresponding to each word The character vector of symbol;
Training module 604, according to the language material after the term vector, the character vector, and participle, to institute's predicate Character vector is trained described in vector sum;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
Alternatively, the determining module 602 determines each n metacharacters corresponding to each word, and the n metacharacters are characterized by it The character string that the continuous n letter maps of corresponding word obtain, is specifically included:
The determining module 602 obtains the alphabetic character mapping relations established, and the alphabetic character mapping relations are:It is described Mapping relations between each letter of the affiliated language of word and each character specified;And
Each n members letter corresponding to each word is determined, the n members letter characterizes the continuous n letter of its corresponding word;
According to the alphabetic character mapping relations, each n members letter is mapped respectively, it is corresponding to obtain each word Each n metacharacters, the n metacharacters characterize the character string obtained by the continuous n letter maps of its corresponding word.
Alternatively, at least part letter of language corresponding to institute's predicate uses unicode code storages, and the character is ASCII character.
Alternatively, the determining module 602 determines each n metacharacters corresponding to each word, specifically includes:
The determining module 602 is according to the result segmented to the language material, it is determined that the word occurred in the language material;
The word of the determination is directed to respectively, is performed:
Each n metacharacters corresponding to the word are determined, n metacharacters corresponding to the word characterize the continuous n letter maps by the word Obtained character string, n are a positive integer or multiple different positive integers.
Alternatively, the determining module 602 is according to the result segmented to the language material, it is determined that occurring in the language material Word, specifically include:
The determining module 602 is according to the result segmented to the language material, it is determined that occurring and occurring in the language material Word of the number no less than setting number.
Alternatively, the initialization module 603 initializes the term vector of each word, and each n corresponding to each word The character vector of metacharacter, is specifically included:
The side that the initialization module 603 is initialized by the way of random initializtion or according to specified probability distribution Formula, the term vector of each word, and the character vector of each n metacharacters corresponding to each word are initialized, wherein, identical n members The character vector of character is also identical.
Alternatively, the training module 604 is according to institute's predicate after the term vector, the character vector, and participle Material, is trained to the term vector and the character vector, specifically includes:
The training module 604 determines the specified word in the language material after segmenting, and the specified word is after participle One or more of language material cliction up and down;
The character vector of each n metacharacters according to corresponding to the specified word, and the term vector of the cliction up and down, it is determined that The specified word and the similarity of the cliction up and down;
According to the specified word with it is described up and down cliction similarity, to it is described up and down cliction term vector and it is described specify The character vector of each n metacharacters is updated corresponding to word.
Alternatively, the training module 604 is according to the specified word and the similarity of the cliction up and down, above and below described The character vector of each n metacharacters is updated corresponding to the term vector of cliction and the specified word, is specifically included:
The training module 604 selects one or more words from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word with it is described up and down cliction similarity, and the specified word with The similarity of each negative sample word, determine loss characterization value corresponding to the specified word;
According to the loss characterization value, to each n metacharacters corresponding to the term vector of cliction up and down and the specified word Character vector be updated.
Alternatively, the training module 604 is according to the loss characterization value, to the term vector of cliction up and down and described Specify the character vector of each n metacharacters corresponding to word to be updated, specifically include:
The training module 604 determines gradient corresponding to the loss function according to the loss characterization value;
According to the gradient, to the character of each n metacharacters corresponding to the term vector of cliction up and down and the specified word Vector is updated.
Alternatively, the training module 604 selects one or more words from each word, as negative sample word, specifically Including:
The training module 604 randomly chooses one or more words from each word, as negative sample word.
Alternatively, the training module 604 is according to institute's predicate after the term vector, the character vector, and participle Material, is trained to the term vector and the character vector, specifically includes:
The training module 604 travels through to the language material after participle, respectively in the language material after participle Word performs:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine the word with being somebody's turn to do The similarity of upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, each n members word corresponding to the term vector and the word to the upper and lower cliction The character vector of symbol is updated.
Alternatively, the character vector of the training module 604 each n metacharacters according to corresponding to the word, and the context The term vector of word, the similarity of the word and the upper and lower cliction is determined, is specifically included:
The training module 604 character vector of each n metacharacters, term vector of the word according to corresponding to the word, and should The term vector of upper and lower cliction, determine the similarity of the word and the upper and lower cliction.
Alternatively, the training module 604 determines one or more of the language material of the word after participle context Word, specifically include:
In the language material of the training module 604 after participle, by centered on the word, sliding to the left and/or to the right The distance of dynamic specified quantity word, establishes window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
Alternatively, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
Based on same thinking, this specification embodiment additionally provides corresponding a kind of electronic equipment, including:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Individual computing device, so that at least one processor can:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n letter by its corresponding word Map obtained character string;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
Based on same thinking, this specification embodiment additionally provides a kind of corresponding non-volatile computer storage and is situated between Matter, is stored with computer executable instructions, and the computer executable instructions are arranged to:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n letter by its corresponding word Map obtained character string;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the language material after the term vector, the character vector, and participle, to the term vector and the word Symbol vector is trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the action recorded in detail in the claims or step can be come according to different from the order in embodiment Perform and still can realize desired result.In addition, the process described in the accompanying drawings not necessarily require show it is specific suitable Sequence or consecutive order could realize desired result.In some embodiments, multitasking and parallel processing be also can With or be probably favourable.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.Especially for device, For electronic equipment, nonvolatile computer storage media embodiment, because it is substantially similar to embodiment of the method, so description It is fairly simple, the relevent part can refer to the partial explaination of embodiments of method.
Device that this specification embodiment provides, electronic equipment, nonvolatile computer storage media with method are corresponding , therefore, device, electronic equipment, nonvolatile computer storage media also there is the Advantageous similar with corresponding method to imitate Fruit, due to the advantageous effects of method being described in detail above, therefore, repeat no more here corresponding intrument, The advantageous effects of electronic equipment, nonvolatile computer storage media.
In the 1990s, the improvement for a technology can clearly distinguish be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And as the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow is programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, PLD (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, its logic function is determined by user to device programming.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, without asking chip maker to design and make Special IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " patrols Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but have many kinds, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also should This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, Can is readily available the hardware circuit for realizing the logical method flow.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing Device and storage can by the computer of the computer readable program code (such as software or firmware) of (micro-) computing device Read medium, gate, switch, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller include but is not limited to following microcontroller Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that except with Pure computer readable program code mode realized beyond controller, completely can be by the way that method and step is carried out into programming in logic to make Controller is obtained in the form of gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. to come in fact Existing identical function.Therefore this controller is considered a kind of hardware component, and various for realizing to including in it The device of function can also be considered as the structure in hardware component.Or even, can be by for realizing that the device of various functions regards For that not only can be the software module of implementation method but also can be the structure in hardware component.
System, device, module or the unit that above-described embodiment illustrates, it can specifically be realized by computer chip or entity, Or realized by the product with certain function.One kind typically realizes that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during specification.
It should be understood by those skilled in the art that, this specification embodiment can be provided as method, system or computer program Product.Therefore, this specification embodiment can use complete hardware embodiment, complete software embodiment or with reference to software and hardware The form of the embodiment of aspect.Moreover, this specification embodiment can be can use using computer is wherein included in one or more It is real in the computer-usable storage medium (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form for the computer program product applied.
This specification is with reference to the method, equipment (system) and computer program product according to this specification embodiment Flow chart and/or block diagram describe.It should be understood that can be by every in computer program instructions implementation process figure and/or block diagram One flow and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computers can be provided Processor of the programmed instruction to all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices To produce a machine so that produce use by the instruction of computer or the computing device of other programmable data processing devices In the dress for realizing the function of being specified in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames Put.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Other identical element also be present in the process of element, method, commodity or equipment.
It will be understood by those skilled in the art that this specification embodiment can be provided as method, system or computer program product. Therefore, this specification can use the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.Moreover, this specification can use the computer for wherein including computer usable program code in one or more can With the computer program product implemented in storage medium (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) Form.
This specification can be described in the general context of computer executable instructions, such as journey Sequence module.Usually, program module include performing particular task or realize the routine of particular abstract data type, program, object, Component, data structure etc..This specification can also be put into practice in a distributed computing environment, in these DCEs In, by performing task by communication network and connected remote processing devices.In a distributed computing environment, program module It can be located in the local and remote computer-readable storage medium including storage device.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system For applying example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
This specification embodiment is the foregoing is only, is not limited to the application.For those skilled in the art For, the application can have various modifications and variations.All any modifications made within spirit herein and principle, it is equal Replace, improve etc., it should be included within the scope of claims hereof.

Claims (28)

1. a kind of term vector processing method, including:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n letter maps by its corresponding word Obtained character string;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the term vector, the character vector, and the language material after participle, to the term vector and the character to Amount is trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
It is described to determine each n metacharacters corresponding to each word 2. the method as described in claim 1, the n metacharacters characterize by The character string that the continuous n letter maps of its corresponding word obtain, is specifically included:
The alphabetic character mapping relations established are obtained, the alphabetic character mapping relations are:Each letter of the affiliated language of institute's predicate With the mapping relations between specified each character;And
Each n members letter corresponding to each word is determined, the n members letter characterizes the continuous n letter of its corresponding word;
According to the alphabetic character mapping relations, each n members letter is mapped respectively, obtained each corresponding to each word N metacharacters, the n metacharacters characterize the character string obtained by the continuous n letter maps of its corresponding word.
3. method as claimed in claim 2, at least part letter of language corresponding to institute's predicate is deposited using unicode codings Storage, the character is ASCII character.
4. the method as described in claim 1, each n metacharacters corresponding to determination each word, are specifically included:
According to the result segmented to the language material, it is determined that the word occurred in the language material;
The mutually different word of the determination is directed to respectively, is performed:
Each n metacharacters corresponding to the word are determined, n metacharacters corresponding to the word are characterized and obtained by the continuous n letter maps of the word Character string, n is a positive integer or multiple different positive integers.
5. method as claimed in claim 4, the result that the basis segments to the language material, it is determined that occurring in the language material The word crossed, is specifically included:
According to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is no less than setting number Word.
6. the method as described in claim 1, the term vector of initialization each word, and each n corresponding to each word The character vector of metacharacter, is specifically included:
By the way of the random initializtion or in the way of specified probability distribution initializes, initialize the word of each word to Amount, and the character vector of each n metacharacters corresponding to each word, wherein, the character vector of identical n metacharacters is also identical.
7. the method as described in claim 1, described according to described after the term vector, the character vector, and participle Language material, the term vector and the character vector are trained, specifically included:
It is determined that the specified word in the language material after participle, and one in the language material of the specified word after participle or Multiple clictions up and down;
The character vector of each n metacharacters according to corresponding to the specified word, and the term vector of the cliction up and down, it is determined that described Specify word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector of cliction and the specified word pair up and down The character vector for each n metacharacters answered is updated.
8. method as claimed in claim 7, described according to the specified word and the similarity of the cliction up and down, on described The character vector of each n metacharacters is updated corresponding to the term vector of lower cliction and the specified word, is specifically included:
One or more words are selected from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, determines loss characterization value corresponding to the specified word;
According to the loss characterization value, to the word of each n metacharacters corresponding to the term vector of cliction up and down and the specified word Symbol vector is updated.
9. method as claimed in claim 8, described according to the loss characterization value, to the term vector of cliction and the institute up and down The character vector for stating each n metacharacters corresponding to specified word is updated, and is specifically included:
According to the loss characterization value, gradient corresponding to the loss function is determined;
According to the gradient, to the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word It is updated.
10. method as claimed in claim 8, described to select one or more words from each word, as negative sample word, tool Body includes:
One or more words are randomly choosed from each word, as negative sample word.
11. the method as described in claim 1, described according to described after the term vector, the character vector, and participle Language material, the term vector and the character vector are trained, specifically included:
The language material after participle is traveled through, the word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine that the word is upper and lower with this The similarity of cliction;
According to the word and the similarity of the upper and lower cliction, each n metacharacters corresponding to the term vector and the word to the upper and lower cliction Character vector is updated.
12. method as claimed in claim 11, the character vector of the n metacharacters each according to corresponding to the word, and on this The term vector of lower cliction, determines the similarity of the word and the upper and lower cliction, specifically includes:
The character vector of the n metacharacters each according to corresponding to the word, the term vector of the word, and the upper and lower cliction word to Amount, determine the similarity of the word and the upper and lower cliction.
13. in one or more of method as claimed in claim 11, the language material for determining the word after participle Lower cliction, specifically includes:
In the language material after participle, by centered on the word, slide to the left and/or to the right specified quantity word away from From establishing window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
14. a kind of term vector processing unit, including:
Word-dividing mode, language material is segmented to obtain each word;
Determining module, determines each n metacharacters corresponding to each word, and the n metacharacters are characterized by continuous n of its corresponding word The character string that letter maps obtain;
Initialization module, establish and initialize the term vector of each word, and the character of each n metacharacters corresponding to each word Vector;
Training module, according to the term vector, the character vector, and the language material after participle, to the term vector and The character vector is trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
15. device as claimed in claim 14, the determining module determines each n metacharacters corresponding to each word, the n members Character characterizes the character string obtained by the continuous n letter maps of its corresponding word, specifically includes:
The determining module obtains the alphabetic character mapping relations established, and the alphabetic character mapping relations are:Belonging to institute's predicate Mapping relations between each letter of language and each character specified;And
Each n members letter corresponding to each word is determined, the n members letter characterizes the continuous n letter of its corresponding word;
According to the alphabetic character mapping relations, each n members letter is mapped respectively, obtained each corresponding to each word N metacharacters, the n metacharacters characterize the character string obtained by the continuous n letter maps of its corresponding word.
16. device as claimed in claim 15, at least part letter of language corresponding to institute's predicate is deposited using unicode codings Storage, the character is ASCII character.
17. device as claimed in claim 14, the determining module determines each n metacharacters corresponding to each word, specific bag Include:
The determining module is according to the result segmented to the language material, it is determined that the word occurred in the language material;
The mutually different word of the determination is directed to respectively, is performed:
Each n metacharacters corresponding to the word are determined, n metacharacters corresponding to the word are characterized and obtained by the continuous n letter maps of the word Character string, n is a positive integer or multiple different positive integers.
18. device as claimed in claim 17, the determining module is according to the result segmented to the language material, it is determined that described The word occurred in language material, is specifically included:
The determining module is according to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is many In the word of setting number.
19. device as claimed in claim 14, the initialization module initializes the term vector of each word, and described each The character vector of each n metacharacters, is specifically included corresponding to word:
The initialization module is by the way of random initializtion or in the way of specified probability distribution initializes, initialization The term vector of each word, and the character vector of each n metacharacters corresponding to each word, wherein, the character of identical n metacharacters Vector is also identical.
20. device as claimed in claim 14, the training module is according to the term vector, the character vector, Yi Jifen The language material after word, is trained to the term vector and the character vector, specifically includes:
The training module determines the specified word in the language material after segmenting, and institute predicate of the specified word after participle Cliction above and below one or more of material;
The character vector of each n metacharacters according to corresponding to the specified word, and the term vector of the cliction up and down, it is determined that described Specify word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector of cliction and the specified word pair up and down The character vector for each n metacharacters answered is updated.
21. device as claimed in claim 20, the training module is similar to the cliction up and down according to the specified word Degree, the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word is updated, specific bag Include:
The training module selects one or more words from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, determines loss characterization value corresponding to the specified word;
According to the loss characterization value, to the word of each n metacharacters corresponding to the term vector of cliction up and down and the specified word Symbol vector is updated.
22. device as claimed in claim 21, the training module is according to the loss characterization value, to the cliction up and down The character vector of each n metacharacters is updated corresponding to term vector and the specified word, is specifically included:
The training module determines gradient corresponding to the loss function according to the loss characterization value;
According to the gradient, to the character vector of each n metacharacters corresponding to the term vector of cliction up and down and the specified word It is updated.
23. device as claimed in claim 21, the training module selects one or more words from each word, as negative Sample word, is specifically included:
The training module randomly chooses one or more words from each word, as negative sample word.
24. device as claimed in claim 14, the training module is according to the term vector, the character vector, Yi Jifen The language material after word, is trained to the term vector and the character vector, specifically includes:
The training module is traveled through to the language material after participle, and the word in the language material after participle is performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to the cliction up and down, perform:
The character vector of each n metacharacters according to corresponding to the word, and the term vector of the upper and lower cliction, determine that the word is upper and lower with this The similarity of cliction;
According to the word and the similarity of the upper and lower cliction, each n metacharacters corresponding to the term vector and the word to the upper and lower cliction Character vector is updated.
25. device as claimed in claim 24, the character vector of the training module each n metacharacters according to corresponding to the word, And the term vector of the upper and lower cliction, the similarity of the word and the upper and lower cliction is determined, is specifically included:
The training module character vector of each n metacharacters, term vector of the word according to corresponding to the word, and the upper and lower cliction Term vector, determine the similarity of the word and the upper and lower cliction.
26. device as claimed in claim 24, the training module determines one in the language material of the word after participle Or multiple clictions up and down, specifically include:
In the language material of the training module after participle, by centered on the word, sliding to the left and/or to the right and specifying number The distance of amount word, establishes window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
27. a kind of term vector processing method, including:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, each word is not It is included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, n metacharacter mapping tables are established according to the vocabulary, the mapping table include each word and n metacharacters it Between mapping relations, the n metacharacters characterize the character string obtained by the continuous n letter maps of the word of its mapping;Redirect step Rapid 3;
Step 3, according to the n metacharacters mapping table, establish and initialize the term vector of each word, and each word mapping Each n metacharacters character vector;Jump procedure 4;
Step 4, the language material after traversal participle, performs step using the word traversed as current word w and to current word w respectively 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, traveled through to remove in the window and work as All words beyond preceding word w, respectively using the word traversed as current word w current context word c and to current context word c Step 6 is performed, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents the set of at least part n metacharacters that current word w maps in the n metacharacters mapping table, and q represents S (w) each n metacharacters in, sim (w, c) represent current word w and current context word c similarity;Q character vector is represented,W term vector is represented,C term vector is represented, ⊙ represents to be directed to two vectorial certain operations, and the certain operations are point Product computing or included angle cosine computing or Euclidean distance computing;β1、β2For weight parameter;Jump procedure 7;
Step 7, randomly select λ word as negative sample word, according to following loss function calculate corresponding to loss characterization value l (w, c):
<mrow> <mi>l</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>c</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&amp;lambda;</mi> </munderover> <msub> <mi>E</mi> <mrow> <msup> <mi>c</mi> <mo>&amp;prime;</mo> </msup> <mo>&amp;Element;</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>V</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mo>-</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <msup> <mi>c</mi> <mo>,</mo> </msup> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> <mo>;</mo> </mrow>
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)The negative sample word c ' that [x] refers to randomly select meets probability point In the case of cloth p (V), expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) calculated calculates the loss function, according to the gradient, to q's Character vectorWith current context word c term vectorIt is updated.
28. a kind of electronic equipment, including:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by least one place Manage device to perform, so that at least one processor can:
Language material is segmented to obtain each word;
Each n metacharacters corresponding to each word are determined, the n metacharacters characterize the continuous n letter maps by its corresponding word Obtained character string;
Establish and initialize the term vector of each word, and the character vector of each n metacharacters corresponding to each word;
According to the term vector, the character vector, and the language material after participle, to the term vector and the character to Amount is trained;
Wherein, institute's predicate is Arabic cliction, either for Malay word or be Indonesian word.
CN201710583773.0A 2017-07-18 2017-07-18 Term vector processing method, device and electronic equipment Pending CN107562716A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710583773.0A CN107562716A (en) 2017-07-18 2017-07-18 Term vector processing method, device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710583773.0A CN107562716A (en) 2017-07-18 2017-07-18 Term vector processing method, device and electronic equipment

Publications (1)

Publication Number Publication Date
CN107562716A true CN107562716A (en) 2018-01-09

Family

ID=60973561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710583773.0A Pending CN107562716A (en) 2017-07-18 2017-07-18 Term vector processing method, device and electronic equipment

Country Status (1)

Country Link
CN (1) CN107562716A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815478A (en) * 2018-12-11 2019-05-28 北京大学 Medicine entity recognition method and system based on convolutional neural networks
CN111539228A (en) * 2020-04-29 2020-08-14 支付宝(杭州)信息技术有限公司 Vector model training method and device, and similarity determining method and device
US11836174B2 (en) 2020-04-24 2023-12-05 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus of establishing similarity model for retrieving geographic location

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120239397A1 (en) * 2001-10-15 2012-09-20 Silverbrook Research Pty Ltd Digital Ink Database Searching Using Handwriting Feature Synthesis
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN105528411A (en) * 2015-12-03 2016-04-27 中国人民解放军海军工程大学 Full-text retrieval device and method for interactive electronic technical manual of shipping equipment
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120239397A1 (en) * 2001-10-15 2012-09-20 Silverbrook Research Pty Ltd Digital Ink Database Searching Using Handwriting Feature Synthesis
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN105528411A (en) * 2015-12-03 2016-04-27 中国人民解放军海军工程大学 Full-text retrieval device and method for interactive electronic technical manual of shipping equipment
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MACHINELEARNING: "FastText分析与实践", 《简书HTTPS://WWW.JIANSHU.COM/P/9EA0D69DD55E》 *
PIOTR BOJANOWASKI ET AL.: "Enriching Word Vectors with Subword Information", 《ARXIV:1607.04606V1》 *
悟乙己: "Word2Vec作者Thomas Mikolov的三篇代表作解析", 《CSDN博客 HTTPS://BLOG.CSDN.NET/SINAT_26917383/ARTICLE/DETAILS/52577551》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815478A (en) * 2018-12-11 2019-05-28 北京大学 Medicine entity recognition method and system based on convolutional neural networks
US11836174B2 (en) 2020-04-24 2023-12-05 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus of establishing similarity model for retrieving geographic location
CN111539228A (en) * 2020-04-29 2020-08-14 支付宝(杭州)信息技术有限公司 Vector model training method and device, and similarity determining method and device
CN111539228B (en) * 2020-04-29 2023-08-08 支付宝(杭州)信息技术有限公司 Vector model training method and device and similarity determining method and device

Similar Documents

Publication Publication Date Title
TWI685761B (en) Word vector processing method and device
TWI701588B (en) Word vector processing method, device and equipment
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN108874765B (en) Word vector processing method and device
CN111428457B (en) Automatic formatting of data tables
US11030411B2 (en) Methods, apparatuses, and devices for generating word vectors
CN107957989A (en) Term vector processing method, device and equipment based on cluster
CN107679082A (en) Question and answer searching method, device and electronic equipment
TWI686713B (en) Word vector generating method, device and equipment
CN107423269A (en) Term vector processing method and processing device
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN112395385A (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN107562716A (en) Term vector processing method, device and electronic equipment
CN107402945A (en) Word stock generating method and device, short text detection method and device
CN106980620A (en) A kind of method and device matched to Chinese character string
CN115861995B (en) Visual question-answering method and device, electronic equipment and storage medium
JP7435951B2 (en) Floating point number generation method, apparatus, electronic device, storage medium and computer program for integrated circuit chip verification
CN108875743A (en) A kind of text recognition method and device
CN111144109B (en) Text similarity determination method and device
CN107247704A (en) Term vector processing method, device and electronic equipment
CN107577658A (en) Term vector processing method, device and electronic equipment
CN111091001B (en) Method, device and equipment for generating word vector of word
CN107562715A (en) Term vector processing method, device and electronic equipment
CN107844472A (en) Term vector processing method, device and electronic equipment
CN107577659A (en) Term vector processing method, device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1248845

Country of ref document: HK

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20191209

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, ky1-1205, Cayman Islands

Applicant after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Co., Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180109