CN107247704A - Term vector processing method, device and electronic equipment - Google Patents

Term vector processing method, device and electronic equipment Download PDF

Info

Publication number
CN107247704A
CN107247704A CN201710430490.2A CN201710430490A CN107247704A CN 107247704 A CN107247704 A CN 107247704A CN 201710430490 A CN201710430490 A CN 201710430490A CN 107247704 A CN107247704 A CN 107247704A
Authority
CN
China
Prior art keywords
word
cangjie
code character
vector
members
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710430490.2A
Other languages
Chinese (zh)
Other versions
CN107247704B (en
Inventor
曹绍升
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710430490.2A priority Critical patent/CN107247704B/en
Publication of CN107247704A publication Critical patent/CN107247704A/en
Application granted granted Critical
Publication of CN107247704B publication Critical patent/CN107247704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The embodiment of the present application discloses term vector processing method, device and electronic equipment.Methods described includes:Each word is obtained to language material participle;The corresponding each n members Cangjie's code character of each word is determined, n members Cangjie's code character characterizes the continuous n Cangjie code character of its corresponding word;Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie's code character vector;According to the term vector, Cangjie's code character vector, and the language material after participle, the term vector and Cangjie's code character vector are trained.Utilize the embodiment of the present application, it is possible to achieve the feature of the word is more subtly showed by the corresponding n members Cangjie code character of word, it is particularly possible to show the font morphological feature of the word, and then be conducive to improving the degree of accuracy of the term vector of Chinese word, practical function is preferable.

Description

Term vector processing method, device and electronic equipment
Technical field
The application is related to computer software technical field, more particularly to term vector processing method, device and electronic equipment.
Background technology
The solution of natural language processing of today, mostly using the framework based on neutral net, and in this framework Next important basic technology is exactly term vector.Term vector is the vector that word is mapped to a fixed dimension, the vector table The semantic information of the word is levied.
In the prior art, the algorithm for being commonly used in generation term vector is specific to English design.Such as, Google The word vector algorithm of company, deep neural network algorithm of Microsoft etc..
But, though these algorithms of prior art are either not used to Chinese or can be used for Chinese, generated The practical function of the term vector of Chinese word is poor.
The content of the invention
The embodiment of the present application provides term vector processing method, device and electronic equipment, to solve to use in the prior art Either it is not used to Chinese in the algorithm of generation term vector or though Chinese can be used for, generate the term vector of Chinese word Practical function it is poor the problem of.
In order to solve the above technical problems, what the embodiment of the present application was realized in:
A kind of term vector processing method that the embodiment of the present application is provided, including:
Each word is obtained to language material participle;
The corresponding each n members Cangjie's code character of each word is determined, n members Cangjie's code character characterizes the company of its corresponding word Continue n Cangjie's code character;
Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie Code character vector;
According to the term vector, Cangjie's code character vector, and the language material after participle, to the term vector and Cangjie's code character vector is trained.
A kind of term vector processing unit that the embodiment of the present application is provided, including:
Word-dividing mode, each word is obtained to language material participle;
Determining module, determines the corresponding each n members Cangjie's code character of each word, it is right that n members Cangjie's code character characterizes its The continuous n Cangjie code character for the word answered;
Initialization module, sets up and initializes the term vector of each word, and the corresponding each n members Cangjie code of each word Cangjie's code character vector of character;
Training module, according to the term vector, Cangjie's code character vector, and the language material after participle, to institute Cangjie's code character vector is trained described in predicate vector sum.
Another term vector processing method that the embodiment of the present application is provided, including:
Step 1, to language material participle, and the vocabulary that each word obtained by the participle is constituted is set up, wherein, it is described each Word is not included in word of the occurrence number less than setting number of times in the language material;Jump procedure 2;
Step 2, n member Cangjie's code character mapping tables are set up according to the vocabulary, the mapping table comprising each word with Mapping relations between n member Cangjie's code characters, n members Cangjie's code character characterizes the continuous n Cangjie code word of the word of its mapping Symbol;Jump procedure 3;
Step 3, according to the n members Cangjie code character mapping table, set up and initialize the term vector of each word, Yi Jisuo State Cangjie's code character vector of each n members Cangjie's code character of each word mapping;Jump procedure 4;
Step 4, the language material after traversal participle, respectively using each word traversed as current word w and to current word w Step 5 is performed, is terminated if completion is traveled through, otherwise continues to travel through;
Step 5, centered on current word w, many k words is respectively slid to both sides and set up window, are traveled through in the window All words in addition to current word w, respectively using each word traversed as current word w current context word c and to current Cliction c performs step 6 up and down, continues the execution of step 4 if completion is traveled through, and otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents each n members Cangjie code character collection that current word w maps in the n members Cangjie code character mapping table Close, q represents each n members Cangjie's code character in S (w), sim (w, c) represents current word w and current context word c similarity;Represent the dot product of q Cangjie's code character vector and current context word c term vector;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)[x] refers to that the negative sample word c ' randomly selected meets general In the case of rate distribution p (V), expression formula x desired value, σ () is neutral net excitation function, is defined as
The corresponding gradient of the loss function is calculated according to the loss characterization value l (w, c) calculated, according to the gradient, To q Cangjie's code character vectorWith current context word c term vectorIt is updated.
The a kind of electronic equipment that the embodiment of the present application is provided, including:
At least one processor;And,
The memory being connected with least one described processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Individual computing device, so that at least one described processor can:
Each word is obtained to language material participle;
The corresponding each n members Cangjie's code character of each word is determined, n members Cangjie's code character characterizes the company of its corresponding word Continue n Cangjie's code character;
Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie Code character vector;
According to the term vector, Cangjie's code character vector, and the language material after participle, to the term vector and Cangjie's code character vector is trained.
At least one above-mentioned technical scheme that the embodiment of the present application is used can reach following beneficial effect:It can realize logical Cross the feature that the corresponding n members Cangjie code character of word more subtly shows the word, it is particularly possible to show the font morphological feature of the word, And then being conducive to improving the degree of accuracy of the term vector of Chinese word, practical function preferably, therefore, it can partly or entirely solve now The problem of having in technology.
Brief description of the drawings
, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments described in application, for those of ordinary skill in the art, are not paying the premise of creative labor Under, other accompanying drawings can also be obtained according to these accompanying drawings.
A kind of schematic flow sheet for term vector processing method that Fig. 1 provides for the embodiment of the present application;
Under the practical application scene that Fig. 2 provides for the embodiment of the present application, a kind of specific reality of the term vector processing method Apply the schematic flow sheet of scheme;
The relevant treatment action schematic diagram of part language material used in flow in Fig. 2 that Fig. 3 provides for the embodiment of the present application;
The schematic flow sheet for another term vector processing method that Fig. 4 provides for the embodiment of the present application;
A kind of structural representation for term vector processing unit corresponding to Fig. 1 that Fig. 5 provides for the embodiment of the present application.
Embodiment
The embodiment of the present application provides term vector processing method, device and electronic equipment.
In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation Example only some embodiments of the present application, rather than whole embodiments.Based on the embodiment in the application, this area is common The every other embodiment that technical staff is obtained under the premise of creative work is not made, should all belong to the application protection Scope.
The scheme of the application is applied to the term vector of Chinese word, is also applied for the word of similar other Chinese some language Term vector, such as, and term vector of word of the obvious language of the font morphological feature such as Korean, Japanese etc., for non-Chinese feelings Condition according to the coding rule of Cangjie's code, it is necessary to correspondingly be encoded.
For the ease of description, following embodiment is mainly for the scene of Chinese word, and the scheme to the application illustrates.
A kind of schematic flow sheet for term vector processing method that Fig. 1 provides for the embodiment of the present application.For program angle, The executive agent of the flow can be program with term vector systematic function and/or training function etc.;For slave unit angle, The executive agent of the flow can include but is not limited to that following at least one equipment of described program can be carried:Personal computer, Big-and-middle-sized computer, computer cluster, mobile phone, tablet personal computer, intelligent wearable device, vehicle device etc..
Flow in Fig. 1 may comprise steps of:
S101:Each word is obtained to language material participle.
In the embodiment of the present application, each word can be specifically:At least occurred in language material in word once at least Part word.For the ease of subsequent treatment, each word can be stored in vocabulary, it is necessary to read word i.e. from vocabulary when using Can.
S102:The corresponding each n members Cangjie's code character of each word is determined, it is corresponding that n members Cangjie's code character characterizes its The continuous n Cangjie code character of word.
Cangjie's code is a kind of a kind of coding (being also coded system) of Chinese character, and it is according to visual recognition principle, to reflect the Chinese The fine feature of word, nearly all different fonts or variant Chinese character have different Cangjie's codes, and its repetition rate of coding is current various Chinese characters Than relatively low one kind in coded system.Substantially, Cangjie's code encodes the font shape information of Chinese character, it is emphasized that, storehouse A word used in person's names code is split according to the structure complexity of Chinese character, is split areal and is not known, this dynamic disassembly principle causes Cangjie's code turns into a kind of relatively low encoding of chinese characters mode of encoding amount.
The character of Cangjie's code typically has 25 or 26 kind, will by principles such as " from top to bottom, right by a left side, from outside to inside " Any Chinese character is split into font structure, then is constituted character string as the corresponding Cangjie's code of the Chinese character using the corresponding character, Wherein it is possible to which the character string is referred to as into Cangjie's code character sequence, each character in Cangjie's code character sequence is referred to as one Individual Cangjie's code character.
For example, Cangjie's code of " eating " word is " RON ", comprising 3 Cangjie's code characters, Cangjie's code of " meal " word is " NVHE ", Include 4 Cangjie's code characters.
Further, Chinese word is typically made up of multiple Chinese characters, then is sequentially connected Cangjie's code of this multiple Chinese character and also may be used To obtain the corresponding Cangjie's code character sequence of the Chinese word, then the corresponding n members Cangjie's code character of word is:The corresponding Cangjie's code of the word Continuous n Cangjie code character in character string.
Use the example above, " had a meal " for word, its corresponding Cangjie's code character sequence is " RONNVHE ", is understood accordingly:Its is right The 3 yuan of Cangjie's code characters answered are:" RON " (the 1st~3 Cangjie's code character), " ONN " (the 2nd~4 Cangjie's code character), " VHE " (the 5th~7 Cangjie's code character) etc.;Its corresponding 4 yuan of Cangjie's code character is:" RONN " (the 1st~4 Cangjie's code character), " ONNV " (the 2nd~5 Cangjie's code character), " NNVH " (the 3rd~6 Cangjie's code character) etc.;Its corresponding 5 yuan of Cangjie's code character For:" RONNV " (the 1st~5 Cangjie's code character), " ONNVH " (the 2nd~6 Cangjie's code character), " NNVHE " (the 3rd~7 storehouse A word used in person's names code character).
In the embodiment of the present application, n value can be that dynamic is adjustable.For same word, it is determined that word correspondence Each n members Cangjie's code character when, n value can only take 1 (such as, only determining the corresponding each 3 yuan of Cangjie's code characters of the word), It can also take multiple (such as, determining the corresponding each 3 yuan of Cangjie's code characters of the word and each 4 yuan of Cangjie's code characters etc.).When n value When Cangjie's code character that exactly word or the corresponding Cangjie's code character sequence of word are included is total, n member Cangjie's code characters are exactly Cangjie's code character sequence.
In the embodiment of the present application, for the ease of computer disposal, n member Cangjie code character can also carry out table with numeral Show.Such as, different Cangjie's code characters can be represented that then n members Cangjie code character correspondingly can be with table with a numeral respectively Numeric string is shown as, wherein, there are mapping relations between the numeral and numeric string or Cangjie's code character.
S103:Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie's code character vector.
In the embodiment of the present application, Cangjie's code character vector refers to the vector for representing n member Cangjie's code characters.Each n members Cangjie's code character can represent with Cangjie's code character vector respectively, can be respectively with a term vector just as each word To represent.
In the embodiment of the present application, in order to ensure the effect of scheme, when initializing term vector and Cangjie's code character is vectorial, It might have some restrictive conditions.Such as, it is impossible to by each term vector and each Cangjie's code character vector be initialized to identical to Amount;Again such as, the vector element value during some term vectors or Cangjie's code character are vectorial can not be all 0;Etc..
In the embodiment of the present application, it can be initialized by the way of random initializtion or according to specified probability distribution Mode, initializes the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie's code character to Amount, wherein, Cangjie's code character vector of identical n members Cangjie's code character is also identical.Such as, the specified probability distribution can be 0- 1 distribution etc..
If in addition, having been based on other language materials before, training the corresponding term vector of some words and Cangjie's code character vector, , can be no longer then when the language material in being further based on Fig. 1 trains the corresponding term vector of these words and Cangjie's code character is vectorial Re-establish and initialize the corresponding term vector of these words and Cangjie's code character vector, but based on the language material in Fig. 1 and before Training result, then be trained.
S104:According to the term vector, Cangjie's code character vector, and the language material after participle, to institute's predicate Cangjie's code character vector is trained described in vector sum.
In the embodiment of the present application, the training can be that, by neural fusion, the neutral net can be Shallow-layer neutral net or deep-neural-network etc..The application is not limited the concrete structure of the neutral net of use.
Pass through Fig. 1 method, it is possible to achieve the spy that the word is more subtly showed by the corresponding n members Cangjie code character of word Levy, it is particularly possible to show the font morphological feature of the word, and then be conducive to improving the degree of accuracy of the term vector of Chinese word, practicality effect Fruit preferably, therefore, it can partly or entirely solve the problems of the prior art.
Method based on Fig. 1, the embodiment of the present application additionally provides some specific embodiments of this method, and extension side Case, is illustrated below.
In the embodiment of the present application, for step S102, the corresponding each n members Cangjie's code character of the determination each word, It can specifically include:According to the result to the language material participle, it is determined that the word occurred in the language material;
Mutually different each word of the determination is directed to respectively, is performed:
The corresponding each n members Cangjie's code character of the word is determined, the corresponding n members Cangjie code character of the word is characterized
The continuous n Cangjie code character of the word, n is a positive integer or multiple different positive integers.
In the embodiment of the present application, for identical word, their corresponding each n members Cangjie's code characters are also identical, because This, for the step in the preceding paragraph, performs for mutually different each word of determination respectively, and for dittograph, Existing result can be directly continued to use, without repeating, so as to save resource.
Further, it is contemplated that if the number of times that some word occurs in language material is very little, correspondence when being trained based on the language material Training sample and frequency of training it is also less, adverse effect can be brought to the confidence level of training result, be therefore, it can this kind of word Screen out, wouldn't train.It can be subsequently trained in other language materials.
Based on such thinking, the basis is to the result of the language material participle, it is determined that occurring in the language material Word, can specifically include:According to the result to the language material participle, it is determined that occurring in the language material and occurrence number is many In the word of setting number of times.It is specifically that how many times can be determined according to actual conditions to set number of times.
In the embodiment of the present application, for step S104, specific training method can have a variety of, such as based on context The training method of word, training method based on specified near synonym or synonym etc., in order to make it easy to understand, in former mode as an example Describe in detail.
It is described according to the term vector, Cangjie's code character vector, and the language material after participle, to institute's predicate to Amount and Cangjie's code character vector are trained, and can specifically be included:The specified word in the language material after participle is determined, with And one or more of the language material of the specified word after participle cliction up and down;According to the corresponding each n of the specified word First Cangjie's code character Cangjie's code character vector, and it is described up and down cliction term vector, determine the specified word with it is described on The similarity of lower cliction;According to the specified word with it is described up and down cliction similarity, to it is described up and down cliction term vector and Cangjie's code character vector of the corresponding each n members Cangjie's code character of the specified word is updated.
The application couple determines that the concrete mode of similarity is not limited.Such as, it can be transported based on vectorial included angle cosine Calculate calculation similarity, can the quadratic sum computing based on vector calculate similarity, etc..
The specified word can have multiple, and specified word can be repeated and the position in language material is different, can be directed to respectively Each specified word performs the processing action in the preceding paragraph.Preferably, each word that will can be included respectively in the language material after participle All as a specified word.
In the embodiment of the present application, the training in step S104 can cause:Specify word and the similarity phase of cliction up and down To uprising, (herein, similarity can reflect the degree of association, and the degree of association of word and its context word is of a relatively high, and meaning of a word phase Corresponding cliction up and down is also often same or like to same or close each word respectively), and specify word and non-cliction up and down Similarity relatively step-down, non-cliction up and down can be as following negative sample words, then cliction relatively can be as just up and down Sample word.
As can be seen here, in the training process, it is thus necessary to determine that some negative sample words are used as control.Language material that can be after participle The middle one or more words of random selection can also strictly select non-cliction up and down as negative sample word as negative sample word.With Exemplified by former mode, it is described according to the specified word with it is described up and down cliction similarity, to it is described up and down cliction word to The Cangjie's code character vector for measuring each n members Cangjie's code character corresponding with the specified word is updated, and can specifically be included:From institute State and one or more words are selected in each word, be used as negative sample word;Determine the similarity of the specified word and each negative sample word; According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word with it is each described negative The similarity of sample word, determines the corresponding loss characterization value of the specified word;According to the loss characterization value, to the context Cangjie's code character vector of the term vector of word and the corresponding each n members Cangjie's code character of the specified word is updated.
Wherein, the error degree between the loss characterization value vector value current for weighing and training objective.It is described The parameter of loss function can be using above-mentioned several similarities as parameter, and specific loss function expression formula the application is not done Limit, behind can illustrated in greater detail.
In the embodiment of the present application, term vector and Cangjie's code character vector are updated and actually the error degree is repaiied Just.When the scheme using neural fusion the application, this amendment can be realized based on backpropagation and gradient descent method. In this case, the gradient is the corresponding gradient of loss function.
It is then described according to the loss characterization value, the corresponding each n members of term vector and the specified word to the specified word Cangjie's code character vector of Cangjie's code character is updated, and can specifically be included:According to the loss characterization value, the damage is determined Lose the corresponding gradient of function;According to the gradient, to the term vector of cliction and the corresponding each n members storehouse of the specified word up and down Cangjie's code character vector of a word used in person's names code character is updated.
In the embodiment of the present application, the training process to term vector and Cangjie's code character vector can be based on after participle What at least part word iteration in language material was carried out, so as to so that term vector and Cangjie's code character vector are little by little restrained, until Complete training.
So that whole words in the language material after based on participle are trained as an example.It is described according to institute's predicate for step S104 Vectorial, described Cangjie's code character vector, and the language material after participle, to the term vector and Cangjie's code character vector It is trained, can specifically includes:
The language material after participle is traveled through, each word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
According to Cangjie's code character of the corresponding each n members Cangjie's code character of word vector, and should above and below cliction term vector, The similarity of cliction above and below determining the word and being somebody's turn to do;
According to the word and the similarity of cliction above and below this, to term vector each n members storehouse corresponding with the word of cliction above and below this Cangjie's code character vector of a word used in person's names code character is updated.
Specific how to be updated has illustrated above, repeats no more.
Further, for the ease of computer disposal, ergodic process above can be realized based on window.
For example, cliction above and below one or more of described described language material for determining the word after participle, can specifically wrap Include:In the language material after participle, by the way that centered on the word, the distance of specified quantity word is slided to the left and/or to the right, Set up window;Word beyond the word in the window is defined as to the cliction up and down of the word.
It is of course also possible to using first word of the language material after participle as starting position, set up one and set length It is continuous comprising first word and afterwards in window, window to set quantity word;Handle after each word in window, by window Slide backward to handle the next group word in the language material, until having traveled through the language material.
A kind of term vector processing method that the embodiment of the present application is provided is illustrated above.In order to make it easy to understand, base In described above, the embodiment of the present application is additionally provided under practical application scene, a kind of specific reality of the term vector processing method The schematic flow sheet of scheme is applied, as shown in Figure 2.
Flow in Fig. 2 is mainly included the following steps that:
Step 1, using participle instrument to Chinese language material carry out participle, scanning participle after Chinese language material, count it is all go out The word now crossed deletes the word that occurrence number is less than b times (that is, above-mentioned setting number of times) to set up vocabulary;Jump procedure 2;
Step 2, vocabulary is scanned one by one, is extracted the corresponding n members Cangjie's code character of each word, is set up n member Cangjie's code words Accord with table, and word and corresponding n members Cangjie's code character mapping table;Jump procedure 3;
Step 3, the term vector that a dimension is d is set up for each word in vocabulary, in n member Cangjie's code character tables Each n members Cangjie's code character set up a dimension also for d Cangjie's code character vector, random initializtion set up institute it is oriented Amount;Jump procedure 4;
Step 4, from the Chinese language material for completing participle, slided one by one since first word, one word of selection is made every time For " current word w (that is, above-mentioned specified word) ", if the traversed whole all words of language material of w, terminate;Otherwise jump procedure 5;
Step 5, centered on current word w, window is set up to k word of two Slideslips, from first word in window to most Latter word (in addition to current word w), every time one word of selection be used as " cliction c " up and down, if all in the traversed windows of c Word, then jump procedure 4;Otherwise, jump procedure 6;
Step 6, for current word w, according to the word in step 2 and corresponding n members Cangjie code character mapping table, find current The corresponding each n members Cangjie's code characters of word w, current word w and cliction c up and down similarity are calculated according to formula (1):
Wherein, S represents the n member Cangjie's code character tables set up in step 2 in formula, and S (w) is represented in step 2 in mapping table N member Cangjie's code character set corresponding to current word w, q represents the element (i.e. some n member Cangjie code character) in set S (w). Sim (w, c) represents current word w and context words c similarity score;Represent n members Cangjie's code character q and cliction up and down Language c dot product computing;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and according to formula (2) (that is, above-mentioned loss function) Counting loss score l (w, c), loss score may act as above-mentioned loss characterization value:
Wherein, log is logarithmic function, and c ' is the negative sample word randomly selected, and Ec'∈p(V)[x] refers to what is randomly selected In the case that negative sample word c ' meets probability distribution p (V), expression formula x desired value, σ () is neutral net excitation function, in detail Carefully referring to formula (3):
Wherein, if x is a real number, σ (x) is also a real number;Gradient is calculated according to l (w, c) value, n members are updated Cangjie's code character vectorWith the vector of context wordsJump procedure 5.
In above-mentioned steps 1~7, step 6 and step 7 are more crucial steps.In order to make it easy to understand, being illustrated with reference to Fig. 3 It is bright.
The relevant treatment action schematic diagram of part language material used in flow in Fig. 2 that Fig. 3 provides for the embodiment of the present application.
As shown in Figure 3, it is assumed that have sentence " administering haze very urgent " in language material, participle obtains three words in the sentence Language " improvement ", " haze ", " very urgent ".
It is assumed that it is current word w now to select " haze ", it is cliction c up and down to select " improvement ", extracts the institute of current word w mappings Have n member Cangjie code character S (w), such as, " haze " mapping 5 yuan of Cangjie's code characters include " MBHES ", " BHESM ", " HESMB ", " ESMBB " etc., 5 yuan of Cangjie's code characters of " improvement " mapping include " EIRMG ", " IRMGW ", " RMGWG ".Then, According to formula (1), formula (2) and formula (3) counting loss score l (w, c), and then calculate gradient, with update c term vector and The corresponding all Cangjie's code character vectors of w.
Based on the embodiment in the invention thinking same with Fig. 1 and Fig. 2, the embodiment of the present application provides another word Vector processing method.
The schematic flow sheet for another term vector processing method that Fig. 4 provides for the embodiment of the present application.
Flow in Fig. 4 may comprise steps of:
Step 1, to language material participle, and the vocabulary that each word obtained by the participle is constituted is set up, wherein, it is described each Word is not included in word of the occurrence number less than setting number of times in the language material;Jump procedure 2;
Step 2, n member Cangjie's code character mapping tables are set up according to the vocabulary, the mapping table comprising each word with Mapping relations between n member Cangjie's code characters, n members Cangjie's code character characterizes the continuous n Cangjie code word of the word of its mapping Symbol;Jump procedure 3;
Step 3, according to the n members Cangjie code character mapping table, set up and initialize the term vector of each word, Yi Jisuo State Cangjie's code character vector of each n members Cangjie's code character of each word mapping;Jump procedure 4;
Step 4, the language material after traversal participle, respectively using each word traversed as current word w and to current word w Step 5 is performed, is terminated if completion is traveled through, otherwise continues to travel through;
Step 5, centered on current word w, many k words is respectively slid to both sides and set up window, are traveled through in the window All words in addition to current word w, respectively using each word traversed as current word w current context word c and to current Cliction c performs step 6 up and down, continues the execution of step 4 if completion is traveled through, and otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents each n members Cangjie code character collection that current word w maps in the n members Cangjie code character mapping table Close, q represents each n members Cangjie's code character in S (w), sim (w, c) represents current word w and current context word c similarity;Represent the dot product of q Cangjie's code character vector and current context word c term vector;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)[x] refers to that the negative sample word c ' randomly selected meets general In the case of rate distribution p (V), expression formula x desired value, σ () is neutral net excitation function, is defined as
The corresponding gradient of the loss function is calculated according to the loss characterization value l (w, c) calculated, according to the gradient, To q Cangjie's code character vectorWith current context word c term vectorIt is updated.
Each step can be performed by identical or different module in another term vector processing method, the application couple This is simultaneously not specifically limited.
It should be noted that in addition to Cangjie's code, other can show the word or word of the font morphological feature of the word Coding can equally be well applied to the scheme of the application, such as, the coding is replaced the storehouse in above-mentioned each scheme by Zheng's code of Chinese character etc. A word used in person's names code.
The term vector processing method provided above for the embodiment of the present application, based on same invention thinking, the application is implemented Example additionally provides corresponding device, as shown in Figure 5.
A kind of structural representation for term vector processing unit corresponding to Fig. 1 that Fig. 5 provides for the embodiment of the present application, the dress The executive agent of flow in Fig. 1 can be located at by putting, including:
Word-dividing mode 501, each word is obtained to language material participle;
Determining module 502, determines the corresponding each n members Cangjie's code character of each word, and n members Cangjie's code character characterizes it The continuous n Cangjie code character of corresponding word;
Initialization module 503, sets up and initializes the term vector of each word, and the corresponding each n members storehouse of each word Cangjie's code character vector of a word used in person's names code character;
Training module 504, it is right according to the term vector, Cangjie's code character vector, and the language material after participle The term vector and Cangjie's code character vector are trained.
Alternatively, the determining module 502 determines the corresponding each n members Cangjie's code character of each word, specifically includes:
The determining module 502 is according to the result to the language material participle, it is determined that the word occurred in the language material;
Each word of the determination is directed to respectively, is performed:
The corresponding each n members Cangjie's code character of the word is determined, the corresponding n members Cangjie code character of the word is characterized
The continuous n Cangjie code character of the word, n is a positive integer or multiple different positive integers.
Alternatively, the determining module 502 is according to the result to the language material participle, it is determined that occurring in the language material Word, specifically include:
The determining module 502 is according to the result to the language material participle, it is determined that occurring and occurring in the language material Word of the number of times no less than setting number of times.
Alternatively, the initialization module 503 initializes the term vector of each word, and the corresponding each n of each word Cangjie's code character vector of first Cangjie's code character, is specifically included:
The side that the initialization module 503 is initialized by the way of random initializtion or according to specified probability distribution Cangjie's code character vector of formula, the term vector of initialization each word, and the corresponding each n members Cangjie's code character of each word, Wherein, Cangjie's code character vector of identical n members Cangjie's code character is also identical.
Alternatively, the training module 504 is vectorial according to the term vector, Cangjie's code character, and after participle The language material, is trained to the term vector and Cangjie's code character vector, specifically includes:
The training module 504 determines the specified word in the language material after participle, and the specified word is after participle One or more of language material cliction up and down;
According to Cangjie's code character vector of the corresponding each n members Cangjie's code character of the specified word, and the cliction up and down Term vector, determine the specified word with it is described up and down cliction similarity;
According to the specified word with it is described up and down cliction similarity, to it is described up and down cliction term vector and it is described specify Cangjie's code character vector of the corresponding each n members Cangjie's code character of word is updated.
Alternatively, the training module 504 is according to the specified word and the similarity of the cliction up and down, above and below described Cangjie's code character vector of the term vector of cliction and the corresponding each n members Cangjie's code character of the specified word is updated, specific bag Include:
The training module 504 selects one or more words from each word, is used as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word with it is described up and down cliction similarity, and the specified word with The similarity of each negative sample word, determines the corresponding loss characterization value of the specified word;
According to the loss characterization value, to the term vector of cliction and the corresponding each n members Cangjie of the specified word up and down Cangjie's code character vector of code character is updated.
Alternatively, the training module 504 is according to the loss characterization value, to the term vector of cliction up and down and described Specify Cangjie's code character vector of the corresponding each n members Cangjie's code character of word to be updated, specifically include:
The training module 504 determines the corresponding gradient of the loss function according to the loss characterization value;
According to the gradient, to the term vector and the corresponding each n members Cangjie's code character of the specified word of the cliction up and down Cangjie's code character vector be updated.
Alternatively, the training module 504 selects one or more words from each word, as negative sample word, specifically Including:
The training module 504 randomly chooses one or more words from each word, is used as negative sample word.
Alternatively, the training module 504 is vectorial according to the term vector, Cangjie's code character, and after participle The language material, is trained to the term vector and Cangjie's code character vector, specifically includes:
The training module 504 is traveled through to the language material after participle, respectively in the language material after participle Each word is performed:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
According to Cangjie's code character of the corresponding each n members Cangjie's code character of word vector, and should above and below cliction term vector, The similarity of cliction above and below determining the word and being somebody's turn to do;
According to the word and the similarity of cliction above and below this, to term vector each n members storehouse corresponding with the word of cliction above and below this Cangjie's code character vector of a word used in person's names code character is updated.
Alternatively, the training module 504 determines one or more of the language material of the word after participle context Word, is specifically included:
In the language material of the training module 504 after participle, by centered on the word, sliding to the left and/or to the right The distance of dynamic specified quantity word, sets up window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
Alternatively, institute's predicate is Chinese word, and the term vector is the term vector of Chinese word, and Cangjie's code character is Cangjie Code character.
Based on same invention thinking, the embodiment of the present application additionally provides corresponding a kind of electronic equipment, including:
At least one processor;And,
The memory being connected with least one described processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Individual computing device, so that at least one described processor can:
Each word is obtained to language material participle;
The corresponding each n members Cangjie's code character of each word is determined, n members Cangjie's code character characterizes the company of its corresponding word Continue n Cangjie's code character;
Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie Code character vector;
According to the term vector, Cangjie's code character vector, and the language material after participle, to the term vector and Cangjie's code character vector is trained.
Based on same invention thinking, the embodiment of the present application additionally provides a kind of corresponding non-volatile computer storage and is situated between Matter, be stored with computer executable instructions, and the computer executable instructions are set to:
Each word is obtained to language material participle;
The corresponding each n members Cangjie's code character of each word is determined, n members Cangjie's code character characterizes the company of its corresponding word Continue n Cangjie's code character;
Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie Code character vector;
According to the term vector, Cangjie's code character vector, and the language material after participle, to the term vector and Cangjie's code character vector is trained.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device, For electronic equipment, nonvolatile computer storage media embodiment, because it is substantially similar to embodiment of the method, so description It is fairly simple, the relevent part can refer to the partial explaination of embodiments of method.
Device that the embodiment of the present application is provided, electronic equipment, nonvolatile computer storage media with method be it is corresponding, Therefore, device, electronic equipment, nonvolatile computer storage media also have the advantageous effects similar with corresponding method, Due to the advantageous effects of method being described in detail above, therefore, corresponding intrument, electronics are repeated no more here The advantageous effects of equipment, nonvolatile computer storage media.
In the 1990s, for a technology improvement can clearly distinguish be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (for the improvement of method flow).So And, with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow is programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, PLD (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, its logic function is determined by user to device programming.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, without asking chip maker to design and make Special IC chip.Moreover, nowadays, substitution manually makes IC chip, and this programming is also used instead mostly " patrols Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but have many kinds, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also should This understands, it is only necessary to slightly programming in logic and be programmed into method flow in integrated circuit with above-mentioned several hardware description languages, The hardware circuit for realizing the logical method flow can be just readily available.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing Device and storage can by the computer of the computer readable program code (such as software or firmware) of (micro-) computing device Read medium, gate, switch, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller includes but is not limited to following microcontroller Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that except with Pure computer readable program code mode is realized beyond controller, can be made completely by the way that method and step is carried out into programming in logic Obtain controller and come real in the form of gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. Existing identical function.Therefore this controller is considered a kind of hardware component, and various for realizing to including in it The device of function can also be considered as the structure in hardware component.Or even, can be by for realizing that the device of various functions is regarded For that not only can be the software module of implementation method but also can be the structure in hardware component.
System, device, module or unit that above-described embodiment is illustrated, can specifically be realized by computer chip or entity, Or realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during application.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of key elements are not only including those key elements, but also wrap Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Also there is other identical element in process, method, commodity or the equipment of element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code The shape for the computer program product that storage media is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The application can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these DCEs, by Remote processing devices connected by communication network perform task.In a distributed computing environment, program module can be with Positioned at including in the local and remote computer-readable storage medium including storage device.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.It is real especially for system Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
Embodiments herein is the foregoing is only, the application is not limited to.For those skilled in the art For, the application can have various modifications and variations.It is all any modifications made within spirit herein and principle, equivalent Replace, improve etc., it should be included within the scope of claims hereof.

Claims (24)

1. a kind of term vector processing method, including:
Each word is obtained to language material participle;
The corresponding each n members Cangjie's code character of each word is determined, n members Cangjie's code character characterizes the continuous n of its corresponding word Individual Cangjie's code character;
Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie's code word Symbol vector;
According to the term vector, Cangjie's code character vector, and the language material after participle, to the term vector and described Cangjie's code character vector is trained.
2. the method as described in claim 1, the corresponding each n members Cangjie's code character of the determination each word is specifically included:
According to the result to the language material participle, it is determined that the word occurred in the language material;
Mutually different each word of the determination is directed to respectively, is performed:
The corresponding each n members Cangjie's code character of the word is determined, the corresponding n members Cangjie's code character of the word characterizes the continuous n storehouse of the word A word used in person's names code character, n is a positive integer or multiple different positive integers.
3. method as claimed in claim 2, the basis is to the result of the language material participle, it is determined that occurring in the language material The word crossed, is specifically included:
According to the result to the language material participle, it is determined that occurring in the language material and occurrence number is no less than setting number of times Word.
4. the method as described in claim 1, the term vector of initialization each word, and the corresponding each n of each word Cangjie's code character vector of first Cangjie's code character, is specifically included:
By the way of the random initializtion or in the way of specified probability distribution is initialized, initialize the word of each word to Amount, and Cangjie's code character of the corresponding each n members Cangjie's code character of each word are vectorial, wherein, identical n members Cangjie's code character Cangjie's code character vector is also identical.
5. the method as described in claim 1, described according to the term vector, Cangjie's code character vector, and after participle The language material, the term vector and Cangjie's code character vector are trained, specifically included:
Determine one in the specified word in the language material after participle, and the language material of the specified word after participle or Multiple clictions up and down;
According to the word of Cangjie's code character vector of the corresponding each n members Cangjie's code character of the specified word, and the cliction up and down Vector, determines the specified word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector and the specified word pair of the cliction up and down The Cangjie's code character vector for each n members Cangjie's code character answered is updated.
6. method as claimed in claim 5, described according to the specified word and the similarity of the cliction up and down, on described Cangjie's code character vector of the term vector of lower cliction and the corresponding each n members Cangjie's code character of the specified word is updated, specifically Including:
One or more words are selected from each word, negative sample word is used as;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, the corresponding loss characterization value of the specified word is determined;
According to the loss characterization value, to the term vector and the corresponding each n members Cangjie's code word of the specified word of the cliction up and down Cangjie's code character vector of symbol is updated.
7. method as claimed in claim 6, described according to the loss characterization value, to the term vector of cliction and the institute up and down The Cangjie's code character vector for stating the corresponding each n members Cangjie's code character of specified word is updated, and is specifically included:
According to the loss characterization value, the corresponding gradient of the loss function is determined;
According to the gradient, to the term vector of cliction and the storehouse of the corresponding each n members Cangjie's code character of the specified word up and down A word used in person's names code character vector is updated.
8. method as claimed in claim 6, described to select one or more words from each word, negative sample word, tool are used as Body includes:
One or more words are randomly choosed from each word, negative sample word is used as.
9. the method as described in claim 1, described according to the term vector, Cangjie's code character vector, and after participle The language material, the term vector and Cangjie's code character vector are trained, specifically included:
The language material after participle is traveled through, each word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
According to Cangjie's code character of the corresponding each n members Cangjie's code character of word vector, and should above and below cliction term vector, it is determined that The word and the similarity of cliction above and below this;
According to the word and the similarity of cliction above and below this, to term vector each n members Cangjie code corresponding with the word of cliction above and below this Cangjie's code character vector of character is updated.
10. above and below one or more of method as claimed in claim 9, the described language material of the determination word after participle Cliction, specifically includes:
In the language material after participle, by centered on the word, slide to the left and/or to the right specified quantity word away from From setting up window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
11. the method as described in any one of claim 1~10, institute's predicate is Chinese word, the term vector is the word of Chinese word Vector, Cangjie's code character is Cangjie's code character.
12. a kind of term vector processing unit, including:
Word-dividing mode, each word is obtained to language material participle;
Determining module, determines the corresponding each n members Cangjie's code character of each word, it is corresponding that n members Cangjie's code character characterizes its The continuous n Cangjie code character of word;
Initialization module, sets up and initializes the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie's code character vector;
Training module, according to the term vector, Cangjie's code character vector, and the language material after participle, to institute's predicate Cangjie's code character vector is trained described in vector sum.
13. device as claimed in claim 12, the determining module determines the corresponding each n members Cangjie's code character of each word, Specifically include:
The determining module is according to the result to the language material participle, it is determined that the word occurred in the language material;
Mutually different each word of the determination is directed to respectively, is performed:
The corresponding each n members Cangjie's code character of the word is determined, the corresponding n members Cangjie's code character of the word characterizes the continuous n storehouse of the word A word used in person's names code character, n is a positive integer or multiple different positive integers.
14. device as claimed in claim 13, the determining module is according to the result to the language material participle, it is determined that described The word occurred in language material, is specifically included:
The determining module is according to the result to the language material participle, it is determined that occurring in the language material and occurrence number is many In the word of setting number of times.
15. device as claimed in claim 12, the initialization module initializes the term vector of each word, and described each Cangjie's code character vector of the corresponding each n members Cangjie's code character of word, is specifically included:
The initialization module is by the way of random initializtion or in the way of specified probability distribution is initialized, initialization The term vector of each word, and Cangjie's code character of the corresponding each n members Cangjie's code character of each word are vectorial, wherein, identical n Cangjie's code character vector of first Cangjie's code character is also identical.
16. device as claimed in claim 12, the training module is vectorial according to the term vector, Cangjie's code character, And the language material after participle, the term vector and Cangjie's code character vector are trained, specifically included:
The training module determines the specified word in the language material after participle, and institute predicate of the specified word after participle Cliction above and below one or more of material;
According to the word of Cangjie's code character vector of the corresponding each n members Cangjie's code character of the specified word, and the cliction up and down Vector, determines the specified word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector and the specified word pair of the cliction up and down The Cangjie's code character vector for each n members Cangjie's code character answered is updated.
17. device as claimed in claim 16, the training module is similar to the cliction up and down according to the specified word Degree, enters to the term vector of cliction and Cangjie's code character vector of the corresponding each n members Cangjie's code character of the specified word up and down Row updates, and specifically includes:
The training module selects one or more words from each word, is used as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, the corresponding loss characterization value of the specified word is determined;
According to the loss characterization value, to the term vector and the corresponding each n members Cangjie's code word of the specified word of the cliction up and down Cangjie's code character vector of symbol is updated.
18. device as claimed in claim 17, the training module is according to the loss characterization value, to the cliction up and down Cangjie's code character vector of term vector and the corresponding each n members Cangjie's code character of the specified word is updated, and is specifically included:
The training module determines the corresponding gradient of the loss function according to the loss characterization value;
According to the gradient, to the term vector of cliction and the storehouse of the corresponding each n members Cangjie's code character of the specified word up and down A word used in person's names code character vector is updated.
19. device as claimed in claim 17, the training module selects one or more words from each word, as negative Sample word, is specifically included:
The training module randomly chooses one or more words from each word, is used as negative sample word.
20. device as claimed in claim 12, the training module is vectorial according to the term vector, Cangjie's code character, And the language material after participle, the term vector and Cangjie's code character vector are trained, specifically included:
The training module is traveled through to the language material after participle, and each word in the language material after participle is held respectively OK:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
According to Cangjie's code character of the corresponding each n members Cangjie's code character of word vector, and should above and below cliction term vector, it is determined that The word and the similarity of cliction above and below this;
According to the word and the similarity of cliction above and below this, to term vector each n members Cangjie code corresponding with the word of cliction above and below this Cangjie's code character vector of character is updated.
21. device as claimed in claim 20, the training module determines one in the language material of the word after participle Or multiple clictions up and down, specifically include:
In the language material of the training module after participle, by centered on the word, sliding to the left and/or to the right and specifying number The distance of amount word, sets up window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
22. the device as described in any one of claim 12~21, institute's predicate is Chinese word, the term vector is the word of Chinese word Vector, Cangjie's code character is Cangjie's code character.
23. a kind of term vector processing method, including:
Step 1, to language material participle, and the vocabulary that each word obtained by the participle is constituted is set up, wherein, each word is not It is included in word of the occurrence number less than setting number of times in the language material;Jump procedure 2;
Step 2, according to the vocabulary, n member Cangjie's code character mapping tables are set up, the mapping table includes each word and n members Mapping relations between Cangjie's code character, n members Cangjie's code character characterizes the continuous n Cangjie code character of the word of its mapping; Jump procedure 3;
Step 3, according to the n members Cangjie code character mapping table, the term vector of each word is set up and initializes, and it is described each Cangjie's code character vector of each n members Cangjie's code character of word mapping;Jump procedure 4;
Step 4, the language material after traversal participle, respectively using each word traversed as current word w and to current word w execution Step 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, many k words is respectively slid to both sides and set up window, traveled through to remove in the window and work as All words beyond preceding word w, respectively using each word traversed as current word w current context word c and to when front upper and lower Cliction c performs step 6, continues the execution of step 4 if completion is traveled through, and otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>q</mi> <mo>&amp;Element;</mo> <mi>S</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> </munder> <mover> <mi>q</mi> <mo>&amp;RightArrow;</mo> </mover> <mo>&amp;CenterDot;</mo> <mover> <mi>c</mi> <mo>&amp;RightArrow;</mo> </mover> <mo>;</mo> </mrow>
Wherein, S (w) represents each n members Cangjie code character set that current word w maps in the n members Cangjie code character mapping table, q Each n members Cangjie's code character in S (w) is represented, sim (w, c) represents current word w and current context word c similarity;Table Show the dot product of q Cangjie's code character vector and current context word c term vector;Jump procedure 7;
Step 7, randomly select λ word as negative sample word, according to following loss function calculate corresponding loss characterization value l (w, c):
<mrow> <mi>l</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>c</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&amp;lambda;</mi> </munderover> <msub> <mi>E</mi> <mrow> <msup> <mi>c</mi> <mo>&amp;prime;</mo> </msup> <mo>&amp;Element;</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>V</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mo>-</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <msup> <mi>c</mi> <mo>,</mo> </msup> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> <mo>;</mo> </mrow>
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)[x] refers to that the negative sample word c ' randomly selected meets probability point In the case of cloth p (V), expression formula x desired value, σ () is neutral net excitation function, is defined as
The corresponding gradient of the loss function is calculated according to the loss characterization value l (w, c) calculated, according to the gradient, to q's Cangjie's code character vectorWith current context word c term vectorIt is updated.
24. a kind of electronic equipment, including:
At least one processor;And,
The memory being connected with least one described processor communication;Wherein,
The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Manage device to perform, so that at least one described processor can:
Each word is obtained to language material participle;
The corresponding each n members Cangjie's code character of each word is determined, n members Cangjie's code character characterizes the continuous n of its corresponding word Individual Cangjie's code character;
Set up and initialize the term vector of each word, and the corresponding each n members Cangjie's code character of each word Cangjie's code word Symbol vector;
According to the term vector, Cangjie's code character vector, and the language material after participle, to the term vector and described Cangjie's code character vector is trained.
CN201710430490.2A 2017-06-09 2017-06-09 Word vector processing method and device and electronic equipment Active CN107247704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710430490.2A CN107247704B (en) 2017-06-09 2017-06-09 Word vector processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710430490.2A CN107247704B (en) 2017-06-09 2017-06-09 Word vector processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN107247704A true CN107247704A (en) 2017-10-13
CN107247704B CN107247704B (en) 2020-09-08

Family

ID=60018874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710430490.2A Active CN107247704B (en) 2017-06-09 2017-06-09 Word vector processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN107247704B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107957989A (en) * 2017-10-23 2018-04-24 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
CN108170663A (en) * 2017-11-14 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
CN111226223A (en) * 2017-10-26 2020-06-02 三菱电机株式会社 Word semantic relation estimation device and word semantic relation estimation method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW446901B (en) * 1999-06-17 2001-07-21 Inventec Corp Teaching method for breakdown input of Chinese characters and system thereof
CN101916159A (en) * 2010-07-30 2010-12-15 凌阳科技股份有限公司 Virtual input system utilizing remote controller
US8788508B2 (en) * 2011-03-28 2014-07-22 Microth, Inc. Object access system based upon hierarchical extraction tree and related methods
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
WO2016008512A1 (en) * 2014-07-15 2016-01-21 Ibeezi Sprl Input of characters of a symbol-based written language
CN105631468A (en) * 2015-12-18 2016-06-01 华南理工大学 RNN-based automatic picture description generation method
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW446901B (en) * 1999-06-17 2001-07-21 Inventec Corp Teaching method for breakdown input of Chinese characters and system thereof
CN101916159A (en) * 2010-07-30 2010-12-15 凌阳科技股份有限公司 Virtual input system utilizing remote controller
US8788508B2 (en) * 2011-03-28 2014-07-22 Microth, Inc. Object access system based upon hierarchical extraction tree and related methods
WO2016008512A1 (en) * 2014-07-15 2016-01-21 Ibeezi Sprl Input of characters of a symbol-based written language
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN105631468A (en) * 2015-12-18 2016-06-01 华南理工大学 RNN-based automatic picture description generation method
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BLUCHE T 等: "Faster Segmentation-Free Handwritten Chinese Text Recognition with Character Decompositions", 《2016 15TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION(ICFHR)》 *
YEH JUI-FENG 等: "Chinese word spelling correction based on n-gram ranked inverted index list", 《PROCEEDINGS OF THE SEVENTH SIGHAN WORKSHOP ON CHINESE LANGUAGE PROCESSING》 *
周建政 等: "问答系统中问题模式分类与相似度计算方法", 《计算机工程与应用》 *
杨露: "基于云计算的嵌入式系统输入法研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107957989A (en) * 2017-10-23 2018-04-24 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
WO2019080615A1 (en) * 2017-10-23 2019-05-02 阿里巴巴集团控股有限公司 Cluster-based word vector processing method, device, and apparatus
US10769383B2 (en) 2017-10-23 2020-09-08 Alibaba Group Holding Limited Cluster-based word vector processing method, device, and apparatus
CN107957989B (en) * 2017-10-23 2020-11-17 创新先进技术有限公司 Cluster-based word vector processing method, device and equipment
CN107957989B9 (en) * 2017-10-23 2021-01-12 创新先进技术有限公司 Cluster-based word vector processing method, device and equipment
TWI721310B (en) * 2017-10-23 2021-03-11 開曼群島商創新先進技術有限公司 Cluster-based word vector processing method, device and equipment
CN111226223A (en) * 2017-10-26 2020-06-02 三菱电机株式会社 Word semantic relation estimation device and word semantic relation estimation method
CN111226223B (en) * 2017-10-26 2023-10-20 三菱电机株式会社 Word semantic relation estimation device and word semantic relation estimation method
CN108170663A (en) * 2017-11-14 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
WO2019095836A1 (en) * 2017-11-14 2019-05-23 阿里巴巴集团控股有限公司 Method, device, and apparatus for word vector processing based on clusters
US10846483B2 (en) 2017-11-14 2020-11-24 Advanced New Technologies Co., Ltd. Method, device, and apparatus for word vector processing based on clusters

Also Published As

Publication number Publication date
CN107247704B (en) 2020-09-08

Similar Documents

Publication Publication Date Title
CN108345580A (en) A kind of term vector processing method and processing device
TWI701588B (en) Word vector processing method, device and equipment
CN108874765A (en) Term vector processing method and processing device
CN107957989A (en) Term vector processing method, device and equipment based on cluster
TWI689831B (en) Word vector generating method, device and equipment
US20190197154A1 (en) Question answering for data visualizations
TWI686713B (en) Word vector generating method, device and equipment
CN109389038A (en) A kind of detection method of information, device and equipment
CN107423269A (en) Term vector processing method and processing device
CN109034183A (en) A kind of object detection method, device and equipment
CN107247704A (en) Term vector processing method, device and electronic equipment
JP7435951B2 (en) Floating point number generation method, apparatus, electronic device, storage medium and computer program for integrated circuit chip verification
CN107291222A (en) Interaction processing method, device, system and the virtual reality device of virtual reality device
CN107402945A (en) Word stock generating method and device, short text detection method and device
CN107562716A (en) Term vector processing method, device and electronic equipment
CN109934253A (en) A kind of confrontation sample generating method and device
CN110502614A (en) Text hold-up interception method, device, system and equipment
CN107577658A (en) Term vector processing method, device and electronic equipment
CN107562715A (en) Term vector processing method, device and electronic equipment
CN108170663A (en) Term vector processing method, device and equipment based on cluster
CN115861995B (en) Visual question-answering method and device, electronic equipment and storage medium
CN112949433A (en) Method, device and equipment for generating video classification model and storage medium
CN110471835A (en) A kind of similarity detection method and system based on power information system code file
CN107844472A (en) Term vector processing method, device and electronic equipment
CN107577659A (en) Term vector processing method, device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Ltd.