CN107423269A - Term vector processing method and processing device - Google Patents

Term vector processing method and processing device Download PDF

Info

Publication number
CN107423269A
CN107423269A CN201710383929.0A CN201710383929A CN107423269A CN 107423269 A CN107423269 A CN 107423269A CN 201710383929 A CN201710383929 A CN 201710383929A CN 107423269 A CN107423269 A CN 107423269A
Authority
CN
China
Prior art keywords
word
code character
corner code
vector
members
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710383929.0A
Other languages
Chinese (zh)
Other versions
CN107423269B (en
Inventor
曹绍升
程远
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710383929.0A priority Critical patent/CN107423269B/en
Publication of CN107423269A publication Critical patent/CN107423269A/en
Application granted granted Critical
Publication of CN107423269B publication Critical patent/CN107423269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present application discloses term vector processing method and processing device.Methods described includes:Language material is segmented to obtain each word;Each n members corner code character corresponding to each word is determined, the n members corner code character characterizes the continuous n corner code character of its corresponding word;Establish and initialize the term vector of each word, and the corner code character vector of each n members corner code character corresponding to each word;According to the term vector, corner code character vector, and the language material after participle, the term vector and the corner code character vector are trained.Utilize the embodiment of the present application, it is possible to achieve the feature of the word is more subtly showed by n members corner code character corresponding to word, it is particularly possible to show the stroke dimensional orientation feature of the word, and then be advantageous to improve the degree of accuracy of the term vector of Chinese word, practical function is preferable.

Description

Term vector processing method and processing device
Technical field
The application is related to computer software technical field, more particularly to term vector processing method and processing device.
Background technology
The solution of natural language processing of today, mostly uses the framework based on neutral net, and in this framework Next important basic technology is exactly term vector.Term vector is the vector that word is mapped to a fixed dimension, the vector table The semantic information of the word is levied.
In the prior art, the algorithm for being commonly used in generation term vector is specific to English design.Such as Google The word vector algorithm of company, deep neural network algorithm of Microsoft etc..
But though these algorithms of prior art are either not used to Chinese or can be used for Chinese, generated The practical function of the term vector of Chinese word is poor.
The content of the invention
The embodiment of the present application provides term vector processing method and processing device, to solve to be used to generate term vector in the prior art Algorithm be either not used to Chinese or though Chinese can be used for, generate Chinese word term vector practical function compared with The problem of poor.
In order to solve the above technical problems, what the embodiment of the present application was realized in:
A kind of term vector processing method that the embodiment of the present application provides, including:
Language material is segmented to obtain each word;
Determine that each n members corner code character, the n members corner code character characterize the company of its corresponding word corresponding to each word Continue n corner code character;
Establish and initialize the term vector of each word, and the corner of each n members corner code character corresponding to each word Code character vector;
According to the term vector, corner code character vector, and the language material after participle, to the term vector and The corner code character vector is trained.
A kind of term vector processing unit that the embodiment of the present application provides, including:
Word-dividing mode, language material is segmented to obtain each word;
Determining module, each n members corner code character corresponding to each word is determined, it is right that the n members corner code character characterizes its The continuous n corner code character for the word answered;
Initialization module, establish and initialize the term vector of each word, and first four corner braces of each n corresponding to each word The corner code character vector of character;
Training module, according to the term vector, corner code character vector, and the language material after participle, to institute Corner code character vector is trained described in predicate vector sum.
Another term vector processing method that the embodiment of the present application provides, including:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, it is described each Word is not included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, according to the vocabulary, establish n members corner code character mapping table, the mapping table include each word with Mapping relations between the code character of n members corner, the n members corner code character characterize the continuous n corner code word of the word of its mapping Symbol;Jump procedure 3;
Step 3, according to the n members corner code character mapping table, establish and initialize the term vector of each word, Yi Jisuo State the corner code character vector of each n members corner code character of each word mapping;Jump procedure 4;
Step 4, the language material after traversal participle, respectively using each word traversed as current word w and to current word w Step 5 is performed, terminates if completion is traveled through, otherwise continues to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, are traveled through in the window All words in addition to current word w, respectively using each word traversed as current word w current context word c and to current Upper and lower cliction c performs step 6, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents each n members corner code character collection that current word w maps in the n members corner code character mapping table Close, q represents each n members corner code character in S (w), and sim (w, c) represents current word w and current context word c similarity;Represent the dot product of q corner code character vector and current context word c term vector;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)[x] refers to the negative sample word randomly selected C ' meets probability distributionp(V) in the case of, expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) that calculates calculates the loss function, according to the gradient, To q corner code character vectorWith current context word c term vectorIt is updated.
Above-mentioned at least one technical scheme that the embodiment of the present application uses can reach following beneficial effect:It can realize logical Cross the feature that n members corner code character corresponding to word more subtly shows the word, it is particularly possible to show the stroke dimensional orientation of the word Feature, and then be advantageous to improve the degree of accuracy of the term vector of Chinese word, practical function is preferable, therefore, can be partly or entirely Solve the problems of the prior art.
Brief description of the drawings
, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments described in application, for those of ordinary skill in the art, do not paying the premise of creative labor Under, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet for term vector processing method that the embodiment of the present application provides;
Fig. 2 is a kind of specific reality of the term vector processing method under the practical application scene that the embodiment of the present application provides Apply the schematic flow sheet of scheme;
Fig. 3 is the relevant treatment action schematic diagram of part language material used in flow in Fig. 2 that the embodiment of the present application provides;
Fig. 4 is the schematic flow sheet for another term vector processing method that the embodiment of the present application provides;
Fig. 5 is a kind of structural representation for term vector processing unit corresponding to Fig. 1 that the embodiment of the present application provides.
Embodiment
The embodiment of the present application provides term vector processing method and processing device.
In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation Example only some embodiments of the present application, rather than whole embodiments.It is common based on the embodiment in the application, this area The every other embodiment that technical staff is obtained under the premise of creative work is not made, it should all belong to the application protection Scope.
The scheme of the application is applied to the term vector of Chinese word, is also applied for the word of other some language of similar Chinese Term vector, such as, term vector of word of the obvious language of the stroke dimensional orientation feature such as Korean, Japanese etc., for non-Chinese Situation, it is necessary to correspondingly be encoded according to the coding rule of four corner braces.
For the ease of description, following embodiment illustrates mainly for the scene of Chinese word to the scheme of the application.
Fig. 1 is a kind of schematic flow sheet for term vector processing method that the embodiment of the present application provides.For program angle, The executive agent of the flow can be program with term vector systematic function and/or training function etc.;For slave unit angle, The executive agent of the flow can include but is not limited to the following at least one equipment that can carry described program:Personal computer, Big-and-middle-sized computer, computer cluster, mobile phone, tablet personal computer, intelligent wearable device, vehicle device etc..
Flow in Fig. 1 may comprise steps of:
S101:Language material is segmented to obtain each word.
In the embodiment of the present application, each word can be specifically:At least occurred in language material in word once at least Part word.For the ease of subsequent treatment, each word can be stored in vocabulary, it is necessary to read word i.e. from vocabulary when using Can.
S102:Each n members corner code character corresponding to each word is determined, the n members corner code character is characterized corresponding to it The continuous n corner code character of word.
Four corner braces are a kind of codings of Chinese character, and the cryptoprinciple of four corner braces is that the basic strokes of Chinese character are divided into 10 kinds, point Not with 0,1~90 digitized representations, take in order Chinese character four angles stroke corresponding to numeral form character string as should Four corner brace corresponding to Chinese character, wherein, each character in the character string is a numeral, can claim the character string For corner code character sequence, each character is referred to as a corner code character.Usually, four corner brace bags corresponding to each Chinese character Containing 5 numerals, sometimes also in finally one bit complement of increase.
For example, four corner braces of " people " word are " 80000 ", four corner braces of " people " word are " 77747 ".
Further, Chinese word is typically made up of multiple Chinese characters, then is sequentially connected four corner braces of this multiple Chinese character and also may be used To obtain code character sequence in corner corresponding to the Chinese word, then n members corner code character is corresponding to word:Four corner braces corresponding to the word Continuous n corner code character in character string.
Use the example above, for word " people ", its corresponding corner code character sequence is " 8000077747 ", is understood accordingly: Its corresponding 3 yuan of corner code character is:" 800 " (the 1st~3 corner code character), " 000 " (the 2nd~4 corner code character), " 747 " (the 8th~10 corner code character) etc.;Its corresponding 4 yuan of corner code character is:" 8000 " (the 1st~4 corner code word Symbol), " 0000 " (the 2nd~5 corner code character), " 0007 " (the 3rd~6 corner code character) etc.;Its corresponding 5 yuan of four corner brace Character is:" 80000 " (the 1st~5 corner code character), " 00007 " (the 2nd~6 corner code character), " 00077 " the (the 3rd~7 Individual corner code character) etc..
In the embodiment of the present application, n value can be that dynamic is adjustable.For same word, it is determined that the word is corresponding Each n members corner code character when, n value can only take 1 (for example only determining each 3 yuan of corner code characters corresponding to the word), It can also take multiple (for example determining each 3 yuan of corner code characters and each 4 yuan of corner code characters etc. corresponding to the word).When n value During the corner code character sum that exactly corner code character sequence corresponding to word or word includes, n members corner code character is exactly The corner code character sequence.
In the embodiment of the present application, the finger beyond numeral can also be used for the ease of computer disposal, n members corner code character Fixed code is indicated.Such as can be by different corner code characters respectively with the different code tables beyond a numeral Showing, then n members corner code character can correspondingly be expressed as nonnumeric code string, wherein, the code and numeral or Serial No. Between have mapping relations.
S103:Establish and initialize the term vector of each word, and each n members corner code character corresponding to each word Corner code character vector.
In the embodiment of the present application, code character vector in corner refers to the vector for representing n members corner code character.Each n members Corner code character can represent with a corner code character vector respectively, can be respectively with a term vector just as each word To represent.
In the embodiment of the present application, in order to ensure the effect of scheme, when initializing term vector and corner code character vector, It might have some restrictive conditions.Such as, it is impossible to by each term vector and each corner code character vector be initialized to identical to Amount;Again for example, the vector element value in some term vectors or corner code character vector can not be all 0;Etc..
In the embodiment of the present application, can be by the way of random initializtion or according to the initialization of specified probability distribution Mode, initializes the term vector of each word, and each n members corner code character corresponding to each word corner code character to Amount, wherein, the corner code character vector of identical n members corner code character is also identical.For example the specified probability distribution can be 0- 1 distribution etc..
If in addition, having been based on other language materials before, term vector corresponding to some words and corner code character vector were trained, , can be no longer then when the language material in being further based on Fig. 1 trains term vector corresponding to these words and corner code character vector Re-establish and initialize term vector corresponding to these words and corner code character vector, but based on the language material in Fig. 1 and before Training result, then be trained.
S104:According to the term vector, corner code character vector, and the language material after participle, to institute's predicate Corner code character vector is trained described in vector sum.
In the embodiment of the present application, the training can be by neural fusion, and the neutral net can be Shallow-layer neutral net or deep-neural-network etc..The application is not limited the concrete structure of the neutral net of use.
Pass through Fig. 1 method, it is possible to achieve the spy of the word is more subtly showed by n members corner code character corresponding to word Sign, it is particularly possible to the stroke dimensional orientation feature of the word is showed, and then is advantageous to improve the degree of accuracy of the term vector of Chinese word, it is real It is preferable with effect, therefore, can partly or entirely solve the problems of the prior art.
Method based on Fig. 1, the embodiment of the present application additionally provide some specific embodiments of this method, and extension side Case, it is illustrated below.
It is in the embodiment of the present application, described to determine each n members corner code character corresponding to each word for step S102, It can specifically include:According to the result segmented to the language material, it is determined that the word occurred in the language material;
Mutually different each word of the determination is directed to respectively, is performed:
Each n members corner code character corresponding to the word is determined, n members corner code character characterizes the continuous n of the word corresponding to the word Individual corner code character, n are a positive integer or multiple different positive integers.
In the embodiment of the present application, for identical word, each n members corner code character corresponding to them is also identical, because This, for the step in the preceding paragraph, performs for the mutually different each word determined respectively, and for dittograph, Existing result can be directly continued to use, without repeating, so as to save resource.
Further, it is contemplated that corresponding when being trained based on the language material if the number that some word occurs in language material is very little Training sample and frequency of training it is also less, adverse effect can be brought to the confidence level of training result, therefore, can be by this kind of word Screen out, wouldn't train.It can be subsequently trained in other language materials.
Based on such thinking, result that the basis segments to the language material, it is determined that occurred in the language material Word, it can specifically include:According to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is many In the word of setting number.It is specifically that how many times can determine according to actual conditions to set number.
In the embodiment of the present application, for step S104, specific training method can have it is a variety of, such as based on context The training method of word, training method based on specified near synonym or synonym etc., in order to make it easy to understand, exemplified by a manner of former Describe in detail.
It is described according to the term vector, the corner code character vector, and participle after the language material, to institute's predicate to Amount and the corner code character vector are trained, and can specifically be included:It is determined that participle after the language material in specified word, with And cliction above and below one or more of the described language material of the specified word after participle;According to each n corresponding to the specified word The corner code character vector of first corner code character, and the term vector of the cliction up and down, determine the specified word with it is described on The similarity of lower cliction;According to the specified word with it is described up and down cliction similarity, to it is described up and down cliction term vector and The corner code character vector of each n members corner code character is updated corresponding to the specified word.
The application couple determines that the concrete mode of similarity does not limit.For example it can be transported based on vectorial included angle cosine Calculate similarity, can the quadratic sum computing based on vector calculate similarity, etc..
The specified word can have the position difference multiple, specified word can be repeatedly and in language material, can be directed to respectively Each specified word performs the processing action in the preceding paragraph.Preferably, each word that will can be included respectively in the language material after participle All it is used as a specified word.
In the embodiment of the present application, the training in step S104 can cause:Specify word and the similarity phase of upper and lower cliction To uprising, (herein, similarity can reflect the degree of association, and the degree of association of word and its context word is of a relatively high, and meaning of a word phase With or similar each word respectively corresponding to up and down cliction be also often same or like), and specify word and non-cliction up and down Relatively step-down, non-cliction up and down can be as following negative sample words for similarity, then cliction can relatively be used as just up and down Sample word.
As can be seen here, in the training process, it is thus necessary to determine that some negative sample words are as control.Can be in the language material after participle The middle one or more words of random selection can also strictly select non-cliction up and down as negative sample word as negative sample word.With Exemplified by former mode, it is described according to the specified word with it is described up and down cliction similarity, to it is described up and down cliction word to The corner code character vector of each n members corner code character is updated corresponding to amount and the specified word, can specifically be included:From institute State and one or more words are selected in each word, as negative sample word;Determine the similarity of the specified word and each negative sample word; According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word with it is each described negative The similarity of sample word, determine loss characterization value corresponding to the specified word;According to the loss characterization value, to the context The corner code character vector of each n members corner code character is updated corresponding to the term vector of word and the specified word.
Wherein, the loss characterization value is used to weigh the error degree between current vector value and training objective.It is described The parameter of loss function can be using above-mentioned several similarities as parameter, and specific loss function expression formula the application is not done Limit, behind can illustrated in greater detail.
In the embodiment of the present application, term vector and the vector renewal of corner code character are actually repaiied to the error degree Just.When using the scheme of neural fusion the application, this amendment can be based on backpropagation and gradient descent method is realized. In this case, the gradient is gradient corresponding to loss function.
It is then described according to the loss characterization value, each n members corresponding to the term vector and the specified word to the specified word The corner code character vector of corner code character is updated, and can specifically be included:According to the loss characterization value, the damage is determined Lose gradient corresponding to function;According to the gradient, to each n members four corresponding to the term vector of cliction up and down and the specified word The corner code character vector of corner brace character is updated.
In the embodiment of the present application, the training process to term vector and corner code character vector can be based on after participle What at least part word iteration in language material was carried out, so as to so that term vector and corner code character vector are little by little restrained, until Complete training.
So that whole words in the language material after based on participle are trained as an example.It is described according to institute's predicate for step S104 Vectorial, described corner code character vector, and the language material after participle, to the term vector and corner code character vector It is trained, can specifically includes:
The language material after participle is traveled through, each word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
The corner code character vector of each n members corner code character according to corresponding to the word, and the term vector of the upper and lower cliction, Determine the similarity of the word and the upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, each n members four corresponding to the term vector and the word to the upper and lower cliction The corner code character vector of corner brace character is updated.
Specific how to be updated has illustrated above, repeats no more.
Further, for the ease of computer disposal, ergodic process above can be realized based on window.
For example, cliction above and below one or more of described described language material for determining the word after participle, can specifically wrap Include:In the language material after participle, by centered on the word, sliding the distance of specified quantity word to the left and/or to the right, Establish window;Word beyond the word in the window is defined as to the cliction up and down of the word.
It is of course also possible to using first word of the language material after participle as starting position, establish one and set length Window, in window comprising first word and afterwards continuous setting quantity word;After having handled each word in window, by window Slide backward to handle the next group word in the language material, until having traveled through the language material.
A kind of term vector processing method provided above the embodiment of the present application is illustrated.In order to make it easy to understand, base In described above, the embodiment of the present application is additionally provided under practical application scene, a kind of specific reality of the term vector processing method The schematic flow sheet of scheme is applied, as shown in Figure 2.
Flow in Fig. 2 mainly includes the following steps that:
Step 1, Chinese language material is segmented using participle instrument, scanning participle after Chinese language material, statistics it is all go out The word now crossed deletes the word that occurrence number is less than b times (that is, above-mentioned setting number) to establish vocabulary;Jump procedure 2;
Step 2, vocabulary is scanned one by one, extracts n members corner code character corresponding to each word, establishes n members corner code word Accord with table, and the mapping table of word and corresponding n members corner code character;Jump procedure 3;
Step 3, the term vector that a dimension is d is established for each word in vocabulary, in the code character table of n members corner Each n members corner code character establish a dimension also for d corner code character vector, the institute that random initializtion is established is oriented Amount;Jump procedure 4;
Step 4, from the Chinese language material for completing participle, slided one by one since first word, one word of selection is made every time For " current word w (that is, above-mentioned specified word) ", if all words of the traversed whole language materials of w, terminate;Otherwise jump procedure 5;
Step 5, centered on current word w, window is established to k word of two Slideslips, first out of window word is to most The latter word (in addition to current word w), one word of selection is as " upper and lower cliction c ", if all in the traversed windows of c every time Word, then jump procedure 4;Otherwise, jump procedure 6;
Step 6, for current word w, according to the word in step 2 and corresponding n members corner code character mapping table, find current Each n members corner code character corresponding to word w, current word w and upper and lower cliction c similarity is calculated according to formula (1):
Wherein, S represents the n members corner code character table established in step 2 in formula, and S (w) is represented in step 2 in mapping table N members corner code character set corresponding to current word w, q represent the element (i.e. some n members corner code character) in set S (w). Sim (w, c) represents current word w and context words c similarity score;Represent n members corner code character q and upper and lower cliction Language c dot product computing;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and according to formula (2) (that is, above-mentioned loss function) Counting loss score l (w, c), loss score may act as above-mentioned loss characterization value:
Wherein, log is logarithmic function, and c ' is the negative sample word randomly selected, and Ec'∈p(V)[x] refers to what is randomly selected In the case that negative sample word c ' meets probability distribution p (V), expression formula x desired value, σ () is neutral net excitation function, in detail Carefully referring to formula (3):
Wherein, if x is a real number, σ (x) and a real number;Gradient, renewal n members are calculated according to l (w, c) value Corner code character vectorWith the vector of context wordsJump procedure 5.
In above-mentioned steps 1~7, step 6 and step 7 are more crucial steps.In order to make it easy to understand, illustrated with reference to Fig. 3 It is bright.
Fig. 3 is the relevant treatment action schematic diagram of part language material used in flow in Fig. 2 that the embodiment of the present application provides.
As shown in Figure 3, it is assumed that have sentence " it is very urgent to administer haze " in language material, participle obtains three words in the sentence Language " improvement ", " haze ", " very urgent ".
It is assumed that it is current word w now to select " haze ", it is upper and lower cliction c to select " improvement ", the institute of extraction current word w mappings Have n member corner code character S (w), such as, " haze " mapping 5 yuan of corner code characters include " 10427 ", " 04271 ", " 42710 ", " 27102 " etc., " improvement " mapping 5 yuan of corner code characters include " 33160 ", " 31601 ", " 16016 ", " 60161 " etc..Then, according to formula (1), formula (2) and formula (3) counting loss score l (w, c), and then gradient is calculated, with Update corner code character vector all corresponding to c term vector and w.
Based on the embodiment in the invention thinking same with Fig. 1 and Fig. 2, the embodiment of the present application provides another word Vector processing method.
Fig. 4 is the schematic flow sheet for another term vector processing method that the embodiment of the present application provides.
Flow in Fig. 4 may comprise steps of:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, it is described each Word is not included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, according to the vocabulary, establish n members corner code character mapping table, the mapping table include each word with Mapping relations between the code character of n members corner, the n members corner code character characterize the continuous n corner code word of the word of its mapping Symbol;Jump procedure 3;
Step 3, according to the n members corner code character mapping table, establish and initialize the term vector of each word, Yi Jisuo State the corner code character vector of each n members corner code character of each word mapping;Jump procedure 4;
Step 4, the language material after traversal participle, respectively using each word traversed as current word w and to current word w Step 5 is performed, terminates if completion is traveled through, otherwise continues to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, are traveled through in the window All words in addition to current word w, respectively using each word traversed as current word w current context word c and to current Upper and lower cliction c performs step 6, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
Wherein, S (w) represents each n members corner code character collection that current word w maps in the n members corner code character mapping table Close, q represents each n members corner code character in S (w), and sim (w, c) represents current word w and current context word c similarity;Represent the dot product of q corner code character vector and current context word c term vector;Jump procedure 7;
Step 7, λ word is randomly selected as negative sample word, and corresponding loss characterization value l is calculated according to following loss function (w,c):
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)[x] refers to that the negative sample word c ' randomly selected meets probability Distributionp(V) in the case of, expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) that calculates calculates the loss function, according to the gradient, To q corner code character vectorWith current context word c term vectorIt is updated.
Each step can be performed by identical or different module in another term vector processing method, the application couple This is simultaneously not specifically limited.
It should be noted that in addition to four corner braces, other can show the stroke dimensional orientation feature of the word word or The coding of person's word can equally be well applied to the scheme of the application, such as, 5-stroke coding of Chinese character etc., the coding is replaced into above-mentioned each side Four corner braces in case.
The term vector processing method provided above for the embodiment of the present application, based on same invention thinking, the application is implemented Example additionally provides corresponding device, as shown in Figure 5.
A kind of structural representation for term vector processing unit corresponding to Fig. 1 that Fig. 5 provides for the embodiment of the present application, the dress The executive agent of flow in Fig. 1 can be located at by putting, including:
Word-dividing mode 501, language material is segmented to obtain each word;
Determining module 502, determines each n members corner code character corresponding to each word, and the n members corner code character characterizes it The continuous n corner code character of corresponding word;
Initialization module 503, establish and initialize the term vector of each word, and each n members four corresponding to each word The corner code character vector of corner brace character;
Training module 504, it is right according to the term vector, corner code character vector, and the language material after participle The term vector and the corner code character vector are trained.
Alternatively, the determining module 502 determines each n members corner code character corresponding to each word, specifically includes:
The determining module 502 is according to the result segmented to the language material, it is determined that the word occurred in the language material;
Each word of the determination is directed to respectively, is performed:
Each n members corner code character corresponding to the word is determined, n members corner code character characterizes the continuous n of the word corresponding to the word Individual corner code character, n are a positive integer or multiple different positive integers.
Alternatively, the determining module 502 is according to the result segmented to the language material, it is determined that occurring in the language material Word, specifically include:
The determining module 502 is according to the result segmented to the language material, it is determined that occurring and occurring in the language material Word of the number no less than setting number.
Alternatively, the initialization module 503 initializes the term vector of each word, and each n corresponding to each word The corner code character vector of first corner code character, is specifically included:
The side that the initialization module 503 is initialized by the way of random initializtion or according to specified probability distribution Formula, the term vector of each word, and the corner code character vector of each n members corner code character corresponding to each word are initialized, Wherein, the corner code character vector of identical n members corner code character is also identical.
Alternatively, the training module 504 is vectorial according to the term vector, the corner code character, and after participle The language material, the term vector and the corner code character vector are trained, specifically included:
The training module 504 determines the specified word in the language material after segmenting, and the specified word is after participle One or more of language material cliction up and down;
The corner code character vector of each n members corner code character according to corresponding to the specified word, and the cliction up and down Term vector, determine the specified word with it is described up and down cliction similarity;
According to the specified word with it is described up and down cliction similarity, to it is described up and down cliction term vector and it is described specify The corner code character vector of each n members corner code character is updated corresponding to word.
Alternatively, the training module 504 is according to the specified word and the similarity of the cliction up and down, above and below described The corner code character vector of each n members corner code character is updated corresponding to the term vector of cliction and the specified word, specific bag Include:
The training module 504 selects one or more words from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word with it is described up and down cliction similarity, and the specified word with The similarity of each negative sample word, determine loss characterization value corresponding to the specified word;
According to the loss characterization value, to each n members corner corresponding to the term vector of cliction up and down and the specified word The corner code character vector of code character is updated.
Alternatively, the training module 504 is according to the loss characterization value, to the term vector of cliction up and down and described Specify the corner code character vector of each n members corner code character corresponding to word to be updated, specifically include:
The training module 504 determines gradient corresponding to the loss function according to the loss characterization value;
According to the gradient, to each n members corner code character corresponding to the term vector of cliction up and down and the specified word Corner code character vector be updated.
Alternatively, the training module 504 selects one or more words from each word, as negative sample word, specifically Including:
The training module 504 randomly chooses one or more words from each word, as negative sample word.
Alternatively, the training module 504 is vectorial according to the term vector, the corner code character, and after participle The language material, the term vector and the corner code character vector are trained, specifically included:
The training module 504 travels through to the language material after participle, respectively in the language material after participle Each word performs:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
The corner code character vector of each n members corner code character according to corresponding to the word, and the term vector of the upper and lower cliction, Determine the similarity of the word and the upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, each n members four corresponding to the term vector and the word to the upper and lower cliction The corner code character vector of corner brace character is updated.
Alternatively, the training module 504 determines one or more of the language material of the word after participle context Word, specifically include:
In the language material of the training module 504 after participle, by centered on the word, sliding to the left and/or to the right The distance of dynamic specified quantity word, establishes window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
Alternatively, institute's predicate is Chinese word, and the term vector is the term vector of Chinese word, and the corner code character is corner Code character.
The apparatus and method that the embodiment of the present application provides are one-to-one, and therefore, device also has corresponding side The similar advantageous effects of method, due to the advantageous effects of method being described in detail above, therefore, here Repeat no more the advantageous effects of corresponding intrument.
In the 1990s, the improvement for a technology can clearly distinguish be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And as the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow is programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, PLD (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, its logic function is determined by user to device programming.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, without asking chip maker to design and make Special IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " patrols Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but have many kinds, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also should This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, Can is readily available the hardware circuit for realizing the logical method flow.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing Device and storage can by the computer of the computer readable program code (such as software or firmware) of (micro-) computing device Read medium, gate, switch, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller include but is not limited to following microcontroller Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that except with Pure computer readable program code mode realized beyond controller, completely can be by the way that method and step is carried out into programming in logic to make Controller is obtained in the form of gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. to come in fact Existing identical function.Therefore this controller is considered a kind of hardware component, and various for realizing to including in it The device of function can also be considered as the structure in hardware component.Or even, can be by for realizing that the device of various functions regards For that not only can be the software module of implementation method but also can be the structure in hardware component.
System, device, module or the unit that above-described embodiment illustrates, it can specifically be realized by computer chip or entity, Or realized by the product with certain function.One kind typically realizes that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during application.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Other identical element also be present in the process of element, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code The shape for the computer program product that storage media is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The application can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these DCEs, by Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with In the local and remote computer-readable storage medium including storage device.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system For applying example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
Embodiments herein is the foregoing is only, is not limited to the application.For those skilled in the art For, the application can have various modifications and variations.All any modifications made within spirit herein and principle, it is equal Replace, improve etc., it should be included within the scope of claims hereof.

Claims (23)

1. a kind of term vector processing method, including:
Language material is segmented to obtain each word;
Determine that each n members corner code character, the n members corner code character characterize the continuous n of its corresponding word corresponding to each word Individual corner code character;
Establish and initialize the term vector of each word, and the corner code word of each n members corner code character corresponding to each word Symbol vector;
According to the term vector, corner code character vector, and the language material after participle, to the term vector and described Corner code character vector is trained.
2. the method as described in claim 1, each n members corner code character corresponding to determination each word, are specifically included:
According to the result segmented to the language material, it is determined that the word occurred in the language material;
Mutually different each word of the determination is directed to respectively, is performed:
Each n members corner code character corresponding to the word is determined, n members corner code character corresponding to the word characterizes the word continuous n four Corner brace character, n are a positive integer or multiple different positive integers.
3. method as claimed in claim 2, the result that the basis segments to the language material, it is determined that occurring in the language material The word crossed, is specifically included:
According to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is no less than setting number Word.
4. the method as described in claim 1, the term vector of initialization each word, and each n corresponding to each word The corner code character vector of first corner code character, is specifically included:
By the way of the random initializtion or in the way of specified probability distribution initializes, initialize the word of each word to Amount, and the corner code character vector of each n members corner code character corresponding to each word, wherein, identical n members corner code character Corner code character vector is also identical.
5. the method as described in claim 1, described according to the term vector, corner code character vector, and after participle The language material, the term vector and the corner code character vector are trained, specifically included:
It is determined that the specified word in the language material after participle, and one in the language material of the specified word after participle or Multiple clictions up and down;
The corner code character vector of each n members corner code character according to corresponding to the specified word, and the word of the cliction up and down Vector, determine the specified word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector of cliction and the specified word pair up and down The corner code character vector for each n members corner code character answered is updated.
6. method as claimed in claim 5, described according to the specified word and the similarity of the cliction up and down, on described The corner code character vector of each n members corner code character is updated corresponding to the term vector of lower cliction and the specified word, specifically Including:
One or more words are selected from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, determines loss characterization value corresponding to the specified word;
According to the loss characterization value, to each n members corner code word corresponding to the term vector of cliction up and down and the specified word The corner code character vector of symbol is updated.
7. method as claimed in claim 6, described according to the loss characterization value, to the term vector of cliction and the institute up and down The corner code character vector for stating each n members corner code character corresponding to specified word is updated, and is specifically included:
According to the loss characterization value, gradient corresponding to the loss function is determined;
According to the gradient, to four of each n members corner code character corresponding to the term vector of cliction up and down and the specified word Corner brace character vector is updated.
8. method as claimed in claim 6, described to select one or more words from each word, as negative sample word, tool Body includes:
One or more words are randomly choosed from each word, as negative sample word.
9. the method as described in claim 1, described according to the term vector, corner code character vector, and after participle The language material, the term vector and the corner code character vector are trained, specifically included:
The language material after participle is traveled through, each word in the language material after participle performed respectively:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
The corner code character vector of each n members corner code character according to corresponding to the word, and the term vector of the upper and lower cliction, it is determined that The word and the similarity of the upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, first four corner braces of each n corresponding to the term vector and the word to the upper and lower cliction The corner code character vector of character is updated.
10. one or more of method as claimed in claim 9, the language material for determining the word after participle is up and down Cliction, specifically include:
In the language material after participle, by centered on the word, slide to the left and/or to the right specified quantity word away from From establishing window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
11. the method as described in any one of claim 1~10, institute's predicate is Chinese word, and the term vector is the word of Chinese word Vector, the corner code character are corner code character.
12. a kind of term vector processing unit, including:
Word-dividing mode, language material is segmented to obtain each word;
Determining module, determines each n members corner code character corresponding to each word, and the n members corner code character is characterized corresponding to it The continuous n corner code character of word;
Initialization module, establish and initialize the term vector of each word, and each n members corner code character corresponding to each word Corner code character vector;
Training module, according to the term vector, corner code character vector, and the language material after participle, to institute's predicate Corner code character vector is trained described in vector sum.
13. device as claimed in claim 12, the determining module determines each n members corner code character corresponding to each word, Specifically include:
The determining module is according to the result segmented to the language material, it is determined that the word occurred in the language material;
Mutually different each word of the determination is directed to respectively, is performed:
Each n members corner code character corresponding to the word is determined, n members corner code character corresponding to the word characterizes the word continuous n four Corner brace character, n are a positive integer or multiple different positive integers.
14. device as claimed in claim 13, the determining module is according to the result segmented to the language material, it is determined that described The word occurred in language material, is specifically included:
The determining module is according to the result segmented to the language material, it is determined that occurring in the language material and occurrence number is many In the word of setting number.
15. device as claimed in claim 12, the initialization module initializes the term vector of each word, and described each The corner code character vector of each n members corner code character corresponding to word, is specifically included:
The initialization module is by the way of random initializtion or in the way of specified probability distribution initializes, initialization The term vector of each word, and the corner code character vector of each n members corner code character corresponding to each word, wherein, identical n The corner code character vector of first corner code character is also identical.
16. device as claimed in claim 12, the training module is vectorial according to the term vector, the corner code character, And the language material after participle, the term vector and the corner code character vector are trained, specifically included:
The training module determines the specified word in the language material after segmenting, and institute predicate of the specified word after participle Cliction above and below one or more of material;
The corner code character vector of each n members corner code character according to corresponding to the specified word, and the word of the cliction up and down Vector, determine the specified word and the similarity of the cliction up and down;
According to the specified word and the similarity of the cliction up and down, to the term vector of cliction and the specified word pair up and down The corner code character vector for each n members corner code character answered is updated.
17. device as claimed in claim 16, the training module is similar to the cliction up and down according to the specified word Degree, the corner code character vector of each n members corner code character corresponding to the term vector of cliction up and down and the specified word is entered Row renewal, is specifically included:
The training module selects one or more words from each word, as negative sample word;
Determine the similarity of the specified word and each negative sample word;
According to specified loss function, the specified word and the similarity of the cliction up and down, and the specified word and each institute The similarity of negative sample word is stated, determines loss characterization value corresponding to the specified word;
According to the loss characterization value, to each n members corner code word corresponding to the term vector of cliction up and down and the specified word The corner code character vector of symbol is updated.
18. device as claimed in claim 17, the training module is according to the loss characterization value, to the cliction up and down The corner code character vector of each n members corner code character is updated corresponding to term vector and the specified word, is specifically included:
The training module determines gradient corresponding to the loss function according to the loss characterization value;
According to the gradient, to four of each n members corner code character corresponding to the term vector of cliction up and down and the specified word Corner brace character vector is updated.
19. device as claimed in claim 17, the training module selects one or more words from each word, as negative Sample word, is specifically included:
The training module randomly chooses one or more words from each word, as negative sample word.
20. device as claimed in claim 12, the training module is vectorial according to the term vector, the corner code character, And the language material after participle, the term vector and the corner code character vector are trained, specifically included:
The training module is traveled through to the language material after participle, and each word in the language material after participle is held respectively OK:
Determine one or more of the described language material of the word after participle cliction up and down;
Respectively according to each cliction up and down, perform:
The corner code character vector of each n members corner code character according to corresponding to the word, and the term vector of the upper and lower cliction, it is determined that The word and the similarity of the upper and lower cliction;
According to the word and the similarity of the upper and lower cliction, first four corner braces of each n corresponding to the term vector and the word to the upper and lower cliction The corner code character vector of character is updated.
21. device as claimed in claim 20, the training module determines one in the language material of the word after participle Or multiple clictions up and down, specifically include:
In the language material of the training module after participle, by centered on the word, sliding to the left and/or to the right and specifying number The distance of amount word, establishes window;
Word beyond the word in the window is defined as to the cliction up and down of the word.
22. the device as described in any one of claim 12~21, institute's predicate is Chinese word, and the term vector is the word of Chinese word Vector, the corner code character are corner code character.
23. a kind of term vector processing method, including:
Step 1, language material is segmented, and established by the vocabulary for segmenting obtained each word and forming, wherein, each word is not It is included in the word that occurrence number in the language material is less than setting number;Jump procedure 2;
Step 2, according to the vocabulary, n members corner code character mapping table is established, the mapping table includes each word and n members Mapping relations between the code character of corner, the n members corner code character characterize the continuous n corner code character of the word of its mapping; Jump procedure 3;
Step 3, according to the n members corner code character mapping table, establish and initialize the term vector of each word, and it is described each The corner code character vector of each n members corner code character of word mapping;Jump procedure 4;
Step 4, the language material after traversal participle, is performed using each word traversed as current word w and to current word w respectively Step 5, terminate if completion is traveled through, otherwise continue to travel through;
Step 5, centered on current word w, more k words is respectively slid to both sides and establish window, traveled through to remove in the window and work as All words beyond preceding word w, respectively using each word traversed as current word w current context word c and to when front upper and lower Cliction c performs step 6, continues the execution of step 4 if completion is traveled through, otherwise continues to travel through;
Step 6, current word w and current context word c similarity is calculated according to equation below:
<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>q</mi> <mo>&amp;Element;</mo> <mi>S</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> </munder> <mover> <mi>q</mi> <mo>&amp;RightArrow;</mo> </mover> <mo>&amp;CenterDot;</mo> <mover> <mi>c</mi> <mo>&amp;RightArrow;</mo> </mover> <mo>;</mo> </mrow>
Wherein, S (w) represents each n members corner code character set that current word w maps in the n members corner code character mapping table, q Each n members corner code character in S (w) is represented, sim (w, c) represents current word w and current context word c similarity;Table Show the dot product of q corner code character vector and current context word c term vector;Jump procedure 7;
Step 7, randomly select λ word as negative sample word, according to following loss function calculate corresponding to loss characterization value l (w, c):
<mrow> <mi>l</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>c</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&amp;lambda;</mi> </munderover> <msub> <mi>E</mi> <mrow> <msup> <mi>c</mi> <mo>&amp;prime;</mo> </msup> <mo>&amp;Element;</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>V</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mo>-</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <msup> <mi>c</mi> <mo>,</mo> </msup> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> <mo>;</mo> </mrow>
Wherein, c ' is the negative sample word randomly selected, and Ec'∈p(V)The negative sample word c ' that [x] refers to randomly select meets probability point In the case of cloth p (V), expression formula x desired value, σ () is neutral net excitation function, is defined as
The gradient according to corresponding to the loss characterization value l (w, c) calculated calculates the loss function, according to the gradient, to q's Corner code character vectorWith current context word c term vectorIt is updated.
CN201710383929.0A 2017-05-26 2017-05-26 Word vector processing method and device Active CN107423269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710383929.0A CN107423269B (en) 2017-05-26 2017-05-26 Word vector processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710383929.0A CN107423269B (en) 2017-05-26 2017-05-26 Word vector processing method and device

Publications (2)

Publication Number Publication Date
CN107423269A true CN107423269A (en) 2017-12-01
CN107423269B CN107423269B (en) 2020-12-18

Family

ID=60429134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710383929.0A Active CN107423269B (en) 2017-05-26 2017-05-26 Word vector processing method and device

Country Status (1)

Country Link
CN (1) CN107423269B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595592A (en) * 2018-04-19 2018-09-28 成都睿码科技有限责任公司 A kind of text emotion analysis method based on five-stroke form code character level language model
CN110119507A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 Term vector generation method, device and equipment
CN111274793A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Text processing method and device and computing equipment
CN113220865A (en) * 2021-04-15 2021-08-06 山东师范大学 Text similar vocabulary retrieval method, system, medium and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119507A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 Term vector generation method, device and equipment
CN108595592A (en) * 2018-04-19 2018-09-28 成都睿码科技有限责任公司 A kind of text emotion analysis method based on five-stroke form code character level language model
CN111274793A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Text processing method and device and computing equipment
CN111274793B (en) * 2018-11-19 2023-04-28 阿里巴巴集团控股有限公司 Text processing method and device and computing equipment
CN113220865A (en) * 2021-04-15 2021-08-06 山东师范大学 Text similar vocabulary retrieval method, system, medium and electronic equipment
CN113220865B (en) * 2021-04-15 2022-06-24 山东师范大学 Text similar vocabulary retrieval method, system, medium and electronic equipment

Also Published As

Publication number Publication date
CN107423269B (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN108345580A (en) A kind of term vector processing method and processing device
TWI701588B (en) Word vector processing method, device and equipment
AU2018247340B2 (en) Dvqa: understanding data visualizations through question answering
CN108874765A (en) Term vector processing method and processing device
TWI689831B (en) Word vector generating method, device and equipment
CN107423269A (en) Term vector processing method and processing device
TWI686713B (en) Word vector generating method, device and equipment
CN107402945A (en) Word stock generating method and device, short text detection method and device
CN109271587A (en) A kind of page generation method and device
CN106155540B (en) Electronic brush pen pen shape treating method and apparatus
CN110471835A (en) A kind of similarity detection method and system based on power information system code file
CN112070040A (en) Text line detection method for video subtitles
KR20220034083A (en) Method and apparatus of generating font database, and method and apparatus of training neural network model, electronic device, recording medium and computer program
CN110134852A (en) A kind of De-weight method of document, equipment and readable medium
CN110019952B (en) Video description method, system and device
JP2023092442A (en) Integrated circuit chip verification method, device, electronic device and storage medium
CN110119754A (en) Image generates description method, apparatus and model
CN111091001B (en) Method, device and equipment for generating word vector of word
CN107368281A (en) A kind of data processing method and device
CN107247704A (en) Term vector processing method, device and electronic equipment
CN107562716A (en) Term vector processing method, device and electronic equipment
WO2023221407A1 (en) Model generation method and apparatus and electronic device
CN105260741B (en) A kind of digital picture labeling method based on high-order graph structure p Laplacian sparse coding
CN107577658A (en) Term vector processing method, device and electronic equipment
CN110321433A (en) Determine the method and device of text categories

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1247338

Country of ref document: HK

TA01 Transfer of patent application right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant