CN107273355A - A kind of Chinese word vector generation method based on words joint training - Google Patents
A kind of Chinese word vector generation method based on words joint training Download PDFInfo
- Publication number
- CN107273355A CN107273355A CN201710435279.XA CN201710435279A CN107273355A CN 107273355 A CN107273355 A CN 107273355A CN 201710435279 A CN201710435279 A CN 201710435279A CN 107273355 A CN107273355 A CN 107273355A
- Authority
- CN
- China
- Prior art keywords
- mrow
- word
- msub
- chinese
- msup
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of Chinese word vector generation method based on words joint training, belong to natural language processing technique field.Using the Chinese character information in word as key character, with reference to cliction and Chinese character up and down, the term vector of joint training Chinese is represented.On the basis of the word-based term vector model of itself, we are by introducing the composition Chinese character information of word in itself, while based on cliction prediction target word up and down, based on Word-predictor target word up and down.The word-based model words joint training model of itself is applied respectively, compares the validity and robustness of the training term vector of two models, it is found that the Chinese term vector of words joint training model generation more meets the Chinese feature of semanteme, while robustness is also more preferable.The invention provides a kind of new method of Chinese term vector generation, generation and application work for Chinese term vector provide a kind of new solution.
Description
Technical field
The invention belongs to natural language processing technique field, it is related to a kind of Chinese term vector based on words joint training
Generation method.
Background technology
In recent years, the word that natural language processing technique has been applied in the various aspects of ours at one's side, text represents research
It is most basic research in natural language processing field.Meanwhile, word represents it is one kind that data are represented, and data are denoted as machine
Learn mid-early stage preparation work, its quality has a great impact to the performance of machine learning model.For Chinese natural language
Process field problems faced, it is intended that computer can be directly automatically from the text data middle school without mark on a large scale
Acquistion is to corresponding text representation, while the semantic information in word and text also is intended to embody by this expression
Come.The conventional word such as Word2Vec, GloVe incorporation model can not meet the characteristic of speech sounds of Chinese, and for Chinese, performance is more
Good, more accurately term vector model awaits researcher and further explored for semantic information capture.
The content of the invention
In place of the purpose of the present invention is mainly for some shortcomings of existing research, propose a kind of based on words joint training
Chinese word vector generation method, i.e. ECWE models, Chinese character and external context and Chinese character together obtain height inside models coupling
The Chinese word insertion of quality.ECWE, which is combined internal word with outside word by a simple but general method, together to be learnt
Chinese term vector.We cause there is more contacts, mould between originally isolated word using internal word and external context word
Type passes through to strengthening effective modeling to Chinese character so that between Chinese character and Chinese character is strengthened with the relation between word, simultaneously
The contextual information of word is enriched, so that word represents to contain more semantic informations, the effect that word is represented is improved.
Technical scheme:
A kind of Chinese word vector generation method based on words joint training, step is as follows:
(1) Chinese text data processing stage
Word represents the generation of vector, it is necessary to which big corpus supports that corpus can voluntarily be built, and can also pass through fund
Purchase, possesses after corpus, and we will carry out word segmentation processing to corpus first.There are many participle instruments to use at present,
This step is not as this method right characteristic.
(2) Chinese word represents vectorial generation phase
For Chinese, a word is usually made up of several Chinese characters, and contains abundant inside implication.One word
The meaning of a word is also usually relevant with the Chinese character for constituting it.For example, Chinese word " science and technology ", his meaning of a word can be by the literature up and down in language material
Acquistion is arrived, while we can see that coming, his meaning of a word can be inferred by the Chinese character " section " and " skill " for constituting him to be obtained, therefore I
Obtain an idea, Chinese word incorporation model is improved using Chinese character information, the word that learns Chinese represent vector.
In the starting stage, we generate the vector representation w of word, Chinese character at random, and c, dimension size is 100, each dimension
Value is the random decimal between one 0 to 1.
2.1) based on cliction prediction target word up and down
For giving sentence D={ x1,…,xM, M represents sentence length, xjJ-th of word in sentence is represented, passes through one
The cliction up and down of (window size is K) predicts target word in individual stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists
In, using term vector and composition word internal word vector vector add and be averaging the vector table as target word w cliction up and down
Show;It is characterised by, for each Chinese character, different according to position, he can there are three different vector representation (cB,cM,cE),
The beginning that they are located among word, middle and ending are represented respectively.The vector representation formula of cliction is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
Wherein, wjRepresent xjTerm vector itself, NjRepresent xjIn Chinese character number, ckRepresent word xjIn k-th Chinese character
Vector representation;
By above formula we obtain above and below cliction vector representation xw, thus predict target word xi, its target is most
Conditional probability function of the bigization target word on cliction up and down:
Wherein M represents the length of sentence, and K represents window size.
2.2) based on Word-predictor target word up and down
For sentence D={ x1,…,xM, the sentence is traveled through first, tables look-up and the Chinese character in each word is mapped to vector, remove
Go target word;Target word is predicted by the cliction up and down in a stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists
The vector representation of internal word adds the vector representation as word up and down with average value in the cliction using above and below;It is characterised by, for
Each Chinese character, different according to position, he can have three different vector representation (cB,cM,cE), represent that they are located at respectively
Beginning among word, middle and ending.The vector representation formula of word is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
By above formula obtain above and below word vector representation cw, thus predict target word xi, its target is to maximize target word
In literal conditional probability function up and down:
Wherein M represents the length of sentence, and K represents window size.
2.3) it is based on words associated prediction target word
We have obtained predicting the object function of target word based on word and word in above-mentioned steps, in the step, for
Sentence D={ x1,…,xM, it is characterised by, word up and down will be based on to predict that the object function of target word is same based on cliction up and down
To predict that the object function of target word combines, joint training word and word;It is exactly the condition in optimization context to target word
While probability, conditional probability of the Chinese character of each in cliction to target word above and below optimization:
Wherein, M represents the length of sentence, and W represents word dictionary, and w represents target word, i.e., x abovei, Context
(w) w context words, i.e., x above are representedw, Circum (w) represent w context in Chinese character, i.e., c abovew,
β is the decimal between one 0 to 1, represents the ratio modeled based on Chinese character;
2.4) iteration updates
In order to reduce computation complexity, the steps characteristic is, optimizes calculating by the negative method of sampling, specifically
It is come design conditions probability with following formula:
NEG (w) represents negative sampling set in above formula, and negative sample size is set to 5, Lw(u) be one sampling u label, when u is
During target word w, Lw(u)=1, otherwise Lw(u)=0, xwIt is the vector representation of above and below target word w clictions, cwIt is above and below target word w
The vector representation of word, θuIt is the vector representation of parameter;
Object function is finally solved using stochastic gradient descent algorithm, specifically more new-standard cement is:
After model repetitive exercise terminates, parameter term vector represents to collect the Chinese word vector representation that w is exactly our model generations.
The beneficial effects of the present invention are disclose a kind of Chinese term vector generation side based on words joint training
Method, using the Chinese character information in word as key character, with reference to cliction and Chinese character up and down, the term vector of joint training Chinese is represented.
On the basis of the word-based term vector model of itself, we are by introducing the composition Chinese character information of word in itself, based on above and below
While cliction predicts target word, based on Word-predictor target word up and down.By the word-based model words joint training mould of itself
Type is applied respectively, is compared the validity and robustness of the training term vector of two models, is found the generation of words joint training model
Chinese term vector more meet the Chinese feature of semanteme, while robustness is also more preferable.The invention provides the generation of Chinese term vector
A kind of new method, generation and application work for Chinese term vector provide a kind of new solution.
Brief description of the drawings
Fig. 1 is the major architectural figure of the inventive method.
Fig. 2 is evaluation result of the inventive method in semantic similarity task, and ECWE is model of the present invention abbreviation, thus
Figure can determine that the Chinese term vector that the present invention is generated contains more accurately semantic information.
Fig. 3 is evaluation result of the inventive method in analogism task, during thus figure can determine that the present invention is generated
Cliction vector contains more accurately semantic information.
Fig. 4 is evaluation result of the inventive method on text categorization task, during thus figure can determine that the present invention is generated
Cliction vector is more suitable for Chinese from speech language processing tasks.
Fig. 5 is evaluation result of the inventive method in different language material sizes, compares the present invention and has more robustness.
Fig. 6 is evaluation result of the inventive method in different Chinese character modeling ratio, compares the present invention and has more robustness.
Embodiment
In order that the object, technical solutions and advantages of the present invention are clearer, below by the specific embodiment party of the present invention
Formula is described in further detail.
The invention provides a kind of Chinese word vector generation method based on words joint training, this method includes:
(1) Chinese text data processing stage
Word represents the generation of vector, it is necessary to which big corpus supports that corpus can voluntarily be built, and can also pass through fund
Purchase, we are by taking wikipedia Chinese data collection as an example here.
1.1) wikipedia Chinese data collection is chosen as training corpus, and wikipedia Chinese data collection Covering domain is wide,
This language material has 1.82 hundred million Chinese words, and word dictionary size is 45.7 ten thousand, and Chinese character dictionary size is 9000.
Enter pretreatment to wikipedia Chinese data collection, the Chinese data of wikipedia be it is complicated and simple mix, the inside is included
A variety of different data such as continent is simplified, TaiWan, China traditional font, Hongkong and Macro's traditional font.Sometimes between the different paragraphs of an article
Different complicated and simple words can be used.The complex form of Chinese characters in language material is converted into simplified Chinese character we used open source projects opencc.Institute
With the complex form of Chinese characters to be removed, Normalization is allowed for, simplified and traditional font exists simultaneously, for same word, conflict can be caused.
1.2) possess after corpus, we will carry out word segmentation processing to corpus.Participle has many methods, and we introduce one
Plant the Chinese word cutting method marked based on word.
Based on word mark Chinese word cutting method basic assumption be a word internal text high cohesion, and word boundary with
Outside word lower coupling.Grammatical term for the character circle is learnt by statistical machine learning method, BMES marks are performed using sequence labelling model.
For monosyllabic word, its label is S;For multi-character words, first Chinese character label in word is B, and last Chinese character label is E,
The label of middle word is M.After being labeled to each word of training data, using a kind of 3 layers of neural network structure to each word
It is trained, for the labeling task of each word in sentence, chooses in current word and contextual window, common win
Word is used as feature.Wherein it is (win-1)/2 word above and below.The urtext of win word is converted into its word first
Vector representation e (w), and win word is connected into a win* | e | the vector x of dimension, the vector is the input layer of neutral net,
Hidden layer h design is consistent with common feedforward neural network, each node of input layer and hidden layer | h | between individual node
There is side connection two-by-two.Hidden layer is used as activation primitive from tanh functions.
Assuming that the training corpus before participle is:" on April -7 on the 6th, under the working closely of the multiple departments in school district, development zone
School district smoothly completes work of unpaid blood donation in 2017.The people of teaching and administrative staff 5, the people of postgraduate 9, the people of undergraduate 463 that early stage registration is donated blood,
Final successfully donate blood 420 people, the wherein people of teaching and administrative staff 4, the people of postgraduate 6, the people of undergraduate 410." be changed into after word segmentation processing:" April
- 7 days on the 6th, under the working closely of the multiple departments in school district, development zone school district smoothly completed work of unpaid blood donation in 2017.Early stage
Register the people of teaching and administrative staff 5, the people of postgraduate 9, undergraduate, 463 people donated blood, and finally successfully donate blood 420 people, wherein the people of teaching and administrative staff 4, research
Raw 6 people, the people of undergraduate 410.”
1.3) it is last we to remove stop words (" ", " " etc.) and punctuation mark etc..
(2) Chinese word represents vectorial generation phase
For Chinese, a word is usually made up of several Chinese characters, and contains abundant inside implication.One word
The meaning of a word is also usually relevant with the Chinese character for constituting it.For example, Chinese word " science and technology ", his meaning of a word can be by the literature up and down in language material
Acquistion is arrived, while we can see that coming, his meaning of a word can be inferred by the Chinese character " section " and " skill " for constituting him to be obtained, therefore I
Obtain an idea, Chinese word incorporation model is improved using Chinese character information, the word that learns Chinese represent vector.Fig. 1 is us
The block schematic illustration of model.Word, which is embedded in (white grey square frame in figure) and word and is embedded in (white box) and combines, obtains one newly
Vectorial (grey square frame), these new vectors add and obtain predicting the vector (left side Dark grey square frame) of target word.Meanwhile, word is embedding
Enter also plus and obtain a new vector (the right Dark grey square frame) to predict target word.
Starting stage, we travel through language material, and vocabulary is added in a vocabulary, while by vocabulary according to word frequency size
Sequence, the vocabulary for word frequency less than 5, we are deleted.Then the vector representation of random generation word, Chinese character and parameter
(dimension size is typically set to 100) w, c and θ.Next, we design an object function, pass through stochastic gradient descent algorithm
Iteration optimization parameters.
2.1) based on cliction prediction target word up and down
For giving sentence D={ x1,…,xM, M represents sentence length, xjJ-th of word in sentence is represented, passes through one
The cliction up and down of (window size is K) predicts target word in individual stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists
In, using term vector and composition word internal word vector vector add and be averaging the vector table as target word w cliction up and down
Show;It is characterised by, for each Chinese character, different according to position, he can there are three different vector representation (cB,cM,cE),
The beginning that they are located among word, middle and ending are represented respectively.The vector representation formula of cliction is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
Wherein, wjRepresent xjTerm vector itself, NjRepresent xjIn Chinese character number, ckRepresent word xjIn k-th Chinese character
Vector representation;
By above formula we obtain above and below cliction vector representation xw, thus predict target word xi, its target is most
Conditional probability function of the bigization target word on cliction up and down:
Wherein M represents the length of sentence, and K represents window size.
2.2) based on Word-predictor target word up and down
Our invention not only will predict target word based on cliction up and down, while will be based on Word-predictor target up and down
Word.Similarly for sentence D={ x1,…,xM, the sentence is traveled through first, tables look-up and the Chinese character in each word is mapped to vector, remove
Go target word;Target word is predicted by the cliction up and down in a stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists
The vector representation of internal word adds the vector representation as word up and down with average value in the cliction using above and below;It is characterised by, for
Each Chinese character, different according to position, he can have three different vector representation (cB,cM,cE), represent that they are located at respectively
Beginning among word, middle and ending.The vector representation formula of word is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
By above formula obtain above and below word vector representation cw, thus predict target word xi, its target is to maximize target word
In literal conditional probability function up and down:
Wherein M represents the length of sentence, and K represents window size.
2.3) it is based on words associated prediction target word
Based on up and down cliction prediction target word in, the meaning of word is added directly into the meaning of a word by we, can so cause for
Word containing same word, model tends to obtain similar term vector, thus we add above and below text information weaken inside
Negative effect of the word to the meaning of a word, in order that word contains more rich semantic information, we introduce the word information of external context.
We by this method, word are put into the language of word by the use of the expression being distributed as this word of each word in word context
In adopted space, more effectively word is modeled.For sentence D={ x1,…,xM, in the step, it is characterised by, context will be based on
Word predicts the object function of target word with predicting that the object function of target word combines based on word up and down, joint training
Word and word;Be exactly while context is optimized to the conditional probability of target word, above and below optimization in cliction each Chinese character to target
The conditional probability of word:
Wherein, M represents the length of sentence, and W represents word dictionary, and w represents target word, i.e., x abovei, Context
(w) w context words, i.e., x above are representedw, Circum (w) represent w context in Chinese character, i.e., c abovew,
β is the decimal between one 0 to 1, represents the ratio modeled based on Chinese character;
2.4) iteration updates
In order to reduce computation complexity, the steps characteristic is, optimizes calculating by the negative method of sampling, specifically
It is come design conditions probability with following formula:
NEG (w) represents negative sampling set in above formula, and negative sample size is set to 5, Lw(u) be one sampling u label, when u is
During target word w, Lw(u)=1, otherwise Lw(u)=0, xwIt is the vector representation of above and below target word w clictions, cwIt is above and below target word w
The vector representation of word, θuIt is the vector representation of parameter;
Object function is finally solved using stochastic gradient descent algorithm, specifically more new-standard cement is:
It is characterised by, the vector table of word and word is shown with different expression formulas, can so obtain the vector representation of more preferable word,
Model is further promoted to obtain more effective word insertion.
After model repetitive exercise terminates, parameter term vector represents to collect the Chinese word vector representation that w is exactly our model generations.
(3) experimental result
By Semantic Similarity Measurement, analogism is evaluated the characteristic of speech sounds of term vector, the knot in Fig. 2 and Fig. 3
Fruit shows the term vector of (ECWE models) of the invention generation is all better than what other models to be showed in which task.It is logical
The result crossed in performance scores of the text categorization task to term vector, Fig. 4 shows that the term vector for generating the present invention is used as Chinese
Feature in natural language processing task, can obtain more preferable result.Fig. 5 is gradually increases training corpus, and model is in semantic phase
Like the performance situation on degree, it can be observed how, the present invention still has preferable performance when language material is less, because logical
Cross and introduce external context Chinese character, expanded the contextual information of word so that model in the case of smaller training corpus,
Also it can guarantee that word is effectively trained, while when language material size changes from small to large, the performance of ECWE models is better than always
Other models, and quickly reach a good performance.Fig. 6 is gradually increase Chinese character modeling ratio, and model is similar in semanteme
Performance situation on degree, it can be observed how, the present invention has more preferable performance in varied situations.This illustrates that the present invention is strictly
One performance more preferably, semantic information capture more accurately Chinese word vector model, while more for stability, in each task
Assessment performance it can also be seen that the Chinese word vector generation method proposed by the present invention based on words joint training it is feasible
Property.
The technical principle for being the specific embodiment of the present invention and being used above, if conception under this invention institute
The change of work, during the spirit that function produced by it is still covered without departing from specification and accompanying drawing, should belong to the present invention's
Protection domain.
Claims (1)
1. a kind of Chinese word vector generation method based on words joint training, it is characterised in that believe the Chinese character in Chinese word
Breath is as key character, and with reference to cliction and Chinese character joint training Chinese word vector representation up and down, step is as follows:
(1) Chinese text data processing stage
Word represents that the generation of vector is based on corpus, carries out word segmentation processing to corpus first;
(2) Chinese word represents vectorial generation phase
For Chinese, a word is made up of several Chinese characters, and the meaning of a word is relevant with the Chinese character for constituting it;This method is believed using Chinese character
Cease to improve Chinese word incorporation model, the word that learns Chinese represents vector;
In the starting stage, the vector representation w of word, Chinese character is generated at random, c, dimension size is 100, and each dimension values are one 0
Random decimal between to 1;
2.1) based on cliction prediction target word up and down
For giving sentence D={ x1,…,xM, M represents sentence length, xjJ-th of word in sentence is represented, it is solid by one
Determine the cliction up and down in window to predict target word, window size is K, it is contemplated that Chinese characteristic, by term vector and composition word
The vector of internal word vector adds and is averaging the vector representation of the cliction up and down as target word w;For each Chinese character, according to
Position is different, can all there is three different vector representation (cB,cM,cE), represent respectively they be located at word among beginning, in
Between and end up;The vector representation formula of cliction is as follows up and down:
<mrow>
<msub>
<mi>x</mi>
<mi>w</mi>
</msub>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mn>2</mn>
<mi>K</mi>
</mrow>
</mfrac>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>j</mi>
</msub>
<mo>+</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
</mfrac>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
</munderover>
<mo>(</mo>
<mrow>
<msubsup>
<mi>c</mi>
<mn>1</mn>
<mi>B</mi>
</msubsup>
<mo>+</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>2</mn>
</mrow>
<msub>
<mi>N</mi>
<mrow>
<mi>j</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</munderover>
<msubsup>
<mi>c</mi>
<mi>k</mi>
<mi>M</mi>
</msubsup>
<mo>+</mo>
<msubsup>
<mi>c</mi>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
<mi>E</mi>
</msubsup>
</mrow>
<mo>)</mo>
<mo>)</mo>
</mrow>
</mrow>
Where j=w-K ... w-1, w+1 ... w+K
Wherein, wjRepresent xjTerm vector itself, NjRepresent xjIn Chinese character number, ckRepresent word xjIn k-th of Chinese character vector
Represent;
By above formula obtain above and below cliction vector representation xw, thus predict target word xi, its target is to maximize target
Conditional probability function of the word on cliction up and down:
<mrow>
<mi>L</mi>
<mrow>
<mo>(</mo>
<mi>D</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mi>M</mi>
</mfrac>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mi>K</mi>
</mrow>
<mrow>
<mi>M</mi>
<mo>-</mo>
<mi>K</mi>
</mrow>
</munderover>
<mi>log</mi>
<mi> </mi>
<mi>P</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<msub>
<mi>x</mi>
<mi>w</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
Wherein M represents sentence length, and K represents window size;
2.2) based on Word-predictor target word up and down
For sentence D={ x1,…,xM, the sentence is traveled through first, tables look-up and the Chinese character in each word is mapped to vector, remove mesh
Mark word;Target word is predicted by the cliction up and down in a stationary window, the vector representation of internal word in cliction up and down is added
With vector representation of the average value as word up and down;It is different according to position for each Chinese character, can all have three it is different to
Amount represents (cB,cM,cE), represent that they are located at starting among word, middle and ending respectively;The vector representation of word up and down
Formula is as follows:
<mrow>
<msub>
<mi>c</mi>
<mi>w</mi>
</msub>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mn>2</mn>
<mi>K</mi>
</mrow>
</mfrac>
<munder>
<mo>&Sigma;</mo>
<mi>j</mi>
</munder>
<mfrac>
<mn>1</mn>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
</mfrac>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
</munderover>
<mrow>
<mo>(</mo>
<msubsup>
<mi>c</mi>
<mn>1</mn>
<mi>B</mi>
</msubsup>
<mo>+</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>2</mn>
</mrow>
<msub>
<mi>N</mi>
<mrow>
<mi>j</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</munderover>
<msubsup>
<mi>c</mi>
<mi>k</mi>
<mi>M</mi>
</msubsup>
<mo>+</mo>
<msubsup>
<mi>c</mi>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
<mi>E</mi>
</msubsup>
<mo>)</mo>
</mrow>
</mrow>
Where j=w-K ... w-1, w+1 ... w+K
By above formula obtain above and below word vector representation cw, thus predict target word xi, its target is to maximize target word upper
Under literal conditional probability function:
<mrow>
<mi>L</mi>
<mrow>
<mo>(</mo>
<mi>D</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mi>M</mi>
</mfrac>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mi>K</mi>
</mrow>
<mrow>
<mi>M</mi>
<mo>-</mo>
<mi>K</mi>
</mrow>
</munderover>
<mi>log</mi>
<mi> </mi>
<mi>P</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<msub>
<mi>c</mi>
<mi>w</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
Wherein, M represents sentence length, and K represents window size;
2.3) it is based on words associated prediction target word
For sentence D={ x1,…,xM, it will be risen based on cliction up and down to predict that the object function of target word is same based on word up and down
To predict that the object function of target word combines, joint training word and word;In optimization context to the conditional probability of target word
Meanwhile, conditional probability of the Chinese character of each in cliction to target word above and below optimization:
<mrow>
<mi>L</mi>
<mrow>
<mo>(</mo>
<mi>&theta;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mi>M</mi>
</mfrac>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>w</mi>
<mo>&Element;</mo>
<mi>W</mi>
</mrow>
</munder>
<mo>&lsqb;</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<mi>&beta;</mi>
<mo>)</mo>
</mrow>
<mi>log</mi>
<mi> </mi>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>|</mo>
<mi>C</mi>
<mi>o</mi>
<mi>n</mi>
<mi>t</mi>
<mi>e</mi>
<mi>x</mi>
<mi>t</mi>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>&beta;</mi>
<mi>log</mi>
<mi> </mi>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>|</mo>
<mi>C</mi>
<mi>i</mi>
<mi>r</mi>
<mi>c</mi>
<mi>u</mi>
<mi>m</mi>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
Wherein, M represents sentence length, and W represents word dictionary, and w represents target word, i.e., x abovei, Context (w) expressions w
Context words, i.e., x abovew, Circum (w) represent w context in Chinese character, i.e., c abovew, β is one 0
Decimal between to 1, represents the ratio modeled based on Chinese character;
2.4) iteration updates
Calculating, design conditions probability are optimized by the negative method of sampling:
<mrow>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>|</mo>
<mi>C</mi>
<mi>o</mi>
<mi>n</mi>
<mi>t</mi>
<mi>e</mi>
<mi>x</mi>
<mi>t</mi>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mo>&Pi;</mo>
<mrow>
<mi>u</mi>
<mo>&Element;</mo>
<mo>{</mo>
<mi>w</mi>
<mo>}</mo>
<mo>&cup;</mo>
<mi>N</mi>
<mi>E</mi>
<mi>G</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<msup>
<mrow>
<mo>&lsqb;</mo>
<mi>&sigma;</mi>
<mrow>
<mo>(</mo>
<msubsup>
<mi>x</mi>
<mi>w</mi>
<mi>T</mi>
</msubsup>
<msup>
<mi>&theta;</mi>
<mi>u</mi>
</msup>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
<mrow>
<msup>
<mi>L</mi>
<mi>w</mi>
</msup>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>)</mo>
</mrow>
</mrow>
</msup>
<mo>&CenterDot;</mo>
<msup>
<mrow>
<mo>&lsqb;</mo>
<mn>1</mn>
<mo>-</mo>
<mi>&sigma;</mi>
<mrow>
<mo>(</mo>
<msubsup>
<mi>x</mi>
<mi>w</mi>
<mi>T</mi>
</msubsup>
<msup>
<mi>&theta;</mi>
<mi>u</mi>
</msup>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
<mrow>
<mn>1</mn>
<mo>-</mo>
<msup>
<mi>L</mi>
<mi>w</mi>
</msup>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>)</mo>
</mrow>
</mrow>
</msup>
</mrow>
<mrow>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>|</mo>
<mi>C</mi>
<mi>i</mi>
<mi>r</mi>
<mi>c</mi>
<mi>u</mi>
<mi>m</mi>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mo>&Pi;</mo>
<mrow>
<mi>u</mi>
<mo>&Element;</mo>
<mo>{</mo>
<mi>w</mi>
<mo>}</mo>
<mo>&cup;</mo>
<mi>N</mi>
<mi>E</mi>
<mi>G</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<msup>
<mrow>
<mo>&lsqb;</mo>
<mi>&sigma;</mi>
<mrow>
<mo>(</mo>
<msubsup>
<mi>c</mi>
<mi>w</mi>
<mi>T</mi>
</msubsup>
<msup>
<mi>&theta;</mi>
<mi>u</mi>
</msup>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
<mrow>
<msup>
<mi>L</mi>
<mi>w</mi>
</msup>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>)</mo>
</mrow>
</mrow>
</msup>
<mo>&CenterDot;</mo>
<msup>
<mrow>
<mo>&lsqb;</mo>
<mn>1</mn>
<mo>-</mo>
<mi>&sigma;</mi>
<mrow>
<mo>(</mo>
<msubsup>
<mi>c</mi>
<mi>w</mi>
<mi>T</mi>
</msubsup>
<msup>
<mi>&theta;</mi>
<mi>u</mi>
</msup>
<mo>)</mo>
</mrow>
<mo>&rsqb;</mo>
</mrow>
<mrow>
<mn>1</mn>
<mo>-</mo>
<msup>
<mi>L</mi>
<mi>w</mi>
</msup>
<mrow>
<mo>(</mo>
<mi>u</mi>
<mo>)</mo>
</mrow>
</mrow>
</msup>
</mrow>
In above formula, NEG (w) represents negative sampling set, and negative sample size is set to 5, Lw(u) be one sampling u label, when u is target
During word w, Lw(u)=1, otherwise Lw(u)=0, xwIt is the vector representation of above and below target word w clictions, cwIt is above and below target word w words
Vector representation, θuIt is the vector representation of parameter;
Object function is finally solved using stochastic gradient descent algorithm, specifically more new-standard cement is:
<mrow>
<mi>v</mi>
<mrow>
<mo>(</mo>
<mover>
<mi>w</mi>
<mo>~</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>:</mo>
<mo>=</mo>
<mi>v</mi>
<mrow>
<mo>(</mo>
<mover>
<mi>w</mi>
<mo>~</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>&eta;</mi>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>u</mi>
<mo>&Element;</mo>
<mo>{</mo>
<mi>w</mi>
<mo>}</mo>
<mo>&cup;</mo>
<mi>N</mi>
<mi>E</mi>
<mi>G</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<mfrac>
<mrow>
<mo>&part;</mo>
<mi>L</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>,</mo>
<mi>u</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mo>&part;</mo>
<msub>
<mi>x</mi>
<mi>w</mi>
</msub>
</mrow>
</mfrac>
<mo>,</mo>
<mover>
<mi>w</mi>
<mo>~</mo>
</mover>
<mo>&Element;</mo>
<mi>C</mi>
<mi>o</mi>
<mi>n</mi>
<mi>t</mi>
<mi>e</mi>
<mi>x</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>v</mi>
<mrow>
<mo>(</mo>
<mover>
<mi>c</mi>
<mo>~</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>:</mo>
<mo>=</mo>
<mi>v</mi>
<mrow>
<mo>(</mo>
<mover>
<mi>c</mi>
<mo>~</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>&eta;</mi>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>u</mi>
<mo>&Element;</mo>
<mo>{</mo>
<mi>w</mi>
<mo>}</mo>
<mo>&cup;</mo>
<mi>N</mi>
<mi>E</mi>
<mi>G</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<mfrac>
<mrow>
<mo>&part;</mo>
<mi>L</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>,</mo>
<mi>u</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mo>&part;</mo>
<msub>
<mi>c</mi>
<mi>w</mi>
</msub>
</mrow>
</mfrac>
<mo>,</mo>
<mover>
<mi>c</mi>
<mo>~</mo>
</mover>
<mo>&Element;</mo>
<mi>C</mi>
<mi>i</mi>
<mi>r</mi>
<mi>c</mi>
<mi>u</mi>
<mi>m</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
</mrow>
After model repetitive exercise terminates, parameter term vector represents to collect the Chinese word vector representation that w is exactly our model generations.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710435279.XA CN107273355B (en) | 2017-06-12 | 2017-06-12 | Chinese word vector generation method based on word and phrase joint training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710435279.XA CN107273355B (en) | 2017-06-12 | 2017-06-12 | Chinese word vector generation method based on word and phrase joint training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107273355A true CN107273355A (en) | 2017-10-20 |
CN107273355B CN107273355B (en) | 2020-07-14 |
Family
ID=60066039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710435279.XA Active CN107273355B (en) | 2017-06-12 | 2017-06-12 | Chinese word vector generation method based on word and phrase joint training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107273355B (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108154167A (en) * | 2017-12-04 | 2018-06-12 | 昆明理工大学 | A kind of Chinese character pattern similarity calculating method |
CN108304376A (en) * | 2017-12-15 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of text vector |
CN108595426A (en) * | 2018-04-23 | 2018-09-28 | 北京交通大学 | Term vector optimization method based on Chinese character pattern structural information |
CN109189825A (en) * | 2018-08-10 | 2019-01-11 | 深圳前海微众银行股份有限公司 | Lateral data cutting federation learning model building method, server and medium |
CN109308353A (en) * | 2018-09-17 | 2019-02-05 | 北京神州泰岳软件股份有限公司 | The training method and device of word incorporation model |
CN109508455A (en) * | 2018-10-18 | 2019-03-22 | 山西大学 | A kind of GloVe hyper parameter tuning method |
CN109543191A (en) * | 2018-11-30 | 2019-03-29 | 重庆邮电大学 | One kind being based on the maximized term vector learning method of word relationship energy |
WO2019095836A1 (en) * | 2017-11-14 | 2019-05-23 | 阿里巴巴集团控股有限公司 | Method, device, and apparatus for word vector processing based on clusters |
CN109815476A (en) * | 2018-12-03 | 2019-05-28 | 国网浙江省电力有限公司杭州供电公司 | A kind of term vector representation method based on Chinese morpheme and phonetic joint statistics |
CN109948159A (en) * | 2019-03-15 | 2019-06-28 | 合肥讯飞数码科技有限公司 | A kind of text data generation method, device, equipment and readable storage medium storing program for executing |
CN110162766A (en) * | 2018-02-12 | 2019-08-23 | 深圳市腾讯计算机系统有限公司 | Term vector update method and device |
CN110287961A (en) * | 2019-05-06 | 2019-09-27 | 平安科技(深圳)有限公司 | Chinese word cutting method, electronic device and readable storage medium storing program for executing |
CN110348001A (en) * | 2018-04-04 | 2019-10-18 | 腾讯科技(深圳)有限公司 | A kind of term vector training method and server |
CN110427608A (en) * | 2019-06-24 | 2019-11-08 | 浙江大学 | A kind of Chinese word vector table dendrography learning method introducing layering ideophone feature |
CN110442874A (en) * | 2019-08-09 | 2019-11-12 | 南京邮电大学 | A kind of Chinese meaning of a word prediction technique based on term vector |
CN110610006A (en) * | 2019-09-18 | 2019-12-24 | 中国科学技术大学 | Morphological double-channel Chinese word embedding method based on strokes and glyphs |
CN110781678A (en) * | 2019-10-14 | 2020-02-11 | 大连理工大学 | Text representation method based on matrix form |
CN111008283A (en) * | 2019-10-31 | 2020-04-14 | 中电药明数据科技(成都)有限公司 | Sequence labeling method and system based on composite boundary information |
CN111199153A (en) * | 2018-10-31 | 2020-05-26 | 北京国双科技有限公司 | Word vector generation method and related equipment |
CN111241819A (en) * | 2020-01-07 | 2020-06-05 | 北京百度网讯科技有限公司 | Word vector generation method and device and electronic equipment |
CN111581970A (en) * | 2020-05-12 | 2020-08-25 | 厦门市美亚柏科信息股份有限公司 | Text recognition method, device and storage medium for network context |
US10769383B2 (en) | 2017-10-23 | 2020-09-08 | Alibaba Group Holding Limited | Cluster-based word vector processing method, device, and apparatus |
CN111832301A (en) * | 2020-07-28 | 2020-10-27 | 电子科技大学 | Chinese word vector generation method based on adaptive component n-tuple |
CN111858841A (en) * | 2019-04-24 | 2020-10-30 | 京东数字科技控股有限公司 | Method and device for generating word vector |
WO2020244065A1 (en) * | 2019-06-04 | 2020-12-10 | 平安科技(深圳)有限公司 | Character vector definition method, apparatus and device based on artificial intelligence, and storage medium |
CN112686035A (en) * | 2019-10-18 | 2021-04-20 | 北京沃东天骏信息技术有限公司 | Method and device for vectorizing unknown words |
CN113095065A (en) * | 2021-06-10 | 2021-07-09 | 北京明略软件系统有限公司 | Chinese character vector learning method and device |
CN113190602A (en) * | 2021-04-09 | 2021-07-30 | 桂林电子科技大学 | Event joint extraction method integrating word features and deep learning |
CN113326693A (en) * | 2021-05-28 | 2021-08-31 | 智者四海(北京)技术有限公司 | Natural language model training method and system based on word granularity |
CN113627176A (en) * | 2021-08-17 | 2021-11-09 | 北京计算机技术及应用研究所 | Method for calculating Chinese word vector by using principal component analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0622742A1 (en) * | 1993-04-28 | 1994-11-02 | International Business Machines Corporation | Language processing system |
EP1335301A2 (en) * | 2002-02-07 | 2003-08-13 | Matsushita Electric Industrial Co., Ltd. | Context-aware linear time tokenizer |
CN104573054A (en) * | 2015-01-21 | 2015-04-29 | 杭州朗和科技有限公司 | Information pushing method and equipment |
CN106055673A (en) * | 2016-06-06 | 2016-10-26 | 中国人民解放军国防科学技术大学 | Chinese short-text sentiment classification method based on text characteristic insertion |
CN106547737A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | Based on the sequence labelling method in the natural language processing of deep learning |
-
2017
- 2017-06-12 CN CN201710435279.XA patent/CN107273355B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0622742A1 (en) * | 1993-04-28 | 1994-11-02 | International Business Machines Corporation | Language processing system |
EP1335301A2 (en) * | 2002-02-07 | 2003-08-13 | Matsushita Electric Industrial Co., Ltd. | Context-aware linear time tokenizer |
CN104573054A (en) * | 2015-01-21 | 2015-04-29 | 杭州朗和科技有限公司 | Information pushing method and equipment |
CN106055673A (en) * | 2016-06-06 | 2016-10-26 | 中国人民解放军国防科学技术大学 | Chinese short-text sentiment classification method based on text characteristic insertion |
CN106547737A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | Based on the sequence labelling method in the natural language processing of deep learning |
Cited By (54)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10769383B2 (en) | 2017-10-23 | 2020-09-08 | Alibaba Group Holding Limited | Cluster-based word vector processing method, device, and apparatus |
WO2019095836A1 (en) * | 2017-11-14 | 2019-05-23 | 阿里巴巴集团控股有限公司 | Method, device, and apparatus for word vector processing based on clusters |
US10846483B2 (en) | 2017-11-14 | 2020-11-24 | Advanced New Technologies Co., Ltd. | Method, device, and apparatus for word vector processing based on clusters |
CN108154167A (en) * | 2017-12-04 | 2018-06-12 | 昆明理工大学 | A kind of Chinese character pattern similarity calculating method |
CN108154167B (en) * | 2017-12-04 | 2021-08-20 | 昆明理工大学 | Chinese character font similarity calculation method |
CN108304376A (en) * | 2017-12-15 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of text vector |
CN108304376B (en) * | 2017-12-15 | 2021-09-10 | 腾讯科技(深圳)有限公司 | Text vector determination method and device, storage medium and electronic device |
CN110162766A (en) * | 2018-02-12 | 2019-08-23 | 深圳市腾讯计算机系统有限公司 | Term vector update method and device |
US11586817B2 (en) * | 2018-02-12 | 2023-02-21 | Tencent Technology (Shenzhen) Company Limited | Word vector retrofitting method and apparatus |
CN110348001A (en) * | 2018-04-04 | 2019-10-18 | 腾讯科技(深圳)有限公司 | A kind of term vector training method and server |
CN110348001B (en) * | 2018-04-04 | 2022-11-25 | 腾讯科技(深圳)有限公司 | Word vector training method and server |
CN108595426B (en) * | 2018-04-23 | 2021-07-20 | 北京交通大学 | Word vector optimization method based on Chinese character font structural information |
CN108595426A (en) * | 2018-04-23 | 2018-09-28 | 北京交通大学 | Term vector optimization method based on Chinese character pattern structural information |
CN109189825B (en) * | 2018-08-10 | 2022-03-15 | 深圳前海微众银行股份有限公司 | Federated learning modeling method, server and medium for horizontal data segmentation |
CN109189825A (en) * | 2018-08-10 | 2019-01-11 | 深圳前海微众银行股份有限公司 | Lateral data cutting federation learning model building method, server and medium |
CN109308353A (en) * | 2018-09-17 | 2019-02-05 | 北京神州泰岳软件股份有限公司 | The training method and device of word incorporation model |
CN109308353B (en) * | 2018-09-17 | 2023-08-15 | 鼎富智能科技有限公司 | Training method and device for word embedding model |
CN109508455A (en) * | 2018-10-18 | 2019-03-22 | 山西大学 | A kind of GloVe hyper parameter tuning method |
CN109508455B (en) * | 2018-10-18 | 2021-11-19 | 山西大学 | GloVe super-parameter tuning method |
CN111199153A (en) * | 2018-10-31 | 2020-05-26 | 北京国双科技有限公司 | Word vector generation method and related equipment |
CN111199153B (en) * | 2018-10-31 | 2023-08-25 | 北京国双科技有限公司 | Word vector generation method and related equipment |
CN109543191A (en) * | 2018-11-30 | 2019-03-29 | 重庆邮电大学 | One kind being based on the maximized term vector learning method of word relationship energy |
CN109543191B (en) * | 2018-11-30 | 2022-12-27 | 重庆邮电大学 | Word vector learning method based on word relation energy maximization |
CN109815476A (en) * | 2018-12-03 | 2019-05-28 | 国网浙江省电力有限公司杭州供电公司 | A kind of term vector representation method based on Chinese morpheme and phonetic joint statistics |
CN109948159A (en) * | 2019-03-15 | 2019-06-28 | 合肥讯飞数码科技有限公司 | A kind of text data generation method, device, equipment and readable storage medium storing program for executing |
CN109948159B (en) * | 2019-03-15 | 2023-05-30 | 合肥讯飞数码科技有限公司 | Text data generation method, device, equipment and readable storage medium |
CN111858841A (en) * | 2019-04-24 | 2020-10-30 | 京东数字科技控股有限公司 | Method and device for generating word vector |
WO2020224219A1 (en) * | 2019-05-06 | 2020-11-12 | 平安科技(深圳)有限公司 | Chinese word segmentation method and apparatus, electronic device and readable storage medium |
CN110287961B (en) * | 2019-05-06 | 2024-04-09 | 平安科技(深圳)有限公司 | Chinese word segmentation method, electronic device and readable storage medium |
CN110287961A (en) * | 2019-05-06 | 2019-09-27 | 平安科技(深圳)有限公司 | Chinese word cutting method, electronic device and readable storage medium storing program for executing |
WO2020244065A1 (en) * | 2019-06-04 | 2020-12-10 | 平安科技(深圳)有限公司 | Character vector definition method, apparatus and device based on artificial intelligence, and storage medium |
CN110427608A (en) * | 2019-06-24 | 2019-11-08 | 浙江大学 | A kind of Chinese word vector table dendrography learning method introducing layering ideophone feature |
CN110427608B (en) * | 2019-06-24 | 2021-06-08 | 浙江大学 | Chinese word vector representation learning method introducing layered shape-sound characteristics |
CN110442874B (en) * | 2019-08-09 | 2023-06-13 | 南京邮电大学 | Chinese word sense prediction method based on word vector |
CN110442874A (en) * | 2019-08-09 | 2019-11-12 | 南京邮电大学 | A kind of Chinese meaning of a word prediction technique based on term vector |
CN110610006A (en) * | 2019-09-18 | 2019-12-24 | 中国科学技术大学 | Morphological double-channel Chinese word embedding method based on strokes and glyphs |
CN110610006B (en) * | 2019-09-18 | 2023-06-20 | 中国科学技术大学 | Morphological double-channel Chinese word embedding method based on strokes and fonts |
CN110781678A (en) * | 2019-10-14 | 2020-02-11 | 大连理工大学 | Text representation method based on matrix form |
CN110781678B (en) * | 2019-10-14 | 2022-09-20 | 大连理工大学 | Text representation method based on matrix form |
CN112686035B (en) * | 2019-10-18 | 2024-07-16 | 北京沃东天骏信息技术有限公司 | Method and device for vectorizing unregistered words |
CN112686035A (en) * | 2019-10-18 | 2021-04-20 | 北京沃东天骏信息技术有限公司 | Method and device for vectorizing unknown words |
CN111008283B (en) * | 2019-10-31 | 2023-06-20 | 中电药明数据科技(成都)有限公司 | Sequence labeling method and system based on composite boundary information |
CN111008283A (en) * | 2019-10-31 | 2020-04-14 | 中电药明数据科技(成都)有限公司 | Sequence labeling method and system based on composite boundary information |
CN111241819A (en) * | 2020-01-07 | 2020-06-05 | 北京百度网讯科技有限公司 | Word vector generation method and device and electronic equipment |
CN111581970B (en) * | 2020-05-12 | 2023-01-24 | 厦门市美亚柏科信息股份有限公司 | Text recognition method, device and storage medium for network context |
CN111581970A (en) * | 2020-05-12 | 2020-08-25 | 厦门市美亚柏科信息股份有限公司 | Text recognition method, device and storage medium for network context |
CN111832301A (en) * | 2020-07-28 | 2020-10-27 | 电子科技大学 | Chinese word vector generation method based on adaptive component n-tuple |
CN113190602A (en) * | 2021-04-09 | 2021-07-30 | 桂林电子科技大学 | Event joint extraction method integrating word features and deep learning |
CN113190602B (en) * | 2021-04-09 | 2022-03-25 | 桂林电子科技大学 | Event joint extraction method integrating word features and deep learning |
CN113326693B (en) * | 2021-05-28 | 2024-04-16 | 智者四海(北京)技术有限公司 | Training method and system of natural language model based on word granularity |
CN113326693A (en) * | 2021-05-28 | 2021-08-31 | 智者四海(北京)技术有限公司 | Natural language model training method and system based on word granularity |
CN113095065A (en) * | 2021-06-10 | 2021-07-09 | 北京明略软件系统有限公司 | Chinese character vector learning method and device |
CN113627176A (en) * | 2021-08-17 | 2021-11-09 | 北京计算机技术及应用研究所 | Method for calculating Chinese word vector by using principal component analysis |
CN113627176B (en) * | 2021-08-17 | 2024-04-19 | 北京计算机技术及应用研究所 | Method for calculating Chinese word vector by principal component analysis |
Also Published As
Publication number | Publication date |
---|---|
CN107273355B (en) | 2020-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273355A (en) | A kind of Chinese word vector generation method based on words joint training | |
CN107239446B (en) | A kind of intelligence relationship extracting method based on neural network Yu attention mechanism | |
CN109902298B (en) | Domain knowledge modeling and knowledge level estimation method in self-adaptive learning system | |
CN108628823B (en) | Named entity recognition method combining attention mechanism and multi-task collaborative training | |
CN106980683B (en) | Blog text abstract generating method based on deep learning | |
CN107463607B (en) | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning | |
CN106547735A (en) | The structure and using method of the dynamic word or word vector based on the context-aware of deep learning | |
CN110909736B (en) | Image description method based on long-term and short-term memory model and target detection algorithm | |
CN110287494A (en) | A method of the short text Similarity matching based on deep learning BERT algorithm | |
CN110502644B (en) | Active learning method for field level dictionary mining construction | |
CN111563383A (en) | Chinese named entity identification method based on BERT and semi CRF | |
CN107153642A (en) | A kind of analysis method based on neural network recognization text comments Sentiment orientation | |
CN103823857B (en) | Space information searching method based on natural language processing | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN112417880A (en) | Court electronic file oriented case information automatic extraction method | |
CN107133220A (en) | Name entity recognition method in a kind of Geography field | |
CN106383816A (en) | Chinese minority region name identification method based on deep learning | |
CN111881677A (en) | Address matching algorithm based on deep learning model | |
CN106294322A (en) | A kind of Chinese based on LSTM zero reference resolution method | |
CN106599933A (en) | Text emotion classification method based on the joint deep learning model | |
CN108363743A (en) | A kind of intelligence questions generation method, device and computer readable storage medium | |
CN109885824A (en) | A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level | |
CN107577662A (en) | Towards the semantic understanding system and method for Chinese text | |
CN107526834A (en) | Joint part of speech and the word2vec improved methods of the correlation factor of word order training | |
CN106897559A (en) | A kind of symptom and sign class entity recognition method and device towards multi-data source |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |