CN107273355A - A kind of Chinese word vector generation method based on words joint training - Google Patents

A kind of Chinese word vector generation method based on words joint training Download PDF

Info

Publication number
CN107273355A
CN107273355A CN201710435279.XA CN201710435279A CN107273355A CN 107273355 A CN107273355 A CN 107273355A CN 201710435279 A CN201710435279 A CN 201710435279A CN 107273355 A CN107273355 A CN 107273355A
Authority
CN
China
Prior art keywords
mrow
word
msub
chinese
msup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710435279.XA
Other languages
Chinese (zh)
Other versions
CN107273355B (en
Inventor
张宪超
刘世柯
梁文新
刘馨月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201710435279.XA priority Critical patent/CN107273355B/en
Publication of CN107273355A publication Critical patent/CN107273355A/en
Application granted granted Critical
Publication of CN107273355B publication Critical patent/CN107273355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Chinese word vector generation method based on words joint training, belong to natural language processing technique field.Using the Chinese character information in word as key character, with reference to cliction and Chinese character up and down, the term vector of joint training Chinese is represented.On the basis of the word-based term vector model of itself, we are by introducing the composition Chinese character information of word in itself, while based on cliction prediction target word up and down, based on Word-predictor target word up and down.The word-based model words joint training model of itself is applied respectively, compares the validity and robustness of the training term vector of two models, it is found that the Chinese term vector of words joint training model generation more meets the Chinese feature of semanteme, while robustness is also more preferable.The invention provides a kind of new method of Chinese term vector generation, generation and application work for Chinese term vector provide a kind of new solution.

Description

A kind of Chinese word vector generation method based on words joint training
Technical field
The invention belongs to natural language processing technique field, it is related to a kind of Chinese term vector based on words joint training Generation method.
Background technology
In recent years, the word that natural language processing technique has been applied in the various aspects of ours at one's side, text represents research It is most basic research in natural language processing field.Meanwhile, word represents it is one kind that data are represented, and data are denoted as machine Learn mid-early stage preparation work, its quality has a great impact to the performance of machine learning model.For Chinese natural language Process field problems faced, it is intended that computer can be directly automatically from the text data middle school without mark on a large scale Acquistion is to corresponding text representation, while the semantic information in word and text also is intended to embody by this expression Come.The conventional word such as Word2Vec, GloVe incorporation model can not meet the characteristic of speech sounds of Chinese, and for Chinese, performance is more Good, more accurately term vector model awaits researcher and further explored for semantic information capture.
The content of the invention
In place of the purpose of the present invention is mainly for some shortcomings of existing research, propose a kind of based on words joint training Chinese word vector generation method, i.e. ECWE models, Chinese character and external context and Chinese character together obtain height inside models coupling The Chinese word insertion of quality.ECWE, which is combined internal word with outside word by a simple but general method, together to be learnt Chinese term vector.We cause there is more contacts, mould between originally isolated word using internal word and external context word Type passes through to strengthening effective modeling to Chinese character so that between Chinese character and Chinese character is strengthened with the relation between word, simultaneously The contextual information of word is enriched, so that word represents to contain more semantic informations, the effect that word is represented is improved.
Technical scheme:
A kind of Chinese word vector generation method based on words joint training, step is as follows:
(1) Chinese text data processing stage
Word represents the generation of vector, it is necessary to which big corpus supports that corpus can voluntarily be built, and can also pass through fund Purchase, possesses after corpus, and we will carry out word segmentation processing to corpus first.There are many participle instruments to use at present, This step is not as this method right characteristic.
(2) Chinese word represents vectorial generation phase
For Chinese, a word is usually made up of several Chinese characters, and contains abundant inside implication.One word The meaning of a word is also usually relevant with the Chinese character for constituting it.For example, Chinese word " science and technology ", his meaning of a word can be by the literature up and down in language material Acquistion is arrived, while we can see that coming, his meaning of a word can be inferred by the Chinese character " section " and " skill " for constituting him to be obtained, therefore I Obtain an idea, Chinese word incorporation model is improved using Chinese character information, the word that learns Chinese represent vector.
In the starting stage, we generate the vector representation w of word, Chinese character at random, and c, dimension size is 100, each dimension Value is the random decimal between one 0 to 1.
2.1) based on cliction prediction target word up and down
For giving sentence D={ x1,…,xM, M represents sentence length, xjJ-th of word in sentence is represented, passes through one The cliction up and down of (window size is K) predicts target word in individual stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists In, using term vector and composition word internal word vector vector add and be averaging the vector table as target word w cliction up and down Show;It is characterised by, for each Chinese character, different according to position, he can there are three different vector representation (cB,cM,cE), The beginning that they are located among word, middle and ending are represented respectively.The vector representation formula of cliction is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
Wherein, wjRepresent xjTerm vector itself, NjRepresent xjIn Chinese character number, ckRepresent word xjIn k-th Chinese character Vector representation;
By above formula we obtain above and below cliction vector representation xw, thus predict target word xi, its target is most Conditional probability function of the bigization target word on cliction up and down:
Wherein M represents the length of sentence, and K represents window size.
2.2) based on Word-predictor target word up and down
For sentence D={ x1,…,xM, the sentence is traveled through first, tables look-up and the Chinese character in each word is mapped to vector, remove Go target word;Target word is predicted by the cliction up and down in a stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists The vector representation of internal word adds the vector representation as word up and down with average value in the cliction using above and below;It is characterised by, for Each Chinese character, different according to position, he can have three different vector representation (cB,cM,cE), represent that they are located at respectively Beginning among word, middle and ending.The vector representation formula of word is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
By above formula obtain above and below word vector representation cw, thus predict target word xi, its target is to maximize target word In literal conditional probability function up and down:
Wherein M represents the length of sentence, and K represents window size.
2.3) it is based on words associated prediction target word
We have obtained predicting the object function of target word based on word and word in above-mentioned steps, in the step, for Sentence D={ x1,…,xM, it is characterised by, word up and down will be based on to predict that the object function of target word is same based on cliction up and down To predict that the object function of target word combines, joint training word and word;It is exactly the condition in optimization context to target word While probability, conditional probability of the Chinese character of each in cliction to target word above and below optimization:
Wherein, M represents the length of sentence, and W represents word dictionary, and w represents target word, i.e., x abovei, Context (w) w context words, i.e., x above are representedw, Circum (w) represent w context in Chinese character, i.e., c abovew, β is the decimal between one 0 to 1, represents the ratio modeled based on Chinese character;
2.4) iteration updates
In order to reduce computation complexity, the steps characteristic is, optimizes calculating by the negative method of sampling, specifically It is come design conditions probability with following formula:
NEG (w) represents negative sampling set in above formula, and negative sample size is set to 5, Lw(u) be one sampling u label, when u is During target word w, Lw(u)=1, otherwise Lw(u)=0, xwIt is the vector representation of above and below target word w clictions, cwIt is above and below target word w The vector representation of word, θuIt is the vector representation of parameter;
Object function is finally solved using stochastic gradient descent algorithm, specifically more new-standard cement is:
After model repetitive exercise terminates, parameter term vector represents to collect the Chinese word vector representation that w is exactly our model generations.
The beneficial effects of the present invention are disclose a kind of Chinese term vector generation side based on words joint training Method, using the Chinese character information in word as key character, with reference to cliction and Chinese character up and down, the term vector of joint training Chinese is represented. On the basis of the word-based term vector model of itself, we are by introducing the composition Chinese character information of word in itself, based on above and below While cliction predicts target word, based on Word-predictor target word up and down.By the word-based model words joint training mould of itself Type is applied respectively, is compared the validity and robustness of the training term vector of two models, is found the generation of words joint training model Chinese term vector more meet the Chinese feature of semanteme, while robustness is also more preferable.The invention provides the generation of Chinese term vector A kind of new method, generation and application work for Chinese term vector provide a kind of new solution.
Brief description of the drawings
Fig. 1 is the major architectural figure of the inventive method.
Fig. 2 is evaluation result of the inventive method in semantic similarity task, and ECWE is model of the present invention abbreviation, thus Figure can determine that the Chinese term vector that the present invention is generated contains more accurately semantic information.
Fig. 3 is evaluation result of the inventive method in analogism task, during thus figure can determine that the present invention is generated Cliction vector contains more accurately semantic information.
Fig. 4 is evaluation result of the inventive method on text categorization task, during thus figure can determine that the present invention is generated Cliction vector is more suitable for Chinese from speech language processing tasks.
Fig. 5 is evaluation result of the inventive method in different language material sizes, compares the present invention and has more robustness.
Fig. 6 is evaluation result of the inventive method in different Chinese character modeling ratio, compares the present invention and has more robustness.
Embodiment
In order that the object, technical solutions and advantages of the present invention are clearer, below by the specific embodiment party of the present invention Formula is described in further detail.
The invention provides a kind of Chinese word vector generation method based on words joint training, this method includes:
(1) Chinese text data processing stage
Word represents the generation of vector, it is necessary to which big corpus supports that corpus can voluntarily be built, and can also pass through fund Purchase, we are by taking wikipedia Chinese data collection as an example here.
1.1) wikipedia Chinese data collection is chosen as training corpus, and wikipedia Chinese data collection Covering domain is wide, This language material has 1.82 hundred million Chinese words, and word dictionary size is 45.7 ten thousand, and Chinese character dictionary size is 9000.
Enter pretreatment to wikipedia Chinese data collection, the Chinese data of wikipedia be it is complicated and simple mix, the inside is included A variety of different data such as continent is simplified, TaiWan, China traditional font, Hongkong and Macro's traditional font.Sometimes between the different paragraphs of an article Different complicated and simple words can be used.The complex form of Chinese characters in language material is converted into simplified Chinese character we used open source projects opencc.Institute With the complex form of Chinese characters to be removed, Normalization is allowed for, simplified and traditional font exists simultaneously, for same word, conflict can be caused.
1.2) possess after corpus, we will carry out word segmentation processing to corpus.Participle has many methods, and we introduce one Plant the Chinese word cutting method marked based on word.
Based on word mark Chinese word cutting method basic assumption be a word internal text high cohesion, and word boundary with Outside word lower coupling.Grammatical term for the character circle is learnt by statistical machine learning method, BMES marks are performed using sequence labelling model. For monosyllabic word, its label is S;For multi-character words, first Chinese character label in word is B, and last Chinese character label is E, The label of middle word is M.After being labeled to each word of training data, using a kind of 3 layers of neural network structure to each word It is trained, for the labeling task of each word in sentence, chooses in current word and contextual window, common win Word is used as feature.Wherein it is (win-1)/2 word above and below.The urtext of win word is converted into its word first Vector representation e (w), and win word is connected into a win* | e | the vector x of dimension, the vector is the input layer of neutral net, Hidden layer h design is consistent with common feedforward neural network, each node of input layer and hidden layer | h | between individual node There is side connection two-by-two.Hidden layer is used as activation primitive from tanh functions.
Assuming that the training corpus before participle is:" on April -7 on the 6th, under the working closely of the multiple departments in school district, development zone School district smoothly completes work of unpaid blood donation in 2017.The people of teaching and administrative staff 5, the people of postgraduate 9, the people of undergraduate 463 that early stage registration is donated blood, Final successfully donate blood 420 people, the wherein people of teaching and administrative staff 4, the people of postgraduate 6, the people of undergraduate 410." be changed into after word segmentation processing:" April - 7 days on the 6th, under the working closely of the multiple departments in school district, development zone school district smoothly completed work of unpaid blood donation in 2017.Early stage Register the people of teaching and administrative staff 5, the people of postgraduate 9, undergraduate, 463 people donated blood, and finally successfully donate blood 420 people, wherein the people of teaching and administrative staff 4, research Raw 6 people, the people of undergraduate 410.”
1.3) it is last we to remove stop words (" ", " " etc.) and punctuation mark etc..
(2) Chinese word represents vectorial generation phase
For Chinese, a word is usually made up of several Chinese characters, and contains abundant inside implication.One word The meaning of a word is also usually relevant with the Chinese character for constituting it.For example, Chinese word " science and technology ", his meaning of a word can be by the literature up and down in language material Acquistion is arrived, while we can see that coming, his meaning of a word can be inferred by the Chinese character " section " and " skill " for constituting him to be obtained, therefore I Obtain an idea, Chinese word incorporation model is improved using Chinese character information, the word that learns Chinese represent vector.Fig. 1 is us The block schematic illustration of model.Word, which is embedded in (white grey square frame in figure) and word and is embedded in (white box) and combines, obtains one newly Vectorial (grey square frame), these new vectors add and obtain predicting the vector (left side Dark grey square frame) of target word.Meanwhile, word is embedding Enter also plus and obtain a new vector (the right Dark grey square frame) to predict target word.
Starting stage, we travel through language material, and vocabulary is added in a vocabulary, while by vocabulary according to word frequency size Sequence, the vocabulary for word frequency less than 5, we are deleted.Then the vector representation of random generation word, Chinese character and parameter (dimension size is typically set to 100) w, c and θ.Next, we design an object function, pass through stochastic gradient descent algorithm Iteration optimization parameters.
2.1) based on cliction prediction target word up and down
For giving sentence D={ x1,…,xM, M represents sentence length, xjJ-th of word in sentence is represented, passes through one The cliction up and down of (window size is K) predicts target word in individual stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists In, using term vector and composition word internal word vector vector add and be averaging the vector table as target word w cliction up and down Show;It is characterised by, for each Chinese character, different according to position, he can there are three different vector representation (cB,cM,cE), The beginning that they are located among word, middle and ending are represented respectively.The vector representation formula of cliction is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
Wherein, wjRepresent xjTerm vector itself, NjRepresent xjIn Chinese character number, ckRepresent word xjIn k-th Chinese character Vector representation;
By above formula we obtain above and below cliction vector representation xw, thus predict target word xi, its target is most Conditional probability function of the bigization target word on cliction up and down:
Wherein M represents the length of sentence, and K represents window size.
2.2) based on Word-predictor target word up and down
Our invention not only will predict target word based on cliction up and down, while will be based on Word-predictor target up and down Word.Similarly for sentence D={ x1,…,xM, the sentence is traveled through first, tables look-up and the Chinese character in each word is mapped to vector, remove Go target word;Target word is predicted by the cliction up and down in a stationary window, it is contemplated that Chinese characteristic, the steps characteristic exists The vector representation of internal word adds the vector representation as word up and down with average value in the cliction using above and below;It is characterised by, for Each Chinese character, different according to position, he can have three different vector representation (cB,cM,cE), represent that they are located at respectively Beginning among word, middle and ending.The vector representation formula of word is as follows up and down:
Where j=w-K ... w-1, w+1 ... w+K
By above formula obtain above and below word vector representation cw, thus predict target word xi, its target is to maximize target word In literal conditional probability function up and down:
Wherein M represents the length of sentence, and K represents window size.
2.3) it is based on words associated prediction target word
Based on up and down cliction prediction target word in, the meaning of word is added directly into the meaning of a word by we, can so cause for Word containing same word, model tends to obtain similar term vector, thus we add above and below text information weaken inside Negative effect of the word to the meaning of a word, in order that word contains more rich semantic information, we introduce the word information of external context. We by this method, word are put into the language of word by the use of the expression being distributed as this word of each word in word context In adopted space, more effectively word is modeled.For sentence D={ x1,…,xM, in the step, it is characterised by, context will be based on Word predicts the object function of target word with predicting that the object function of target word combines based on word up and down, joint training Word and word;Be exactly while context is optimized to the conditional probability of target word, above and below optimization in cliction each Chinese character to target The conditional probability of word:
Wherein, M represents the length of sentence, and W represents word dictionary, and w represents target word, i.e., x abovei, Context (w) w context words, i.e., x above are representedw, Circum (w) represent w context in Chinese character, i.e., c abovew, β is the decimal between one 0 to 1, represents the ratio modeled based on Chinese character;
2.4) iteration updates
In order to reduce computation complexity, the steps characteristic is, optimizes calculating by the negative method of sampling, specifically It is come design conditions probability with following formula:
NEG (w) represents negative sampling set in above formula, and negative sample size is set to 5, Lw(u) be one sampling u label, when u is During target word w, Lw(u)=1, otherwise Lw(u)=0, xwIt is the vector representation of above and below target word w clictions, cwIt is above and below target word w The vector representation of word, θuIt is the vector representation of parameter;
Object function is finally solved using stochastic gradient descent algorithm, specifically more new-standard cement is:
It is characterised by, the vector table of word and word is shown with different expression formulas, can so obtain the vector representation of more preferable word, Model is further promoted to obtain more effective word insertion.
After model repetitive exercise terminates, parameter term vector represents to collect the Chinese word vector representation that w is exactly our model generations.
(3) experimental result
By Semantic Similarity Measurement, analogism is evaluated the characteristic of speech sounds of term vector, the knot in Fig. 2 and Fig. 3 Fruit shows the term vector of (ECWE models) of the invention generation is all better than what other models to be showed in which task.It is logical The result crossed in performance scores of the text categorization task to term vector, Fig. 4 shows that the term vector for generating the present invention is used as Chinese Feature in natural language processing task, can obtain more preferable result.Fig. 5 is gradually increases training corpus, and model is in semantic phase Like the performance situation on degree, it can be observed how, the present invention still has preferable performance when language material is less, because logical Cross and introduce external context Chinese character, expanded the contextual information of word so that model in the case of smaller training corpus, Also it can guarantee that word is effectively trained, while when language material size changes from small to large, the performance of ECWE models is better than always Other models, and quickly reach a good performance.Fig. 6 is gradually increase Chinese character modeling ratio, and model is similar in semanteme Performance situation on degree, it can be observed how, the present invention has more preferable performance in varied situations.This illustrates that the present invention is strictly One performance more preferably, semantic information capture more accurately Chinese word vector model, while more for stability, in each task Assessment performance it can also be seen that the Chinese word vector generation method proposed by the present invention based on words joint training it is feasible Property.
The technical principle for being the specific embodiment of the present invention and being used above, if conception under this invention institute The change of work, during the spirit that function produced by it is still covered without departing from specification and accompanying drawing, should belong to the present invention's Protection domain.

Claims (1)

1. a kind of Chinese word vector generation method based on words joint training, it is characterised in that believe the Chinese character in Chinese word Breath is as key character, and with reference to cliction and Chinese character joint training Chinese word vector representation up and down, step is as follows:
(1) Chinese text data processing stage
Word represents that the generation of vector is based on corpus, carries out word segmentation processing to corpus first;
(2) Chinese word represents vectorial generation phase
For Chinese, a word is made up of several Chinese characters, and the meaning of a word is relevant with the Chinese character for constituting it;This method is believed using Chinese character Cease to improve Chinese word incorporation model, the word that learns Chinese represents vector;
In the starting stage, the vector representation w of word, Chinese character is generated at random, c, dimension size is 100, and each dimension values are one 0 Random decimal between to 1;
2.1) based on cliction prediction target word up and down
For giving sentence D={ x1,…,xM, M represents sentence length, xjJ-th of word in sentence is represented, it is solid by one Determine the cliction up and down in window to predict target word, window size is K, it is contemplated that Chinese characteristic, by term vector and composition word The vector of internal word vector adds and is averaging the vector representation of the cliction up and down as target word w;For each Chinese character, according to Position is different, can all there is three different vector representation (cB,cM,cE), represent respectively they be located at word among beginning, in Between and end up;The vector representation formula of cliction is as follows up and down:
<mrow> <msub> <mi>x</mi> <mi>w</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>2</mn> <mi>K</mi> </mrow> </mfrac> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>+</mo> <mfrac> <mn>1</mn> <msub> <mi>N</mi> <mi>j</mi> </msub> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>j</mi> </msub> </munderover> <mo>(</mo> <mrow> <msubsup> <mi>c</mi> <mn>1</mn> <mi>B</mi> </msubsup> <mo>+</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>2</mn> </mrow> <msub> <mi>N</mi> <mrow> <mi>j</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </munderover> <msubsup> <mi>c</mi> <mi>k</mi> <mi>M</mi> </msubsup> <mo>+</mo> <msubsup> <mi>c</mi> <msub> <mi>N</mi> <mi>j</mi> </msub> <mi>E</mi> </msubsup> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>
Where j=w-K ... w-1, w+1 ... w+K
Wherein, wjRepresent xjTerm vector itself, NjRepresent xjIn Chinese character number, ckRepresent word xjIn k-th of Chinese character vector Represent;
By above formula obtain above and below cliction vector representation xw, thus predict target word xi, its target is to maximize target Conditional probability function of the word on cliction up and down:
<mrow> <mi>L</mi> <mrow> <mo>(</mo> <mi>D</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>M</mi> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mi>K</mi> </mrow> <mrow> <mi>M</mi> <mo>-</mo> <mi>K</mi> </mrow> </munderover> <mi>log</mi> <mi> </mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>x</mi> <mi>w</mi> </msub> <mo>)</mo> </mrow> </mrow>
Wherein M represents sentence length, and K represents window size;
2.2) based on Word-predictor target word up and down
For sentence D={ x1,…,xM, the sentence is traveled through first, tables look-up and the Chinese character in each word is mapped to vector, remove mesh Mark word;Target word is predicted by the cliction up and down in a stationary window, the vector representation of internal word in cliction up and down is added With vector representation of the average value as word up and down;It is different according to position for each Chinese character, can all have three it is different to Amount represents (cB,cM,cE), represent that they are located at starting among word, middle and ending respectively;The vector representation of word up and down Formula is as follows:
<mrow> <msub> <mi>c</mi> <mi>w</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>2</mn> <mi>K</mi> </mrow> </mfrac> <munder> <mo>&amp;Sigma;</mo> <mi>j</mi> </munder> <mfrac> <mn>1</mn> <msub> <mi>N</mi> <mi>j</mi> </msub> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>j</mi> </msub> </munderover> <mrow> <mo>(</mo> <msubsup> <mi>c</mi> <mn>1</mn> <mi>B</mi> </msubsup> <mo>+</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>2</mn> </mrow> <msub> <mi>N</mi> <mrow> <mi>j</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </munderover> <msubsup> <mi>c</mi> <mi>k</mi> <mi>M</mi> </msubsup> <mo>+</mo> <msubsup> <mi>c</mi> <msub> <mi>N</mi> <mi>j</mi> </msub> <mi>E</mi> </msubsup> <mo>)</mo> </mrow> </mrow>
Where j=w-K ... w-1, w+1 ... w+K
By above formula obtain above and below word vector representation cw, thus predict target word xi, its target is to maximize target word upper Under literal conditional probability function:
<mrow> <mi>L</mi> <mrow> <mo>(</mo> <mi>D</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>M</mi> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mi>K</mi> </mrow> <mrow> <mi>M</mi> <mo>-</mo> <mi>K</mi> </mrow> </munderover> <mi>log</mi> <mi> </mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>w</mi> </msub> <mo>)</mo> </mrow> </mrow>
Wherein, M represents sentence length, and K represents window size;
2.3) it is based on words associated prediction target word
For sentence D={ x1,…,xM, it will be risen based on cliction up and down to predict that the object function of target word is same based on word up and down To predict that the object function of target word combines, joint training word and word;In optimization context to the conditional probability of target word Meanwhile, conditional probability of the Chinese character of each in cliction to target word above and below optimization:
<mrow> <mi>L</mi> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>M</mi> </mfrac> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>w</mi> <mo>&amp;Element;</mo> <mi>W</mi> </mrow> </munder> <mo>&amp;lsqb;</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&amp;beta;</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi> </mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>|</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>x</mi> <mi>t</mi> <mo>(</mo> <mi>w</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>&amp;beta;</mi> <mi>log</mi> <mi> </mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>|</mo> <mi>C</mi> <mi>i</mi> <mi>r</mi> <mi>c</mi> <mi>u</mi> <mi>m</mi> <mo>(</mo> <mi>w</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow>
Wherein, M represents sentence length, and W represents word dictionary, and w represents target word, i.e., x abovei, Context (w) expressions w Context words, i.e., x abovew, Circum (w) represent w context in Chinese character, i.e., c abovew, β is one 0 Decimal between to 1, represents the ratio modeled based on Chinese character;
2.4) iteration updates
Calculating, design conditions probability are optimized by the negative method of sampling:
<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>|</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>x</mi> <mi>t</mi> <mo>(</mo> <mi>w</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Pi;</mo> <mrow> <mi>u</mi> <mo>&amp;Element;</mo> <mo>{</mo> <mi>w</mi> <mo>}</mo> <mo>&amp;cup;</mo> <mi>N</mi> <mi>E</mi> <mi>G</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> </munder> <msup> <mrow> <mo>&amp;lsqb;</mo> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <msubsup> <mi>x</mi> <mi>w</mi> <mi>T</mi> </msubsup> <msup> <mi>&amp;theta;</mi> <mi>u</mi> </msup> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mrow> <msup> <mi>L</mi> <mi>w</mi> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> </msup> <mo>&amp;CenterDot;</mo> <msup> <mrow> <mo>&amp;lsqb;</mo> <mn>1</mn> <mo>-</mo> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <msubsup> <mi>x</mi> <mi>w</mi> <mi>T</mi> </msubsup> <msup> <mi>&amp;theta;</mi> <mi>u</mi> </msup> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mrow> <mn>1</mn> <mo>-</mo> <msup> <mi>L</mi> <mi>w</mi> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> </msup> </mrow>
<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>|</mo> <mi>C</mi> <mi>i</mi> <mi>r</mi> <mi>c</mi> <mi>u</mi> <mi>m</mi> <mo>(</mo> <mi>w</mi> <mo>)</mo> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Pi;</mo> <mrow> <mi>u</mi> <mo>&amp;Element;</mo> <mo>{</mo> <mi>w</mi> <mo>}</mo> <mo>&amp;cup;</mo> <mi>N</mi> <mi>E</mi> <mi>G</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> </munder> <msup> <mrow> <mo>&amp;lsqb;</mo> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <msubsup> <mi>c</mi> <mi>w</mi> <mi>T</mi> </msubsup> <msup> <mi>&amp;theta;</mi> <mi>u</mi> </msup> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mrow> <msup> <mi>L</mi> <mi>w</mi> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> </msup> <mo>&amp;CenterDot;</mo> <msup> <mrow> <mo>&amp;lsqb;</mo> <mn>1</mn> <mo>-</mo> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <msubsup> <mi>c</mi> <mi>w</mi> <mi>T</mi> </msubsup> <msup> <mi>&amp;theta;</mi> <mi>u</mi> </msup> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mrow> <mn>1</mn> <mo>-</mo> <msup> <mi>L</mi> <mi>w</mi> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> </msup> </mrow>
In above formula, NEG (w) represents negative sampling set, and negative sample size is set to 5, Lw(u) be one sampling u label, when u is target During word w, Lw(u)=1, otherwise Lw(u)=0, xwIt is the vector representation of above and below target word w clictions, cwIt is above and below target word w words Vector representation, θuIt is the vector representation of parameter;
Object function is finally solved using stochastic gradient descent algorithm, specifically more new-standard cement is:
<mrow> <mi>v</mi> <mrow> <mo>(</mo> <mover> <mi>w</mi> <mo>~</mo> </mover> <mo>)</mo> </mrow> <mo>:</mo> <mo>=</mo> <mi>v</mi> <mrow> <mo>(</mo> <mover> <mi>w</mi> <mo>~</mo> </mover> <mo>)</mo> </mrow> <mo>+</mo> <mi>&amp;eta;</mi> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>u</mi> <mo>&amp;Element;</mo> <mo>{</mo> <mi>w</mi> <mo>}</mo> <mo>&amp;cup;</mo> <mi>N</mi> <mi>E</mi> <mi>G</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <mo>&amp;part;</mo> <mi>L</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mo>&amp;part;</mo> <msub> <mi>x</mi> <mi>w</mi> </msub> </mrow> </mfrac> <mo>,</mo> <mover> <mi>w</mi> <mo>~</mo> </mover> <mo>&amp;Element;</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>x</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow>
<mrow> <mi>v</mi> <mrow> <mo>(</mo> <mover> <mi>c</mi> <mo>~</mo> </mover> <mo>)</mo> </mrow> <mo>:</mo> <mo>=</mo> <mi>v</mi> <mrow> <mo>(</mo> <mover> <mi>c</mi> <mo>~</mo> </mover> <mo>)</mo> </mrow> <mo>+</mo> <mi>&amp;eta;</mi> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>u</mi> <mo>&amp;Element;</mo> <mo>{</mo> <mi>w</mi> <mo>}</mo> <mo>&amp;cup;</mo> <mi>N</mi> <mi>E</mi> <mi>G</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <mo>&amp;part;</mo> <mi>L</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mo>&amp;part;</mo> <msub> <mi>c</mi> <mi>w</mi> </msub> </mrow> </mfrac> <mo>,</mo> <mover> <mi>c</mi> <mo>~</mo> </mover> <mo>&amp;Element;</mo> <mi>C</mi> <mi>i</mi> <mi>r</mi> <mi>c</mi> <mi>u</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow>
After model repetitive exercise terminates, parameter term vector represents to collect the Chinese word vector representation that w is exactly our model generations.
CN201710435279.XA 2017-06-12 2017-06-12 Chinese word vector generation method based on word and phrase joint training Active CN107273355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710435279.XA CN107273355B (en) 2017-06-12 2017-06-12 Chinese word vector generation method based on word and phrase joint training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710435279.XA CN107273355B (en) 2017-06-12 2017-06-12 Chinese word vector generation method based on word and phrase joint training

Publications (2)

Publication Number Publication Date
CN107273355A true CN107273355A (en) 2017-10-20
CN107273355B CN107273355B (en) 2020-07-14

Family

ID=60066039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710435279.XA Active CN107273355B (en) 2017-06-12 2017-06-12 Chinese word vector generation method based on word and phrase joint training

Country Status (1)

Country Link
CN (1) CN107273355B (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method
CN108304376A (en) * 2017-12-15 2018-07-20 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of text vector
CN108595426A (en) * 2018-04-23 2018-09-28 北京交通大学 Term vector optimization method based on Chinese character pattern structural information
CN109189825A (en) * 2018-08-10 2019-01-11 深圳前海微众银行股份有限公司 Lateral data cutting federation learning model building method, server and medium
CN109308353A (en) * 2018-09-17 2019-02-05 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109508455A (en) * 2018-10-18 2019-03-22 山西大学 A kind of GloVe hyper parameter tuning method
CN109543191A (en) * 2018-11-30 2019-03-29 重庆邮电大学 One kind being based on the maximized term vector learning method of word relationship energy
WO2019095836A1 (en) * 2017-11-14 2019-05-23 阿里巴巴集团控股有限公司 Method, device, and apparatus for word vector processing based on clusters
CN109815476A (en) * 2018-12-03 2019-05-28 国网浙江省电力有限公司杭州供电公司 A kind of term vector representation method based on Chinese morpheme and phonetic joint statistics
CN109948159A (en) * 2019-03-15 2019-06-28 合肥讯飞数码科技有限公司 A kind of text data generation method, device, equipment and readable storage medium storing program for executing
CN110162766A (en) * 2018-02-12 2019-08-23 深圳市腾讯计算机系统有限公司 Term vector update method and device
CN110287961A (en) * 2019-05-06 2019-09-27 平安科技(深圳)有限公司 Chinese word cutting method, electronic device and readable storage medium storing program for executing
CN110348001A (en) * 2018-04-04 2019-10-18 腾讯科技(深圳)有限公司 A kind of term vector training method and server
CN110427608A (en) * 2019-06-24 2019-11-08 浙江大学 A kind of Chinese word vector table dendrography learning method introducing layering ideophone feature
CN110442874A (en) * 2019-08-09 2019-11-12 南京邮电大学 A kind of Chinese meaning of a word prediction technique based on term vector
CN110610006A (en) * 2019-09-18 2019-12-24 中国科学技术大学 Morphological double-channel Chinese word embedding method based on strokes and glyphs
CN110781678A (en) * 2019-10-14 2020-02-11 大连理工大学 Text representation method based on matrix form
CN111008283A (en) * 2019-10-31 2020-04-14 中电药明数据科技(成都)有限公司 Sequence labeling method and system based on composite boundary information
CN111199153A (en) * 2018-10-31 2020-05-26 北京国双科技有限公司 Word vector generation method and related equipment
CN111241819A (en) * 2020-01-07 2020-06-05 北京百度网讯科技有限公司 Word vector generation method and device and electronic equipment
CN111581970A (en) * 2020-05-12 2020-08-25 厦门市美亚柏科信息股份有限公司 Text recognition method, device and storage medium for network context
US10769383B2 (en) 2017-10-23 2020-09-08 Alibaba Group Holding Limited Cluster-based word vector processing method, device, and apparatus
CN111832301A (en) * 2020-07-28 2020-10-27 电子科技大学 Chinese word vector generation method based on adaptive component n-tuple
WO2020244065A1 (en) * 2019-06-04 2020-12-10 平安科技(深圳)有限公司 Character vector definition method, apparatus and device based on artificial intelligence, and storage medium
CN112686035A (en) * 2019-10-18 2021-04-20 北京沃东天骏信息技术有限公司 Method and device for vectorizing unknown words
CN113095065A (en) * 2021-06-10 2021-07-09 北京明略软件系统有限公司 Chinese character vector learning method and device
CN113190602A (en) * 2021-04-09 2021-07-30 桂林电子科技大学 Event joint extraction method integrating word features and deep learning
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
CN113627176A (en) * 2021-08-17 2021-11-09 北京计算机技术及应用研究所 Method for calculating Chinese word vector by using principal component analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0622742A1 (en) * 1993-04-28 1994-11-02 International Business Machines Corporation Language processing system
EP1335301A2 (en) * 2002-02-07 2003-08-13 Matsushita Electric Industrial Co., Ltd. Context-aware linear time tokenizer
CN104573054A (en) * 2015-01-21 2015-04-29 杭州朗和科技有限公司 Information pushing method and equipment
CN106055673A (en) * 2016-06-06 2016-10-26 中国人民解放军国防科学技术大学 Chinese short-text sentiment classification method based on text characteristic insertion
CN106547737A (en) * 2016-10-25 2017-03-29 复旦大学 Based on the sequence labelling method in the natural language processing of deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0622742A1 (en) * 1993-04-28 1994-11-02 International Business Machines Corporation Language processing system
EP1335301A2 (en) * 2002-02-07 2003-08-13 Matsushita Electric Industrial Co., Ltd. Context-aware linear time tokenizer
CN104573054A (en) * 2015-01-21 2015-04-29 杭州朗和科技有限公司 Information pushing method and equipment
CN106055673A (en) * 2016-06-06 2016-10-26 中国人民解放军国防科学技术大学 Chinese short-text sentiment classification method based on text characteristic insertion
CN106547737A (en) * 2016-10-25 2017-03-29 复旦大学 Based on the sequence labelling method in the natural language processing of deep learning

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10769383B2 (en) 2017-10-23 2020-09-08 Alibaba Group Holding Limited Cluster-based word vector processing method, device, and apparatus
WO2019095836A1 (en) * 2017-11-14 2019-05-23 阿里巴巴集团控股有限公司 Method, device, and apparatus for word vector processing based on clusters
US10846483B2 (en) 2017-11-14 2020-11-24 Advanced New Technologies Co., Ltd. Method, device, and apparatus for word vector processing based on clusters
CN108154167B (en) * 2017-12-04 2021-08-20 昆明理工大学 Chinese character font similarity calculation method
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method
CN108304376B (en) * 2017-12-15 2021-09-10 腾讯科技(深圳)有限公司 Text vector determination method and device, storage medium and electronic device
CN108304376A (en) * 2017-12-15 2018-07-20 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of text vector
CN110162766A (en) * 2018-02-12 2019-08-23 深圳市腾讯计算机系统有限公司 Term vector update method and device
US11586817B2 (en) * 2018-02-12 2023-02-21 Tencent Technology (Shenzhen) Company Limited Word vector retrofitting method and apparatus
CN110348001A (en) * 2018-04-04 2019-10-18 腾讯科技(深圳)有限公司 A kind of term vector training method and server
CN110348001B (en) * 2018-04-04 2022-11-25 腾讯科技(深圳)有限公司 Word vector training method and server
CN108595426A (en) * 2018-04-23 2018-09-28 北京交通大学 Term vector optimization method based on Chinese character pattern structural information
CN108595426B (en) * 2018-04-23 2021-07-20 北京交通大学 Word vector optimization method based on Chinese character font structural information
CN109189825B (en) * 2018-08-10 2022-03-15 深圳前海微众银行股份有限公司 Federated learning modeling method, server and medium for horizontal data segmentation
CN109189825A (en) * 2018-08-10 2019-01-11 深圳前海微众银行股份有限公司 Lateral data cutting federation learning model building method, server and medium
CN109308353B (en) * 2018-09-17 2023-08-15 鼎富智能科技有限公司 Training method and device for word embedding model
CN109308353A (en) * 2018-09-17 2019-02-05 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109508455B (en) * 2018-10-18 2021-11-19 山西大学 GloVe super-parameter tuning method
CN109508455A (en) * 2018-10-18 2019-03-22 山西大学 A kind of GloVe hyper parameter tuning method
CN111199153A (en) * 2018-10-31 2020-05-26 北京国双科技有限公司 Word vector generation method and related equipment
CN111199153B (en) * 2018-10-31 2023-08-25 北京国双科技有限公司 Word vector generation method and related equipment
CN109543191B (en) * 2018-11-30 2022-12-27 重庆邮电大学 Word vector learning method based on word relation energy maximization
CN109543191A (en) * 2018-11-30 2019-03-29 重庆邮电大学 One kind being based on the maximized term vector learning method of word relationship energy
CN109815476A (en) * 2018-12-03 2019-05-28 国网浙江省电力有限公司杭州供电公司 A kind of term vector representation method based on Chinese morpheme and phonetic joint statistics
CN109948159A (en) * 2019-03-15 2019-06-28 合肥讯飞数码科技有限公司 A kind of text data generation method, device, equipment and readable storage medium storing program for executing
CN109948159B (en) * 2019-03-15 2023-05-30 合肥讯飞数码科技有限公司 Text data generation method, device, equipment and readable storage medium
WO2020224219A1 (en) * 2019-05-06 2020-11-12 平安科技(深圳)有限公司 Chinese word segmentation method and apparatus, electronic device and readable storage medium
CN110287961B (en) * 2019-05-06 2024-04-09 平安科技(深圳)有限公司 Chinese word segmentation method, electronic device and readable storage medium
CN110287961A (en) * 2019-05-06 2019-09-27 平安科技(深圳)有限公司 Chinese word cutting method, electronic device and readable storage medium storing program for executing
WO2020244065A1 (en) * 2019-06-04 2020-12-10 平安科技(深圳)有限公司 Character vector definition method, apparatus and device based on artificial intelligence, and storage medium
CN110427608A (en) * 2019-06-24 2019-11-08 浙江大学 A kind of Chinese word vector table dendrography learning method introducing layering ideophone feature
CN110427608B (en) * 2019-06-24 2021-06-08 浙江大学 Chinese word vector representation learning method introducing layered shape-sound characteristics
CN110442874B (en) * 2019-08-09 2023-06-13 南京邮电大学 Chinese word sense prediction method based on word vector
CN110442874A (en) * 2019-08-09 2019-11-12 南京邮电大学 A kind of Chinese meaning of a word prediction technique based on term vector
CN110610006A (en) * 2019-09-18 2019-12-24 中国科学技术大学 Morphological double-channel Chinese word embedding method based on strokes and glyphs
CN110610006B (en) * 2019-09-18 2023-06-20 中国科学技术大学 Morphological double-channel Chinese word embedding method based on strokes and fonts
CN110781678A (en) * 2019-10-14 2020-02-11 大连理工大学 Text representation method based on matrix form
CN110781678B (en) * 2019-10-14 2022-09-20 大连理工大学 Text representation method based on matrix form
CN112686035A (en) * 2019-10-18 2021-04-20 北京沃东天骏信息技术有限公司 Method and device for vectorizing unknown words
CN111008283B (en) * 2019-10-31 2023-06-20 中电药明数据科技(成都)有限公司 Sequence labeling method and system based on composite boundary information
CN111008283A (en) * 2019-10-31 2020-04-14 中电药明数据科技(成都)有限公司 Sequence labeling method and system based on composite boundary information
CN111241819A (en) * 2020-01-07 2020-06-05 北京百度网讯科技有限公司 Word vector generation method and device and electronic equipment
CN111581970A (en) * 2020-05-12 2020-08-25 厦门市美亚柏科信息股份有限公司 Text recognition method, device and storage medium for network context
CN111581970B (en) * 2020-05-12 2023-01-24 厦门市美亚柏科信息股份有限公司 Text recognition method, device and storage medium for network context
CN111832301A (en) * 2020-07-28 2020-10-27 电子科技大学 Chinese word vector generation method based on adaptive component n-tuple
CN113190602B (en) * 2021-04-09 2022-03-25 桂林电子科技大学 Event joint extraction method integrating word features and deep learning
CN113190602A (en) * 2021-04-09 2021-07-30 桂林电子科技大学 Event joint extraction method integrating word features and deep learning
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
CN113326693B (en) * 2021-05-28 2024-04-16 智者四海(北京)技术有限公司 Training method and system of natural language model based on word granularity
CN113095065A (en) * 2021-06-10 2021-07-09 北京明略软件系统有限公司 Chinese character vector learning method and device
CN113627176A (en) * 2021-08-17 2021-11-09 北京计算机技术及应用研究所 Method for calculating Chinese word vector by using principal component analysis
CN113627176B (en) * 2021-08-17 2024-04-19 北京计算机技术及应用研究所 Method for calculating Chinese word vector by principal component analysis

Also Published As

Publication number Publication date
CN107273355B (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN107273355A (en) A kind of Chinese word vector generation method based on words joint training
CN107239446B (en) A kind of intelligence relationship extracting method based on neural network Yu attention mechanism
CN110825881B (en) Method for establishing electric power knowledge graph
CN109902298B (en) Domain knowledge modeling and knowledge level estimation method in self-adaptive learning system
CN106980683B (en) Blog text abstract generating method based on deep learning
CN106202010B (en) Method and apparatus based on deep neural network building Law Text syntax tree
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN106547735A (en) The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN109697232A (en) A kind of Chinese text sentiment analysis method based on deep learning
CN110287494A (en) A method of the short text Similarity matching based on deep learning BERT algorithm
CN107153642A (en) A kind of analysis method based on neural network recognization text comments Sentiment orientation
CN107818164A (en) A kind of intelligent answer method and its system
CN103823857B (en) Space information searching method based on natural language processing
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN107943784A (en) Relation extraction method based on generation confrontation network
CN112417880A (en) Court electronic file oriented case information automatic extraction method
CN106294322A (en) A kind of Chinese based on LSTM zero reference resolution method
CN106383816A (en) Chinese minority region name identification method based on deep learning
CN107577662A (en) Towards the semantic understanding system and method for Chinese text
CN109885824A (en) A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level
CN107526834A (en) Joint part of speech and the word2vec improved methods of the correlation factor of word order training
CN106897559A (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN111881677A (en) Address matching algorithm based on deep learning model
CN107451115A (en) The construction method and system of Chinese Prosodic Hierarchy forecast model end to end
CN105261358A (en) N-gram grammar model constructing method for voice identification and voice identification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant