CN107608953A - A kind of term vector generation method based on random length context - Google Patents
A kind of term vector generation method based on random length context Download PDFInfo
- Publication number
- CN107608953A CN107608953A CN201710609471.6A CN201710609471A CN107608953A CN 107608953 A CN107608953 A CN 107608953A CN 201710609471 A CN201710609471 A CN 201710609471A CN 107608953 A CN107608953 A CN 107608953A
- Authority
- CN
- China
- Prior art keywords
- context
- word
- term vector
- corpus
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
A kind of term vector generation method based on random length context.The present invention relates to natural language processing field, is related specifically to the term vector generation method based on random length context.Technical scheme proposes a kind of context partition strategy of indefinite length and the term vector generation method based on random length context.Corpus has been divided into indefinite length, but semantic complete context by this strategy using punctuation mark.The unfixed of length result in traditional language model and can not utilize this context generation term vector.In order to tackle this problem, the language model that can handle a random length context F Model is devised herein in conjunction with convolutional neural networks and Recognition with Recurrent Neural Network.Analyzed by result of implementation, corpus, which is divided into semantic complete context, using punctuate can improve the quality of term vector.F Model have good learning ability, and the term vector for implementing to obtain contains abundant semantic and preferable linear relationship.
Description
Technical field
The present invention relates to natural language processing field, is related specifically to the term vector generation side based on random length context
Method.
Background technology
It is most of to be all based on term vector to realize in common natural language processing task, and final place
Reason result is often largely dependent upon the quality of term vector.In general, the quality of term vector is higher, its semanteme included
It is abundanter and accurate, it is also easier to allow semanteme in computer understanding natural language, this also fundamentally improves other natures
The result of language processing tasks.So the term vector for how generating high quality is a basis in natural language processing field
And important task, this is direct to generations such as follow-up other natural language processing tasks, such as machine translation, part-of-speech taggings again
Great influence.
In conventional term vector generation method, in order to simplify problem and computation complexity, corpus can be all divided into solid
The context unit of measured length, but the context of this regular length is not complete semantic primitive, which results in upper and lower
The semantic missing of text is semantic chaotic.The semantic missing of context and semantic confusion can be delivered in term vector, directly result in word
The semantic missing and semanteme of vector are chaotic.
In order to solve the semantic missing of term vector and the semantic chaotic problem that this fixed context is brought, make full use of herein
Original language material information, corpus is divided into semantic relatively complete context unit, such context using punctuation mark
The length of unit is uncertain, therefore traditional term vector generation method based on fixed context will be no longer applicable.
Therefore, the present invention has gone out a kind of term vector generation method based on random length context.This method is based on convolution
Neutral net and Recognition with Recurrent Neural Network, strengthen the long Dependency Specification between word.Last result of implementation shows that this method is given birth to
Into term vector contain more abundant semanteme, there is more preferable linear relationship between term vector.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of context partition strategy of indefinite length and based on random length
The term vector generation method of context.Corpus has been divided into indefinite length by this strategy using punctuation mark, but semantic complete
Whole context, solve the semantic missing brought in traditional language model using the context of regular length and chaotic problem.
The term vector generation method of random length context based on this strategy division, utilizes convolutional neural networks and Recognition with Recurrent Neural Network
The characteristics of and advantage, strengthen the long Dependency Specification between word, the quality of the final term vector for improving generation.
To achieve the above object of the invention, the present invention proposes the term vector generation method based on random length context, its feature
It is, is drawn using punctuation mark, probability statistics, convolutional neural networks and the characteristics of Recognition with Recurrent Neural Network and advantage, above and below completion
Literary semantic integrity, strengthens the long dependence between word and word, and the semanteme for improving term vector contains ability.
The present invention divides context first after being pre-processed to corpus, using punctuation mark, and corpus is divided
For length, semantic complete context unit.Then using the convolutional neural networks hereinafter weight of each word in study, this
Weight is then and the global of corpus is distributed the final weight for combining each word in generation context.Followed by this final weight and
The vector table that term vector calculates context reaches.Followed by the vector table of context up between each word in structure and context
One-to-many mapping relations.Then by stochastic gradient algorithm's training pattern, and finally obtain term vector.
The present invention is achieved through the following technical solutions:
(1) document pre-processes, and obtains training corpus.Given one group of collection of document on certain professional domain, pass through word
Remove the preconditioning techniques such as stop words and low-frequency word, obtain the useful information in corpus, and then composing training corpus.
(2) word frequency statisticses, statistics language material distribution.Based on the statistics of the word frequency of occurrences in document, the word of corpus is generated
Allusion quotation, word, the index of word and the frequency of word in corpus are included in dictionary.
(3) build training set, the punctuation mark in training corpus, corpus be divided into length not wait up and down
Text, form training set.
(4) weight of term vector in context is calculated.The term vector of each word forms context-aware matrix in context.Utilize volume
Product neutral net is by obtaining the weight of each word to the convolution algorithm of context-aware matrix, this weight frequency with word in corpus again
Combine to form final weight.
(5) the distributed expression of context is calculated.With reference to the weight of the term vector obtained in step (4), obtain on current
Distributed expression hereafter.The history context information in Recognition with Recurrent Neural Network is recycled, generates the distribution of newest context
Formula is expressed, while updates the historical information in Recognition with Recurrent Neural Network.
(6) mode inference.Using the context distributed intelligence obtained in step (5), build in context and context
The one-to-many mapping relations of word.Build the loss function of model.
(7) training pattern, term vector is obtained.Mapping relations according to being built in step (6) carry out optimal on training set
Change training, training method is using negative sampling and stochastic gradient descent algorithm.
In the above-mentioned methods, punctuation mark has been used in the step (3), punctuation mark used herein above refers to include
Compare the semantic punctuate of full segmentation, such as " ", ".”、“”、“!" etc..
In the above-mentioned methods, the step (4) has used convolutional neural networks, and the size of convolution kernel is [1,3, m, 1], its
Middle m represents the dimension size of term vector.Using convolutional neural networks, the context-aware matrix for being shaped as [k, m] can be passed through volume
Product generation is shaped as [k, 1] weight, and wherein k represents the number of word in context.This weight calculates in conjunction with the distribution of corpus
Go out final weight.
In the above-mentioned methods, the step (4) specifically includes the following steps:
A) context-aware matrix is built.Context-aware matrix is built according to F-Model input.The input In of model is with upper and lower
It is literary for unit, including two parts:The index I of word in context is vectorial, the global distribution vector Wt1 of word in context.To
Each measured in I represents index of the word in dictionary.Vectorial Wt1 length is identical with the length of context, per the value on one-dimensional
Represent frequency disribution of the corresponding word in I in whole corpus.Index vector I by gather () operations according to input
The term vector in dictionary table is searched, and these term vector arranged in sequence are combined into context-aware matrix C.
Wt1=(d0,d1,…,dk) (1)
I=(i0,i1,…,ik) (2)
In=(I, Wt1) (3)
C=gather (D, I) (4)
Wherein k be context length, ikRepresent word WkIndex in dictionary, dkRepresent word WkGlobal distribution.
B) context-aware matrix C is by being shaped as the weight Wt2 of [k, 1] after convolutional layer convolution.
Wt2=f (C)+B (5)
Wherein:D is dictionary, and f () is convolution algorithm, and convolution kernel is shaped as [1,3, m, 1], and B is the bias term of convolution.
C) final weight is calculated.Final weight Wt is calculated with reference to the distribution of corpus.
Wt=Wt2Wt1 (6)
In the above-mentioned methods, the step (5) passes through the weight Wt obtained in the term vector C of context, step (4), meter
Count semantic vector Ct hereafter in.Based on context semantic vector Ct and Recognition with Recurrent Neural Network learning arrive Recognition with Recurrent Neural Network
Historical context state Ct calculates newest context semantic vector Ct, and circulation nerve net is contained in context semantic vector C
The historical context that network learning arrives is semantic.
In the above-mentioned methods, the step (5) specifically includes the following steps:
A) the distributed expression of current context is calculated.The weight information of word is with the addition of on the basis of bag of words.
The calculation of the distributed expression of current context is:
Ct=CX (Wt) (7)
Wherein, C represents context-aware matrix, X representing matrix multiplication.
B) Recognition with Recurrent Neural Network is utilized, adds historical information.
Ct=rnn (Ct) (8)
Wherein, rnn () represents to pass through Recognition with Recurrent Neural Network.
In the above-mentioned methods, the step (6) based on context semantic vector Ct predict context in each word.Meter
Calculate the conditional probability of the target word under context condition.Calculation is:
P (W | C)=sigmoid (C θi) (7)
Wherein, θiFor word WiParameter.
In the above-mentioned methods, during step (7) training pattern, the loss function is using model complexity
(perplexity) length of the average context of prediction target word, is represented.
Wherein, k represents the length of context, P (Wi| C) represent word WiConditional probability under context C.
The present invention effect and have an advantage that:Corpus is divided using punctuation mark, forms semantic complete context.
F-Model utilizes the characteristics of convolutional neural networks and Recognition with Recurrent Neural Network and advantage, and long rely on strengthened between word and word is closed
System, improve the quality of term vector.
Brief description of the drawings
Fig. 1 is F-Model structure chart.
Fig. 2 is the implementation module map of the present invention.
F-Model loss changes under Fig. 3 difference learning rates.
Fig. 4:The similarity analysis of the word of table 1.
Fig. 5:Table 2F-Model linear relationships analyze data (part).
The different dimensions of Fig. 6 tables 3 and iterations questions test set accuracy.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing, to according to this hair
The Ontological concept and level generation method of bright embodiment are further described.It should be appreciated that specific implementation described herein
Example is only used for explaining the present invention, is not intended to limit the present invention, i.e., protection scope of the present invention is not limited to following embodiments, phase
Instead, can suitably be changed according to the inventive concept of the present invention, those of ordinary skill in the art, these changes can fall into power
Within the invention scope that sharp claim is limited.
As shown in Fig. 1 structured flowchart, the term vector based on random length context being embodied according to the present invention generates
Method comprises the following steps:
1) pretreatment module:
Corpus is pre-processed, stop words and low-frequency word is mainly removed, numeral is substituted for fixation mark, is owned
Word using lowercase versions etc., eventually form effective training corpus.
2) model construction module:
Count the word frequency distribution of training corpus and generate dictionary, dictionary content includes word, index and frequency.Utilize punctuate
Corpus is divided into indefinite length by symbol, semantic complete context unit, forms training set.The input of F-Model models
It is that the above is hereafter unit, including two parts:The index I vectors of word in context, in context the overall situation of word be distributed to
Measure Wt1.Then I structure context-aware matrix C are indexed, convolutional neural networks learn word in context-aware matrix by context-aware matrix
Weight Wt2.Weight Wt2 and language material distribution Wt1 form last weight Wt.F-Model calculates further according to Wt and context-aware matrix C
The distribution for going out context represents Ct.F-Model builds the one-to-many mapping of context and word again, and model is built using negative sampling
Loss function, optimization aim is provided for follow-up training.
3) model training module:
Stochastic gradient algorithm is used, is passed through come training pattern, training by constantly tuning loss function on training set
Model hyper parameter is adjusted to improve the term vector quality of acquisition.Hyper parameter in model includes:Vector dimension, iterations and
Habit rate.The dimension of term vector has made 3 kinds of different dimensions respectively in specific implementation:50th, 100,200 dimension.Iterations refers to
The iterations that each training set enters after model, 10,20 two kind of embodiment has been respectively adopted in we.Learning rate we adopt
With the learning rate of several fixations, it is respectively:0.1、0.01、0.001;
Implement explanation and result:
4 kinds of data sets have been used altogether in implementation, have been respectively:English language material in Billion-Words corpus
News.2011.en.shuffled, ptb.train.txt corpus, Wordsim353 data sets and questions-
Words.txt data sets.Wherein news.2011.en.shuffled language materials are used as training set, ptb.train.txt corpus
For test model function.Wordsim353 data sets and questions-words.txt data sets are made as test set
With Wordsim353 data sets are used for the similarity for contrasting term vector, and questions-words.txt data sets are used to assess word
The linear relationship of vector.
News.2011.en.shuffled language materials are a plain text datas captured from the news of 2011, data
Amount is big, information is complete and remains the raw informations such as punctuate.When this corpus is handled, word frequency is less than 5 times and is disabling
Word in word list is all ignored, and all words are all converted into small letter, and all numerals are converted into capital N.Compare
For Billion-Words, ptb.train.txt corpus is a more small-sized corpus, and test model is used in implementation.
Wordsim353 data sets are a small-sized corpus artificially built, and it contains similar between multiple words pair and word pair
Degree, this similarity show that minimum is divided into 0 point, and maximum is divided into 10 points by manually giving a mark.Contained in Wordsim353
3 Sub Data Sets:Set1, set2 and combined.Wherein set1 is contained from 13 different sides manually to the phase of word pair
Given a mark data like degree, set2 then contains to give a mark data from 16 sides to the similarity of word pair, combined be then set1 and
The result after marking average value in set2 to similarity.Wordsim353 data sets are as test set in implementation, as similar
Spend the benchmark assessed.Questions-words.txt data sets are the data set used in tensorflow in word2vec, this
Individual data set is an artificial constructed test data set, for testing the linear relationship of term vector.It is every in this data set
Individual test case includes 4 words, and this 4 words have similar King-Man+Woman=Queen linear relationship.
Found after being analyzed and tested by the similarity to term vector and linear relationship (refering to table 1, the table 3 of table 2 and figure
3):The quality of term vector can be influenceed by model, context and dimension size simultaneously.In random length using punctuation mark division
More complete semanteme is hereafter can guarantee that, improves the quality of term vector.F-Model has good learning ability, and training obtains
Term vector contain abundant semanteme, and there is good linear relationship.The dimension of term vector contains more greatly the ability of semanteme
Also it is stronger, but this also brings along dimension disaster and over-fitting, causes term vector to show poor effect when migrating.
It should be noted that and understand, the situation of the spirit and scope of the present invention required by appended claims are not departed from
Under, various modifications and improvements can be made to the present invention of foregoing detailed description.It is therefore desirable to the scope of the technical scheme of protection
Do not limited by given any specific exemplary teachings.
Claims (8)
1. a kind of term vector generation method based on random length context, it is characterised in that located in advance to corpus first
After reason, context is divided using punctuation mark, corpus is divided into length, semantic complete context unit.Then
Using convolutional neural networks, hereinafter the weight of each word, this weight are then distributed in combination generation with the global of corpus in study
The hereinafter final weight of each word.The vector table that context is calculated followed by this final weight and term vector reaches.Followed by
The vector table of context reaches the one-to-many mapping relations between each word in structure and context.Then pass through stochastic gradient algorithm
Training pattern, and finally obtain term vector.
A kind of 2. term vector generation method based on random length context as claimed in claim 1, it is characterised in that specific bag
Include following steps:
(1) document pre-processes, and obtains training corpus.Given one group of collection of document on certain professional domain, removed by word
The preconditioning technique such as stop words and low-frequency word, obtain the useful information in corpus, and then composing training corpus.
(2) word frequency statisticses, statistics language material distribution.Based on the statistics of the word frequency of occurrences in document, the dictionary of corpus, word are generated
Word, the index of word and the frequency of word in corpus are included in allusion quotation.
(3) build training set, the punctuation mark in training corpus, corpus be divided into length not wait context,
Form training set.
(4) weight of term vector in context is calculated.The term vector of each word forms context-aware matrix in context.Utilize convolution god
Weight through network by each word of convolution algorithm acquisition to context-aware matrix, this weight combine with the frequency of word in corpus again
Form final weight.
(5) the distributed expression of context is calculated.With reference to the weight of the term vector obtained in step (4), current context is obtained
Distributed expression.The history context information in Recognition with Recurrent Neural Network is recycled, generates the distributed table of newest context
Reach, while update the historical information in Recognition with Recurrent Neural Network.
(6) mode inference.Using the context distributed intelligence obtained in step (5), context and the word in context are built
One-to-many mapping relations.Build the loss function of model.
(7) training pattern, term vector is obtained.Mapping relations according to being built in step (6) carry out optimization instruction on training set
Practice, training method is using negative sampling and stochastic gradient descent algorithm.
In the above-mentioned methods, punctuation mark has been used in the step (3), punctuation mark used herein above refers to include and compared
The semantic punctuate of full segmentation, such as " ", ".”、“”、“!" etc..
A kind of 3. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step
Suddenly (4) have used convolutional neural networks, and the size of convolution kernel is [1,3, m, 1], and wherein m represents the dimension size of term vector.Profit
With convolutional neural networks, the context-aware matrix for being shaped as [k, m] can be shaped as [k, 1] weight, wherein k by convolution generation
Represent the number of word in context.This weight calculates final weight in conjunction with the distribution of corpus.
A kind of 4. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step
Suddenly (4) specifically include the following steps:
A) context-aware matrix is built.Context-aware matrix is built according to F-Model input.The input In of model be the above hereinafter
Unit, including two parts:The index I of word in context is vectorial, the global distribution vector Wt1 of word in context.In vectorial I
Each represent index in dictionary of word.Vectorial Wt1 length is identical with the length of context, and I is represented per the value on one-dimensional
In corresponding word whole corpus frequency disribution.Word is searched according to the index vector I of input by gather () operations
Term vector in allusion quotation table, and these term vector arranged in sequence are combined into context-aware matrix C.
Wt1=(d0,d1,…,dk) (1)
I=(i0,i1,…,ik) (2)
In=(I, Wt1) (3)
C=gather (D, I) (4)
Wherein k be context length, ikRepresent word WkIndex in dictionary, dkRepresent word WkGlobal distribution.B) context
Matrix C is by being shaped as the weight Wt2 of [k, 1] after convolutional layer convolution.
Wt2=f (C)+B (5)
Wherein:D is dictionary, and f () is convolution algorithm, and convolution kernel is shaped as [1,3, m, 1], and B is the bias term of convolution.
C) final weight is calculated.Final weight Wt is calculated with reference to the distribution of corpus.
Wt=Wt2Wt1 (6)
A kind of 5. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step
Suddenly (5) calculate the semantic vector Ct of context by the weight Wt that is obtained in the term vector C of context, step (4).Circulation god
Calculated through the network historical context state Ct that based on context semantic vector Ct and Recognition with Recurrent Neural Network learning arrive newest
It is semantic that the historical context that Recognition with Recurrent Neural Network learning arrives is contained in context semantic vector Ct, context semantic vector C.
A kind of 6. term vector generation method based on random length context as claimed in claim 5, it is characterised in that the step
Suddenly (5) specifically include the following steps:
A) the distributed expression of current context is calculated.The weight information of word is with the addition of on the basis of bag of words.
The calculation of the distributed expression of current context is:
Ct=CX (Wt) (7)
Wherein, C represents context-aware matrix, X representing matrix multiplication.
B) Recognition with Recurrent Neural Network is utilized, adds historical information.
Ct=rnn (Ct) (8)
Wherein, rnn () represents to pass through Recognition with Recurrent Neural Network.
A kind of 7. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step
Suddenly (6) based on context semantic vector Ct predict context in each word.Calculate the bar of the target word under context condition
Part probability.Calculation is:
P (W | C)=sigmoid (C θi) (7)
Wherein, θiFor word WiParameter.
8. a kind of term vector generation method based on random length context as claimed in claim 1, it is characterised in that described
During step (7) training pattern, the loss function represents prediction target word using model complexity (perplexity)
The length of average context.
<mrow>
<mi>l</mi>
<mi>o</mi>
<mi>s</mi>
<mi>s</mi>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mroot>
<mrow>
<munderover>
<mi>&Pi;</mi>
<mn>1</mn>
<mi>k</mi>
</munderover>
<mi>P</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>W</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<mi>C</mi>
<mo>)</mo>
</mrow>
</mrow>
<mi>k</mi>
</mroot>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>8</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, k represents the length of context, P (Wi| C) represent word WiConditional probability under context C.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710609471.6A CN107608953B (en) | 2017-07-25 | 2017-07-25 | Word vector generation method based on indefinite-length context |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710609471.6A CN107608953B (en) | 2017-07-25 | 2017-07-25 | Word vector generation method based on indefinite-length context |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107608953A true CN107608953A (en) | 2018-01-19 |
CN107608953B CN107608953B (en) | 2020-08-14 |
Family
ID=61059516
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710609471.6A Active CN107608953B (en) | 2017-07-25 | 2017-07-25 | Word vector generation method based on indefinite-length context |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107608953B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108733647A (en) * | 2018-04-13 | 2018-11-02 | 中山大学 | A kind of term vector generation method based on Gaussian Profile |
CN109582771A (en) * | 2018-11-26 | 2019-04-05 | 国网湖南省电力有限公司 | Smart client exchange method towards power domain based on mobile application |
CN110096697A (en) * | 2019-03-15 | 2019-08-06 | 华为技术有限公司 | Term vector matrix compression method and apparatus and the method and apparatus for obtaining term vector |
CN110119507A (en) * | 2018-02-05 | 2019-08-13 | 阿里巴巴集团控股有限公司 | Term vector generation method, device and equipment |
WO2019154411A1 (en) * | 2018-02-12 | 2019-08-15 | 腾讯科技(深圳)有限公司 | Word vector retrofitting method and device |
CN110287337A (en) * | 2019-06-19 | 2019-09-27 | 上海交通大学 | The system and method for medicine synonym is obtained based on deep learning and knowledge mapping |
CN110781678A (en) * | 2019-10-14 | 2020-02-11 | 大连理工大学 | Text representation method based on matrix form |
CN111241819A (en) * | 2020-01-07 | 2020-06-05 | 北京百度网讯科技有限公司 | Word vector generation method and device and electronic equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101482860A (en) * | 2008-01-09 | 2009-07-15 | 中国科学院自动化研究所 | Automatic extraction and filtration method for Chinese-English phrase translation pairs |
US20090210218A1 (en) * | 2008-02-07 | 2009-08-20 | Nec Laboratories America, Inc. | Deep Neural Networks and Methods for Using Same |
CN101685441A (en) * | 2008-09-24 | 2010-03-31 | 中国科学院自动化研究所 | Generalized reordering statistic translation method and device based on non-continuous phrase |
CN102231278A (en) * | 2011-06-10 | 2011-11-02 | 安徽科大讯飞信息科技股份有限公司 | Method and system for realizing automatic addition of punctuation marks in speech recognition |
CN102930055A (en) * | 2012-11-18 | 2013-02-13 | 浙江大学 | New network word discovery method in combination with internal polymerization degree and external discrete information entropy |
CN103324612A (en) * | 2012-03-22 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for segmenting word |
CN103530284A (en) * | 2013-09-22 | 2014-01-22 | 中国专利信息中心 | Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method |
CN104317965A (en) * | 2014-11-14 | 2015-01-28 | 南京理工大学 | Establishment method of emotion dictionary based on linguistic data |
JP2017059205A (en) * | 2015-09-17 | 2017-03-23 | パナソニックIpマネジメント株式会社 | Subject estimation system, subject estimation method, and program |
CN106682220A (en) * | 2017-01-04 | 2017-05-17 | 华南理工大学 | Online traditional Chinese medicine text named entity identifying method based on deep learning |
-
2017
- 2017-07-25 CN CN201710609471.6A patent/CN107608953B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101482860A (en) * | 2008-01-09 | 2009-07-15 | 中国科学院自动化研究所 | Automatic extraction and filtration method for Chinese-English phrase translation pairs |
US20090210218A1 (en) * | 2008-02-07 | 2009-08-20 | Nec Laboratories America, Inc. | Deep Neural Networks and Methods for Using Same |
CN101685441A (en) * | 2008-09-24 | 2010-03-31 | 中国科学院自动化研究所 | Generalized reordering statistic translation method and device based on non-continuous phrase |
CN102231278A (en) * | 2011-06-10 | 2011-11-02 | 安徽科大讯飞信息科技股份有限公司 | Method and system for realizing automatic addition of punctuation marks in speech recognition |
CN103324612A (en) * | 2012-03-22 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for segmenting word |
CN102930055A (en) * | 2012-11-18 | 2013-02-13 | 浙江大学 | New network word discovery method in combination with internal polymerization degree and external discrete information entropy |
CN103530284A (en) * | 2013-09-22 | 2014-01-22 | 中国专利信息中心 | Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method |
CN104317965A (en) * | 2014-11-14 | 2015-01-28 | 南京理工大学 | Establishment method of emotion dictionary based on linguistic data |
JP2017059205A (en) * | 2015-09-17 | 2017-03-23 | パナソニックIpマネジメント株式会社 | Subject estimation system, subject estimation method, and program |
CN106682220A (en) * | 2017-01-04 | 2017-05-17 | 华南理工大学 | Online traditional Chinese medicine text named entity identifying method based on deep learning |
Non-Patent Citations (1)
Title |
---|
李琼: "利用标点符号自动识别分句", 《皖西学院学报》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110119507A (en) * | 2018-02-05 | 2019-08-13 | 阿里巴巴集团控股有限公司 | Term vector generation method, device and equipment |
WO2019154411A1 (en) * | 2018-02-12 | 2019-08-15 | 腾讯科技(深圳)有限公司 | Word vector retrofitting method and device |
US11586817B2 (en) | 2018-02-12 | 2023-02-21 | Tencent Technology (Shenzhen) Company Limited | Word vector retrofitting method and apparatus |
CN108733647A (en) * | 2018-04-13 | 2018-11-02 | 中山大学 | A kind of term vector generation method based on Gaussian Profile |
CN108733647B (en) * | 2018-04-13 | 2022-03-25 | 中山大学 | Word vector generation method based on Gaussian distribution |
CN109582771A (en) * | 2018-11-26 | 2019-04-05 | 国网湖南省电力有限公司 | Smart client exchange method towards power domain based on mobile application |
CN110096697A (en) * | 2019-03-15 | 2019-08-06 | 华为技术有限公司 | Term vector matrix compression method and apparatus and the method and apparatus for obtaining term vector |
CN110096697B (en) * | 2019-03-15 | 2022-04-12 | 华为技术有限公司 | Word vector matrix compression method and device, and method and device for obtaining word vectors |
CN110287337A (en) * | 2019-06-19 | 2019-09-27 | 上海交通大学 | The system and method for medicine synonym is obtained based on deep learning and knowledge mapping |
CN110781678A (en) * | 2019-10-14 | 2020-02-11 | 大连理工大学 | Text representation method based on matrix form |
CN110781678B (en) * | 2019-10-14 | 2022-09-20 | 大连理工大学 | Text representation method based on matrix form |
CN111241819A (en) * | 2020-01-07 | 2020-06-05 | 北京百度网讯科技有限公司 | Word vector generation method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107608953B (en) | 2020-08-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107608953A (en) | A kind of term vector generation method based on random length context | |
US20200012953A1 (en) | Method and apparatus for generating model | |
CN107967255A (en) | A kind of method and system for judging text similarity | |
CN106897559B (en) | A kind of symptom and sign class entity recognition method and device towards multi-data source | |
CN106980609A (en) | A kind of name entity recognition method of the condition random field of word-based vector representation | |
CN105955962B (en) | The calculation method and device of topic similarity | |
CN110516245A (en) | Fine granularity sentiment analysis method, apparatus, computer equipment and storage medium | |
CN107038480A (en) | A kind of text sentiment classification method based on convolutional neural networks | |
CN108090510A (en) | A kind of integrated learning approach and device based on interval optimization | |
CN108563703A (en) | A kind of determination method of charge, device and computer equipment, storage medium | |
CN105893609A (en) | Mobile APP recommendation method based on weighted mixing | |
CN109739995B (en) | Information processing method and device | |
CN109214004B (en) | Big data processing method based on machine learning | |
CN107203600A (en) | It is a kind of to utilize the evaluation method for portraying cause and effect dependence and sequential influencing mechanism enhancing answer quality-ordered | |
CN104794222B (en) | Network form semanteme restoration methods | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
JP7347179B2 (en) | Methods, devices and computer programs for extracting web page content | |
CN112035629B (en) | Method for implementing question-answer model based on symbolized knowledge and neural network | |
CN105893363A (en) | A method and a system for acquiring relevant knowledge points of a knowledge point | |
CN109829054A (en) | A kind of file classification method and system | |
CN114692615A (en) | Small sample semantic graph recognition method for small languages | |
CN104361077B (en) | The creation method and device of webpage scoring model | |
CN113254612A (en) | Knowledge question-answering processing method, device, equipment and storage medium | |
Renner et al. | Mapping common data elements to a domain model using an artificial neural network | |
CN105468657B (en) | A kind of method and system of the important knowledge point in acquisition field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |