CN107608953A - A kind of term vector generation method based on random length context - Google Patents

A kind of term vector generation method based on random length context Download PDF

Info

Publication number
CN107608953A
CN107608953A CN201710609471.6A CN201710609471A CN107608953A CN 107608953 A CN107608953 A CN 107608953A CN 201710609471 A CN201710609471 A CN 201710609471A CN 107608953 A CN107608953 A CN 107608953A
Authority
CN
China
Prior art keywords
context
word
term vector
corpus
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710609471.6A
Other languages
Chinese (zh)
Other versions
CN107608953B (en
Inventor
王俊丽
王小敏
杨亚星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201710609471.6A priority Critical patent/CN107608953B/en
Publication of CN107608953A publication Critical patent/CN107608953A/en
Application granted granted Critical
Publication of CN107608953B publication Critical patent/CN107608953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of term vector generation method based on random length context.The present invention relates to natural language processing field, is related specifically to the term vector generation method based on random length context.Technical scheme proposes a kind of context partition strategy of indefinite length and the term vector generation method based on random length context.Corpus has been divided into indefinite length, but semantic complete context by this strategy using punctuation mark.The unfixed of length result in traditional language model and can not utilize this context generation term vector.In order to tackle this problem, the language model that can handle a random length context F Model is devised herein in conjunction with convolutional neural networks and Recognition with Recurrent Neural Network.Analyzed by result of implementation, corpus, which is divided into semantic complete context, using punctuate can improve the quality of term vector.F Model have good learning ability, and the term vector for implementing to obtain contains abundant semantic and preferable linear relationship.

Description

A kind of term vector generation method based on random length context
Technical field
The present invention relates to natural language processing field, is related specifically to the term vector generation side based on random length context Method.
Background technology
It is most of to be all based on term vector to realize in common natural language processing task, and final place Reason result is often largely dependent upon the quality of term vector.In general, the quality of term vector is higher, its semanteme included It is abundanter and accurate, it is also easier to allow semanteme in computer understanding natural language, this also fundamentally improves other natures The result of language processing tasks.So the term vector for how generating high quality is a basis in natural language processing field And important task, this is direct to generations such as follow-up other natural language processing tasks, such as machine translation, part-of-speech taggings again Great influence.
In conventional term vector generation method, in order to simplify problem and computation complexity, corpus can be all divided into solid The context unit of measured length, but the context of this regular length is not complete semantic primitive, which results in upper and lower The semantic missing of text is semantic chaotic.The semantic missing of context and semantic confusion can be delivered in term vector, directly result in word The semantic missing and semanteme of vector are chaotic.
In order to solve the semantic missing of term vector and the semantic chaotic problem that this fixed context is brought, make full use of herein Original language material information, corpus is divided into semantic relatively complete context unit, such context using punctuation mark The length of unit is uncertain, therefore traditional term vector generation method based on fixed context will be no longer applicable.
Therefore, the present invention has gone out a kind of term vector generation method based on random length context.This method is based on convolution Neutral net and Recognition with Recurrent Neural Network, strengthen the long Dependency Specification between word.Last result of implementation shows that this method is given birth to Into term vector contain more abundant semanteme, there is more preferable linear relationship between term vector.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of context partition strategy of indefinite length and based on random length The term vector generation method of context.Corpus has been divided into indefinite length by this strategy using punctuation mark, but semantic complete Whole context, solve the semantic missing brought in traditional language model using the context of regular length and chaotic problem. The term vector generation method of random length context based on this strategy division, utilizes convolutional neural networks and Recognition with Recurrent Neural Network The characteristics of and advantage, strengthen the long Dependency Specification between word, the quality of the final term vector for improving generation.
To achieve the above object of the invention, the present invention proposes the term vector generation method based on random length context, its feature It is, is drawn using punctuation mark, probability statistics, convolutional neural networks and the characteristics of Recognition with Recurrent Neural Network and advantage, above and below completion Literary semantic integrity, strengthens the long dependence between word and word, and the semanteme for improving term vector contains ability.
The present invention divides context first after being pre-processed to corpus, using punctuation mark, and corpus is divided For length, semantic complete context unit.Then using the convolutional neural networks hereinafter weight of each word in study, this Weight is then and the global of corpus is distributed the final weight for combining each word in generation context.Followed by this final weight and The vector table that term vector calculates context reaches.Followed by the vector table of context up between each word in structure and context One-to-many mapping relations.Then by stochastic gradient algorithm's training pattern, and finally obtain term vector.
The present invention is achieved through the following technical solutions:
(1) document pre-processes, and obtains training corpus.Given one group of collection of document on certain professional domain, pass through word Remove the preconditioning techniques such as stop words and low-frequency word, obtain the useful information in corpus, and then composing training corpus.
(2) word frequency statisticses, statistics language material distribution.Based on the statistics of the word frequency of occurrences in document, the word of corpus is generated Allusion quotation, word, the index of word and the frequency of word in corpus are included in dictionary.
(3) build training set, the punctuation mark in training corpus, corpus be divided into length not wait up and down Text, form training set.
(4) weight of term vector in context is calculated.The term vector of each word forms context-aware matrix in context.Utilize volume Product neutral net is by obtaining the weight of each word to the convolution algorithm of context-aware matrix, this weight frequency with word in corpus again Combine to form final weight.
(5) the distributed expression of context is calculated.With reference to the weight of the term vector obtained in step (4), obtain on current Distributed expression hereafter.The history context information in Recognition with Recurrent Neural Network is recycled, generates the distribution of newest context Formula is expressed, while updates the historical information in Recognition with Recurrent Neural Network.
(6) mode inference.Using the context distributed intelligence obtained in step (5), build in context and context The one-to-many mapping relations of word.Build the loss function of model.
(7) training pattern, term vector is obtained.Mapping relations according to being built in step (6) carry out optimal on training set Change training, training method is using negative sampling and stochastic gradient descent algorithm.
In the above-mentioned methods, punctuation mark has been used in the step (3), punctuation mark used herein above refers to include Compare the semantic punctuate of full segmentation, such as " ", ".”、“”、“!" etc..
In the above-mentioned methods, the step (4) has used convolutional neural networks, and the size of convolution kernel is [1,3, m, 1], its Middle m represents the dimension size of term vector.Using convolutional neural networks, the context-aware matrix for being shaped as [k, m] can be passed through volume Product generation is shaped as [k, 1] weight, and wherein k represents the number of word in context.This weight calculates in conjunction with the distribution of corpus Go out final weight.
In the above-mentioned methods, the step (4) specifically includes the following steps:
A) context-aware matrix is built.Context-aware matrix is built according to F-Model input.The input In of model is with upper and lower It is literary for unit, including two parts:The index I of word in context is vectorial, the global distribution vector Wt1 of word in context.To Each measured in I represents index of the word in dictionary.Vectorial Wt1 length is identical with the length of context, per the value on one-dimensional Represent frequency disribution of the corresponding word in I in whole corpus.Index vector I by gather () operations according to input The term vector in dictionary table is searched, and these term vector arranged in sequence are combined into context-aware matrix C.
Wt1=(d0,d1,…,dk) (1)
I=(i0,i1,…,ik) (2)
In=(I, Wt1) (3)
C=gather (D, I) (4)
Wherein k be context length, ikRepresent word WkIndex in dictionary, dkRepresent word WkGlobal distribution.
B) context-aware matrix C is by being shaped as the weight Wt2 of [k, 1] after convolutional layer convolution.
Wt2=f (C)+B (5)
Wherein:D is dictionary, and f () is convolution algorithm, and convolution kernel is shaped as [1,3, m, 1], and B is the bias term of convolution.
C) final weight is calculated.Final weight Wt is calculated with reference to the distribution of corpus.
Wt=Wt2Wt1 (6)
In the above-mentioned methods, the step (5) passes through the weight Wt obtained in the term vector C of context, step (4), meter Count semantic vector Ct hereafter in.Based on context semantic vector Ct and Recognition with Recurrent Neural Network learning arrive Recognition with Recurrent Neural Network Historical context state Ct calculates newest context semantic vector Ct, and circulation nerve net is contained in context semantic vector C The historical context that network learning arrives is semantic.
In the above-mentioned methods, the step (5) specifically includes the following steps:
A) the distributed expression of current context is calculated.The weight information of word is with the addition of on the basis of bag of words.
The calculation of the distributed expression of current context is:
Ct=CX (Wt) (7)
Wherein, C represents context-aware matrix, X representing matrix multiplication.
B) Recognition with Recurrent Neural Network is utilized, adds historical information.
Ct=rnn (Ct) (8)
Wherein, rnn () represents to pass through Recognition with Recurrent Neural Network.
In the above-mentioned methods, the step (6) based on context semantic vector Ct predict context in each word.Meter Calculate the conditional probability of the target word under context condition.Calculation is:
P (W | C)=sigmoid (C θi) (7)
Wherein, θiFor word WiParameter.
In the above-mentioned methods, during step (7) training pattern, the loss function is using model complexity (perplexity) length of the average context of prediction target word, is represented.
Wherein, k represents the length of context, P (Wi| C) represent word WiConditional probability under context C.
The present invention effect and have an advantage that:Corpus is divided using punctuation mark, forms semantic complete context. F-Model utilizes the characteristics of convolutional neural networks and Recognition with Recurrent Neural Network and advantage, and long rely on strengthened between word and word is closed System, improve the quality of term vector.
Brief description of the drawings
Fig. 1 is F-Model structure chart.
Fig. 2 is the implementation module map of the present invention.
F-Model loss changes under Fig. 3 difference learning rates.
Fig. 4:The similarity analysis of the word of table 1.
Fig. 5:Table 2F-Model linear relationships analyze data (part).
The different dimensions of Fig. 6 tables 3 and iterations questions test set accuracy.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing, to according to this hair The Ontological concept and level generation method of bright embodiment are further described.It should be appreciated that specific implementation described herein Example is only used for explaining the present invention, is not intended to limit the present invention, i.e., protection scope of the present invention is not limited to following embodiments, phase Instead, can suitably be changed according to the inventive concept of the present invention, those of ordinary skill in the art, these changes can fall into power Within the invention scope that sharp claim is limited.
As shown in Fig. 1 structured flowchart, the term vector based on random length context being embodied according to the present invention generates Method comprises the following steps:
1) pretreatment module:
Corpus is pre-processed, stop words and low-frequency word is mainly removed, numeral is substituted for fixation mark, is owned Word using lowercase versions etc., eventually form effective training corpus.
2) model construction module:
Count the word frequency distribution of training corpus and generate dictionary, dictionary content includes word, index and frequency.Utilize punctuate Corpus is divided into indefinite length by symbol, semantic complete context unit, forms training set.The input of F-Model models It is that the above is hereafter unit, including two parts:The index I vectors of word in context, in context the overall situation of word be distributed to Measure Wt1.Then I structure context-aware matrix C are indexed, convolutional neural networks learn word in context-aware matrix by context-aware matrix Weight Wt2.Weight Wt2 and language material distribution Wt1 form last weight Wt.F-Model calculates further according to Wt and context-aware matrix C The distribution for going out context represents Ct.F-Model builds the one-to-many mapping of context and word again, and model is built using negative sampling Loss function, optimization aim is provided for follow-up training.
3) model training module:
Stochastic gradient algorithm is used, is passed through come training pattern, training by constantly tuning loss function on training set Model hyper parameter is adjusted to improve the term vector quality of acquisition.Hyper parameter in model includes:Vector dimension, iterations and Habit rate.The dimension of term vector has made 3 kinds of different dimensions respectively in specific implementation:50th, 100,200 dimension.Iterations refers to The iterations that each training set enters after model, 10,20 two kind of embodiment has been respectively adopted in we.Learning rate we adopt With the learning rate of several fixations, it is respectively:0.1、0.01、0.001;
Implement explanation and result:
4 kinds of data sets have been used altogether in implementation, have been respectively:English language material in Billion-Words corpus News.2011.en.shuffled, ptb.train.txt corpus, Wordsim353 data sets and questions- Words.txt data sets.Wherein news.2011.en.shuffled language materials are used as training set, ptb.train.txt corpus For test model function.Wordsim353 data sets and questions-words.txt data sets are made as test set With Wordsim353 data sets are used for the similarity for contrasting term vector, and questions-words.txt data sets are used to assess word The linear relationship of vector.
News.2011.en.shuffled language materials are a plain text datas captured from the news of 2011, data Amount is big, information is complete and remains the raw informations such as punctuate.When this corpus is handled, word frequency is less than 5 times and is disabling Word in word list is all ignored, and all words are all converted into small letter, and all numerals are converted into capital N.Compare For Billion-Words, ptb.train.txt corpus is a more small-sized corpus, and test model is used in implementation. Wordsim353 data sets are a small-sized corpus artificially built, and it contains similar between multiple words pair and word pair Degree, this similarity show that minimum is divided into 0 point, and maximum is divided into 10 points by manually giving a mark.Contained in Wordsim353 3 Sub Data Sets:Set1, set2 and combined.Wherein set1 is contained from 13 different sides manually to the phase of word pair Given a mark data like degree, set2 then contains to give a mark data from 16 sides to the similarity of word pair, combined be then set1 and The result after marking average value in set2 to similarity.Wordsim353 data sets are as test set in implementation, as similar Spend the benchmark assessed.Questions-words.txt data sets are the data set used in tensorflow in word2vec, this Individual data set is an artificial constructed test data set, for testing the linear relationship of term vector.It is every in this data set Individual test case includes 4 words, and this 4 words have similar King-Man+Woman=Queen linear relationship.
Found after being analyzed and tested by the similarity to term vector and linear relationship (refering to table 1, the table 3 of table 2 and figure 3):The quality of term vector can be influenceed by model, context and dimension size simultaneously.In random length using punctuation mark division More complete semanteme is hereafter can guarantee that, improves the quality of term vector.F-Model has good learning ability, and training obtains Term vector contain abundant semanteme, and there is good linear relationship.The dimension of term vector contains more greatly the ability of semanteme Also it is stronger, but this also brings along dimension disaster and over-fitting, causes term vector to show poor effect when migrating.
It should be noted that and understand, the situation of the spirit and scope of the present invention required by appended claims are not departed from Under, various modifications and improvements can be made to the present invention of foregoing detailed description.It is therefore desirable to the scope of the technical scheme of protection Do not limited by given any specific exemplary teachings.

Claims (8)

1. a kind of term vector generation method based on random length context, it is characterised in that located in advance to corpus first After reason, context is divided using punctuation mark, corpus is divided into length, semantic complete context unit.Then Using convolutional neural networks, hereinafter the weight of each word, this weight are then distributed in combination generation with the global of corpus in study The hereinafter final weight of each word.The vector table that context is calculated followed by this final weight and term vector reaches.Followed by The vector table of context reaches the one-to-many mapping relations between each word in structure and context.Then pass through stochastic gradient algorithm Training pattern, and finally obtain term vector.
A kind of 2. term vector generation method based on random length context as claimed in claim 1, it is characterised in that specific bag Include following steps:
(1) document pre-processes, and obtains training corpus.Given one group of collection of document on certain professional domain, removed by word The preconditioning technique such as stop words and low-frequency word, obtain the useful information in corpus, and then composing training corpus.
(2) word frequency statisticses, statistics language material distribution.Based on the statistics of the word frequency of occurrences in document, the dictionary of corpus, word are generated Word, the index of word and the frequency of word in corpus are included in allusion quotation.
(3) build training set, the punctuation mark in training corpus, corpus be divided into length not wait context, Form training set.
(4) weight of term vector in context is calculated.The term vector of each word forms context-aware matrix in context.Utilize convolution god Weight through network by each word of convolution algorithm acquisition to context-aware matrix, this weight combine with the frequency of word in corpus again Form final weight.
(5) the distributed expression of context is calculated.With reference to the weight of the term vector obtained in step (4), current context is obtained Distributed expression.The history context information in Recognition with Recurrent Neural Network is recycled, generates the distributed table of newest context Reach, while update the historical information in Recognition with Recurrent Neural Network.
(6) mode inference.Using the context distributed intelligence obtained in step (5), context and the word in context are built One-to-many mapping relations.Build the loss function of model.
(7) training pattern, term vector is obtained.Mapping relations according to being built in step (6) carry out optimization instruction on training set Practice, training method is using negative sampling and stochastic gradient descent algorithm.
In the above-mentioned methods, punctuation mark has been used in the step (3), punctuation mark used herein above refers to include and compared The semantic punctuate of full segmentation, such as " ", ".”、“”、“!" etc..
A kind of 3. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (4) have used convolutional neural networks, and the size of convolution kernel is [1,3, m, 1], and wherein m represents the dimension size of term vector.Profit With convolutional neural networks, the context-aware matrix for being shaped as [k, m] can be shaped as [k, 1] weight, wherein k by convolution generation Represent the number of word in context.This weight calculates final weight in conjunction with the distribution of corpus.
A kind of 4. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (4) specifically include the following steps:
A) context-aware matrix is built.Context-aware matrix is built according to F-Model input.The input In of model be the above hereinafter Unit, including two parts:The index I of word in context is vectorial, the global distribution vector Wt1 of word in context.In vectorial I Each represent index in dictionary of word.Vectorial Wt1 length is identical with the length of context, and I is represented per the value on one-dimensional In corresponding word whole corpus frequency disribution.Word is searched according to the index vector I of input by gather () operations Term vector in allusion quotation table, and these term vector arranged in sequence are combined into context-aware matrix C.
Wt1=(d0,d1,…,dk) (1)
I=(i0,i1,…,ik) (2)
In=(I, Wt1) (3)
C=gather (D, I) (4)
Wherein k be context length, ikRepresent word WkIndex in dictionary, dkRepresent word WkGlobal distribution.B) context Matrix C is by being shaped as the weight Wt2 of [k, 1] after convolutional layer convolution.
Wt2=f (C)+B (5)
Wherein:D is dictionary, and f () is convolution algorithm, and convolution kernel is shaped as [1,3, m, 1], and B is the bias term of convolution.
C) final weight is calculated.Final weight Wt is calculated with reference to the distribution of corpus.
Wt=Wt2Wt1 (6)
A kind of 5. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (5) calculate the semantic vector Ct of context by the weight Wt that is obtained in the term vector C of context, step (4).Circulation god Calculated through the network historical context state Ct that based on context semantic vector Ct and Recognition with Recurrent Neural Network learning arrive newest It is semantic that the historical context that Recognition with Recurrent Neural Network learning arrives is contained in context semantic vector Ct, context semantic vector C.
A kind of 6. term vector generation method based on random length context as claimed in claim 5, it is characterised in that the step Suddenly (5) specifically include the following steps:
A) the distributed expression of current context is calculated.The weight information of word is with the addition of on the basis of bag of words.
The calculation of the distributed expression of current context is:
Ct=CX (Wt) (7)
Wherein, C represents context-aware matrix, X representing matrix multiplication.
B) Recognition with Recurrent Neural Network is utilized, adds historical information.
Ct=rnn (Ct) (8)
Wherein, rnn () represents to pass through Recognition with Recurrent Neural Network.
A kind of 7. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (6) based on context semantic vector Ct predict context in each word.Calculate the bar of the target word under context condition Part probability.Calculation is:
P (W | C)=sigmoid (C θi) (7)
Wherein, θiFor word WiParameter.
8. a kind of term vector generation method based on random length context as claimed in claim 1, it is characterised in that described During step (7) training pattern, the loss function represents prediction target word using model complexity (perplexity) The length of average context.
<mrow> <mi>l</mi> <mi>o</mi> <mi>s</mi> <mi>s</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mroot> <mrow> <munderover> <mi>&amp;Pi;</mi> <mn>1</mn> <mi>k</mi> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>|</mo> <mi>C</mi> <mo>)</mo> </mrow> </mrow> <mi>k</mi> </mroot> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow>
Wherein, k represents the length of context, P (Wi| C) represent word WiConditional probability under context C.
CN201710609471.6A 2017-07-25 2017-07-25 Word vector generation method based on indefinite-length context Active CN107608953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710609471.6A CN107608953B (en) 2017-07-25 2017-07-25 Word vector generation method based on indefinite-length context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710609471.6A CN107608953B (en) 2017-07-25 2017-07-25 Word vector generation method based on indefinite-length context

Publications (2)

Publication Number Publication Date
CN107608953A true CN107608953A (en) 2018-01-19
CN107608953B CN107608953B (en) 2020-08-14

Family

ID=61059516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710609471.6A Active CN107608953B (en) 2017-07-25 2017-07-25 Word vector generation method based on indefinite-length context

Country Status (1)

Country Link
CN (1) CN107608953B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN109582771A (en) * 2018-11-26 2019-04-05 国网湖南省电力有限公司 Smart client exchange method towards power domain based on mobile application
CN110096697A (en) * 2019-03-15 2019-08-06 华为技术有限公司 Term vector matrix compression method and apparatus and the method and apparatus for obtaining term vector
CN110119507A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 Term vector generation method, device and equipment
WO2019154411A1 (en) * 2018-02-12 2019-08-15 腾讯科技(深圳)有限公司 Word vector retrofitting method and device
CN110287337A (en) * 2019-06-19 2019-09-27 上海交通大学 The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN110781678A (en) * 2019-10-14 2020-02-11 大连理工大学 Text representation method based on matrix form
CN111241819A (en) * 2020-01-07 2020-06-05 北京百度网讯科技有限公司 Word vector generation method and device and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482860A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Automatic extraction and filtration method for Chinese-English phrase translation pairs
US20090210218A1 (en) * 2008-02-07 2009-08-20 Nec Laboratories America, Inc. Deep Neural Networks and Methods for Using Same
CN101685441A (en) * 2008-09-24 2010-03-31 中国科学院自动化研究所 Generalized reordering statistic translation method and device based on non-continuous phrase
CN102231278A (en) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 Method and system for realizing automatic addition of punctuation marks in speech recognition
CN102930055A (en) * 2012-11-18 2013-02-13 浙江大学 New network word discovery method in combination with internal polymerization degree and external discrete information entropy
CN103324612A (en) * 2012-03-22 2013-09-25 北京百度网讯科技有限公司 Method and device for segmenting word
CN103530284A (en) * 2013-09-22 2014-01-22 中国专利信息中心 Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method
CN104317965A (en) * 2014-11-14 2015-01-28 南京理工大学 Establishment method of emotion dictionary based on linguistic data
JP2017059205A (en) * 2015-09-17 2017-03-23 パナソニックIpマネジメント株式会社 Subject estimation system, subject estimation method, and program
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482860A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Automatic extraction and filtration method for Chinese-English phrase translation pairs
US20090210218A1 (en) * 2008-02-07 2009-08-20 Nec Laboratories America, Inc. Deep Neural Networks and Methods for Using Same
CN101685441A (en) * 2008-09-24 2010-03-31 中国科学院自动化研究所 Generalized reordering statistic translation method and device based on non-continuous phrase
CN102231278A (en) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 Method and system for realizing automatic addition of punctuation marks in speech recognition
CN103324612A (en) * 2012-03-22 2013-09-25 北京百度网讯科技有限公司 Method and device for segmenting word
CN102930055A (en) * 2012-11-18 2013-02-13 浙江大学 New network word discovery method in combination with internal polymerization degree and external discrete information entropy
CN103530284A (en) * 2013-09-22 2014-01-22 中国专利信息中心 Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method
CN104317965A (en) * 2014-11-14 2015-01-28 南京理工大学 Establishment method of emotion dictionary based on linguistic data
JP2017059205A (en) * 2015-09-17 2017-03-23 パナソニックIpマネジメント株式会社 Subject estimation system, subject estimation method, and program
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李琼: "利用标点符号自动识别分句", 《皖西学院学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119507A (en) * 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 Term vector generation method, device and equipment
WO2019154411A1 (en) * 2018-02-12 2019-08-15 腾讯科技(深圳)有限公司 Word vector retrofitting method and device
US11586817B2 (en) 2018-02-12 2023-02-21 Tencent Technology (Shenzhen) Company Limited Word vector retrofitting method and apparatus
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN108733647B (en) * 2018-04-13 2022-03-25 中山大学 Word vector generation method based on Gaussian distribution
CN109582771A (en) * 2018-11-26 2019-04-05 国网湖南省电力有限公司 Smart client exchange method towards power domain based on mobile application
CN110096697A (en) * 2019-03-15 2019-08-06 华为技术有限公司 Term vector matrix compression method and apparatus and the method and apparatus for obtaining term vector
CN110096697B (en) * 2019-03-15 2022-04-12 华为技术有限公司 Word vector matrix compression method and device, and method and device for obtaining word vectors
CN110287337A (en) * 2019-06-19 2019-09-27 上海交通大学 The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN110781678A (en) * 2019-10-14 2020-02-11 大连理工大学 Text representation method based on matrix form
CN110781678B (en) * 2019-10-14 2022-09-20 大连理工大学 Text representation method based on matrix form
CN111241819A (en) * 2020-01-07 2020-06-05 北京百度网讯科技有限公司 Word vector generation method and device and electronic equipment

Also Published As

Publication number Publication date
CN107608953B (en) 2020-08-14

Similar Documents

Publication Publication Date Title
CN107608953A (en) A kind of term vector generation method based on random length context
US20200012953A1 (en) Method and apparatus for generating model
CN107967255A (en) A kind of method and system for judging text similarity
CN106897559B (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN105955962B (en) The calculation method and device of topic similarity
CN110516245A (en) Fine granularity sentiment analysis method, apparatus, computer equipment and storage medium
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN108090510A (en) A kind of integrated learning approach and device based on interval optimization
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN109739995B (en) Information processing method and device
CN109214004B (en) Big data processing method based on machine learning
CN107203600A (en) It is a kind of to utilize the evaluation method for portraying cause and effect dependence and sequential influencing mechanism enhancing answer quality-ordered
CN104794222B (en) Network form semanteme restoration methods
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
JP7347179B2 (en) Methods, devices and computer programs for extracting web page content
CN112035629B (en) Method for implementing question-answer model based on symbolized knowledge and neural network
CN105893363A (en) A method and a system for acquiring relevant knowledge points of a knowledge point
CN109829054A (en) A kind of file classification method and system
CN114692615A (en) Small sample semantic graph recognition method for small languages
CN104361077B (en) The creation method and device of webpage scoring model
CN113254612A (en) Knowledge question-answering processing method, device, equipment and storage medium
Renner et al. Mapping common data elements to a domain model using an artificial neural network
CN105468657B (en) A kind of method and system of the important knowledge point in acquisition field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant