CN110309317A - Term vector generation method, system, electronic device and the medium of Chinese corpus - Google Patents
Term vector generation method, system, electronic device and the medium of Chinese corpus Download PDFInfo
- Publication number
- CN110309317A CN110309317A CN201910429450.5A CN201910429450A CN110309317A CN 110309317 A CN110309317 A CN 110309317A CN 201910429450 A CN201910429450 A CN 201910429450A CN 110309317 A CN110309317 A CN 110309317A
- Authority
- CN
- China
- Prior art keywords
- word
- meaning
- vector
- similarity
- synonym
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides term vector generation method, system, electronic device and the medium of a kind of Chinese corpus, comprising: building database stores independent word set, synset and related word set;Chinese corpus is acquired to segment to obtain word collection;Independent word set, synset and the related word set of word is concentrated to encode word;Coding vector input word expression model is obtained to the primary vector of each word;Judge whether word belongs to independent word set;Belong to independent word set, using primary vector as output vector;It is not belonging to independent word set, each primary vector of word is inputted into the first probabilistic model and the second probabilistic model, word is respectively obtained and belongs to the first probability of each meaning and its second probability of context words;First probability and the second probability are inputted into third probabilistic model, obtain the third probability that word belongs to each meaning;Using the primary vector of the meaning of the corresponding word of maximum third probability as the output vector of the word.
Description
Technical field
The present invention relates to natural language processing technique fields, generate more particularly, to a kind of term vector of Chinese corpus
Method, system, electronic device and medium.
Background technique
With the continuous development of Internet technology, more and more people, which start to make comments on network, expresses viewpoint, is
The deep data value ensconced after user behavior is excavated, natural language processing technique is gradually widely used in a variety of downstreams and appoints
In business, such as sentiment analysis, Entity recognition, machine translation, text snippet.In natural language processing task, it is necessary first to examine
The problem of worry, is exactly how word indicates in a computer.
Method currently based on distributed vector is using a kind of extensive word representation method.Such as Word2Vec and
GloVe indicates that this leads to not the ambiguity for indicating word, especially by the single vector-quantities that the training to corpus obtains word
It is that Chinese corpus problem above abundant for semantic information is more prominent.In order to solve the problems, such as that word ambiguity indicates, one
A little researchers propose that, by corpus context cluster building polyarch word lists representation model, this method determines that cluster is maximum first
Then number uses clustering algorithm, the context of word is gathered by vector similarity as multiple concept items of word
Class, every one kind represent a concept of word.But there is also some ignorable problems, such as cluster number for this method
The uncertainty and cluster result of K can not corresponding word the meaning etc..Corpus is only relied only on simultaneously carries out model training, in order to
Result accuracy rate is improved, a large amount of corpus datas certainly will be needed, causes computation burden to aggravate, and not can avoid corpus itself matter
Measuring poor caused word indicates result inaccuracy phenomenon.Therefore study Chinese corpus polyarch expression be very it is necessary to
's.
Summary of the invention
In view of the above problems, the term vector for the Chinese corpus that the polyarch word that the present invention provides a kind of Chinese corpus indicates is raw
At method, system, electronic device and medium.
According to an aspect of the present invention, a kind of term vector generation method of Chinese corpus is provided, comprising:
Database is constructed, autonomous word is stored as independent word set by the database, by the synonym of each meaning of word
It is stored as synset, the related term of each meaning of word is stored as related word set;
Chinese corpus is acquired, Chinese corpus is segmented, obtains the word collection W=[w that the word of Chinese corpus is constituted1,
w2,…,wb], b is the word sum of word collection;
Independent word set, synset and the related word set of word is concentrated to compile by the first setting order the word
Code, obtain the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word it is every
The coding vector of each meaning of the coding vector and each word of each related term of a meaning;
Coding vector input word is indicated into model, by each of the autonomous word of word concentration, each meaning of each word
Each related term of each meaning of synonym, each word and each meaning of each word are converted into primary vector, each
In each meaning of word the primary vector of each synonym constitute the first of the synset of each meaning of each word to
Quantity set, the primary vector of each related term of each meaning of each word constitute the related word set of each meaning of each word
Primary vector collection, the primary vector collection of the synset of each meaning of each word and related word set constitutes each word
The primary vector collection of each meaning, the interesting primary vector collection of each word constitute the primary vector collection of each word;
Judge that word concentrates whether each word belongs to independent word set with the second setting order;
If the word belongs to independent word set, using the primary vector of the word as the output vector of the word;
If the word is not belonging to independent word set, following step is executed:
Each primary vector that the primary vector of each meaning of the word is concentrated inputs the first probabilistic model, obtains
The word belongs to the first probability of each meaning, wherein first probabilistic model is constructed by following formula (1)
Wherein, wtT-th of word, c are concentrated for wordjFor word wtJ-th the meaning,Indicate word
Language wtBelong to meaning cjThe first probability,Indicate word wtJ-th of meaning cjIn i-th of synonym or i-th of phase
Close the primary vector of word;
The primary vector of each meaning of the word is inputted into the second probabilistic model, the word is obtained and belongs to each meaning
Second probability of the context words of think of, wherein second probabilistic model is constructed by following formula (2)
Wherein, wt+kIndicate the context words of word wt,Indicate the meaning c of word wtjPrimary vector,
It is that word is concentrated in addition to word wtOther words in addition indicate that model obtains as the coding vector input word of autonomous word the
One vector,For word wtMeaning cjCoding vector,Indicate word wtBelong to meaning cjIt is upper
Hereafter the second probability of word;
The word is belonged into the first probability of each meaning and the word belongs to the context words of each meaning
Second probability inputs third probabilistic model, obtains the third probability that word belongs to each meaning, wherein the third probabilistic model
It is constructed by following formula (3)
Wherein, p (cj|wt) indicate word wtBelong to meaning cjThird probability;
Using the primary vector of the meaning of the corresponding word of maximum third probability as the output vector of the word.
According to another aspect of the present invention, the term vector for providing a kind of Chinese corpus generates system, comprising:
Autonomous word is stored as independent word set by database, and the synonym of each meaning of word is stored as synset,
The related term of each meaning of word is stored as related word set;
Acquisition module acquires Chinese corpus;
Word segmentation module segments the Chinese corpus of acquisition, obtains the word collection W=that the word of Chinese corpus is constituted
[w1, w2..., wb], b is the word sum of word collection;
The word is concentrated independent word set, synset and the related word set of word by the first setting time by coding module
Sequence is encoded, obtain the coding vector of autonomous word, each meaning of each word each synonym coding vector, each
The coding vector of each meaning of the coding vector and each word of each related term of each meaning of word;
Primary vector collection constructs module, and coding vector input word is indicated model, autonomous word, each word that word is concentrated
Each meaning of each synonym of each meaning of language, each related term of each meaning of each word and each word turns
Primary vector is turned to, the primary vector of each synonym constitutes each meaning of each word in each meaning of each word
The primary vector of the primary vector collection of synset, each related term of each meaning of each word constitutes the every of each word
The primary vector collection of the related word set of a meaning, the primary vector of the synset of each meaning of each word and related word set
Collection constitutes the primary vector collection of each meaning of each word, and the interesting primary vector collection of each word constitutes each word
The primary vector collection of language;
Judgment module judges that word concentrates whether each word belongs to independent word set with the second setting order, if described
Word belongs to independent word set, sends a signal to vector output module, if the word is not belonging to independent word set, sends a signal to
First probabilistic model module and the second probabilistic model module;
First probabilistic model module, including the first probabilistic model construction unit and the first data processing unit, shown first
Probabilistic model construction unit constructs the first probabilistic model, and first data processing unit is by the of each meaning of the word
Each primary vector in one vector set inputs the first probabilistic model, obtains the first probability that the word belongs to each meaning,
Wherein, first probabilistic model is constructed by following formula (1)
Wherein, wtT-th of word, c are concentrated for wordjFor word wtJ-th the meaning,Indicate word
Language wtBelong to meaning cjThe first probability,Indicate word wtJ-th of meaning cjIn i-th of synonym or i-th of phase
Close the primary vector of word;
Second probabilistic model module, including the second probabilistic model construction unit and the second data processing unit, shown second
Probabilistic model construction unit constructs the second probabilistic model, and second data processing unit is by the of each meaning of the word
One vector inputs the second probabilistic model, obtains the second probability that the word belongs to the context words of each meaning, wherein institute
The second probabilistic model is stated to construct by following formula (2)
Wherein, wt+kIndicate the context words of word wt,Indicate the meaning c of word wtjPrimary vector,It is
Word is concentrated in addition to word wtOther words in addition as autonomous word coding vector input word indicate model obtain first
Vector,For the meaning c of word wtjCoding vector,Indicate word wtBelong to meaning cjUp and down
Second probability of cliction language;
Third probabilistic module, including third probabilistic model construction unit and third data processing unit, shown third probability
The word is belonged to the first of each meaning by model construction building unit third probabilistic model, the third data processing unit
Probability and the word belong to the second probability input third probabilistic model of the context words of each meaning, obtain word and belong to
The third probability of each meaning, wherein the third probabilistic model is constructed by following formula (3)
Wherein, p (cj|wt) indicate word wtBelong to meaning cjThird probability;
Vector output module will belong to the primary vector of the word of independent word set as the output vector of the word, incite somebody to action
The primary vector of the meaning of the corresponding word of maximum third probability of the word of independent word set is not belonging to as the word
Output vector.
In addition, the present invention also provides a kind of electronic device, including memory and processor, it is stored in the memory
The term vector of literary corpus generates program, and the term vector generation program of the Chinese corpus is realized above-mentioned when being executed by the processor
The step of the term vector generation method of Chinese corpus.
In addition, including in the computer readable storage medium the present invention also provides a kind of computer readable storage medium
There is the term vector of Chinese corpus to generate program, when the term vector generation program of the Chinese corpus is executed by processor, in realization
The step of stating the term vector generation method of Chinese corpus.
Term vector generation method, system, electronic device and the medium of above-mentioned Chinese corpus carry out the synonym of the word meaning
Collection and related word set are searched, and multiple declaration of will vectors of word can be obtained, and the polysemant for preferably resolving word indicates
Problem is conducive to the semantic disambiguation in word expression.Simultaneously by the introducing of expert knowledge library, can also supplement between word
Relationship improves because of pre-training result inaccuracy situation caused by training corpus is insufficient or quality is bad.
Detailed description of the invention
Fig. 1 is the flow chart of the term vector generation method of Chinese corpus of the present invention;
Fig. 2 is that the term vector of Chinese corpus of the present invention generates the schematic diagram that system constitutes block diagram.
Specific embodiment
In the following description, for purposes of illustration, it in order to provide the comprehensive understanding to one or more embodiments, explains
Many details are stated.It may be evident, however, that these embodiments can also be realized without these specific details.
In other examples, one or more embodiments for ease of description, well known structure and equipment are shown in block form an.
Each embodiment according to the present invention is described in detail below with reference to accompanying drawings.
Fig. 1 is the flow chart of the term vector generation method of Chinese corpus of the present invention, as shown in Figure 1, the term vector
Generation method includes:
Step S1 constructs database, and autonomous word is stored as independent word set by the database, by each meaning of word
Synonym is stored as synset, and the related term of each meaning of word is stored as related word set;
Step S2 acquires Chinese corpus, segments to Chinese corpus, obtains the word collection that the word of Chinese corpus is constituted
W=[w1, w2..., wb], b is the word sum of word collection;
The word is concentrated independent word set, synset and the related word set of word by the first setting order by step S3
It is encoded, obtains the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word
The coding vector of each meaning of the coding vector and each word of each related term of each meaning of language;
Coding vector input word is indicated model, each meaning of autonomous word, each word that word is concentrated by step S4
Each synonym, each word each related term of each meaning and each meaning of each word be converted into first to
Amount, the primary vector of each synonym of each meaning of each word constitute the synset of each meaning of each word
Primary vector collection, the primary vector of each related term of each meaning of each word constitute the phase of each meaning of each word
The primary vector collection of word set is closed, the synset of each meaning of each word and the primary vector collection of correlation word set constitute each
The primary vector collection of each meaning of word, the interesting primary vector collection of each word constitute the first of each word to
Quantity set, it is preferable that the vocabulary representation model is Skip-gram model (rising space model), CWE (character enhanced
Word embedding model, character enhance word incorporation model) or SAT model (sememe attention over
Target model, the former attention model of target justice);
Step S5 judges that word concentrates whether each word belongs to independent word set with the second setting order;
Step S6, if the word belongs to independent word set, using the primary vector of the word as the defeated of the word
Outgoing vector;
Step S7 executes following step if the word is not belonging to independent word set:
Step S71, each primary vector that the primary vector of each meaning of the word is concentrated input the first probability
Model obtains the first probability that the word belongs to each meaning, wherein first probabilistic model is constructed by following formula (1)
Wherein, wtT-th of word, c are concentrated for wordjFor word wtJ-th the meaning,Indicate word
Language wtBelong to meaning cjThe first probability,Indicate word wtJ-th of meaning cjIn i-th of synonym or i-th of phase
Close the primary vector of word;
The primary vector of each meaning of the word is inputted the second probabilistic model, obtains the word category by step S72
In the second probability of the context words of each meaning, wherein second probabilistic model is constructed by following formula (2)
Wherein, wt+kIndicate word wtContext words,Indicate word wtMeaning cjPrimary vector,It is
Word is concentrated in addition to word wtOther words in addition as autonomous word coding vector input word indicate model obtain first
Vector,For word wtMeaning cjCoding vector,Indicate word wtBelong to meaning cjUp and down
Second probability of cliction language;
The word is belonged to the first probability of each meaning and the word belongs to the upper and lower of each meaning by step S73
Second probability of cliction language inputs third probabilistic model, obtains the third probability that word belongs to each meaning, wherein the third
Probabilistic model is constructed by following formula (3)
Wherein, p (cj|wt) indicate word wtBelong to meaning cjThird probability;
Step S74, using the primary vector of the meaning of the corresponding word of maximum third probability as the defeated of the word
Outgoing vector.
The term vector generation method of above-mentioned Chinese corpus proposes that Chinese polyarch word indicates that (existing word indicates mould to learning model
Type, the first probabilistic model, the second probabilistic model and the combination of third probabilistic model), it realizes that Chinese polysemant indicates, solves Chinese word
Semantic ambiguity problem in language expression, and be finally applied in Chinese Word similarity, it is quasi- to improve Words similarity detection
True rate.
In step S71, further include the steps that being modified the first probabilistic model by attention mechanism, described first
The step of probabilistic model is modified include:
It is obtained according to the following formula (4) by the coding vector of any two synonym or related term in each meaning of word
The priori similarity of any two synonym or related term in each meaning of the word
Wherein,For word wtA meaning cjSynset,For word wtA meaning cjCorrelation
Word set, witAnd wntFor synsetIn two synonyms or related word setIn two related terms, fp(wit, wnt) be
witAnd wntPriori similarity, SIM (wit, wnt) it is to work as witAnd wntAll in related word setIn, witAnd wntCoding vector
The priori similarity obtained by similarity based method;
The vector similarity of any two synonym or related term in each meaning of the word is obtained by following formula (5)
fv(wit, wnt)=< wit, wnt>=∑dvd(wit)*vd(wnt) (5)
Wherein, fv(wit, wnt) it is witAnd wntVector similarity, vd(wit) and vd(wnt) indicate word witAnd wnt's
Primary vector, d indicate vector dimension;
Word w is corrected according to the following formula (6) and (7) by priori similarity and vector similaritytEach meaning in it is each
The primary vector of synonym or each related term
Wherein,For primary vectorCorrection factor, indicate for word wtJ-th of meaning cjIn i-th
A synonym or i-th of related termAttention score,For word wtJ-th of meaning cjIn i-th of synonym
Or the modification vector of i-th of related term;
Modification vector substitution primary vector is inputted into the first probabilistic model or/and the second probabilistic model.
Preferably, in step S71, by the coding of any two synonym or related term in each meaning of word to
Amount obtains the priori similarity of any two synonym or related term in each meaning of the word according to the following formula (8)
Wherein, β is the harmonic coefficient that synonym and related term influence, β≤1.
In one alternate embodiment, in step S71, optimize harmonic coefficient β the step of, the step includes:
The similarity numerical value for by expert knowledge library or manually marking out pairs of word, is added to the first sequence with setting sequence
In column;
The vector similarity of the primary vector collection of the pairs of word is obtained by similarity based method, it is suitable according to above-mentioned setting
Vector similarity is added in the second sequence by sequence;
The overall similarity of First ray and the second sequence is obtained using similarity based method;
Using the corresponding harmonic coefficient β of overall similarity maximum value as best harmonic coefficient β.
Preferably, the vector similarity method of the primary vector collection that the pairs of word is obtained by similarity based method
Include:
Wherein, sim (w1, w2) be expressed as to word w1And w2Primary vector collection vector similarity,Indicate w1
Kth1The synset of a meaningPrimary vector collection,Indicate w1Kth1The related word set of a meaningPrimary vector collection,Indicate w2Kth2The synonym of a meaningPrimary vector collection,It indicates
w2Kth2The related term of a meaningPrimary vector collection, β is the harmonic coefficient that synonym and related term influence.
The term vector generation method of above-mentioned Chinese corpus gives the Chinese polyarch vocabulary dendrography based on expert knowledge library
Practise model.
Furthermore it is preferred that the side of the overall similarity for obtaining First ray and the second sequence using similarity based method
Method includes:
The overall similarity of First ray and the second sequence is obtained by following formula (10) using Pearson's coefficient
Wherein, s1Indicate First ray, s2Indicating the second sequence, N is expressed as the sum to word,For First ray
s1With the second sequence s2Pearson's coefficient, indicate First ray s1With the second sequence s2Overall similarity.
In another alternative embodiment, the step of optimization harmonic coefficient β, includes:
The output vector of word and other multiple words is obtained by third probabilistic model;
The word and other each words are grouped;
The vector similarity of the output vector of multipair pairs of word is obtained by similarity based method;
Vector similarity is ranked up according to sequence from big to small;
Other words for the forward setting quantity that sorts are extracted into combination, the arest neighbors set as word;
Make arest neighbors set requirements by adjusting harmonic coefficient β, the requirement is by the nearest of artificial judgment output
Whether the word quantity of correct or word arest neighbors set reaches the word of its synset and related word set to neighbour's set
The setting ratio of quantity summation.
In step s 5, method packet that word concentrates each word whether to belong to independent word set is judged with the second setting order
It includes:
It constructs window and word collection input window is obtained into the centre word of window;
Judge whether centre word belongs to independent word set;
By the way of sliding window, successively judge that word concentrates whether each word belongs to independent word set.
The term vector of above-mentioned Chinese corpus generates system for the ambiguity problem in Chinese word expression at present, in proposition
Literary polyarch word representation method, wherein the meaning of polyarch assumes that the meaning of word is not unique, is made of multiple meanings, right
It is indicated in each meaning using a vector.Expert knowledge library, which refers to, passes through what long-time was write by domain experts
Include the dictionary resources that more comprehensive word relationship and word annotate, such as HowNet (Hownet) and Chinese thesaurus.This
Invention introduces Chinese thesaurus in the expression of Chinese word, compared to HowNet, can more directly correct due to instructing in advance
Practicing word indicates the bad phenomenon of modelling effect caused by the inaccuracy of result, and the research to be indicated based on pre-training word provides new think of
Road.And by the synset of each word difference meaning in Chinese thesaurus or related word set, word can be carried out
Polyarch indicate that realizes in Chinese word expression semantic disambiguates.And it, can be to a certain degree by the introducing of external resource
On make up due to data volume deficiency, cause not training data concentrate word can not indicate the problem of, realize decimal indirectly
Word under indicates.Eventually by experiment, method proposed by the present invention is substantially better than existing model.
Fig. 2 is that the term vector of Chinese corpus of the invention generates the schematic diagram that system constitutes block diagram, as shown in Fig. 2, in described
The term vector of literary corpus generates system 1
Autonomous word is stored as independent word set by database 10, and the synonym of each meaning of word is stored as synonym
Collection, is stored as related word set for the related term of each meaning of word;
Acquisition module 20 acquires Chinese corpus;
Word segmentation module 30 segments the Chinese corpus of acquisition, obtains the word collection W=that the word of Chinese corpus is constituted
[w1, w2..., wb], b is the word sum of word collection;
The word is concentrated independent word set, synset and the related word set of word by the first setting by coding module 40
Order is encoded, obtain the coding vector of autonomous word, each meaning of each word each synonym coding vector, every
The coding vector of each meaning of the coding vector and each word of each related term of each meaning of a word;
Primary vector collection constructs module 50, and coding vector input word is indicated model, by the autonomous word of word concentration, each
Each synonym, each related term of each meaning of each word and each meaning of each word of each meaning of word
It is converted into primary vector, the primary vector of each synonym constitutes each meaning of each word in each meaning of each word
Synset primary vector collection, the primary vector of each related term of each meaning of each word constitutes each word
The first of the primary vector collection of the related word set of each meaning, the synset of each meaning of each word and related word set to
Quantity set constitutes the primary vector collection of each meaning of each word, and the interesting primary vector collection of each word constitutes each
The primary vector collection of word;
Judgment module 60 judges that word concentrates whether each word belongs to independent word set with the second setting order, if institute
Predicate language belongs to independent word set, sends a signal to vector output module, if the word is not belonging to independent word set, sends signal
To the first probabilistic model module and the second probabilistic model module;
First probabilistic model module 70, including the first probabilistic model construction unit 71 and the first data processing unit 72, institute
Show that the first probabilistic model construction unit 71 constructs the first probabilistic model, first data processing unit 72 is by the every of the word
Each primary vector that the primary vector of a meaning is concentrated inputs the first probabilistic model, obtains the word and belongs to each meaning
First probability, wherein first probabilistic model is constructed by following formula (1)
Wherein, wtT-th of word, c are concentrated for wordjFor word wtJ-th the meaning,Indicate word
Language wtBelong to meaning cjThe first probability,Indicate word wtJ-th of meaning cjIn i-th of synonym or i-th of phase
Close the primary vector of word;
Second probabilistic model module 80, including the second probabilistic model construction unit 81 and the second data processing unit 82, institute
Show that the second probabilistic model construction unit constructs the second probabilistic model, second data processing unit is by each meaning of the word
The primary vector of think of inputs the second probabilistic model, obtains the second probability that the word belongs to the context words of each meaning,
Wherein, second probabilistic model is constructed by following formula (2)
Wherein, wt+kIndicate word wtContext words,Indicate word wtMeaning cjPrimary vector,It is
Word is concentrated in addition to word wtOther words in addition as autonomous word coding vector input word indicate model obtain first
Vector,For word wtMeaning cjCoding vector,Indicate word wtBelong to meaning cjUp and down
Second probability of cliction language;
Third probabilistic module 90, including third probabilistic model construction unit 91 and third data processing unit 92, shown
Three probabilistic model construction units construct third probabilistic model, and the word is belonged to each meaning by the third data processing unit
The first probability and the word belong to each meaning context words the second probability input third probabilistic model, obtain word
Language belongs to the third probability of each meaning, wherein the third probabilistic model is constructed by following formula (3)
Wherein, p (cj|wt) indicate word wtBelong to meaning cjThird probability;
Vector output module 100, will belong to the primary vector of the word of independent word set as the output vector of the word,
The primary vector of the meaning of the corresponding word of maximum third probability of the word of independent word set be will not belong to as institute's predicate
The output vector of language.
The term vector of above-mentioned Chinese corpus generates system and compares traditional Skip-gram model, and the present invention proposes to carry out single
The synset of the word meaning and related word set are searched, and can be obtained multiple declaration of will vectors of word, be preferably resolved list
The polysemant of word indicates problem, is conducive to the semantic disambiguation in word expression.It, can be with simultaneously by the introducing of expert knowledge library
The relationship between word is supplemented, is improved because of pre-training result inaccuracy situation caused by training corpus is insufficient or quality is bad.
Preferably, further include correction module 200, the first probabilistic model be modified by attention mechanism, comprising:
Priori similarity obtaining unit 210 passes through the volume of any two synonym or related term in each meaning of word
Code vector obtains the priori similarity of any two synonym or related term in each meaning of the word according to the following formula (4)
Wherein,For word wtA meaning cjSynset,For word wtA meaning cjCorrelation
Word set, witAnd wntFor synsetIn two synonyms or related word setIn two related terms, fp(wit, wnt) be
witAnd wntPriori similarity, SIM (wit, wnt) it is to work as witAnd wntAll in related word setIn, witAnd wntCoding vector
The priori similarity obtained by similarity based method;
It is synonymous to obtain any two in each meaning of the word by following formula (5) for vector similarity obtaining unit 220
The vector similarity of word or related term
fv(wit, wnt)=< wit, wnt>=∑dvd(wit)*vd(wnt) (5)
Wherein, fv(wit, wnt) it is witAnd wntVector similarity, vd(wit) and vd(wnt) indicate word witAnd wnt's
Primary vector, d indicate vector dimension;
Amending unit 230 corrects word w according to the following formula (6) and (7) by priori similarity and vector similaritytIt is every
The primary vector of each synonym or each related term in a meaning
Wherein,For primary vectorCorrection factor, indicate for word wtJ-th of meaning cjIn i-th
A synonym or i-th of related termAttention score,For word wtJ-th of meaning cjIn i-th of synonym
Or the modification vector of i-th of related term,
Wherein, the revised modification vector of amending unit 230 substitution primary vector inputs the first probabilistic model or/and second
Probabilistic model.
Preferably, the priori similarity obtaining unit 210 by any two synonym in each meaning of word or
The coding vector of related term obtains the elder generation of any two synonym or related term in each meaning of the word according to the following formula (8)
Test similarity
Wherein, β is the harmonic coefficient that synonym and related term influence, β≤1.
Preferably, further include optimization module 300, optimize harmonic coefficient β.
In an alternate embodiment of the present invention where, optimization module 300 includes:
First ray construction unit 310 by expert knowledge library or manually marks out the similarity numerical value of pairs of word, with
Setting sequence is added in First ray;
Second sequence construct unit 320 obtains the vector of the primary vector collection of the pairs of word by similarity based method
Vector similarity is added in the second sequence by similarity according to above-mentioned setting sequence;
Overall similarity obtaining unit 330, the entirety for obtaining First ray and the second sequence using similarity based method are similar
Degree;
Best harmonic coefficient obtaining unit 340 reconciles using the corresponding harmonic coefficient β of overall similarity maximum value as best
Factor beta.
In another alternative embodiment of the invention, optimization module 300 includes:
Unit 310 ' is grouped in word and other multiple words by group, by third probabilistic model obtain word and
The output vector of other multiple words;
Output vector processing unit 320 ' obtains the vector phase of the output vector of multipair pairs of word by similarity based method
Like degree, vector similarity is ranked up according to sequence from big to small;
Other words for the forward setting quantity that sorts are extracted combination, as word by arest neighbors set obtaining unit 330 '
Arest neighbors set;
Optimize unit 340 ', makes arest neighbors set requirements by adjusting harmonic coefficient β, the requirement is by manually sentencing
Whether the arest neighbors set of disconnected output correct or the word quantity of arest neighbors set of word reaches its synset and phase
Close the setting ratio of the word quantity summation of word set.
In one particular embodiment of the present invention, window size is set as 5, the number of iterations 5, the minimum word frequency of word is
10, vector dimension 300;Database has selected expert knowledge library --- Chinese thesaurus, word woods by coding, every a line it is multiple
There are three types of state, synonym, related term or autonomous words for word, therefore pass through matching Chinese thesaurus for any word, can
To contain multiple meanings, contain multiple synonyms or related term again in each meaning.
Chinese corpus mainly from Baidupedia, wikipedia, People's Daily, search dog news, know question and answer, microblogging and
Literary works data;
Vocabulary representation model uses Skip-gram model, and training data comes from search dog laboratory, abbreviation SogouCA, and data have
1.51GB;
Using WordSim-240 and WordSim-297 as evaluation data, this is the good pairs of Words similarity of handmarking
Data set.WordSim-240 includes 240 pairs of Chinese word languages, these words most of are all related words, such as " li po " and
" poem ".WordSim-297 data set contains 297 pairs of Chinese word languages, is largely similar word, such as " admission ticket " and " admission ticket ".
Word in the two data is reversed according to similarity numerical value;
It will be after Skip-gram model, Glove model, SAT model and present invention training that the data input prior art be evaluated
Term vector generate system (MP-CWR), using accuracy as evaluation index, obtained evaluation result is as shown in table 1 below
Table 1
Model | WordSim-240 | WordSim-297 |
Skip-gram(2013) | 0.5703 | 0.5851 |
Glove(2014) | 0.5241 | 0.5277 |
SAT(2017) | 0.5220 | 0.6150 |
MP-CWR | 0.5724 | 0.6170 |
As can be seen from the above table, in similarity task, model of the present invention makes on the basis of small data quantity training,
The accuracy rate of WordSim-240 and WordSim-297 evaluation data set is all promoted, 0.5724 He is respectively reached
0.6170。
In order to verify performance of the model in arest neighbors Detection task, the present invention has selected some polysemants and single respectively
The word of the meaning calculates separately the vector similarity with other words, and in existing Skip-gram, CWE and SAT model
Compare, such as selection polysemant " pride " and " troop " as an example, carries out arest neighbors detection respectively, as a result as shown in table 2 below:
Table 2
As can be seen from Table 2, due to existing methods limitation, Skip-gram, CWE and SAT model are finally all only
There is unique meaning.Although CWE and SAT model is all building, polyarch indicates model, is finally all integrating the more of word
What a meaning vector obtained is that unique vector indicates.And the model constructed through the invention, it can be seen that " pride " and " team
5 " all there are two look like vector.Two meanings of " pride " are pride and complacency respectively, and two meanings of " troop " are team respectively
And army.And closest word the case where compared to other models being correlation word mostly, the word that the present invention obtains is main
Similar word.
In addition, arest neighbors detection is also carried out using univocal, for example, selected " frog " and " pregnancy " as an example, into
The detection of row arest neighbors, as a result as shown in table 3 below:
Table 3
It as can be seen from Table 3, is relevant word mostly by having the word that method obtains, rather than similar word
Language.And the model proposed through the invention, what is obtained is similar word mostly, therefore method proposed by the present invention compares it
His method can preferably find out the arest neighbors word of word.
The term vector generation method of Chinese corpus of the invention can not indicate polysemant for existing vocabulary representation model at present, make
The problem of indicating fuzzy at word and need a large amount of corpus, Chinese corpus polyarch word lists representation model is proposed, by drawing
Enter Chinese thesaurus, the different meanings of each word and the synset under each meaning and related word set can be inquired,
It is then based on attention to remember, the power different with each word in related term of the synset under each meaning of imparting word
Weight, and the different vectors that finally combination obtains the word difference meaning indicate, to realize that the polysemant of word indicates.In similarity
In task, model of the present invention makes on the basis of small data quantity training, to WordSim-240 and WordSim-297 review number
It is all promoted according to the accuracy rate of collection, respectively reaches 0.5724 and 0.6170.Compared to relatively existing side in arest neighbors Detection task
The case where the obtained arest neighbors word of method there are Semantic fuzziness and is mostly related words or other words, the present invention mentions
Model out, which obtains arest neighbors result, can clearly distinguish the different meanings for indicating word, and result is mostly with similar import
Based on word.
In addition, the present invention also provides a kind of electronic device, including memory and processor, it is stored in the memory
The term vector of literary corpus generates program, and the term vector generation program of the Chinese corpus is realized above-mentioned when being executed by the processor
The step of the term vector generation method of Chinese corpus.
In the present embodiment, memory includes the readable storage medium storing program for executing of at least one type.At least one type
Readable storage medium storing program for executing can be the non-volatile memory medium of such as flash memory, hard disk, multimedia card, card-type memory.In some realities
It applies in example, the readable storage medium storing program for executing can be the internal storage unit of the electronic device, such as the hard disk of the electronic device.
In further embodiments, the readable storage medium storing program for executing is also possible to the external memory of the electronic device, such as the electricity
The plug-in type hard disk being equipped in sub-device, intelligent memory card (Smart Media Card, SMC), secure digital (Secure
Digital, SD) card, flash card (Flash Card) etc., the memory can be also used for temporarily storing exported or
The data that will be exported.
Processor can be a central processing unit (Central Processing Unit, CPU) in some embodiments,
Microprocessor or other data processing chips, program code or processing data for being stored in run memory.
Preferably, electronic device further includes network interface, optionally may include standard wireline interface and wireless interface
(such as WI-FI interface), commonly used in establishing communication connection between the electronic device, with other electronic equipments;Communication bus is used
Connection communication between these components of realization.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium
In include that the term vector of Chinese corpus generates program, the term vector of the Chinese corpus generates program when being executed by processor,
The step of realizing the term vector generation method of above-mentioned Chinese corpus.
The word of the electronic device of the present invention and the specific embodiment of computer readable storage medium and above-mentioned Chinese corpus
Vector generation method, the specific embodiment of system are roughly the same, and details are not described herein.
Although content disclosed above shows exemplary embodiment of the present invention, it should be noted that without departing substantially from power
Under the premise of benefit requires the range limited, it may be many modifications and modify.According to the side of inventive embodiments described herein
Function, step and/or the movement of method claim are not required to the execution of any particular order.In addition, although element of the invention can
It is unless explicitly limited individual element it is also contemplated that having multiple elements to be described or be required in the form of individual.
Claims (10)
1. a kind of term vector generation method of Chinese corpus characterized by comprising
Database is constructed, autonomous word is stored as independent word set by the database, and the synonym of each meaning of word is stored
For synset, the related term of each meaning of word is stored as related word set;
Chinese corpus is acquired, Chinese corpus is segmented, obtains the word collection W=[w that the word of Chinese corpus is constituted1,
w2..., wb], b is the word sum of word collection;
It concentrates independent word set, synset and the related word set of word to encode by the first setting order the word, obtains
To each meaning of the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word
The coding vector of each meaning of the coding vector and each word for each related term thought;
Coding vector input word is indicated into model, each of each meaning of autonomous word, each word that word is concentrated is synonymous
Each related term of each meaning of word, each word and each meaning of each word are converted into primary vector, each word
Each meaning in each synonym primary vector constitute each word each meaning synset primary vector collection,
The primary vector of each related term of each meaning of each word constitutes the of the related word set of each meaning of each word
The primary vector collection of one vector set, the synset of each meaning of each word and related word set constitutes each of each word
The primary vector collection of the meaning, the interesting primary vector collection of each word constitute the primary vector collection of each word;
Judge that word concentrates whether each word belongs to independent word set with the second setting order;
If the word belongs to independent word set, using the primary vector of the word as the output vector of the word;
If the word is not belonging to independent word set, following step is executed:
Each primary vector that the primary vector of each meaning of the word is concentrated inputs the first probabilistic model, obtains described
Word belongs to the first probability of each meaning, wherein first probabilistic model is constructed by following formula (1)
Wherein, wtT-th of word, c are concentrated for wordjFor word wtJ-th the meaning,Indicate word wt
Belong to meaning cjThe first probability,Indicate word wtJ-th of meaning cjIn i-th of synonym or i-th of related term
Primary vector;
The primary vector of each meaning of the word is inputted into the second probabilistic model, the word is obtained and belongs to each meaning
Second probability of context words, wherein second probabilistic model is constructed by following formula (2)
Wherein, wt+kIndicate word wtContext words,Indicate word wtMeaning cjPrimary vector,It is word
It concentrates in addition to word wtOther words in addition indicate the primary vector that model obtains as the coding vector input word of autonomous word,For word wtMeaning cjCoding vector,Indicate word wtBelong to meaning cjCliction up and down
Second probability of language;
By the word belong to each meaning the first probability and the word belong to each meaning context words second
Probability inputs third probabilistic model, obtains the third probability that word belongs to each meaning, wherein the third probabilistic model passes through
Following formula (3) building
Wherein, p (cj|wt) indicate word wtBelong to meaning cjThird probability;
Using the primary vector of the meaning of the corresponding word of maximum third probability as the output vector of the word.
2. the term vector generation method of Chinese corpus according to claim 1, which is characterized in that further include passing through attention
The step of the step of mechanism is modified the first probabilistic model, first probabilistic model is modified includes:
Institute's predicate is obtained according to the following formula (4) by the coding vector of any two synonym or related term in each meaning of word
The priori similarity of any two synonym or related term in each meaning of language
Wherein,For word wtA meaning cjSynset,For word wtA meaning cjRelated word set,
witAnd wntFor synsetIn two synonyms or related word setIn two related terms, fp(wit, wnt) it is witAnd wnt
Priori similarity, SIM (wit, wnt) it is to work as witAnd wntAll in related word setIn, witAnd wntCoding vector pass through phase
The priori similarity obtained like degree method;
The vector similarity of any two synonym or related term in each meaning of the word is obtained by following formula (5)
fv(wit, wnt)=< wit, wnt>=∑dvd(wit)*vd(wnt) (5)
Wherein, fv(wit, wnt) it is witAnd wntVector similarity, vd(wit) and vd(wnt) indicate word witAnd wntFirst to
Amount, d indicate vector dimension;
Word w is corrected according to the following formula (6) and (7) by priori similarity and vector similaritytEach meaning in each synonym
Or the primary vector of each related term
Wherein,For primary vectorCorrection factor, indicate for word wtJ-th of meaning cjIn i-th it is same
Adopted word or i-th of related termAttention score,For word wtJ-th of meaning cjIn i-th of synonym or
The modification vector of i-th of related term;
Modification vector substitution primary vector is inputted into the first probabilistic model or/and the second probabilistic model.
3. the term vector generation method of Chinese corpus according to claim 1, which is characterized in that described with the second setting time
The method that sequence judges that word concentrates each word whether to belong to independent word set includes:
It constructs window and word collection input window is obtained into the centre word of window;
Judge whether centre word belongs to independent word set;
By the way of sliding window, successively judge that word concentrates whether each word belongs to independent word set.
4. the term vector generation method of Chinese corpus according to claim 2, which is characterized in that pass through each meaning of word
The coding vector of any two synonym or related term obtains any two in each meaning of the word according to the following formula (8) in think of
The priori similarity of a synonym or related term
Wherein, β is the harmonic coefficient that synonym and related term influence, β≤1.
5. the term vector generation method of Chinese corpus according to claim 4, which is characterized in that further include optimization reconciliation system
The step of number β, the step includes:
The similarity numerical value for by expert knowledge library or manually marking out pairs of word, is added to First ray with setting sequence
In;
The vector similarity that the primary vector collection of the pairs of word is obtained by similarity based method, will according to above-mentioned setting sequence
Vector similarity is added in the second sequence;
The overall similarity of First ray and the second sequence is obtained using similarity based method;
Using the corresponding harmonic coefficient β of overall similarity maximum value as best harmonic coefficient β.
6. a kind of term vector of Chinese corpus generates system characterized by comprising
Autonomous word is stored as independent word set by database, the synonym of each meaning of word is stored as synset, by word
The related term of each meaning of language is stored as related word set;
Acquisition module acquires Chinese corpus;
Word segmentation module segments the Chinese corpus of acquisition, obtains the word collection W=[w that the word of Chinese corpus is constituted1,
w2..., wb], b is the word sum of word collection;
Coding module, by the word concentrate the independent word set of word, synset and related word set by the first setting order into
Row coding, obtains the coding vector of autonomous word, the coding vector of each synonym of each meaning of each word, each word
Each meaning each related term coding vector and each word each meaning coding vector;
Primary vector collection constructs module, and coding vector input word is indicated model, autonomous word, each word that word is concentrated
Each synonym, each related term of each meaning of each word and each meaning of each word of each meaning are converted into
Primary vector, the primary vector of each synonym constitutes the synonymous of each meaning of each word in each meaning of each word
The primary vector of the primary vector collection of word set, each related term of each meaning of each word constitutes each meaning of each word
The primary vector collection of the related word set of think of, the primary vector collection structure of the synset of each meaning of each word and related word set
At the primary vector collection of each meaning of each word, the interesting primary vector collection of each word constitutes each word
Primary vector collection;
Judgment module judges that word concentrates whether each word belongs to independent word set with the second setting order, if the word
Belong to independent word set, send a signal to vector output module, if the word is not belonging to independent word set, sends a signal to first
Probabilistic model module and the second probabilistic model module;
First probabilistic model module, including the first probabilistic model construction unit and the first data processing unit, shown first probability
The first probabilistic model of model construction building unit, first data processing unit by the first of each meaning of the word to
Each primary vector in quantity set inputs the first probabilistic model, obtains the first probability that the word belongs to each meaning, wherein
First probabilistic model is constructed by following formula (1)
Wherein, wtT-th of word, c are concentrated for wordjFor word wtJ-th the meaning,Indicate word wt
Belong to meaning cjThe first probability,Indicate word wtJ-th of meaning cjIn i-th of synonym or i-th of related term
Primary vector;
Second probabilistic model module, including the second probabilistic model construction unit and the second data processing unit, shown second probability
The second probabilistic model of model construction building unit, second data processing unit by the first of each meaning of the word to
Amount the second probabilistic model of input, obtains the second probability that the word belongs to the context words of each meaning, wherein described the
Two probabilistic models are constructed by following formula (2)
Wherein, wt+kIndicate word wtContext words,Indicate word wtMeaning cjPrimary vector,It is word
It concentrates in addition to word wtOther words in addition indicate the primary vector that model obtains as the coding vector input word of autonomous word,For word wtMeaning cjCoding vector,Indicate word wtBelong to meaning cjContext words
The second probability;
Third probabilistic module, including third probabilistic model construction unit and third data processing unit, shown third probabilistic model
Construction unit constructs third probabilistic model, and the word is belonged to the first probability of each meaning by the third data processing unit
The second probability for belonging to the context words of each meaning with the word inputs third probabilistic model, and acquisition word belongs to each
The third probability of the meaning, wherein the third probabilistic model is constructed by following formula (3)
Wherein, p (cj|wt) indicate word wtBelong to meaning cjThird probability;
Vector output module will belong to the primary vector of the word of independent word set as the output vector of the word, will not belong to
In the corresponding word of maximum third probability of the word of independent word set the meaning primary vector as the defeated of the word
Outgoing vector.
7. the term vector of Chinese corpus according to claim 6 generates system, which is characterized in that it further include correction module,
The first probabilistic model is modified by attention mechanism, comprising:
Priori similarity obtaining unit passes through the coding vector root of any two synonym or related term in each meaning of word
The priori similarity of any two synonym or related term in each meaning of the word is obtained according to following formula (4)
Wherein,For word wtA meaning cjSynset,For word wtA meaning cjRelated word set,
witAnd wntFor synsetIn two synonyms or related word setIn two related terms, fp(wit, wnt) it is witAnd wnt
Priori similarity, SIM (wit, wnt) it is to work as witAnd wntAll in related word setIn, witAnd wntCoding vector pass through phase
The priori similarity obtained like degree method;
Vector similarity obtaining unit obtains any two synonym or phase in each meaning of the word by following formula (5)
Close the vector similarity of word
fv(wit, wnt)=< wit, wnt>=∑dvd(wit)*vd(wnt) (5)
Wherein, fv(wit, wnt) it is witAnd wntVector similarity, vd(wit) and vd(wnt) indicate word witAnd wntFirst to
Amount, d indicate vector dimension;
Amending unit corrects word w according to the following formula (6) and (7) by priori similarity and vector similaritytEach meaning in
The primary vector of each synonym or each related term
Wherein,For primary vectorCorrection factor, indicate for word wtJ-th of meaning cjIn i-th it is same
Adopted word or i-th of related termAttention score,For word wtJ-th of meaning cjIn i-th of synonym or
The modification vector of i-th of related term,
Wherein, the revised modification vector substitution primary vector of amending unit inputs the first probabilistic model or/and the second probability mould
Type.
8. the term vector of Chinese corpus according to claim 6 generates system, which is characterized in that the judgment module includes
Window construction unit, judging unit and sliding unit, wherein the window construction unit constructs window, by word collection input window
Mouthful, obtain the centre word of window;The sliding unit, sliding window;The judging unit judges whether centre word belongs to independence
Word set.
9. the term vector of Chinese corpus according to claim 7 generates system, which is characterized in that the priori similarity obtains
Unit by the coding vector of any two synonym or related term in each meaning of word described in (8) obtain according to the following formula
The priori similarity of any two synonym or related term in each meaning of word
Wherein, β is the harmonic coefficient that synonym and related term influence, β≤1.
10. the term vector of Chinese corpus according to claim 9 generates system, which is characterized in that it further include optimization module,
Optimize harmonic coefficient β, comprising:
First ray construction unit by expert knowledge library or manually marks out the similarity numerical value of pairs of word, suitable to set
Sequence is added in First ray;
Second sequence construct unit obtains the vector similarity of the primary vector collection of the pairs of word by similarity based method,
Vector similarity is added in the second sequence according to above-mentioned setting sequence;
Overall similarity obtaining unit obtains the overall similarity of First ray and the second sequence using similarity based method;
Best harmonic coefficient obtaining unit, using the corresponding harmonic coefficient β of overall similarity maximum value as best harmonic coefficient β.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910429450.5A CN110309317B (en) | 2019-05-22 | 2019-05-22 | Method, system, electronic device and medium for generating word vector of Chinese corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910429450.5A CN110309317B (en) | 2019-05-22 | 2019-05-22 | Method, system, electronic device and medium for generating word vector of Chinese corpus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110309317A true CN110309317A (en) | 2019-10-08 |
CN110309317B CN110309317B (en) | 2021-07-23 |
Family
ID=68075495
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910429450.5A Active CN110309317B (en) | 2019-05-22 | 2019-05-22 | Method, system, electronic device and medium for generating word vector of Chinese corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110309317B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156004A (en) * | 2016-07-04 | 2016-11-23 | 中国传媒大学 | The sentiment analysis system and method for film comment information based on term vector |
US20170053646A1 (en) * | 2015-08-17 | 2017-02-23 | Mitsubishi Electric Research Laboratories, Inc. | Method for using a Multi-Scale Recurrent Neural Network with Pretraining for Spoken Language Understanding Tasks |
CN106649250A (en) * | 2015-10-29 | 2017-05-10 | 北京国双科技有限公司 | Method and device for identifying emotional new words |
CN107102981A (en) * | 2016-02-19 | 2017-08-29 | 腾讯科技(深圳)有限公司 | Term vector generation method and device |
CN109325114A (en) * | 2018-07-24 | 2019-02-12 | 武汉理工大学 | A kind of text classification algorithm merging statistical nature and Attention mechanism |
-
2019
- 2019-05-22 CN CN201910429450.5A patent/CN110309317B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170053646A1 (en) * | 2015-08-17 | 2017-02-23 | Mitsubishi Electric Research Laboratories, Inc. | Method for using a Multi-Scale Recurrent Neural Network with Pretraining for Spoken Language Understanding Tasks |
CN106649250A (en) * | 2015-10-29 | 2017-05-10 | 北京国双科技有限公司 | Method and device for identifying emotional new words |
CN107102981A (en) * | 2016-02-19 | 2017-08-29 | 腾讯科技(深圳)有限公司 | Term vector generation method and device |
CN106156004A (en) * | 2016-07-04 | 2016-11-23 | 中国传媒大学 | The sentiment analysis system and method for film comment information based on term vector |
CN109325114A (en) * | 2018-07-24 | 2019-02-12 | 武汉理工大学 | A kind of text classification algorithm merging statistical nature and Attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN110309317B (en) | 2021-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111310438B (en) | Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN111241294B (en) | Relationship extraction method of graph convolution network based on dependency analysis and keywords | |
CN110019843B (en) | Knowledge graph processing method and device | |
CN110427623A (en) | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN109871538A (en) | A kind of Chinese electronic health record name entity recognition method | |
CN105243129A (en) | Commodity property characteristic word clustering method | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
CN104573028A (en) | Intelligent question-answer implementing method and system | |
CN101539907A (en) | Part-of-speech tagging model training device and part-of-speech tagging system and method thereof | |
CN106708929B (en) | Video program searching method and device | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN111444704B (en) | Network safety keyword extraction method based on deep neural network | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
CN113761890A (en) | BERT context sensing-based multi-level semantic information retrieval method | |
CN106886565B (en) | Automatic polymerization method for foundation house type | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN114936277A (en) | Similarity problem matching method and user similarity problem matching system | |
CN111680131A (en) | Document clustering method and system based on semantics and computer equipment | |
CN114997288A (en) | Design resource association method | |
CN111222330A (en) | Chinese event detection method and system | |
CN110516145A (en) | Information searching method based on sentence vector coding | |
CN112883199A (en) | Collaborative disambiguation method based on deep semantic neighbor and multi-entity association |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |