CN106909537B - One-word polysemous analysis method based on topic model and vector space - Google Patents

One-word polysemous analysis method based on topic model and vector space Download PDF

Info

Publication number
CN106909537B
CN106909537B CN201710067919.6A CN201710067919A CN106909537B CN 106909537 B CN106909537 B CN 106909537B CN 201710067919 A CN201710067919 A CN 201710067919A CN 106909537 B CN106909537 B CN 106909537B
Authority
CN
China
Prior art keywords
word
topic
vector
representing
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710067919.6A
Other languages
Chinese (zh)
Other versions
CN106909537A (en
Inventor
罗嘉文
卓汉逵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201710067919.6A priority Critical patent/CN106909537B/en
Publication of CN106909537A publication Critical patent/CN106909537A/en
Application granted granted Critical
Publication of CN106909537B publication Critical patent/CN106909537B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a word polysemous analysis method based on a topic model and a vector space, which comprises the following steps: s1, establishing a word-polysemous topic model by taking the formula (1) as an objective function; s2, reading the data of the whole document set D; s3, topic-word distribution
Figure DDA0001221638640000011
Initializing; s4, theme sampling; s5, updating the theme vector; s6, training word vectors; s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations; s8, outputting and storing the obtained word vector and the obtained theme vector; s9, judging whether the word is ambiguous or not. The word polysemous analysis method based on the topic model and the vector space can train higher-quality word vectors and topic vectors, so that the word vectors and the topic vectors can be more reasonably explained in the research and analysis of polysemous, and the expression of the topic model is obviously superior to that of the original model LDA. The method and the system improve the similarity through the cross learning of the topic model, the word vector and the topic vector, and can be effectively applied to tasks such as similarity evaluation, document classification and topic correlation.

Description

One-word polysemous analysis method based on topic model and vector space
Technical Field
The invention relates to the field of natural language processing, in particular to a word polysemous analysis method based on a topic model and a vector space.
Background
With the vigorous development of artificial intelligence technology, natural language processing is used as an innovative language research mode, combines computer science, linguistics and mathematics into an intelligent science, and is widely applied to the aspects of machine translation, question-answering systems, information retrieval, document processing and the like. Since most words do not have a meaning, that is, a phenomenon of word ambiguity exists, if each word is represented by a single word vector, the phenomenon of ambiguity cannot be eliminated, in order to solve the problem, context information or a topic vector is used for assisting in a word ambiguity study, but the study isolates a topic model, a word vector and a topic vector, and the existing result is simply used as prior knowledge to assist in training the model.
The topic model is used for mining hidden topic information of a document set, each topic represents a related concept and is embodied as a series of related words, and the realization form is topic-word distribution. The word vector model maps each word to a low-dimensional real-valued space by using context information in the text and contains information such as syntax semantics, so that the similarity of word vectors can be measured by using Euclidean distance or cosine included angle. The topic vector directly maps a topic into vector space, approximately representing the semantic center of a topic.
The topic model, the word vector and the topic vector can be used for document representation and are mainly applied to tasks such as document clustering and document classification. The three items have the characteristics in text mining, the combination of global information of a topic model and local information of a word vector is proved by research to be beneficial to improving the effect of an original model, but the research has great limitation, most of the research independently opens the three items, or one or two items of the three items are trained independently, and then the effect of the other item is improved by means of a training result; or directly using the training result of the larger training set as external knowledge to assist model training of other small data sets.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a word polysemous analysis method based on a topic model and a vector space by modeling a text document set and using the advantages of the topic model, a word vector and a topic vector for reference so as to better mine the hidden topic information of the document set.
In order to achieve the purpose, the invention adopts the following technical scheme:
a word-polysemous analysis method based on a topic model and a vector space comprises the following steps:
s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:
Figure BDA0001221638620000021
wherein
Figure BDA0001221638620000022
For a collection of text documents, M is the number of documents in the collection, NmNumber of words for mth document, c size of context information window, wm,nRepresenting the nth word of the mth document, K representing the number of topics, tkWhich represents the k-th topic vector,
Figure BDA0001221638620000023
representing a topic-word distribution in the topic model,
Figure BDA0001221638620000024
denotes wm,nThe subject number of (1);
s2, reading the whole document set
Figure BDA0001221638620000025
The data of (a);
s3, topic-word distribution
Figure BDA0001221638620000026
Initialization: firstly, a GibbsLDA algorithm is adopted to collect text documents
Figure BDA0001221638620000027
Subject sampling is performed on each word in the list; then, topic-word distribution to topic models
Figure BDA0001221638620000028
Carrying out initialization estimation;
s4, theme sampling: for each word w in the documentm,nCalculating the probability of each topic belonging to the word, and sampling the corresponding topic number z by adopting an accumulation distribution modem,n∈[1,K];
S5, theme vector updating: for each topic vector tk,k∈[1,K]The vector representation is recalculated according to equation (5):
Figure BDA0001221638620000031
wherein the content of the first and second substances,
Figure BDA0001221638620000032
to indicate a function, when x is true, the result is 1, otherwise it is 0.
Figure BDA0001221638620000033
The expression wm,nCorresponding word vector representation, W represents the vocabulary size of the document collection, nk,wThe number of words w assigned to the topic k;
s6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode;
s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations;
s8, outputting and storing the obtained word vector and the obtained theme vector;
s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold value, determining that the word has a word ambiguity phenomenon; otherwise, the word is determined not to have a word ambiguity phenomenon.
Further, in S3, the update rule used in the topic sampling process is as shown in equation (2):
Figure BDA0001221638620000034
where- (m, n) denotes removing the current word at statistical time, W denotes the collection of text documents
Figure BDA0001221638620000035
Number of words of (1), nm,kRepresenting the number of words belonging to topic k in the mth document, zm,nThe expression wm,nThe assigned subject number is assigned to the subject,
Figure BDA0001221638620000036
the expression wm,nNumber, n, assigned to topic kk' denotes the number of all words assigned to subject k, α is the dirichlet symmetric hyperparameter;
topic-word distribution to topic models
Figure BDA0001221638620000037
The formula used for the initialization estimation is formula (3):
Figure BDA0001221638620000041
wherein the content of the first and second substances,
Figure BDA0001221638620000042
the topic-word distribution representing the initialization estimate, β represents the dirichlet symmetric hyperparameter.
Further, in S4, the probability that the word belongs to each topic is calculated according to equation (4):
Figure BDA0001221638620000043
further, in S6, the method specifically includes the following steps:
s601, updating theme-word distribution
Figure BDA0001221638620000044
Is calculated according to equation (6)
Figure BDA0001221638620000045
The gradient of each component; for each component, defining its constraint as
Figure BDA0001221638620000046
Figure BDA0001221638620000047
Wherein, L (w)m,n+j) Representing the distance w from the root node to the leaf node of the Huffman treem,n+jThe path length (number of nodes, including root and leaf nodes),
Figure BDA0001221638620000048
representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)-x),
Figure BDA0001221638620000049
Representing the ith non-leaf node on the path;
s602, updating a word vector w: calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;
Figure BDA00012216386200000410
s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree: calculating a non-leaf node vector u on a Huffman tree path according to the formula (8) so that the non-leaf node vector u can influence the training quality of a word vector w;
Figure BDA0001221638620000051
further, in S9, the set threshold value is 0.6.
The word polysemous analysis method based on the topic model and the vector space can train higher-quality word vectors and topic vectors, so that the word vectors and the topic vectors can be more reasonably explained in the research and analysis of polysemous, and the expression of the topic model is obviously superior to that of the original model LDA. The method and the system improve the similarity through the cross learning of the topic model, the word vector and the topic vector, and can be effectively applied to tasks such as similarity evaluation, document classification and topic correlation.
Drawings
Fig. 1 is a schematic flowchart of a word-polysemous analysis method based on a topic model and a vector space according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
In order to fully utilize the intrinsic characteristics of a topic model, a word vector and a topic vector, consider the universality of a word ambiguity phenomenon of text data and better mine the latent topic information of a document set and train a word vector and a topic vector with higher quality, the invention provides a word ambiguity analysis method based on the topic model and a vector space.
Specifically, the present invention makes the following reasonable assumptions according to the basic rules of natural language processing:
1. topic-word distribution in topic models
Figure BDA0001221638620000052
A series of words with higher probability can be used for representing a specific concept, the numerical meaning of the concept is the probability of a certain word appearing under the theme, and the quality of the mined theme can be evaluated through theme correlation.
2. Each word in the text can be mapped into a low-dimensional real-valued vector space, i.e., a word vector, which contains information such as the syntactic semantics of the word, and the differences between them can be evaluated using mathematical means such as euclidean distance or cosine.
3. Topic-word distribution in topic vectors and topic models
Figure BDA0001221638620000061
Rather than being completely isolated, the topic vector can be viewed as a semantic center mapping of the probability distribution in the word vector space, closely associated with the word vector.
Based on the above assumptions, the present invention provides a word-of-word polysemous analysis method based on a topic model and a vector space, as shown in fig. 1, the method includes the following steps:
s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:
Figure BDA0001221638620000062
wherein
Figure BDA0001221638620000063
For a collection of text documents, M is the number of documents in the collection, NmNumber of words for mth document, c size of context information window, wm,nRepresenting the nth word of the mth document, K representing the number of topics, tkWhich represents the k-th topic vector,
Figure BDA0001221638620000064
representing a topic-word distribution in the topic model,
Figure BDA0001221638620000065
denotes wm,nThe subject number of (1);
s2, reading the whole text document set
Figure BDA0001221638620000066
The data of (a);
s3, topic-word distribution
Figure BDA0001221638620000067
Initialization: firstly, a GibbsLDA algorithm is adopted to collect text documents
Figure BDA0001221638620000068
Subject sampling is performed on each word in the list; then, topic-word distribution to topic models
Figure BDA0001221638620000069
Carrying out initialization estimation;
the updating rule used in the theme sampling process is shown as formula (2):
Figure BDA00012216386200000610
where- (m, n) denotes removing the current word at statistical time, W denotes the collection of text documents
Figure BDA00012216386200000611
Number of words of (1), nm,kRepresenting the number of words belonging to topic k in the mth document, zm,nThe expression wm,nThe assigned subject number is assigned to the subject,
Figure BDA00012216386200000612
the expression wm,nNumber, n, assigned to topic kk'Representing the number of all words assigned to topic k, α is the dirichlet symmetric hyperparameter;
topic-word distribution to topic models
Figure BDA0001221638620000071
The formula used for the initialization estimation is formula (3):
Figure BDA0001221638620000072
wherein the content of the first and second substances,
Figure BDA0001221638620000073
a topic-word distribution representing an initialization estimate, β a dirichlet symmetric hyperparameter;
s4, theme sampling: for each word w in the documentm,nCalculating the probability of each topic belonging to the word according to the formula (4), and sampling the corresponding topic number z by adopting an accumulation distribution modem,n∈[1,K];
Figure BDA0001221638620000074
In the topic model, z is an indispensable intermediate bridge in the Gibbs sampling solving process as an implicit variable of the model, and directly influences the topic-word distribution finally required to be acquired by the topic model
Figure BDA0001221638620000075
And the effect of the document-topic distribution theta. With the original Gibbs update gaugeIn contrast, the present invention employs formula (4) as a gibbs update rule, characterized in that the subject-word distribution is used directly in the update rule
Figure BDA0001221638620000076
The beneficial effect is that the distribution can be fully utilized
Figure BDA0001221638620000077
The statistical significance and the practical significance of the method can be improved, the calculation speed can be increased, and the method is more suitable for application of large-scale data sets.
S5, theme vector updating: for each topic vector tk,k∈[1,K]The vector representation is recalculated according to equation (5):
Figure BDA0001221638620000078
wherein the content of the first and second substances,
Figure BDA0001221638620000079
to indicate a function, when x is true, the result is 1, otherwise it is 0.
Figure BDA00012216386200000710
The expression wm,nCorresponding word vector representation, W represents the vocabulary size of the text document set, nk,wIndicating the number of words w assigned to topic k, which is a subtle manifestation of word ambiguity, a word may belong to different topics. Furthermore, the calculation and updating of the theme vector can be carried out simultaneously without mutual interference.
The primary purpose of the topic vector is to use vector space to represent underlying topic information in a document collection rather than similarity
Figure BDA0001221638620000081
The polynomial distribution of (2) so that the theme has more spatial geometrical significance and is more closely combined with the word vector. Unlike the TWE model which trains topic vectors using a Skip-Gram like approach, the present invention employs equation (5) to directly update the computation cost perThe vector representation corresponding to each topic is characterized in that the vector representation of each topic is only related to words under the topic, and the method has the advantages that the calculation mode of the topic vector is simple, easy to understand, fast and efficient, the vector can be closer to the geometric center of the word vectors by adopting mean calculation, and the vector can be approximately regarded as the semantic center of a topic concept according to the hypothesis 2.
S6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode; specifically, S6 includes the steps of:
s601, updating theme-word distribution
Figure BDA0001221638620000082
Is calculated according to equation (6)
Figure BDA0001221638620000083
The gradient of each component; where it is noted that the distribution has a probabilistic meaning, we need to define for each component its constraint as
Figure BDA0001221638620000084
Figure BDA0001221638620000085
Wherein, L (w)m,n+j) Representing the distance w from the root node to the leaf node of the Huffman treem,n+jThe path length (number of nodes, including root and leaf nodes),
Figure BDA0001221638620000086
representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)-x),
Figure BDA0001221638620000087
Representing the ith non-leaf node on the path;
s602, updating a word vector w: calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;
Figure BDA0001221638620000091
s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree. The step is mainly to calculate a non-leaf node vector u on the path of the Huffman tree according to the formula (8) so as to influence the training quality of leaf nodes (namely word vectors w);
Figure BDA0001221638620000092
the computation complexity of the softmax function is linearly related to the size W of the vocabulary, so that the training of a large-scale data set is not facilitated. The invention continues to use the Skip-Gram approximate calculation method and adopts the idea of layered softmax to construct a Huffman tree, leaf nodes are each word w in the vocabulary, and non-leaf nodes u are used as auxiliary vectors u. In the word vector training stage, the invention adopts random gradient descent to solve the target function shown in the formula (1), and the distribution of the subject and the word is realized
Figure RE-GDA0001272580550000093
Is characterized by a topic-word distribution as shown in equation (6)
Figure RE-GDA0001272580550000094
Directly using the subject vector tkAnd a non-leaf node vector u of the Huffman tree, which has the beneficial effect that the distribution of the subject-word is
Figure RE-GDA0001272580550000095
Continuously absorbing topic vector t in iterative updatekAnd exchange information with the word vector is achieved by the auxiliary vector u, so that
Figure RE-GDA0001272580550000096
The updating directly or indirectly utilizes the spatial characteristics of the theme vector and the word vector。
Further, the calculation of the update gradient of the node vector in the huffman tree is shown in the formula (7) and the formula (8), respectively, and is characterized in that the update of the non-leaf vector u directly uses the topic vector and the topic distribution
Figure BDA0001221638620000097
The updating of the leaf node w directly utilizes the non-leaf vector, and has the advantages that the theme vector and the theme distribution
Figure BDA0001221638620000098
By permeating non-leaf nodes (namely branches) of the whole Huffman tree, the leaf nodes on the Huffman tree are influenced by the leaf nodes and the branch nodes on a deeper level, so that the mutual promotion effect is achieved.
S7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations; the iteration is mainly performed to further improve the topic model, the word vector and the topic vector by cross learning.
S8, outputting and storing the obtained word vector and the obtained theme vector;
s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold (e.g. e <0.60), the word has a word ambiguity phenomenon; otherwise, the meaning of the word is consistent in different given contexts, and the word has no phenomenon of word ambiguity.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (5)

1. A word-polysemous analysis method based on a topic model and a vector space is characterized by comprising the following steps:
s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:
Figure FDA0002379593700000011
wherein
Figure FDA0002379593700000012
For a collection of text documents, M is the number of documents in the collection, NmNumber of words for mth document, c size of context information window, wm,nRepresenting the nth word of the mth document, K representing the number of topics, tkWhich represents the k-th topic vector,
Figure FDA0002379593700000013
representing a topic-word distribution in a topic model, zm,nDenotes wm,nThe subject number of (1);
s2, reading the whole document set
Figure FDA0002379593700000014
The data of (a);
s3, topic-word distribution
Figure FDA0002379593700000015
Initialization: firstly, a GibbsLDA algorithm is adopted to collect text documents
Figure FDA0002379593700000016
Subject sampling is performed on each word in the list; then, topic-word distribution to topic models
Figure FDA0002379593700000017
Carrying out initialization estimation;
s4, theme sampling: for each word w in the documentm,nCalculating the word belonging to each topicProbability, then sampling the corresponding topic number z by adopting a cumulative distribution modem,n∈[1,K];
S5, theme vector updating: for each topic vector tk,k∈[1,K]The vector representation is recalculated according to equation (5):
Figure FDA0002379593700000018
wherein the content of the first and second substances,
Figure FDA0002379593700000019
to indicate a function, when x is true, the result is 1, otherwise it is 0,
Figure FDA00023795937000000110
the expression wm,nCorresponding word vector representation, W represents the vocabulary size of the document collection, nk,wThe number of words w assigned to the topic k;
s6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode;
s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations;
s8, outputting and storing the obtained word vector and the obtained theme vector;
s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold value, determining that the word has a word ambiguity phenomenon; otherwise, the word is determined not to have a word ambiguity phenomenon.
2. The analysis method according to claim 1, wherein in S3, the update rule used in the topic sampling process is as shown in equation (2):
Figure FDA0002379593700000021
wherein- (m, n) represents removing the nth word of the mth document in statistics, and W represents the text document set
Figure FDA0002379593700000022
W denotes a collection of text documents
Figure FDA0002379593700000023
Vector representation of all terms in, nm,kRepresenting the number of words belonging to topic k in the mth document, zm,nThe expression wm,nAssigned topic number, z-(m,n)Representing a set of documents after the removal of the nth word from the mth document
Figure FDA0002379593700000024
A vector representation of all the topic numbers,
Figure FDA0002379593700000025
the expression wm,nNumber, n, assigned to topic kk'Representing the number of all words assigned to the topic k', α being the dirichlet symmetric hyperparameter;
topic-word distribution to topic models
Figure FDA0002379593700000026
The formula used for the initialization estimation is formula (3):
Figure FDA0002379593700000027
wherein the content of the first and second substances,
Figure FDA0002379593700000028
the topic-word distribution representing the initialization estimate, β represents the dirichlet symmetric hyperparameter.
3. The analysis method according to claim 1, wherein in S4, the probability that the word belongs to each topic is calculated according to equation (4):
Figure FDA0002379593700000031
where- (m, n) denotes removing the nth word of the mth document at statistical time, w denotes the text document set
Figure FDA0002379593700000032
Vector representation of all words in, z-(m,n)Representing a set of documents after the removal of the nth word from the mth document
Figure FDA0002379593700000033
A vector representation of all the topic numbers,
Figure FDA0002379593700000034
indicates the number of words belonging to topic k after the nth word is removed from the mth document,
Figure FDA0002379593700000035
represents the number of words belonging to the topic K' after the nth word is removed in the mth document, K represents the number of topics,
Figure FDA0002379593700000036
representing the topic-word distribution in the topic model, α is a dirichlet symmetric hyperparameter.
4. The analysis method according to claim 1, wherein in S6, the method specifically comprises the steps of:
s601, updating theme-word distribution
Figure FDA0002379593700000037
Is calculated according to equation (6)
Figure FDA0002379593700000038
The gradient of each component; for each component, defining its constraint as
Figure FDA0002379593700000039
Figure FDA00023795937000000310
Wherein the content of the first and second substances,
Figure FDA00023795937000000311
representing the distance w from the root node to the leaf node of the Huffman treem,n+jIncluding the root node and the leaf node,
Figure FDA00023795937000000312
representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)-x),
Figure FDA00023795937000000313
Representing the ith non-leaf node on the path;
s602, updating word vectors
Figure FDA00023795937000000314
Calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;
Figure FDA0002379593700000041
s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree: calculating the auxiliary vector u of the non-leaf node of the Huffman tree according to the formula (8) to influence the word vector
Figure FDA0002379593700000042
The training quality of (2);
Figure FDA0002379593700000043
5. the analysis method according to claim 1, wherein in S9, the set threshold value is 0.6.
CN201710067919.6A 2017-02-07 2017-02-07 One-word polysemous analysis method based on topic model and vector space Active CN106909537B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710067919.6A CN106909537B (en) 2017-02-07 2017-02-07 One-word polysemous analysis method based on topic model and vector space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710067919.6A CN106909537B (en) 2017-02-07 2017-02-07 One-word polysemous analysis method based on topic model and vector space

Publications (2)

Publication Number Publication Date
CN106909537A CN106909537A (en) 2017-06-30
CN106909537B true CN106909537B (en) 2020-04-07

Family

ID=59208107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710067919.6A Active CN106909537B (en) 2017-02-07 2017-02-07 One-word polysemous analysis method based on topic model and vector space

Country Status (1)

Country Link
CN (1) CN106909537B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629009A (en) * 2018-05-04 2018-10-09 南京信息工程大学 Topic relativity modeling method based on FrankCopula functions
CN108984526B (en) * 2018-07-10 2021-05-07 北京理工大学 Document theme vector extraction method based on deep learning
CN108920467B (en) * 2018-08-01 2021-04-27 北京三快在线科技有限公司 Method and device for learning word meaning of polysemous word and search result display method
CN109376226A (en) * 2018-11-08 2019-02-22 合肥工业大学 Complain disaggregated model, construction method, system, classification method and the system of text
CN110134786B (en) * 2019-05-14 2021-09-10 南京大学 Short text classification method based on subject word vector and convolutional neural network
CN110705304B (en) * 2019-08-09 2020-11-06 华南师范大学 Attribute word extraction method
CN110717015B (en) * 2019-10-10 2021-03-26 大连理工大学 Neural network-based polysemous word recognition method
CN112052334B (en) * 2020-09-02 2024-04-05 广州极天信息技术股份有限公司 Text interpretation method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207899A (en) * 2013-03-19 2013-07-17 新浪网技术(中国)有限公司 Method and system for recommending text files
CN103970730A (en) * 2014-04-29 2014-08-06 河海大学 Method for extracting multiple subject terms from single Chinese text
JP5754018B2 (en) * 2011-07-11 2015-07-22 日本電気株式会社 Polysemy extraction system, polysemy extraction method, and program

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510406B1 (en) * 1999-03-23 2003-01-21 Mathsoft, Inc. Inverse inference engine for high performance web search
WO2001090921A2 (en) * 2000-05-25 2001-11-29 Kanisa, Inc. System and method for automatically classifying text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5754018B2 (en) * 2011-07-11 2015-07-22 日本電気株式会社 Polysemy extraction system, polysemy extraction method, and program
CN103207899A (en) * 2013-03-19 2013-07-17 新浪网技术(中国)有限公司 Method and system for recommending text files
CN103970730A (en) * 2014-04-29 2014-08-06 河海大学 Method for extracting multiple subject terms from single Chinese text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
The Representation of Polysemous Words;Devorah E.Klein et al;《Journal of Memory and Language》;20011231;第259-282页 *
一种多义词词向量计算方法;曾琦 等;《小型微型计算机系统》;20160731;第37卷(第7期);第1417-1421页 *

Also Published As

Publication number Publication date
CN106909537A (en) 2017-06-30

Similar Documents

Publication Publication Date Title
CN106909537B (en) One-word polysemous analysis method based on topic model and vector space
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN108733742B (en) Global normalized reader system and method
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110377903B (en) Sentence-level entity and relation combined extraction method
CN107085581B (en) Short text classification method and device
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN108628823A (en) In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN104657350A (en) Hash learning method for short text integrated with implicit semantic features
CN110688489B (en) Knowledge graph deduction method and device based on interactive attention and storage medium
CN111881677A (en) Address matching algorithm based on deep learning model
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN114358007A (en) Multi-label identification method and device, electronic equipment and storage medium
CN113051399B (en) Small sample fine-grained entity classification method based on relational graph convolutional network
WO2023004528A1 (en) Distributed system-based parallel named entity recognition method and apparatus
CN104699797A (en) Webpage data structured analytic method and device
Sun et al. Probabilistic Chinese word segmentation with non-local information and stochastic training
CN114742069A (en) Code similarity detection method and device
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN114358020A (en) Disease part identification method and device, electronic device and storage medium
CN116720519B (en) Seedling medicine named entity identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant