CN106909537A - A kind of polysemy analysis method based on topic model and vector space - Google Patents
A kind of polysemy analysis method based on topic model and vector space Download PDFInfo
- Publication number
- CN106909537A CN106909537A CN201710067919.6A CN201710067919A CN106909537A CN 106909537 A CN106909537 A CN 106909537A CN 201710067919 A CN201710067919 A CN 201710067919A CN 106909537 A CN106909537 A CN 106909537A
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- topic
- representing
- theme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 131
- 238000004458 analytical method Methods 0.000 title claims abstract description 17
- 238000005070 sampling Methods 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000000034 method Methods 0.000 claims description 11
- 238000009825 accumulation Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000005065 mining Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of polysemy analysis method based on topic model and vector space, including:S1, using formula (1) as object function, set up the topic model of polysemy;S2, the data for reading whole collection of document D;S3, descriptor distributionInitialization;S4, theme sampling;S5, theme vector update;S6, term vector training;S7, circulation perform S4 to S6 several times, to carry out iteration several times;S8, the term vector that will be drawn and theme vector are exported and stored;S9, judge whether polysemy.A kind of polysemy analysis method based on topic model and vector space that the present invention is provided, term vector, the theme vector of more high-quality can be trained, it is set to show more reasonably to explain in the researching and analysing of polysemy, and the performance of topic model is also significantly better than archetype LDA.The present invention learns mutually to improve by the intersection of topic model, term vector, theme vector this three, can be efficiently applied to the tasks such as similarity assessment, document classification, topic relativity.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a word polysemous analysis method based on a topic model and a vector space.
Background
With the vigorous development of artificial intelligence technology, natural language processing is used as an innovative language research mode, combines computer science, linguistics and mathematics into an intelligent science, and is widely applied to the aspects of machine translation, question-answering systems, information retrieval, document processing and the like. Since most words do not have a meaning, that is, a phenomenon of word ambiguity exists, if each word is represented by a single word vector, the phenomenon of ambiguity cannot be eliminated, in order to solve the problem, context information or a topic vector is used for assisting in a word ambiguity study, but the study isolates a topic model, a word vector and a topic vector, and the existing result is simply used as prior knowledge to assist in training the model.
The topic model is used for mining hidden topic information of a document set, each topic represents a related concept and is embodied as a series of related words, and the realization form is topic-word distribution. The word vector model maps each word to a low-dimensional real-valued space by using context information in the text and contains information such as syntax semantics, so that the similarity of word vectors can be measured by using Euclidean distance or cosine included angle. The topic vector directly maps a topic into vector space, approximately representing the semantic center of a topic.
The topic model, the word vector and the topic vector can be used for document representation and are mainly applied to tasks such as document clustering and document classification. The three items have the characteristics in text mining, the combination of global information of a topic model and local information of a word vector is proved by research to be beneficial to improving the effect of an original model, but the research has great limitation, most of the research independently opens the three items, or one or two items of the three items are trained independently, and then the effect of the other item is improved by means of a training result; or directly using the training result of the larger training set as external knowledge to assist model training of other small data sets.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a word polysemous analysis method based on a topic model and a vector space by modeling a text document set and using the advantages of the topic model, a word vector and a topic vector for reference so as to better mine the hidden topic information of the document set.
In order to achieve the purpose, the invention adopts the following technical scheme:
a word-polysemous analysis method based on a topic model and a vector space comprises the following steps:
s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:
whereinFor a collection of text documents, M is the number of documents in the collection, NmNumber of words for mth document, c size of context information window, wm,nRepresenting the nth word of the mth document, K representing the number of topics, tkWhich represents the k-th topic vector,representing a topic-word distribution in the topic model,denotes wm,nThe subject number of (1);
s2, reading the whole document setThe data of (a);
s3, topic-word distributionInitialAnd (3) conversion: firstly, a GibbsLDA algorithm is adopted to collect text documentsSubject sampling is performed on each word in the list; then, topic-word distribution to topic modelsCarrying out initialization estimation;
s4, theme sampling: for each word w in the documentm,nCalculating the probability of each topic belonging to the word, and sampling the corresponding topic number z by adopting an accumulation distribution modem,n∈[1,K];
S5, theme vector updating: for each topic vector tk,k∈[1,K]The vector representation is recalculated according to equation (5):
wherein,to indicate a function, when x is true, the result is 1, otherwise it is 0.The expression wm,nCorresponding word vector representation, W represents the vocabulary size of the document collection, nk,wThe number of words w assigned to the topic k;
s6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode;
s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations;
s8, outputting and storing the obtained word vector and the obtained theme vector;
s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold value, determining that the word has a word ambiguity phenomenon; otherwise, the word is determined not to have a word ambiguity phenomenon.
Further, in S3, the update rule used in the topic sampling process is as shown in equation (2):
where- (m, n) denotes removing the current word at statistical time, W denotes the collection of text documentsNumber of words of (1), nm,kRepresenting the number of words belonging to topic k in the mth document, zm,nThe expression wm,nThe assigned subject number is assigned to the subject,the expression wm,nNumber, n, assigned to topic kk' denotes the number of all words assigned to subject k, α is the dirichlet symmetric hyperparameter;
topic-word distribution to topic modelsThe formula used for the initialization estimation is formula (3):
wherein,the topic-word distribution representing the initialization estimate, β represents the dirichlet symmetric hyperparameter.
Further, in S4, the probability that the word belongs to each topic is calculated according to equation (4):
further, in S6, the method specifically includes the following steps:
s601, updating theme-word distributionIs calculated according to equation (6)The gradient of each component; for each component, defining its constraint as
Wherein, L (w)m,n+j) Representing the distance w from the root node to the leaf node of the Huffman treem,n+jThe path length (number of nodes, including root and leaf nodes),representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)-x),Representing the ith non-leaf node on the path;
s602, updating a word vector w: calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;
s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree: calculating a non-leaf node vector u on a Huffman tree path according to the formula (8) so that the non-leaf node vector u can influence the training quality of a word vector w;
further, in S9, the set threshold value is 0.6.
The word polysemous analysis method based on the topic model and the vector space can train higher-quality word vectors and topic vectors, so that the word vectors and the topic vectors can be more reasonably explained in the research and analysis of polysemous, and the expression of the topic model is obviously superior to that of the original model LDA. The method and the system improve the similarity through the cross learning of the topic model, the word vector and the topic vector, and can be effectively applied to tasks such as similarity evaluation, document classification and topic correlation.
Drawings
Fig. 1 is a schematic flowchart of a word-polysemous analysis method based on a topic model and a vector space according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
In order to fully utilize the intrinsic characteristics of a topic model, a word vector and a topic vector, consider the universality of a word ambiguity phenomenon of text data and better mine the latent topic information of a document set and train a word vector and a topic vector with higher quality, the invention provides a word ambiguity analysis method based on the topic model and a vector space.
Specifically, the present invention makes the following reasonable assumptions according to the basic rules of natural language processing:
1. topic-word distribution in topic modelsA series of words with higher probability can be used for representing a specific concept, the numerical meaning of the concept is the probability of a certain word appearing under the theme, and the quality of the mined theme can be evaluated through theme correlation.
2. Each word in the text can be mapped into a low-dimensional real-valued vector space, i.e., a word vector, which contains information such as the syntactic semantics of the word, and the differences between them can be evaluated using mathematical means such as euclidean distance or cosine.
3. Topic-word distribution in topic vectors and topic modelsRather than being completely isolated, the topic vector can be viewed as a semantic center mapping of the probability distribution in the word vector space, closely associated with the word vector.
Based on the above assumptions, the present invention provides a word-of-word polysemous analysis method based on a topic model and a vector space, as shown in fig. 1, the method includes the following steps:
s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:
whereinFor a collection of text documents, M is the number of documents in the collection, NmNumber of words for mth document, c size of context information window, wm,nRepresenting the nth word of the mth document, K representing the number of topics, tkWhich represents the k-th topic vector,representing a topic-word distribution in the topic model,denotes wm,nThe subject number of (1);
s2, reading the whole text document setThe data of (a);
s3, topic-word distributionInitialization: firstly, a GibbsLDA algorithm is adopted to collect text documentsSubject sampling is performed on each word in the list; then, topic-word distribution to topic modelsCarrying out initialization estimation;
the updating rule used in the theme sampling process is shown as formula (2):
where- (m, n) indicates removal of the current word at the time of statistics, and W indicates the textDocument collectionNumber of words of (1), nm,kRepresenting the number of words belonging to topic k in the mth document, zm,nThe expression wm,nThe assigned subject number is assigned to the subject,the expression wm,nNumber, n, assigned to topic kk'Representing the number of all words assigned to topic k, α is the dirichlet symmetric hyperparameter;
topic-word distribution to topic modelsThe formula used for the initialization estimation is formula (3):
wherein,a topic-word distribution representing an initialization estimate, β a dirichlet symmetric hyperparameter;
s4, theme sampling: for each word w in the documentm,nCalculating the probability of each topic belonging to the word according to the formula (4), and sampling the corresponding topic number z by adopting an accumulation distribution modem,n∈[1,K];
In the topic model, z is an indispensable intermediate bridge in the Gibbs sampling solving process as an implicit variable of the model, and directly influences the topic-word distribution finally required to be acquired by the topic modelAnd the effect of the document-topic distribution theta. Unlike the original gibbs update rule, the present invention employs equation (4) as the gibbs update rule, which is characterized by directly using the subject-word distribution in the update ruleThe beneficial effect is that the distribution can be fully utilizedThe statistical significance and the practical significance of the method can be improved, the calculation speed can be increased, and the method is more suitable for application of large-scale data sets.
S5, theme vector updating: for each topic vector tk,k∈[1,K]The vector representation is recalculated according to equation (5):
wherein,to indicate a function, when x is true, the result is 1, otherwise it is 0.The expression wm,nCorresponding word vector representation, W represents the vocabulary size of the text document set, nk,wIndicating the number of words w assigned to topic k, which is a subtle manifestation of word ambiguity, a word may belong to different topics. Furthermore, the calculation and updating of the theme vector can be carried out simultaneously without mutual interference.
The primary purpose of the topic vector is to use vector space to represent underlying topic information in a document collection rather than similarityThe polynomial distribution of (2) so that the theme has more spatial geometrical significance and is more closely combined with the word vector. Unlike the TWE model which trains the topic vector in a Skip-Gram-like manner, the vector representation corresponding to each topic is directly updated and calculated by the formula (5), and the method is characterized in that the vector representation of each topic is only related to words under the topic, and has the advantages that the topic vector is simple in calculation manner, easy to understand, fast and efficient, and the vector can be closer to the geometric center of the word vectors by mean calculation, and can be approximately regarded as the semantic center of a topic concept according to the hypothesis 2.
S6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode; specifically, S6 includes the steps of:
s601, updating theme-word distributionIs calculated according to equation (6)The gradient of each component; where it is noted that the distribution has a probabilistic meaning, we need to define for each component its constraint as
Wherein, L (w)m,n+j) Representing the distance w from the root node to the leaf node of the Huffman treem,n+jThe path length (number of nodes, including root and leaf nodes),representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)-x),Representing the ith non-leaf node on the path;
s602, updating a word vector w: calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;
s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree. The step is mainly to calculate a non-leaf node vector u on the path of the Huffman tree according to the formula (8) so as to influence the training quality of leaf nodes (namely word vectors w);
the computation complexity of the softmax function is linearly related to the size W of the vocabulary, so that the training of a large-scale data set is not facilitated. The invention continues to use the Skip-Gram approximate calculation method and adopts the idea of layered softmax to construct a Huffman tree, leaf nodes are each word w in the vocabulary, and non-leaf nodes u are used as auxiliary vectors u. In the word vector training stage, the invention adopts random gradient descent to solve the target function shown in the formula (1), and the distribution of the subject and the word is realizedIs characterized by a topic-word distribution as shown in equation (6)Directly using the subject vector tkAnd a non-leaf node vector u of the Huffman tree, which has the beneficial effects that the subject-word scoreClothContinuously absorbing topic vector t in iterative updatekAnd exchange information with the word vector is achieved by the auxiliary vector u, so thatThe updating directly or indirectly utilizes the spatial characteristics of the theme vector and the word vector.
Further, the calculation of the update gradient of the node vector in the huffman tree is shown in the formula (7) and the formula (8), respectively, and is characterized in that the update of the non-leaf vector u directly uses the topic vector and the topic distributionThe updating of the leaf node w directly utilizes the non-leaf vector, and has the advantages that the theme vector and the theme distributionBy permeating non-leaf nodes (namely branches) of the whole Huffman tree, the leaf nodes on the Huffman tree are influenced by the leaf nodes and the branch nodes on a deeper level, so that the mutual promotion effect is achieved.
S7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations; the iteration is mainly performed to further improve the topic model, the word vector and the topic vector by cross learning.
S8, outputting and storing the obtained word vector and the obtained theme vector;
s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold (e.g. e <0.60), the word has a word ambiguity phenomenon; otherwise, the meaning of the word is consistent in different given contexts, and the word has no phenomenon of word ambiguity.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (5)
1. A word-polysemous analysis method based on a topic model and a vector space is characterized by comprising the following steps:
s1, taking the formula (1) as an objective function, establishing a word-polysemous topic model:
whereinFor a collection of text documents, M is the number of documents in the collection, NmNumber of words for mth document, c size of context information window, wm,nRepresenting the nth word of the mth document, K representing the number of topics, tkWhich represents the k-th topic vector,representing a topic-word distribution in the topic model,denotes wm,nThe subject number of (1);
s2, reading the whole document setThe data of (a);
s3, topic-word distributionInitialization: firstly, a GibbsLDA algorithm is adopted to collect text documentsSubject sampling is performed on each word in the list; then, topic-word distribution to topic modelsCarrying out initialization estimation;
s4, theme sampling: for each word w in the documentm,nCalculating the probability of each topic belonging to the word, and sampling the corresponding topic number z by adopting an accumulation distribution modem,n∈[1,K];
S5, theme vector updating: for each topic vector tk,k∈[1,K]The vector representation is recalculated according to equation (5):
wherein,to indicate a function, when x is true, the result is 1, otherwise it is 0.The expression wm,nCorresponding word vector representation, W represents the vocabulary size of the document collection, nk,wThe number of words w assigned to the topic k;
s6, training word vectors: constructing a Huffman tree, wherein leaf nodes are each word w in a vocabulary table, non-leaf nodes are used as auxiliary vectors u, and a target function shown in a formula (1) is solved by adopting a random gradient descending mode;
s7, circularly executing S4-S6 for a plurality of times to perform a plurality of iterations;
s8, outputting and storing the obtained word vector and the obtained theme vector;
s9, judging whether the word is ambiguous: splicing the word vector and the subject vector of the word to be analyzed to form a new vector representing the whole context environment, then calculating the cosine value of the new vector, and when the cosine value is smaller than a set threshold value, determining that the word has a word ambiguity phenomenon; otherwise, the word is determined not to have a word ambiguity phenomenon.
2. The analysis method according to claim 1, wherein in S3, the update rule used in the topic sampling process is as shown in equation (2):
where- (m, n) denotes removing the current word at statistical time, W denotes the collection of text documentsNumber of words of (1), nm,kRepresenting the number of words belonging to topic k in the mth document, zm,nThe expression wm,nThe assigned subject number is assigned to the subject,the expression wm,nNumber, n, assigned to topic kk'Representing the number of all words assigned to topic k, α is the dirichlet symmetric hyperparameter;
topic-word distribution to topic modelsThe formula used for the initialization estimation is formula (3):
wherein,the topic-word distribution representing the initialization estimate, β represents the dirichlet symmetric hyperparameter.
3. The analysis method according to claim 1, wherein in S4, the probability that the word belongs to each topic is calculated according to equation (4):
4. the analysis method according to claim 1, wherein in S6, the method specifically comprises the steps of:
s601, updating theme-word distributionIs calculated according to equation (6)The gradient of each component; for each component, defining its constraint as
Wherein,representing from a root node to a leaf node of a Huffman treePath length (number of nodes, including root node and leaf)A node),representing the Huffman coding of the node i → i +1 on the path, σ (x) 1/(1+ e)-x),Representing the ith non-leaf node on the path;
s602, updating a word vector w: calculating the gradient of each word according to the formula (7) and updating by using the auxiliary vector;
s603, updating the auxiliary vector u of the non-leaf node of the Huffman tree: calculating a non-leaf node vector u on a Huffman tree path according to the formula (8) so that the non-leaf node vector u can influence the training quality of a word vector w;
5. the analysis method according to claim 1, wherein in S9, the set threshold value is 0.6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710067919.6A CN106909537B (en) | 2017-02-07 | 2017-02-07 | One-word polysemous analysis method based on topic model and vector space |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710067919.6A CN106909537B (en) | 2017-02-07 | 2017-02-07 | One-word polysemous analysis method based on topic model and vector space |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106909537A true CN106909537A (en) | 2017-06-30 |
CN106909537B CN106909537B (en) | 2020-04-07 |
Family
ID=59208107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710067919.6A Active CN106909537B (en) | 2017-02-07 | 2017-02-07 | One-word polysemous analysis method based on topic model and vector space |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106909537B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108629009A (en) * | 2018-05-04 | 2018-10-09 | 南京信息工程大学 | Topic relativity modeling method based on FrankCopula functions |
CN108920467A (en) * | 2018-08-01 | 2018-11-30 | 北京三快在线科技有限公司 | Polysemant lexical study method and device, search result display methods |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
CN109376226A (en) * | 2018-11-08 | 2019-02-22 | 合肥工业大学 | Complain disaggregated model, construction method, system, classification method and the system of text |
CN110134786A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text classification method based on theme term vector and convolutional neural networks |
CN110705304A (en) * | 2019-08-09 | 2020-01-17 | 华南师范大学 | Attribute word extraction method |
CN110717015A (en) * | 2019-10-10 | 2020-01-21 | 大连理工大学 | Neural network-based polysemous word recognition method |
CN112052334A (en) * | 2020-09-02 | 2020-12-08 | 广州极天信息技术股份有限公司 | Text paraphrasing method, text paraphrasing device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020022956A1 (en) * | 2000-05-25 | 2002-02-21 | Igor Ukrainczyk | System and method for automatically classifying text |
US20030217047A1 (en) * | 1999-03-23 | 2003-11-20 | Insightful Corporation | Inverse inference engine for high performance web search |
CN103207899A (en) * | 2013-03-19 | 2013-07-17 | 新浪网技术(中国)有限公司 | Method and system for recommending text files |
CN103970730A (en) * | 2014-04-29 | 2014-08-06 | 河海大学 | Method for extracting multiple subject terms from single Chinese text |
JP5754018B2 (en) * | 2011-07-11 | 2015-07-22 | 日本電気株式会社 | Polysemy extraction system, polysemy extraction method, and program |
-
2017
- 2017-02-07 CN CN201710067919.6A patent/CN106909537B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030217047A1 (en) * | 1999-03-23 | 2003-11-20 | Insightful Corporation | Inverse inference engine for high performance web search |
US20020022956A1 (en) * | 2000-05-25 | 2002-02-21 | Igor Ukrainczyk | System and method for automatically classifying text |
JP5754018B2 (en) * | 2011-07-11 | 2015-07-22 | 日本電気株式会社 | Polysemy extraction system, polysemy extraction method, and program |
CN103207899A (en) * | 2013-03-19 | 2013-07-17 | 新浪网技术(中国)有限公司 | Method and system for recommending text files |
CN103970730A (en) * | 2014-04-29 | 2014-08-06 | 河海大学 | Method for extracting multiple subject terms from single Chinese text |
Non-Patent Citations (2)
Title |
---|
DEVORAH E.KLEIN ET AL: "The Representation of Polysemous Words", 《JOURNAL OF MEMORY AND LANGUAGE》 * |
曾琦 等: "一种多义词词向量计算方法", 《小型微型计算机系统》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108629009A (en) * | 2018-05-04 | 2018-10-09 | 南京信息工程大学 | Topic relativity modeling method based on FrankCopula functions |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
CN108920467A (en) * | 2018-08-01 | 2018-11-30 | 北京三快在线科技有限公司 | Polysemant lexical study method and device, search result display methods |
CN109376226A (en) * | 2018-11-08 | 2019-02-22 | 合肥工业大学 | Complain disaggregated model, construction method, system, classification method and the system of text |
CN110134786A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text classification method based on theme term vector and convolutional neural networks |
CN110705304A (en) * | 2019-08-09 | 2020-01-17 | 华南师范大学 | Attribute word extraction method |
CN110705304B (en) * | 2019-08-09 | 2020-11-06 | 华南师范大学 | Attribute word extraction method |
CN110717015A (en) * | 2019-10-10 | 2020-01-21 | 大连理工大学 | Neural network-based polysemous word recognition method |
CN112052334A (en) * | 2020-09-02 | 2020-12-08 | 广州极天信息技术股份有限公司 | Text paraphrasing method, text paraphrasing device and storage medium |
CN112052334B (en) * | 2020-09-02 | 2024-04-05 | 广州极天信息技术股份有限公司 | Text interpretation method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106909537B (en) | 2020-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106909537B (en) | One-word polysemous analysis method based on topic model and vector space | |
CN112214995B (en) | Hierarchical multitasking term embedded learning for synonym prediction | |
CN113792818B (en) | Intention classification method and device, electronic equipment and computer readable storage medium | |
Ganea et al. | Hyperbolic neural networks | |
CN109213995B (en) | Cross-language text similarity evaluation technology based on bilingual word embedding | |
CN108733742B (en) | Global normalized reader system and method | |
CN108920460B (en) | Training method of multi-task deep learning model for multi-type entity recognition | |
Teng et al. | Context-sensitive lexicon features for neural sentiment analysis | |
CN104834747B (en) | Short text classification method based on convolutional neural networks | |
CN110969020B (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
CN111291556B (en) | Chinese entity relation extraction method based on character and word feature fusion of entity meaning item | |
CN110334219A (en) | The knowledge mapping for incorporating text semantic feature based on attention mechanism indicates learning method | |
CN108628823A (en) | In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training | |
CN110704621A (en) | Text processing method and device, storage medium and electronic equipment | |
CN106202010A (en) | The method and apparatus building Law Text syntax tree based on deep neural network | |
CN114358007A (en) | Multi-label identification method and device, electronic equipment and storage medium | |
CN110688489B (en) | Knowledge graph deduction method and device based on interactive attention and storage medium | |
CN108874896B (en) | Humor identification method based on neural network and humor characteristics | |
CN111274794B (en) | Synonym expansion method based on transmission | |
CN112052684A (en) | Named entity identification method, device, equipment and storage medium for power metering | |
CN111881256B (en) | Text entity relation extraction method and device and computer readable storage medium equipment | |
WO2023004528A1 (en) | Distributed system-based parallel named entity recognition method and apparatus | |
Sadr et al. | Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms | |
CN113553440A (en) | Medical entity relationship extraction method based on hierarchical reasoning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |