CN106372061A - Short text similarity calculation method based on semantics - Google Patents
Short text similarity calculation method based on semantics Download PDFInfo
- Publication number
- CN106372061A CN106372061A CN201610817910.8A CN201610817910A CN106372061A CN 106372061 A CN106372061 A CN 106372061A CN 201610817910 A CN201610817910 A CN 201610817910A CN 106372061 A CN106372061 A CN 106372061A
- Authority
- CN
- China
- Prior art keywords
- semantic
- short text
- word
- similarity
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a short text similarity calculation method based on semantics. The short text similarity calculation method comprises the following steps of: preprocessing corpus data, establishing word Embedding, constructing a word semantic tree, calculating semantic similarity between words in a short text, and calculating the semantic similarity between the short texts. On the basis of deep learning word Embedding, a hierarchical clustering method is combined to create the word semantic tree and calculate the similarity between words in the short text, on the basis, various characteristics of the short text are combined to calculate the semantic similarity between the short texts, and the defect in the prior art that the word semantic tree can not describe a semantic relationship between a fresh word and a known word is effectively solved.
Description
Technical field
The invention belongs to short text Similarity Measure technical field, more particularly, to a kind of based on semantic short text similarity
Computational methods.
Background technology
Semantic Similarity Measurement between short text artificial intelligence, natural language processing, cognitive science, semanticss, psychology,
All there is in the fields such as bioinformatics researching value and the application background of theory.Can be overcome well using short text similarity
Information redundancy in corpus.At present, many researchs all show that short text Similarity Measure can promote many natural language processings
Task, such as event detection, information retrieval, text normalization, automatic text summarization, text classification and cluster etc..Short text is similar
Widely, good semantic similarity calculation method can considerably improve existing a lot the application that degree calculates
The performance of system.
At present, the computational methods of short text similarity have a lot, can be largely classified into several classes as follows: based on semantic dictionary
Method, the method based on corpus, the method for feature based, the method by Internet resources.Method based on semantic dictionary
Refer to by semantic dictionary, such as wordnet [], ppdb, framenet etc., calculate the semantic similarity between word and word, finally
Semantic similarity is integrated the method obtaining text semantic similarity.Referred to extensive based on the method for corpus
Text set carries out statistical analysiss, and typical method has lsa (latent semantic analysis) [] and hal
(hyperspace analogues to language)[].The method [] of feature based attempts to be defined in advance with some
Feature, to represent short text, then obtains the semantic similarity of short text by grader.Method by Internet resources
[] great majority all enrich the contextual information of short text or the phase calculating word or entity using the returning result of search engine
Like degree thus calculating the semantic similarity of short text.
It is highly dependent on the completeness of inquired about semantic dictionary based on the method for semantic dictionary, because may in short text
Non-existent word in dictionary can be comprised, thus causing to calculate the semantic similarity of this short text and other short texts.Secondly,
In dictionary, the polysemy of word also can affect the accuracy of Semantic Similarity Measurement.How the difficult point of the method for feature based is
Define effective feature and automatically obtain the value of these features.In addition, the definition of feature is for specific proximate nutrition easily, right
Relatively difficult in abstract conception.By Internet resources method for search engine returning result very sensitive it is impossible to
To stable semantic similarity.Additionally, the co-occurrence information in search engine returning result can only react two to a certain extent
The relation of word, and automatically extract from summary grammar templates precision it is difficult to ensure that.The shortcoming of hal be its construction word-
Word matrix can not capture the meaning of whole text well.Lsa may not process the neologisms occurring in short text, secondly lsa
Short text vector representation very sparse, can affect the precision of Similarity Measure, and nor represent in short text some
Syntactic information.
With the rise of neutral net and deep learning, traditional word vectors space can be converted to word
Embedding layer vector space, compensate for the features such as short text is sparse in term vector space, noise is big, and can be by no
Supervised learning and supervised learning process seamless combination, are that the calculating of short text semantic similarity opens new direction, become not
The development trend come.
Short text is different from long texts such as common news, magazines, and its length causes indivedual noise words to parsing compared with short-range missile
The semantic interference of whole short text is very serious.Therefore using the regular text of conventional treatment model and method for short text
Semantic Similarity Measurement may not be effective.
Content of the invention
The goal of the invention of the present invention is: cannot effectively solving short text length cause individually compared with short-range missile to solve prior art
The noise word problem very serious to parsing the semantic interference of whole short text, the present invention propose a kind of based on semantic short
Text similarity computing method.
The technical scheme is that a kind of based on semantic short text similarity calculating method, comprise the following steps:
A, pretreatment is carried out to language material database data, word embedding is set up according to word2vec hyper parameter;
B, using Hierarchical clustering methods build corpus phrase semantic tree;
C, the semanteme between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects
Similarity;
D, according to the semantic similarity between the Semantic Similarity Measurement short text between word in step c short text.
Further, in described step a, pretreatment is carried out to language material database data, particularly as follows: by all words in corpus
Language is all converted to small letter, and carries out participle;Select the word that occurrence number in corpus is more than n to set up corpus corresponding simultaneously
Vocabulary, wherein n are default occurrence number threshold value.
Further, in described step a, word embedding is set up according to word2vec hyper parameter, particularly as follows: using not
Train cbow the and sg model of word2vec with hyper parameter, by the use of COS distance as the semantic similarity of word embedding,
Screen first three similarity highest word synonymously, using wordnet synonymously knowledge base, by accuracy rate,
Recall rate and f1 fraction determine the word2vec hyper parameter simulating this corpus phrase semantic, thus setting up word embedding;
Wherein, accuracy rate p represents the ratio to quantity and always pre- quantitation for the correctly predicted synonym of word embedding, recall rate r
Represent the ratio to quantity for the synonym to appearance in quantity and wordnet for the correctly predicted synonym of word embedding, f1
Fraction representation is
Further, described step b adopts Hierarchical clustering methods to build the phrase semantic tree of corpus, particularly as follows: utilizing
Simlex-999 data set determines distance metric and connects tolerance, using Hierarchical clustering methods according to the distance metric determining and company
Connect the phrase semantic tree that tolerance builds corpus.
Further, described step c calculate the semantic similarity in short text between word computing formula particularly as follows:
Wherein, w1And w2All represent word, link represents the public ancestor node of the minimum of two words, inconsistent
(tree)thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection
Rate.
Further, described step d is according to the language between the Semantic Similarity Measurement short text between word in short text
Adopted similarity include following step by step:
D1, to short text t1And t2Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short essay
In this, each word is converted to small letter;
D2, respectively calculating short text t1Middle word wiWith short text t2Middle word wjSemantic similarity sij;
D3, calculating short text t1And t2Semantic similarity, computing formula particularly as follows:
Wherein, sum (rows) represents short text t1And t2Semantic similitude matrix s in every row element be not all zero row
Maximum summation, sum (columns) represent short text t1And t2Semantic similitude matrix s in every column element be not all zero
The maximum summation of row, | | srow≠0| | represent the sum of non-zero row in the semantic similitude matrix s of short text t1 and t2, | |
scolumn≠0| | represent short text t1And t2Semantic similitude matrix s in non-zero column sum.
The following beneficial effect of office of the present invention:
1st, the phrase semantic tree of the present invention is that the word embedding based on deep neural network is reasonably layered
Cluster gets, and compares existing phrase semantic tree and is more easily extensible;And for different corpus, can be with rapid build pair
The phrase semantic tree answering, the vocabulary quantity comprising is more, and the phrase semantic tree solving wordnet, Chinese thesaurus etc. can not be carved
The shortcoming drawing fresh word and known word semantic relation;
2nd, semantic similarity computational methods proposed by the present invention to be determined using the synonym data set of artificial mark
The connection inconsistent rate threshold value of hierarchical cluster phrase semantic tree, thus reduce the inconsistent rate extreme value of connection to cause semantic similarity
Differentiation out of proportion, improve semantic similarity calculating precision;
3rd, the method based on word Semantic Similarity Measurement short text semantic similarity proposed by the present invention, simply effectively,
Any short text data collection can be processed by adjusting training corpus, and be capable of identify that the different parts of speech of similar word, thus
Without the part of speech matching problem considering word, the more succinct similar short texts various to clause change are identified.
Brief description
Fig. 1 is the present invention based on semantic short text similarity calculating method schematic flow sheet.
Fig. 2 is hierarchical cluster lexical semantic tree construction schematic diagram in the embodiment of the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with drawings and Examples, right
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only in order to explain the present invention, not
For limiting the present invention.
As shown in figure 1, for the present invention based on semantic short text similarity calculating method schematic flow sheet.One kind is based on
Semantic short text similarity calculating method, comprises the following steps:
A, pretreatment is carried out to language material database data, word embedding is set up according to word2vec hyper parameter;
B, using Hierarchical clustering methods build corpus phrase semantic tree;
C, the semanteme between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects
Similarity;
D, according to the semantic similarity between the Semantic Similarity Measurement short text between word in step c short text.
It is necessary first to pretreatment is carried out to language material database data in step a, particularly as follows: by all words in corpus
All be converted to small letter, and carry out participle;In order to ensure the quality of word embedding, the present invention selects to go out occurrence in corpus
The corresponding vocabulary of corpus set up in the word more than n for the number, and wherein n is default occurrence number threshold value;Preferably, the present invention sets
N is 10, that is, select the word that occurrence number in corpus is more than 10 to set up the corresponding vocabulary of corpus.
The present invention adopts word2vec instrument to train word embedding, sets up word according to word2vec hyper parameter
Embedding, particularly as follows: train cbow the and sg model of word2vec using different hyper parameter, hyper parameter here is upper and lower
Text window size, dimension size etc., recycle COS distance as the semantic similarity of word embedding, screen first three
Similarity highest word synonymously, using wordnet synonymously knowledge base, by accuracy rate, recall rate and f1
Fraction determines the word2vec hyper parameter simulating this corpus phrase semantic, thus setting up word embedding;Wherein, accurately
Rate p represents the ratio to quantity and always pre- quantitation for the correctly predicted synonym of word embedding, and recall rate r represents word
The correctly predicted synonym of the embedding ratio to quantity, f1 fraction representation to the synonym occurring in quantity and wordnet
ForPreferably, present invention determine that word2vec hyper parameter for dimension size be 300, contextual window is big
Little is 32, and iteration is 5 times, and negative sample rate is 5.
In stepb, as shown in Fig. 2 being hierarchical cluster lexical semantic tree construction schematic diagram in the embodiment of the present invention.This
Bright employing Hierarchical clustering methods dynamically build the phrase semantic tree of corpus, and the word making semantic similarity is in phrase semantic tree
Neighbouring.The Hierarchical clustering methods of cohesion use bottom-up strategy, and typically, it is opened from making each object form the cluster of oneself
Begin, and iteratively cluster is merged into increasing cluster, until all of object is all in a cluster, or meet certain eventually
Only condition.This single cluster becomes the root of hierarchical structure.In combining step, according to certain similarity measurement, it is found out two and connects most
Near cluster, and merge their one clusters of formation.Because each iteration merges two clusters, wherein each cluster is right including at least one
As therefore condensing method at most needs n iteration.
Build corpus phrase semantic tree when it needs to be determined that between word embedding distance tolerance, here this
Bright consideration Euler's distance, COS distance and manhatton distance;Equally, in condensing method, between cluster, the tolerance of distance connects tolerance,
Here the present invention is directed to different word vectors distance metrics, analysis mean distance, centre distance, ultimate range, average distance
Deng the impact to Agglomerative Hierarchical Clustering arborescence amount for the connection tolerance.The present invention assesses different distance using simlex-999 data set
Tolerance and the quality connecting the Agglomerative Hierarchical Clustering tree that tolerance generates, build high-quality phrase semantic tree with this.simlex-
999 data sets comprise 999 pairs of English words, and the synonymous similarity between these English words and semantic relevancy are by manually marking
Note.Based on this data set, the present invention according to spearman coefficient of rank correlation analysis result determine suitable distance metric and
Connect tolerance, to build high-quality hierarchical cluster tree.
The phrase semantic tree of the present invention is poly- to being reasonably layered based on the word embedding of deep neural network
Class gets, and compares existing phrase semantic tree and is more easily extensible;And for different corpus, can be corresponded to rapid build
Phrase semantic tree, the vocabulary quantity comprising is more, and the phrase semantic tree solving wordnet, Chinese thesaurus etc. can not be portrayed
Fresh word and the shortcoming of known word semantic relation.
In step c, the present invention utilizes the semantic tree that Hierarchical clustering methods build, and devises a kind of new phrase semantic phase
Like degree computational methods.In hierarchical cluster phrase semantic tree, leaf node represents word, and father node and root node represent a company
Connect, each connects through inconsistent rate to indicate the degree of consistency between member in this connection.The present invention is according to hierarchical cluster
The inconsistent rate that in tree, each connects calculating the semantic similarity between word, computing formula particularly as follows:
Wherein, w1And w2All represent word, link represents the public ancestor node of the minimum of two words, inconsistent
(tree)thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection
Rate, if inconsistent rate exceedes given threshold, is equal to threshold value.In order to improve the precision of semantic similarity, the present invention sets when two
The inconsistent rate of the connection of word be higher than certain value when then it is assumed that this two words are altogether irrelevant, therefore we set inconsistent
Rate threshold value.Because the maximum of inconsistent rate may be very big, therefore this blocks and will effectively improve semantic similarity
Precision.Ws-353 data set comprises 353 pairs of English words, and the semantic similarity between these English words is by manually marking.Base
In this data set, the present invention determines the inconsistent rate threshold of hierarchical cluster tree according to spearman coefficient of rank correlation analysis result
Value.
Using the synonym data set of artificial mark, the semantic similarity computational methods of the present invention to determine that layering is poly-
The connection inconsistent rate threshold value of class phrase semantic tree, thus reduce connect the differentiation that inconsistent rate extreme value causes semantic similarity
Out of proportion, improve the precision of semantic similarity calculating.
In step d, the present invention remembers that two short texts are respectively t1 and t2, according to step c calculate in short text word it
Between semantic similarity, then the semantic similarity calculating between short text include following step by step:
D1, to short text t1And t2Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short essay
In this, each word is converted to small letter;
D2, respectively calculating short text t1Middle word wiWith short text t2Middle word wjSemantic similarity sij;
D3, calculating short text t1And t2Semantic similarity, computing formula particularly as follows:
Wherein, sum (rows) represents short text t1And t2Semantic similitude matrix s in every row element be not all zero row
Maximum summation, sum (columns) represent short text t1And t2Semantic similitude matrix s in every column element be not all zero
The maximum summation of row, | | srow≠0| | represent the sum of non-zero row in the semantic similitude matrix s of short text t1 and t2, | |
scolumn≠0| | represent short text t1And t2Semantic similitude matrix s in non-zero column sum.
In step d1, the present invention is directed to short text t1And t2Illustrate, two short texts particularly as follows:
text 1:my phone is annoying me with these amber alerts.
text 2:that amber alert was getting annoying.
To short text t1And t2Carry out pretreatment, that is, remove the punctuation mark in short text and special symbol, and by short text
In each word be converted to lowercase versions.
In step d2, the present invention is for short text t1In each word wi, in short text t2Middle selection and its maximum language
The word w of adopted similarityj;Again for short text t2In each word wj, in short text t1Middle selection and its maximum semantic similitude
The word w of degreei;Thus obtaining short text t1And t2Semantic similitude matrix s, as shown in table 1, be short text t1And t2Semantic phase
Like matrix.
Table 1 short text t1And t2Semantic similitude matrix
In step d3, the present invention calculates to semantic similarity calculated in step d2 and carries out sum-average arithmetic,
Obtain the semantic similarity between short text, short text t is calculated according to computing formula1And t2Semantic similarity be 0.855.
The method based on word Semantic Similarity Measurement short text semantic similarity of the present invention, simply effectively, by adjusting
Whole training corpus can process any short text data collection, and is capable of identify that the different parts of speech of similar word, thus without examining
Consider the part of speech matching problem of word, the more succinct similar short texts various to clause change are identified.
Those of ordinary skill in the art will be appreciated that, embodiment described here is to aid in reader and understands this
Bright principle is it should be understood that protection scope of the present invention is not limited to such special statement and embodiment.This area
Those of ordinary skill can make various other each without departing from present invention essence according to these technology disclosed by the invention enlightenment
Plant concrete deformation and combine, these deform and combine still within the scope of the present invention.
Claims (6)
1. a kind of based on semantic short text similarity calculating method it is characterised in that comprising the following steps:
A, pretreatment is carried out to language material database data, word embedding is set up according to word2vec hyper parameter;
B, using Hierarchical clustering methods build corpus phrase semantic tree;
C, the semantic similitude between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects
Degree;
D, according to the semantic similarity between the Semantic Similarity Measurement short text between word in step c short text.
2. as claimed in claim 1 based on semantic short text similarity calculating method it is characterised in that in described step a
Pretreatment is carried out to language material database data, particularly as follows: all words in corpus are all converted to small letter, and carries out participle;With
When select the word that occurrence number in corpus is more than n to set up the corresponding vocabulary of corpus, wherein n is default occurrence number threshold
Value.
3. as claimed in claim 2 based on semantic short text similarity calculating method it is characterised in that in described step a
Word embedding is set up according to word2vec hyper parameter, particularly as follows: using different hyper parameter train word2vec cbow and
Sg model, by the use of COS distance as the semantic similarity of word embedding, screens first three similarity highest word and makees
For synonym, using wordnet synonymously knowledge base, determined by accuracy rate, recall rate and f1 fraction and simulate this language material
The word2vec hyper parameter of storehouse phrase semantic, thus set up word embedding;Wherein, accuracy rate p represents word
The ratio to quantity and always pre- quantitation for the correctly predicted synonym of embedding, recall rate r is just representing word embedding
Really the synonym of prediction to the synonym of appearance in quantity and wordnet the ratio to quantity, f1 fraction representation is
4. as claimed in claim 3 based on semantic short text similarity calculating method it is characterised in that described step b is adopted
Build the phrase semantic tree of corpus with Hierarchical clustering methods, particularly as follows: determining distance metric using simlex-999 data set
Measure with connecting, using Hierarchical clustering methods according to the distance metric determining and the phrase semantic connecting tolerance structure corpus
Tree.
5. as claimed in claim 4 based on semantic short text similarity calculating method it is characterised in that described step c meter
Calculate the computing formula of the semantic similarity between word in short text particularly as follows:
Wherein, w1And w2All represent word, link represents the public ancestor node of the minimum of two words, inconsistent
(tree)thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection
Rate.
6. as claimed in claim 5 based on semantic short text similarity calculating method it is characterised in that described step d root
According to the semantic similarity between the Semantic Similarity Measurement short text between word in short text include following step by step:
D1, to short text t1And t2Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short text
Each word is converted to small letter;
D2, respectively calculating short text t1Middle word wiWith short text t2Middle word wjSemantic similarity sij;
D3, calculating short text t1And t2Semantic similarity, computing formula particularly as follows:
Wherein, sum (rows) represents short text t1And t2Semantic similitude matrix s in every row element be not all zero row
Big value summation, sum (columns) represents short text t1And t2Semantic similitude matrix s in every column element be not all zero row
Maximum is sued for peace, | | srow≠0| | represent short text t1And t2Semantic similitude matrix s in non-zero row sum, | | scolumn≠0||
Represent short text t1And t2Semantic similitude matrix s in non-zero column sum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610817910.8A CN106372061B (en) | 2016-09-12 | 2016-09-12 | Short text similarity calculation method based on semantics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610817910.8A CN106372061B (en) | 2016-09-12 | 2016-09-12 | Short text similarity calculation method based on semantics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106372061A true CN106372061A (en) | 2017-02-01 |
CN106372061B CN106372061B (en) | 2020-11-24 |
Family
ID=57896767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610817910.8A Active CN106372061B (en) | 2016-09-12 | 2016-09-12 | Short text similarity calculation method based on semantics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106372061B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463705A (en) * | 2017-08-17 | 2017-12-12 | 陕西优百信息技术有限公司 | A kind of data cleaning method |
CN107958061A (en) * | 2017-12-01 | 2018-04-24 | 厦门快商通信息技术有限公司 | The computational methods and computer-readable recording medium of a kind of text similarity |
CN108509410A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN108509407A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN109086756A (en) * | 2018-06-15 | 2018-12-25 | 众安信息技术服务有限公司 | A kind of text detection analysis method, device and equipment based on deep neural network |
CN109472019A (en) * | 2018-10-11 | 2019-03-15 | 厦门快商通信息技术有限公司 | A kind of short text Similarity Match Method and system based on thesaurus |
CN109657210A (en) * | 2018-11-13 | 2019-04-19 | 平安科技(深圳)有限公司 | Text accuracy rate calculation method, device, computer equipment based on semanteme parsing |
CN110019832A (en) * | 2017-09-29 | 2019-07-16 | 阿里巴巴集团控股有限公司 | The acquisition methods and device of language model |
CN110263347A (en) * | 2019-06-26 | 2019-09-20 | 腾讯科技(深圳)有限公司 | A kind of construction method and relevant apparatus of synonym |
CN110348007A (en) * | 2019-06-14 | 2019-10-18 | 北京奇艺世纪科技有限公司 | A kind of text similarity determines method and device |
CN110413986A (en) * | 2019-04-12 | 2019-11-05 | 上海晏鼠计算机技术股份有限公司 | A kind of text cluster multi-document auto-abstracting method and system improving term vector model |
CN110442863A (en) * | 2019-07-16 | 2019-11-12 | 深圳供电局有限公司 | A kind of short text semantic similarity calculation method and its system, medium |
CN111199154A (en) * | 2019-12-20 | 2020-05-26 | 重庆邮电大学 | Fault-tolerant rough set-based polysemous word expression method, system and medium |
CN112131341A (en) * | 2020-08-24 | 2020-12-25 | 博锐尚格科技股份有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
CN112784046A (en) * | 2021-01-20 | 2021-05-11 | 北京百度网讯科技有限公司 | Text clustering method, device and equipment and storage medium |
CN113590763A (en) * | 2021-09-27 | 2021-11-02 | 湖南大学 | Similar text retrieval method and device based on deep learning and storage medium |
CN114169651A (en) * | 2022-02-14 | 2022-03-11 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
US11334608B2 (en) | 2017-11-23 | 2022-05-17 | Infosys Limited | Method and system for key phrase extraction and generation from text |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011078186A1 (en) * | 2009-12-22 | 2011-06-30 | 日本電気株式会社 | Document clustering system, document clustering method, and recording medium |
CN103177125A (en) * | 2013-04-17 | 2013-06-26 | 镇江诺尼基智能技术有限公司 | Method for realizing fast-speed short text bi-cluster |
CN103377239A (en) * | 2012-04-26 | 2013-10-30 | 腾讯科技(深圳)有限公司 | Method and device for calculating inter-textual similarity |
CN104182388A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Semantic analysis based text clustering system and method |
-
2016
- 2016-09-12 CN CN201610817910.8A patent/CN106372061B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011078186A1 (en) * | 2009-12-22 | 2011-06-30 | 日本電気株式会社 | Document clustering system, document clustering method, and recording medium |
CN103377239A (en) * | 2012-04-26 | 2013-10-30 | 腾讯科技(深圳)有限公司 | Method and device for calculating inter-textual similarity |
CN103177125A (en) * | 2013-04-17 | 2013-06-26 | 镇江诺尼基智能技术有限公司 | Method for realizing fast-speed short text bi-cluster |
CN104182388A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Semantic analysis based text clustering system and method |
Non-Patent Citations (1)
Title |
---|
彭京、杨冬青、唐世渭、付艳、蒋汉奎: "一种基于语义内积空间模型的文本聚类算法", 《计算机学报》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509410B (en) * | 2017-02-27 | 2022-08-05 | 阿里巴巴(中国)有限公司 | Text semantic similarity calculation method and device and user terminal |
CN108509410A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN108509407A (en) * | 2017-02-27 | 2018-09-07 | 广东神马搜索科技有限公司 | Text semantic similarity calculating method, device and user terminal |
CN108509407B (en) * | 2017-02-27 | 2022-03-18 | 阿里巴巴(中国)有限公司 | Text semantic similarity calculation method and device and user terminal |
CN107463705A (en) * | 2017-08-17 | 2017-12-12 | 陕西优百信息技术有限公司 | A kind of data cleaning method |
CN110019832A (en) * | 2017-09-29 | 2019-07-16 | 阿里巴巴集团控股有限公司 | The acquisition methods and device of language model |
CN110019832B (en) * | 2017-09-29 | 2023-02-24 | 阿里巴巴集团控股有限公司 | Method and device for acquiring language model |
US11334608B2 (en) | 2017-11-23 | 2022-05-17 | Infosys Limited | Method and system for key phrase extraction and generation from text |
CN107958061A (en) * | 2017-12-01 | 2018-04-24 | 厦门快商通信息技术有限公司 | The computational methods and computer-readable recording medium of a kind of text similarity |
CN109086756A (en) * | 2018-06-15 | 2018-12-25 | 众安信息技术服务有限公司 | A kind of text detection analysis method, device and equipment based on deep neural network |
CN109086756B (en) * | 2018-06-15 | 2021-08-03 | 众安信息技术服务有限公司 | Text detection analysis method, device and equipment based on deep neural network |
CN109472019A (en) * | 2018-10-11 | 2019-03-15 | 厦门快商通信息技术有限公司 | A kind of short text Similarity Match Method and system based on thesaurus |
CN109472019B (en) * | 2018-10-11 | 2023-02-10 | 厦门快商通信息技术有限公司 | Short text similarity matching method and system based on synonymy dictionary |
CN109657210A (en) * | 2018-11-13 | 2019-04-19 | 平安科技(深圳)有限公司 | Text accuracy rate calculation method, device, computer equipment based on semanteme parsing |
CN109657210B (en) * | 2018-11-13 | 2023-10-10 | 平安科技(深圳)有限公司 | Text accuracy rate calculation method and device based on semantic analysis and computer equipment |
CN110413986B (en) * | 2019-04-12 | 2023-08-29 | 上海晏鼠计算机技术股份有限公司 | Text clustering multi-document automatic summarization method and system for improving word vector model |
CN110413986A (en) * | 2019-04-12 | 2019-11-05 | 上海晏鼠计算机技术股份有限公司 | A kind of text cluster multi-document auto-abstracting method and system improving term vector model |
CN110348007A (en) * | 2019-06-14 | 2019-10-18 | 北京奇艺世纪科技有限公司 | A kind of text similarity determines method and device |
CN110348007B (en) * | 2019-06-14 | 2023-04-07 | 北京奇艺世纪科技有限公司 | Text similarity determination method and device |
CN110263347A (en) * | 2019-06-26 | 2019-09-20 | 腾讯科技(深圳)有限公司 | A kind of construction method and relevant apparatus of synonym |
CN110442863B (en) * | 2019-07-16 | 2023-05-05 | 深圳供电局有限公司 | Short text semantic similarity calculation method, system and medium thereof |
CN110442863A (en) * | 2019-07-16 | 2019-11-12 | 深圳供电局有限公司 | A kind of short text semantic similarity calculation method and its system, medium |
CN111199154B (en) * | 2019-12-20 | 2022-12-27 | 重庆邮电大学 | Fault-tolerant rough set-based polysemous word expression method, system and medium |
CN111199154A (en) * | 2019-12-20 | 2020-05-26 | 重庆邮电大学 | Fault-tolerant rough set-based polysemous word expression method, system and medium |
CN112131341A (en) * | 2020-08-24 | 2020-12-25 | 博锐尚格科技股份有限公司 | Text similarity calculation method and device, electronic equipment and storage medium |
CN112784046A (en) * | 2021-01-20 | 2021-05-11 | 北京百度网讯科技有限公司 | Text clustering method, device and equipment and storage medium |
CN112784046B (en) * | 2021-01-20 | 2024-05-28 | 北京百度网讯科技有限公司 | Text clustering method, device, equipment and storage medium |
CN113590763A (en) * | 2021-09-27 | 2021-11-02 | 湖南大学 | Similar text retrieval method and device based on deep learning and storage medium |
CN114169651A (en) * | 2022-02-14 | 2022-03-11 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
CN114169651B (en) * | 2022-02-14 | 2022-04-19 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
Also Published As
Publication number | Publication date |
---|---|
CN106372061B (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106372061A (en) | Short text similarity calculation method based on semantics | |
CN103049435B (en) | Text fine granularity sentiment analysis method and device | |
CN103207913B (en) | The acquisition methods of commercial fine granularity semantic relation and system | |
Stein et al. | Intrinsic plagiarism analysis | |
CN104794169B (en) | A kind of subject terminology extraction method and system based on sequence labelling model | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
Liu et al. | Measuring similarity of academic articles with semantic profile and joint word embedding | |
CN109670039A (en) | Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN101295294A (en) | Improved Bayes acceptation disambiguation method based on information gain | |
CN103646112A (en) | Dependency parsing field self-adaption method based on web search | |
Reiplinger et al. | Extracting glossary sentences from scholarly articles: A comparative evaluation of pattern bootstrapping and deep analysis | |
CN111680131B (en) | Document clustering method and system based on semantics and computer equipment | |
Li et al. | A large probabilistic semantic network based approach to compute term similarity | |
Alsallal et al. | Intrinsic plagiarism detection using latent semantic indexing and stylometry | |
CN109472022A (en) | New word identification method and terminal device based on machine learning | |
Dascalu et al. | Age of exposure: A model of word learning | |
CN108073571A (en) | A kind of multi-language text method for evaluating quality and system, intelligent text processing system | |
Sadr et al. | Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms | |
CN114997288A (en) | Design resource association method | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
CN104881400A (en) | Semantic dependency calculating method based on associative network | |
Thielmann et al. | Coherence based document clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |