CN106372061A - Short text similarity calculation method based on semantics - Google Patents

Short text similarity calculation method based on semantics Download PDF

Info

Publication number
CN106372061A
CN106372061A CN201610817910.8A CN201610817910A CN106372061A CN 106372061 A CN106372061 A CN 106372061A CN 201610817910 A CN201610817910 A CN 201610817910A CN 106372061 A CN106372061 A CN 106372061A
Authority
CN
China
Prior art keywords
semantic
short text
word
similarity
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610817910.8A
Other languages
Chinese (zh)
Other versions
CN106372061B (en
Inventor
费高雷
胡馨月
胡光岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201610817910.8A priority Critical patent/CN106372061B/en
Publication of CN106372061A publication Critical patent/CN106372061A/en
Application granted granted Critical
Publication of CN106372061B publication Critical patent/CN106372061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a short text similarity calculation method based on semantics. The short text similarity calculation method comprises the following steps of: preprocessing corpus data, establishing word Embedding, constructing a word semantic tree, calculating semantic similarity between words in a short text, and calculating the semantic similarity between the short texts. On the basis of deep learning word Embedding, a hierarchical clustering method is combined to create the word semantic tree and calculate the similarity between words in the short text, on the basis, various characteristics of the short text are combined to calculate the semantic similarity between the short texts, and the defect in the prior art that the word semantic tree can not describe a semantic relationship between a fresh word and a known word is effectively solved.

Description

Based on semantic short text similarity calculating method
Technical field
The invention belongs to short text Similarity Measure technical field, more particularly, to a kind of based on semantic short text similarity Computational methods.
Background technology
Semantic Similarity Measurement between short text artificial intelligence, natural language processing, cognitive science, semanticss, psychology, All there is in the fields such as bioinformatics researching value and the application background of theory.Can be overcome well using short text similarity Information redundancy in corpus.At present, many researchs all show that short text Similarity Measure can promote many natural language processings Task, such as event detection, information retrieval, text normalization, automatic text summarization, text classification and cluster etc..Short text is similar Widely, good semantic similarity calculation method can considerably improve existing a lot the application that degree calculates The performance of system.
At present, the computational methods of short text similarity have a lot, can be largely classified into several classes as follows: based on semantic dictionary Method, the method based on corpus, the method for feature based, the method by Internet resources.Method based on semantic dictionary Refer to by semantic dictionary, such as wordnet [], ppdb, framenet etc., calculate the semantic similarity between word and word, finally Semantic similarity is integrated the method obtaining text semantic similarity.Referred to extensive based on the method for corpus Text set carries out statistical analysiss, and typical method has lsa (latent semantic analysis) [] and hal (hyperspace analogues to language)[].The method [] of feature based attempts to be defined in advance with some Feature, to represent short text, then obtains the semantic similarity of short text by grader.Method by Internet resources [] great majority all enrich the contextual information of short text or the phase calculating word or entity using the returning result of search engine Like degree thus calculating the semantic similarity of short text.
It is highly dependent on the completeness of inquired about semantic dictionary based on the method for semantic dictionary, because may in short text Non-existent word in dictionary can be comprised, thus causing to calculate the semantic similarity of this short text and other short texts.Secondly, In dictionary, the polysemy of word also can affect the accuracy of Semantic Similarity Measurement.How the difficult point of the method for feature based is Define effective feature and automatically obtain the value of these features.In addition, the definition of feature is for specific proximate nutrition easily, right Relatively difficult in abstract conception.By Internet resources method for search engine returning result very sensitive it is impossible to To stable semantic similarity.Additionally, the co-occurrence information in search engine returning result can only react two to a certain extent The relation of word, and automatically extract from summary grammar templates precision it is difficult to ensure that.The shortcoming of hal be its construction word- Word matrix can not capture the meaning of whole text well.Lsa may not process the neologisms occurring in short text, secondly lsa Short text vector representation very sparse, can affect the precision of Similarity Measure, and nor represent in short text some Syntactic information.
With the rise of neutral net and deep learning, traditional word vectors space can be converted to word Embedding layer vector space, compensate for the features such as short text is sparse in term vector space, noise is big, and can be by no Supervised learning and supervised learning process seamless combination, are that the calculating of short text semantic similarity opens new direction, become not The development trend come.
Short text is different from long texts such as common news, magazines, and its length causes indivedual noise words to parsing compared with short-range missile The semantic interference of whole short text is very serious.Therefore using the regular text of conventional treatment model and method for short text Semantic Similarity Measurement may not be effective.
Content of the invention
The goal of the invention of the present invention is: cannot effectively solving short text length cause individually compared with short-range missile to solve prior art The noise word problem very serious to parsing the semantic interference of whole short text, the present invention propose a kind of based on semantic short Text similarity computing method.
The technical scheme is that a kind of based on semantic short text similarity calculating method, comprise the following steps:
A, pretreatment is carried out to language material database data, word embedding is set up according to word2vec hyper parameter;
B, using Hierarchical clustering methods build corpus phrase semantic tree;
C, the semanteme between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects Similarity;
D, according to the semantic similarity between the Semantic Similarity Measurement short text between word in step c short text.
Further, in described step a, pretreatment is carried out to language material database data, particularly as follows: by all words in corpus Language is all converted to small letter, and carries out participle;Select the word that occurrence number in corpus is more than n to set up corpus corresponding simultaneously Vocabulary, wherein n are default occurrence number threshold value.
Further, in described step a, word embedding is set up according to word2vec hyper parameter, particularly as follows: using not Train cbow the and sg model of word2vec with hyper parameter, by the use of COS distance as the semantic similarity of word embedding, Screen first three similarity highest word synonymously, using wordnet synonymously knowledge base, by accuracy rate, Recall rate and f1 fraction determine the word2vec hyper parameter simulating this corpus phrase semantic, thus setting up word embedding; Wherein, accuracy rate p represents the ratio to quantity and always pre- quantitation for the correctly predicted synonym of word embedding, recall rate r Represent the ratio to quantity for the synonym to appearance in quantity and wordnet for the correctly predicted synonym of word embedding, f1 Fraction representation is
Further, described step b adopts Hierarchical clustering methods to build the phrase semantic tree of corpus, particularly as follows: utilizing Simlex-999 data set determines distance metric and connects tolerance, using Hierarchical clustering methods according to the distance metric determining and company Connect the phrase semantic tree that tolerance builds corpus.
Further, described step c calculate the semantic similarity in short text between word computing formula particularly as follows:
s y n ( w 1 , w 2 ) = 1 - i n c o n s i s t e n t ( l i n k ( w 1 . w 2 ) ) i n c o n s i s t e n t ( t r e e ) t h r e s h o l d
Wherein, w1And w2All represent word, link represents the public ancestor node of the minimum of two words, inconsistent (tree)thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection Rate.
Further, described step d is according to the language between the Semantic Similarity Measurement short text between word in short text Adopted similarity include following step by step:
D1, to short text t1And t2Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short essay In this, each word is converted to small letter;
D2, respectively calculating short text t1Middle word wiWith short text t2Middle word wjSemantic similarity sij
D3, calculating short text t1And t2Semantic similarity, computing formula particularly as follows:
s i m ( t 1 , t 2 ) = 1 2 ( s u m ( r o w s ) | | s r o w &notequal; 0 | | + s u m ( c o l u m n s ) | | s c o l u m n &notequal; 0 | | )
Wherein, sum (rows) represents short text t1And t2Semantic similitude matrix s in every row element be not all zero row Maximum summation, sum (columns) represent short text t1And t2Semantic similitude matrix s in every column element be not all zero The maximum summation of row, | | srow≠0| | represent the sum of non-zero row in the semantic similitude matrix s of short text t1 and t2, | | scolumn≠0| | represent short text t1And t2Semantic similitude matrix s in non-zero column sum.
The following beneficial effect of office of the present invention:
1st, the phrase semantic tree of the present invention is that the word embedding based on deep neural network is reasonably layered Cluster gets, and compares existing phrase semantic tree and is more easily extensible;And for different corpus, can be with rapid build pair The phrase semantic tree answering, the vocabulary quantity comprising is more, and the phrase semantic tree solving wordnet, Chinese thesaurus etc. can not be carved The shortcoming drawing fresh word and known word semantic relation;
2nd, semantic similarity computational methods proposed by the present invention to be determined using the synonym data set of artificial mark The connection inconsistent rate threshold value of hierarchical cluster phrase semantic tree, thus reduce the inconsistent rate extreme value of connection to cause semantic similarity Differentiation out of proportion, improve semantic similarity calculating precision;
3rd, the method based on word Semantic Similarity Measurement short text semantic similarity proposed by the present invention, simply effectively, Any short text data collection can be processed by adjusting training corpus, and be capable of identify that the different parts of speech of similar word, thus Without the part of speech matching problem considering word, the more succinct similar short texts various to clause change are identified.
Brief description
Fig. 1 is the present invention based on semantic short text similarity calculating method schematic flow sheet.
Fig. 2 is hierarchical cluster lexical semantic tree construction schematic diagram in the embodiment of the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only in order to explain the present invention, not For limiting the present invention.
As shown in figure 1, for the present invention based on semantic short text similarity calculating method schematic flow sheet.One kind is based on Semantic short text similarity calculating method, comprises the following steps:
A, pretreatment is carried out to language material database data, word embedding is set up according to word2vec hyper parameter;
B, using Hierarchical clustering methods build corpus phrase semantic tree;
C, the semanteme between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects Similarity;
D, according to the semantic similarity between the Semantic Similarity Measurement short text between word in step c short text.
It is necessary first to pretreatment is carried out to language material database data in step a, particularly as follows: by all words in corpus All be converted to small letter, and carry out participle;In order to ensure the quality of word embedding, the present invention selects to go out occurrence in corpus The corresponding vocabulary of corpus set up in the word more than n for the number, and wherein n is default occurrence number threshold value;Preferably, the present invention sets N is 10, that is, select the word that occurrence number in corpus is more than 10 to set up the corresponding vocabulary of corpus.
The present invention adopts word2vec instrument to train word embedding, sets up word according to word2vec hyper parameter Embedding, particularly as follows: train cbow the and sg model of word2vec using different hyper parameter, hyper parameter here is upper and lower Text window size, dimension size etc., recycle COS distance as the semantic similarity of word embedding, screen first three Similarity highest word synonymously, using wordnet synonymously knowledge base, by accuracy rate, recall rate and f1 Fraction determines the word2vec hyper parameter simulating this corpus phrase semantic, thus setting up word embedding;Wherein, accurately Rate p represents the ratio to quantity and always pre- quantitation for the correctly predicted synonym of word embedding, and recall rate r represents word The correctly predicted synonym of the embedding ratio to quantity, f1 fraction representation to the synonym occurring in quantity and wordnet ForPreferably, present invention determine that word2vec hyper parameter for dimension size be 300, contextual window is big Little is 32, and iteration is 5 times, and negative sample rate is 5.
In stepb, as shown in Fig. 2 being hierarchical cluster lexical semantic tree construction schematic diagram in the embodiment of the present invention.This Bright employing Hierarchical clustering methods dynamically build the phrase semantic tree of corpus, and the word making semantic similarity is in phrase semantic tree Neighbouring.The Hierarchical clustering methods of cohesion use bottom-up strategy, and typically, it is opened from making each object form the cluster of oneself Begin, and iteratively cluster is merged into increasing cluster, until all of object is all in a cluster, or meet certain eventually Only condition.This single cluster becomes the root of hierarchical structure.In combining step, according to certain similarity measurement, it is found out two and connects most Near cluster, and merge their one clusters of formation.Because each iteration merges two clusters, wherein each cluster is right including at least one As therefore condensing method at most needs n iteration.
Build corpus phrase semantic tree when it needs to be determined that between word embedding distance tolerance, here this Bright consideration Euler's distance, COS distance and manhatton distance;Equally, in condensing method, between cluster, the tolerance of distance connects tolerance, Here the present invention is directed to different word vectors distance metrics, analysis mean distance, centre distance, ultimate range, average distance Deng the impact to Agglomerative Hierarchical Clustering arborescence amount for the connection tolerance.The present invention assesses different distance using simlex-999 data set Tolerance and the quality connecting the Agglomerative Hierarchical Clustering tree that tolerance generates, build high-quality phrase semantic tree with this.simlex- 999 data sets comprise 999 pairs of English words, and the synonymous similarity between these English words and semantic relevancy are by manually marking Note.Based on this data set, the present invention according to spearman coefficient of rank correlation analysis result determine suitable distance metric and Connect tolerance, to build high-quality hierarchical cluster tree.
The phrase semantic tree of the present invention is poly- to being reasonably layered based on the word embedding of deep neural network Class gets, and compares existing phrase semantic tree and is more easily extensible;And for different corpus, can be corresponded to rapid build Phrase semantic tree, the vocabulary quantity comprising is more, and the phrase semantic tree solving wordnet, Chinese thesaurus etc. can not be portrayed Fresh word and the shortcoming of known word semantic relation.
In step c, the present invention utilizes the semantic tree that Hierarchical clustering methods build, and devises a kind of new phrase semantic phase Like degree computational methods.In hierarchical cluster phrase semantic tree, leaf node represents word, and father node and root node represent a company Connect, each connects through inconsistent rate to indicate the degree of consistency between member in this connection.The present invention is according to hierarchical cluster The inconsistent rate that in tree, each connects calculating the semantic similarity between word, computing formula particularly as follows:
s y n ( w 1 , w 2 ) = 1 - i n c o n s i s t e n t ( l i n k ( w 1 . w 2 ) ) i n c o n s i s t e n t ( t r e e ) t h r e s h o l d
Wherein, w1And w2All represent word, link represents the public ancestor node of the minimum of two words, inconsistent (tree)thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection Rate, if inconsistent rate exceedes given threshold, is equal to threshold value.In order to improve the precision of semantic similarity, the present invention sets when two The inconsistent rate of the connection of word be higher than certain value when then it is assumed that this two words are altogether irrelevant, therefore we set inconsistent Rate threshold value.Because the maximum of inconsistent rate may be very big, therefore this blocks and will effectively improve semantic similarity Precision.Ws-353 data set comprises 353 pairs of English words, and the semantic similarity between these English words is by manually marking.Base In this data set, the present invention determines the inconsistent rate threshold of hierarchical cluster tree according to spearman coefficient of rank correlation analysis result Value.
Using the synonym data set of artificial mark, the semantic similarity computational methods of the present invention to determine that layering is poly- The connection inconsistent rate threshold value of class phrase semantic tree, thus reduce connect the differentiation that inconsistent rate extreme value causes semantic similarity Out of proportion, improve the precision of semantic similarity calculating.
In step d, the present invention remembers that two short texts are respectively t1 and t2, according to step c calculate in short text word it Between semantic similarity, then the semantic similarity calculating between short text include following step by step:
D1, to short text t1And t2Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short essay In this, each word is converted to small letter;
D2, respectively calculating short text t1Middle word wiWith short text t2Middle word wjSemantic similarity sij
D3, calculating short text t1And t2Semantic similarity, computing formula particularly as follows:
s i m ( t 1 , t 2 ) = 1 2 ( s u m ( r o w s ) | | s r o w &notequal; 0 | | + s u m ( c o l u m n s ) | | s c o l u m n &notequal; 0 | | )
Wherein, sum (rows) represents short text t1And t2Semantic similitude matrix s in every row element be not all zero row Maximum summation, sum (columns) represent short text t1And t2Semantic similitude matrix s in every column element be not all zero The maximum summation of row, | | srow≠0| | represent the sum of non-zero row in the semantic similitude matrix s of short text t1 and t2, | | scolumn≠0| | represent short text t1And t2Semantic similitude matrix s in non-zero column sum.
In step d1, the present invention is directed to short text t1And t2Illustrate, two short texts particularly as follows:
text 1:my phone is annoying me with these amber alerts.
text 2:that amber alert was getting annoying.
To short text t1And t2Carry out pretreatment, that is, remove the punctuation mark in short text and special symbol, and by short text In each word be converted to lowercase versions.
In step d2, the present invention is for short text t1In each word wi, in short text t2Middle selection and its maximum language The word w of adopted similarityj;Again for short text t2In each word wj, in short text t1Middle selection and its maximum semantic similitude The word w of degreei;Thus obtaining short text t1And t2Semantic similitude matrix s, as shown in table 1, be short text t1And t2Semantic phase Like matrix.
Table 1 short text t1And t2Semantic similitude matrix
In step d3, the present invention calculates to semantic similarity calculated in step d2 and carries out sum-average arithmetic, Obtain the semantic similarity between short text, short text t is calculated according to computing formula1And t2Semantic similarity be 0.855.
The method based on word Semantic Similarity Measurement short text semantic similarity of the present invention, simply effectively, by adjusting Whole training corpus can process any short text data collection, and is capable of identify that the different parts of speech of similar word, thus without examining Consider the part of speech matching problem of word, the more succinct similar short texts various to clause change are identified.
Those of ordinary skill in the art will be appreciated that, embodiment described here is to aid in reader and understands this Bright principle is it should be understood that protection scope of the present invention is not limited to such special statement and embodiment.This area Those of ordinary skill can make various other each without departing from present invention essence according to these technology disclosed by the invention enlightenment Plant concrete deformation and combine, these deform and combine still within the scope of the present invention.

Claims (6)

1. a kind of based on semantic short text similarity calculating method it is characterised in that comprising the following steps:
A, pretreatment is carried out to language material database data, word embedding is set up according to word2vec hyper parameter;
B, using Hierarchical clustering methods build corpus phrase semantic tree;
C, the semantic similitude between word in short text is calculated according to the inconsistent rate that in the phrase semantic tree of step b, each connects Degree;
D, according to the semantic similarity between the Semantic Similarity Measurement short text between word in step c short text.
2. as claimed in claim 1 based on semantic short text similarity calculating method it is characterised in that in described step a Pretreatment is carried out to language material database data, particularly as follows: all words in corpus are all converted to small letter, and carries out participle;With When select the word that occurrence number in corpus is more than n to set up the corresponding vocabulary of corpus, wherein n is default occurrence number threshold Value.
3. as claimed in claim 2 based on semantic short text similarity calculating method it is characterised in that in described step a Word embedding is set up according to word2vec hyper parameter, particularly as follows: using different hyper parameter train word2vec cbow and Sg model, by the use of COS distance as the semantic similarity of word embedding, screens first three similarity highest word and makees For synonym, using wordnet synonymously knowledge base, determined by accuracy rate, recall rate and f1 fraction and simulate this language material The word2vec hyper parameter of storehouse phrase semantic, thus set up word embedding;Wherein, accuracy rate p represents word The ratio to quantity and always pre- quantitation for the correctly predicted synonym of embedding, recall rate r is just representing word embedding Really the synonym of prediction to the synonym of appearance in quantity and wordnet the ratio to quantity, f1 fraction representation is
4. as claimed in claim 3 based on semantic short text similarity calculating method it is characterised in that described step b is adopted Build the phrase semantic tree of corpus with Hierarchical clustering methods, particularly as follows: determining distance metric using simlex-999 data set Measure with connecting, using Hierarchical clustering methods according to the distance metric determining and the phrase semantic connecting tolerance structure corpus Tree.
5. as claimed in claim 4 based on semantic short text similarity calculating method it is characterised in that described step c meter Calculate the computing formula of the semantic similarity between word in short text particularly as follows:
s y n ( w 1 , w 2 ) = 1 - i n c o n s i s t e n t ( l i n k ( w 1 . w 2 ) ) i n c o n s i s t e n t ( t r e e ) t h r e s h o l d
Wherein, w1And w2All represent word, link represents the public ancestor node of the minimum of two words, inconsistent (tree)thresholdRepresent the inconsistent rate threshold value connecting in this hierarchical cluster tree, inconsistent represents the inconsistent of connection Rate.
6. as claimed in claim 5 based on semantic short text similarity calculating method it is characterised in that described step d root According to the semantic similarity between the Semantic Similarity Measurement short text between word in short text include following step by step:
D1, to short text t1And t2Carry out pretreatment, remove the punctuation mark in short text and special symbol, and by short text Each word is converted to small letter;
D2, respectively calculating short text t1Middle word wiWith short text t2Middle word wjSemantic similarity sij
D3, calculating short text t1And t2Semantic similarity, computing formula particularly as follows:
s i m ( t 1 , t 2 ) = 1 2 ( s u m ( r o w s ) | | s r o w &notequal; 0 | | + s u m ( c o l u m n s ) | | s c o l u m n &notequal; 0 | | )
Wherein, sum (rows) represents short text t1And t2Semantic similitude matrix s in every row element be not all zero row Big value summation, sum (columns) represents short text t1And t2Semantic similitude matrix s in every column element be not all zero row Maximum is sued for peace, | | srow≠0| | represent short text t1And t2Semantic similitude matrix s in non-zero row sum, | | scolumn≠0|| Represent short text t1And t2Semantic similitude matrix s in non-zero column sum.
CN201610817910.8A 2016-09-12 2016-09-12 Short text similarity calculation method based on semantics Active CN106372061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610817910.8A CN106372061B (en) 2016-09-12 2016-09-12 Short text similarity calculation method based on semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610817910.8A CN106372061B (en) 2016-09-12 2016-09-12 Short text similarity calculation method based on semantics

Publications (2)

Publication Number Publication Date
CN106372061A true CN106372061A (en) 2017-02-01
CN106372061B CN106372061B (en) 2020-11-24

Family

ID=57896767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610817910.8A Active CN106372061B (en) 2016-09-12 2016-09-12 Short text similarity calculation method based on semantics

Country Status (1)

Country Link
CN (1) CN106372061B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463705A (en) * 2017-08-17 2017-12-12 陕西优百信息技术有限公司 A kind of data cleaning method
CN107958061A (en) * 2017-12-01 2018-04-24 厦门快商通信息技术有限公司 The computational methods and computer-readable recording medium of a kind of text similarity
CN108509410A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN108509407A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN109086756A (en) * 2018-06-15 2018-12-25 众安信息技术服务有限公司 A kind of text detection analysis method, device and equipment based on deep neural network
CN109472019A (en) * 2018-10-11 2019-03-15 厦门快商通信息技术有限公司 A kind of short text Similarity Match Method and system based on thesaurus
CN109657210A (en) * 2018-11-13 2019-04-19 平安科技(深圳)有限公司 Text accuracy rate calculation method, device, computer equipment based on semanteme parsing
CN110019832A (en) * 2017-09-29 2019-07-16 阿里巴巴集团控股有限公司 The acquisition methods and device of language model
CN110263347A (en) * 2019-06-26 2019-09-20 腾讯科技(深圳)有限公司 A kind of construction method and relevant apparatus of synonym
CN110348007A (en) * 2019-06-14 2019-10-18 北京奇艺世纪科技有限公司 A kind of text similarity determines method and device
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN110442863A (en) * 2019-07-16 2019-11-12 深圳供电局有限公司 A kind of short text semantic similarity calculation method and its system, medium
CN111199154A (en) * 2019-12-20 2020-05-26 重庆邮电大学 Fault-tolerant rough set-based polysemous word expression method, system and medium
CN112131341A (en) * 2020-08-24 2020-12-25 博锐尚格科技股份有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN112784046A (en) * 2021-01-20 2021-05-11 北京百度网讯科技有限公司 Text clustering method, device and equipment and storage medium
CN113590763A (en) * 2021-09-27 2021-11-02 湖南大学 Similar text retrieval method and device based on deep learning and storage medium
CN114169651A (en) * 2022-02-14 2022-03-11 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity
US11334608B2 (en) 2017-11-23 2022-05-17 Infosys Limited Method and system for key phrase extraction and generation from text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011078186A1 (en) * 2009-12-22 2011-06-30 日本電気株式会社 Document clustering system, document clustering method, and recording medium
CN103177125A (en) * 2013-04-17 2013-06-26 镇江诺尼基智能技术有限公司 Method for realizing fast-speed short text bi-cluster
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011078186A1 (en) * 2009-12-22 2011-06-30 日本電気株式会社 Document clustering system, document clustering method, and recording medium
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN103177125A (en) * 2013-04-17 2013-06-26 镇江诺尼基智能技术有限公司 Method for realizing fast-speed short text bi-cluster
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭京、杨冬青、唐世渭、付艳、蒋汉奎: "一种基于语义内积空间模型的文本聚类算法", 《计算机学报》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509410B (en) * 2017-02-27 2022-08-05 阿里巴巴(中国)有限公司 Text semantic similarity calculation method and device and user terminal
CN108509410A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN108509407A (en) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 Text semantic similarity calculating method, device and user terminal
CN108509407B (en) * 2017-02-27 2022-03-18 阿里巴巴(中国)有限公司 Text semantic similarity calculation method and device and user terminal
CN107463705A (en) * 2017-08-17 2017-12-12 陕西优百信息技术有限公司 A kind of data cleaning method
CN110019832A (en) * 2017-09-29 2019-07-16 阿里巴巴集团控股有限公司 The acquisition methods and device of language model
CN110019832B (en) * 2017-09-29 2023-02-24 阿里巴巴集团控股有限公司 Method and device for acquiring language model
US11334608B2 (en) 2017-11-23 2022-05-17 Infosys Limited Method and system for key phrase extraction and generation from text
CN107958061A (en) * 2017-12-01 2018-04-24 厦门快商通信息技术有限公司 The computational methods and computer-readable recording medium of a kind of text similarity
CN109086756A (en) * 2018-06-15 2018-12-25 众安信息技术服务有限公司 A kind of text detection analysis method, device and equipment based on deep neural network
CN109086756B (en) * 2018-06-15 2021-08-03 众安信息技术服务有限公司 Text detection analysis method, device and equipment based on deep neural network
CN109472019A (en) * 2018-10-11 2019-03-15 厦门快商通信息技术有限公司 A kind of short text Similarity Match Method and system based on thesaurus
CN109472019B (en) * 2018-10-11 2023-02-10 厦门快商通信息技术有限公司 Short text similarity matching method and system based on synonymy dictionary
CN109657210A (en) * 2018-11-13 2019-04-19 平安科技(深圳)有限公司 Text accuracy rate calculation method, device, computer equipment based on semanteme parsing
CN109657210B (en) * 2018-11-13 2023-10-10 平安科技(深圳)有限公司 Text accuracy rate calculation method and device based on semantic analysis and computer equipment
CN110413986B (en) * 2019-04-12 2023-08-29 上海晏鼠计算机技术股份有限公司 Text clustering multi-document automatic summarization method and system for improving word vector model
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN110348007A (en) * 2019-06-14 2019-10-18 北京奇艺世纪科技有限公司 A kind of text similarity determines method and device
CN110348007B (en) * 2019-06-14 2023-04-07 北京奇艺世纪科技有限公司 Text similarity determination method and device
CN110263347A (en) * 2019-06-26 2019-09-20 腾讯科技(深圳)有限公司 A kind of construction method and relevant apparatus of synonym
CN110442863B (en) * 2019-07-16 2023-05-05 深圳供电局有限公司 Short text semantic similarity calculation method, system and medium thereof
CN110442863A (en) * 2019-07-16 2019-11-12 深圳供电局有限公司 A kind of short text semantic similarity calculation method and its system, medium
CN111199154B (en) * 2019-12-20 2022-12-27 重庆邮电大学 Fault-tolerant rough set-based polysemous word expression method, system and medium
CN111199154A (en) * 2019-12-20 2020-05-26 重庆邮电大学 Fault-tolerant rough set-based polysemous word expression method, system and medium
CN112131341A (en) * 2020-08-24 2020-12-25 博锐尚格科技股份有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN112784046A (en) * 2021-01-20 2021-05-11 北京百度网讯科技有限公司 Text clustering method, device and equipment and storage medium
CN112784046B (en) * 2021-01-20 2024-05-28 北京百度网讯科技有限公司 Text clustering method, device, equipment and storage medium
CN113590763A (en) * 2021-09-27 2021-11-02 湖南大学 Similar text retrieval method and device based on deep learning and storage medium
CN114169651A (en) * 2022-02-14 2022-03-11 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity
CN114169651B (en) * 2022-02-14 2022-04-19 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity

Also Published As

Publication number Publication date
CN106372061B (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN106372061A (en) Short text similarity calculation method based on semantics
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
Stein et al. Intrinsic plagiarism analysis
CN104794169B (en) A kind of subject terminology extraction method and system based on sequence labelling model
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN107992542A (en) A kind of similar article based on topic model recommends method
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
Liu et al. Measuring similarity of academic articles with semantic profile and joint word embedding
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN101295294A (en) Improved Bayes acceptation disambiguation method based on information gain
CN103646112A (en) Dependency parsing field self-adaption method based on web search
Reiplinger et al. Extracting glossary sentences from scholarly articles: A comparative evaluation of pattern bootstrapping and deep analysis
CN111680131B (en) Document clustering method and system based on semantics and computer equipment
Li et al. A large probabilistic semantic network based approach to compute term similarity
Alsallal et al. Intrinsic plagiarism detection using latent semantic indexing and stylometry
CN109472022A (en) New word identification method and terminal device based on machine learning
Dascalu et al. Age of exposure: A model of word learning
CN108073571A (en) A kind of multi-language text method for evaluating quality and system, intelligent text processing system
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
CN114997288A (en) Design resource association method
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN104881400A (en) Semantic dependency calculating method based on associative network
Thielmann et al. Coherence based document clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant