CN108334495A - Short text similarity calculating method and system - Google Patents

Short text similarity calculating method and system Download PDF

Info

Publication number
CN108334495A
CN108334495A CN201810090296.9A CN201810090296A CN108334495A CN 108334495 A CN108334495 A CN 108334495A CN 201810090296 A CN201810090296 A CN 201810090296A CN 108334495 A CN108334495 A CN 108334495A
Authority
CN
China
Prior art keywords
term vector
short text
word
vector
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810090296.9A
Other languages
Chinese (zh)
Inventor
王慧
汪立东
王博
刘春阳
张旭
王萌
李雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201810090296.9A priority Critical patent/CN108334495A/en
Publication of CN108334495A publication Critical patent/CN108334495A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of short text similarity calculating methods, include the following steps:S1, training corpus is segmented, obtains the term vector of each word using word2vec algorithms, and combine and form term vector set;S2, short text to be calculated is segmented respectively, the term vector of each word of short text to be calculated is found in term vector set, and combine and form short text vector set;S3, the cosine similarity for calculating each term vector and each term vector in short text vector set in term vector set, and the maximum similarity value for obtaining each term vector combines to obtain short text sentence vector;Similarity between two S4, calculating short text sentence vectors, you can calculate the similarity between two short texts.The present invention also provides a kind of short text similarity calculation systems.The similarity algorithm of the present invention is by indicating short text sentence with sentence vector, and effectively feature the semantic similarity between short text sentence, accuracy rate is high.

Description

Short text similarity calculating method and system
Technical field
The invention belongs to short text similarity technical fields, and in particular to a kind of short text similarity calculating method and be System.
Background technology
With the fast development of computer science and technology and internet, more and more data occur in the form of short text On the internet, such as Twitter message, headline, mhkc speech etc..For internet short text data application class, cluster Equal machine learning techniques, therefrom excavate valuable information and provide useful facility for people’s lives, to meet not Tongfang Face needs as a very popular project in current big data application technology.However Chinese short text has word dilute The features such as dredging, is semantic discrete, word is random, it is extremely challenging that this is that the research of Chinese short text is brought, therefore to short essay Notebook data excavate and carry out accurately cognition to its inherent meaning becoming an extremely urgent task, and with very Important theoretical significance.
Method about Chinese short text similarity calculation at present mainly indicates text using vector space model (VSM) This, and then realize the similarity calculation of text.In the model, content of text turns to a point in hyperspace by form, It is provided by the form of vector, operation vectorial in vector space is reduced to the processing of content of text, makes the complexity of problem Degree reduces.Analyzing processing is carried out to short text in this way and is primarily present following two points problem, first is due to short essay The sparsity of eigen word so that since text vector is excessively sparse when it utilizes the algorithm process of common text, cause poly- Effect is undesirable when class, is unable to reach the comparable effect of long text;Second is to indicate text only using vector space model The statistical property of word within a context is considered, based on the assumed condition of linear independence between keyword, without considering word sheet The semantic information of body, therefore there is significant limitation, can not accurately express in sentence semantic meaning.
Invention content
It is an object of the invention to solve the above problems, and provide the advantages of at least will be described later.
It is a still further object of the present invention to provide a kind of short text similarity calculating methods, utilize deep learning word2vec Algorithm trains to obtain the term vector of each word in training corpus, calculates each term vector and short text vector in term vector set The cosine similarity of each term vector in set, obtains the maximum similarity value of each term vector in term vector set, will be each The maximum similarity value of term vector combines to obtain short text sentence vector, then cosine similarity algorithm is used to calculate short text sentence Similarity between subvector effectively features the semantic similarity between short text sentence.
In order to realize these purposes and other advantages according to the present invention, a kind of short text similarity calculation side is provided Method includes the following steps:
S1, training corpus is obtained, training corpus is segmented, using deep learning word2vec algorithms to training corpus It is trained, obtains the term vector (a of each word in training corpus1i,a2i,a3i...), then each term vector is combined to be formed Term vector set S;
S=((a11,a21,a31…),(a12,a22,a32…),(a13,a23,a33…),…(a1i,a2i,a3i…)…(a1N, a2N,a3N…))
S2, short text to be calculated is segmented respectively, after short text participle to be calculated is found in term vector set The corresponding term vector word of each wordi, and combine and form short text vector set sen;
Sen=(word1,word2,word3,…wordi…wordM)
S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each The cosine similarity of term vector, and obtain the maximum similarity value max of each term vector in term vector seti, by each word to The maximum similarity value max of amountiCombination obtains short text sentence vector senVec;
SenVec=(max1,max2,max3,…maxi…maxN);
S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short Similarity between text.
Preferably, the short text similarity calculating method, the method that training corpus is obtained in S1 are:Obtain language material Data remove non-legible class data in corpus data and obtain training corpus;The term vector of each word in obtaining training corpus Afterwards, remove stop words in corpus data and word frequency is less than the corresponding term vector of word of predetermined threshold value, the corresponding word of remaining word Vector combination forms term vector set S, and predetermined threshold value is between 5~10.
Preferably, the short text similarity calculating method passes through HMM model and Viterbi algorithm pair in S2 Short text and training corpus to be calculated is segmented.
Preferably, the short text similarity calculating method, the corpus data are obtained by crawler technology.
Preferably, the short text similarity calculating method, in S4 when similarity value is more than 0.7, then it is assumed that two It is being semantically similar between a short text sentence.
The present invention also provides a kind of short text similarity calculation systems, including:
Training corpus word-dividing mode is used to obtain training corpus and is segmented to training corpus;
Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training Language material is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector collection It closes;
Short text word-dividing mode is used to be segmented to obtain multiple words to short text to be calculated;
Short text vector generation module is connect with term vector training module, and short text vector generation module is used in word The corresponding term vector of each word after participle is found in vector set, and the term vector of each word after participle is combined into shape At short text vector set;
First similarity calculation module is connect, described first with term vector training module, short text vector generation module Similarity calculation module be used to calculate each term vector in term vector set in short text vector set each term vector it is remaining String similarity;
Comparison module is connect with term vector training module, the comparison module be used for by each word of term vector set to The similarity value of amount is compared, and obtains the maximum value of the similarity of each term vector in term vector set;
Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used In combining to obtain short text sentence vector by the maximum similarity value of each term vector in term vector set;
Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation Module calculates the similarity between short text sentence vector using cosine similarity formula.
Preferably, the short text similarity calculation system, the training corpus word-dividing mode include:
Acquiring unit is used to obtain training corpus;
Participle unit is connect with acquiring unit, and the participle unit is for segmenting training corpus.
Preferably, the short text similarity calculation system, the term vector training module include:
Term vector training unit, be used to that training corpus to be trained to obtain the word of each word in training corpus to Amount;
Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each word Vector combination forms term vector set.
Preferably, the short text similarity calculation system, short text vector generation module include:
Searching unit is used to find the corresponding term vector of each word after participle in term vector set;
Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for that will segment The term vector of each word afterwards combines to form short text vector set.
The present invention includes at least following advantageous effect:
1, the present invention is trained training corpus using deep learning word2vec algorithms, and obtains in training corpus The term vector of each word, indicate word in the form of term vector, the effectively expressing real inherent meaning of word;It then will be every A term vector combines to form term vector set, is segmented respectively to short text to be calculated, is calculated using cosine similarity formula The cosine similarity of each term vector and each term vector in short text vector set in term vector set, and obtain term vector collection The maximum similarity value of each term vector in conjunction, by the maximum similarity value of each term vector combine to obtain short text sentence to Amount, i.e., indicate a short text sentence in the form of real vector, and converting short text to mathematic vector by algorithm indicates shape Formula, the short text sentence vector being built such that fully consider the inherent semantic meaning of sentence word;Then cosine similarity is used Algorithm calculates the similarity between short text sentence vector, effectively features the semantic similarity between short text sentence, has There is higher accuracy rate, is the support of the natural language processings task creation powerful techniques such as follow-up short text clustering, classification.
Part is illustrated to embody by further advantage, target and the feature of the present invention by following, and part will also be by this The research and practice of invention and be understood by the person skilled in the art.
Description of the drawings
Fig. 1 is the flow diagram of the short text similarity calculating method of the present invention;
Fig. 2 is the short text similarity calculation system composition schematic diagram of the present invention.
Specific implementation mode
Present invention will be described in further detail below with reference to the accompanying drawings, to enable those skilled in the art with reference to specification text Word can be implemented according to this.
It should be appreciated that such as " having ", "comprising" and " comprising " term used herein do not allot one or more The presence or addition of a other elements or combinations thereof.
As shown in Figure 1, a kind of short text similarity calculating method, includes the following steps:
S1, training corpus is obtained, training corpus is segmented, using deep learning word2vec algorithms to training corpus It is trained, obtains the term vector (a of each word in training corpus1i,a2i,a3i...), then each term vector is combined to be formed Term vector set S;
S=((a11,a21,a31…),(a12,a22,a32…),(a13,a23,a33…),…(a1i,a2i,a3i…)…(a1N, a2N,a3N…))
S2, short text to be calculated is segmented respectively, after short text participle to be calculated is found in term vector set The corresponding term vector word of each wordi, and combine and form short text vector set sen;
Sen=(word1,word2,word3,…wordi…wordM)
S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each The cosine similarity of term vector, and obtain the maximum similarity value max of each term vector in term vector seti, by each word to The maximum similarity value max of amountiCombination obtains short text sentence vector senVec;
SenVec=(max1,max2,max3,…maxi…maxN);
S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short Similarity between text.
The short text similarity calculating method of the present invention, after obtaining training corpus on the internet first, by segmenting work Tool segments training corpus, i.e., is the set formed containing a large amount of word in training corpus, word number is super in practice 10,000 are crossed, for example training corpus includes for I, is come from, the U.S., Europe, people is that Chinese seven words are convenient for explanation herein It is illustrated by taking seven words as an example, the word of each word in training corpus is then calculated by deep learning word2vec algorithms Vector, such as my term vector are (a11,a21,a31), from term vector be (a12,a22,a32), the term vector in the U.S. is (a13, a23,a33), European term vector is (a14,a24,a34), the term vector of people is (a15,a25,a35), the term vector for being is (a16,a26, a36), Chinese term vector is (a17,a27,a37), term vector is three-dimensional herein, can be in practice multidimensional, then by upper predicate The combination of vector forms term vector set.There are two short text sentences to be calculated now, " I am Chinese ", " during I comes from State " first respectively segments short text sentence according to segmenting method identical with training corpus, such as " I am Chinese " Word after participle is for I, be, China, people, the word after " I am from China " participle is me, comes from, China, every after participle A word can be found in training corpus, can be corresponded in this way in term vector set find the corresponding word of each word to Amount, then combination form short text vector set, then calculate separately each term vector of term vector set and short text vector set The cosine similarity of each term vector in conjunction, by taking short text sentence " I am Chinese " as an example, by above-mentioned term vector set I Corresponding term vector respectively in short text sentence I, be, China, the corresponding term vector of four words of people carry out cosine similarity Four numerical value are calculated, then compare the size of cosine similarity, chooses cosine similarity maximum value and is denoted as a1, then calculate Come in term vector set self-corresponding term vector in short text sentence I, be, China, the corresponding term vector progress of four words of people Four numerical value are calculated in cosine similarity, then compare the size of cosine similarity, choose cosine similarity maximum value and are denoted as a2, and so on, finally obtain in term vector set me, come from, the U.S., Europe, people is, each word term vector of China it is remaining String similarity maximum value is a1、a2、a3、a4、a5、a6、a7, then the cosine similarity maximum value of each word term vector is combined To short text sentence vector, you can the short text sentence vector representation for obtaining " I am Chinese " is denoted as senVec1= (a1, a2, a3, a4, a5, a6, a7), similarly, calculate the short text sentence vector of " I from China ", first in term vector set I Corresponding term vector respectively in short text sentence I, come from, the corresponding term vector of Chinese three words carries out cosine similarity Three numerical value are calculated, then compare the size of cosine similarity, chooses cosine similarity maximum value and is denoted as b1, then calculate Come in term vector set self-corresponding term vector in short text sentence I, come from, the corresponding term vector progress of Chinese three words Three numerical value are calculated in cosine similarity, then compare the size of cosine similarity, choose cosine similarity maximum value and are denoted as b2, and so on, finally obtain in term vector set me, come from, the U.S., Europe, people is, each word term vector of China it is remaining String similarity maximum value is b1、b2、b3、b4、b5、b6、b7, you can the short text sentence vector for obtaining " I am from China " indicates shape Formula is denoted as senVec2=(b1, b2, b3, b4, b5, b6, b7).Then in the phase of calculating " I am Chinese " and " I am from China " Like degree, specific formula is:
Similarity value is bigger closer to 1 two short text sentence similarity between 0~1.
In another technical solution, the short text similarity calculating method, the method that training corpus is obtained in S1 For:Corpus data is obtained, non-legible class data in corpus data is removed and obtains training corpus;Obtaining each of training corpus After the term vector of word, remove the corresponding term vector of word that stop words and word frequency in corpus data are less than predetermined threshold value, it is remaining The corresponding term vector of word combines to form term vector set S, and predetermined threshold value is between 5~10.In the technical scheme, it is interconnecting Online to obtain corpus data, corpus data includes for the article inside mhkc, forum, comment, professional journals magazine etc., removing language The information data of the non-legible classes such as some link, emoticon in material data obtains training expectation;Using word2vec algorithms Before being trained to training corpus, classify to training corpus word, count the frequency and stop words that each word occurs, Stop words includes auxiliary words of mood, adverbial word, preposition, conjunction etc., these words itself have no specific meaning, only put it into one Just have certain effect in a complete sentence, as it is common " ", " " etc, instructed after being trained to training corpus Practice the term vector of each word in language material, and removes the corresponding term vector of word of stop words and word frequency less than predetermined threshold value, this In can set a threshold value in advance, for example threshold value is 8, then word corresponding term vector of the word frequency less than 8 all removes, i.e. frequency Rate is too small to show that the sentence being made of the word can seldom be ignored substantially, can reduce term vector in term vector set in this way Quantity improves calculating speed, if short text sentence to be calculated contains the word removed, such as short text sentence to be calculated point Include A after word1、A2、A3、A4Four words, A3For the low-frequency word removed, then A is searched respectively in term vector set in S21、A2、 A4Corresponding term vector, and combine corresponding term vector to form short text vector set, the calculating of S3, S4 step is then carried out again The similarity of the short text sentence.
In another technical solution, the short text similarity calculating method, in S2 by HMM model and Viterbi algorithm segments short text and training corpus to be calculated.Using identical method to short essay to be calculated This and training corpus are segmented, so that short text to be calculated each word after participle can be looked in training corpus It arrives.
In another technical solution, the short text similarity calculating method, the corpus data passes through reptile skill Art obtains.
In another technical solution, the short text similarity calculating method, when similarity value is more than 0.7 in S4 When, then it is assumed that it is being semantically similar between two short text sentences.The even more big then two short text sentences of fruit similarity value More close, when similarity value is more than 0.7, then two short text sentence semantics are identical.
As shown in Fig. 2, the present invention also provides a kind of short text similarity calculation systems, including:
Training corpus word-dividing mode is used to obtain training corpus and is segmented to training corpus;
Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training Language material is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector collection It closes;
Short text word-dividing mode is used to be segmented to obtain multiple words to short text to be calculated;
Short text vector generation module is connect with term vector training module, and short text vector generation module is used in word The corresponding term vector of each word after participle is found in vector set, and the term vector of each word after participle is combined into shape At short text vector set;
First similarity calculation module is connect, described first with term vector training module, short text vector generation module Similarity calculation module be used to calculate each term vector in term vector set in short text vector set each term vector it is remaining String similarity;
Comparison module is connect with term vector training module, the comparison module be used for by each word of term vector set to The similarity value of amount is compared, and obtains the maximum value of the similarity of each term vector in term vector set;
Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used In combining to obtain short text sentence vector by the maximum similarity value of each term vector in term vector set;
Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation Module calculates the similarity between short text sentence vector using cosine similarity formula.
The short text similarity calculation system of the present invention obtains training corpus and to training by training corpus word-dividing mode Language material is segmented;Then it is trained to obtain each word in training corpus to training corpus by term vector training module Then term vector combines each term vector to form term vector set;Again by short text word-dividing mode to short essay to be calculated This is segmented to obtain multiple words;Each of found in term vector set by short text vector generation module again after participle The corresponding term vector of word, and combine the term vector of each word after participle to form short text vector set;Then pass through First similarity calculation module calculate each term vector in term vector set in short text vector set each term vector it is remaining String similarity;The similarity value of each term vector of term vector set is compared by comparing module, and obtains term vector collection The maximum value of the similarity of each term vector in conjunction;By short text sentence vector generation module by each word in term vector set The maximum similarity value of vector combines to obtain short text sentence vector;Cosine similarity is used by the second similarity calculation module Formula calculates the similarity between short text sentence vector.
In another technical solution, the short text similarity calculation system, the training corpus word-dividing mode packet It includes:
Acquiring unit is used to obtain training corpus;
Participle unit is connect with acquiring unit, and the participle unit is for segmenting training corpus.
In another technical solution, the short text similarity calculation system, the term vector training module includes:
Term vector training unit, be used to that training corpus to be trained to obtain the word of each word in training corpus to Amount;
Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each word Vector combination forms term vector set.
In another technical solution, the short text similarity calculation system, short text vector generation module includes:
Searching unit is used to find the corresponding term vector of each word after participle in term vector set;
Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for that will segment The term vector of each word afterwards combines to form short text vector set.
Although the embodiments of the present invention have been disclosed as above, but its is not only in the description and the implementation listed With it can be fully applied to various fields suitable for the present invention, for those skilled in the art, can be easily Realize other modification, therefore without departing from the general concept defined in the claims and the equivalent scope, the present invention is simultaneously unlimited In specific details and legend shown and described herein.

Claims (9)

1. a kind of short text similarity calculating method, which is characterized in that include the following steps:
S1, training corpus is obtained, training corpus is segmented, training corpus is carried out using deep learning word2vec algorithms Training, obtains the term vector (a of each word in training corpus1i,a2i,a3i...), then each term vector is combined to be formed word to Duration set S;
S=((a11,a21,a31…),(a12,a22,a32…),(a13,a23,a33…),…(a1i,a2i,a3i…)…(a1N,a2N, a3N…))
Each of S2, short text to be calculated is segmented respectively, found in term vector set after short text to be calculated participle The corresponding term vector word of wordi, and combine and form short text vector set sen;
Sen=(word1,word2,word3,…wordiwordM)
S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each word to The cosine similarity of amount, and obtain the maximum similarity value max of each term vector in term vector seti, by each term vector Maximum similarity value maxiCombination obtains short text sentence vector senVec;
SenVec=(max1,max2,max3,…maxi…maxN);
S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short texts Between similarity.
2. short text similarity calculating method as described in claim 1, which is characterized in that the method for obtaining training corpus in S1 For:Corpus data is obtained, non-legible class data in corpus data is removed and obtains training corpus;Obtaining each of training corpus After the term vector of word, remove the corresponding term vector of word that stop words and word frequency in corpus data are less than predetermined threshold value, it is remaining The corresponding term vector of word combines to form term vector set S, and predetermined threshold value is between 5~10.
3. short text similarity calculating method as described in claim 1, which is characterized in that in S2 by HMM model and Viterbi algorithm segments short text and training corpus to be calculated.
4. short text similarity calculating method as claimed in claim 2, which is characterized in that the corpus data passes through reptile skill Art obtains.
5. short text similarity calculating method as described in claim 1, which is characterized in that when similarity value is more than 0.7 in S4 When, then it is assumed that it is being semantically similar between two short text sentences.
6. a kind of short text similarity calculation system as described in claim 1, which is characterized in that including:
Training corpus word-dividing mode is used to obtain training corpus and is segmented to training corpus;
Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training corpus It is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector set;
Short text word-dividing mode is used to be segmented to obtain multiple words to short text to be calculated;
Short text vector generation module is connect with term vector training module, and short text vector generation module is used in term vector The corresponding term vector of each word after participle is found in set, and the term vector of each word after participle combined to be formed it is short Text vector set;
First similarity calculation module is connect with term vector training module, short text vector generation module, and described first is similar Degree computing module is used to calculate the cosine phase of each term vector and each term vector in short text vector set in term vector set Like degree;
Comparison module is connect with term vector training module, and the comparison module is used for each term vector of term vector set Similarity value is compared, and obtains the maximum value of the similarity of each term vector in term vector set;
Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used for will The maximum similarity value of each term vector combines to obtain short text sentence vector in term vector set;
Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation module Similarity between short text sentence vector is calculated using cosine similarity formula.
7. short text similarity calculation system as claimed in claim 6, which is characterized in that the training corpus word-dividing mode packet It includes:
Acquiring unit is used to obtain training corpus;
Participle unit is connect with acquiring unit, and the participle unit is for segmenting training corpus.
8. short text similarity calculation system as claimed in claim 6, which is characterized in that the term vector training module packet It includes:
Term vector training unit is used to be trained to obtain to training corpus the term vector of each word in training corpus;
Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each term vector Combination forms term vector set.
9. short text similarity calculation system as claimed in claim 6, which is characterized in that short text vector generation module packet It includes:
Searching unit is used to find the corresponding term vector of each word after participle in term vector set;
Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for after segmenting The term vector of each word combines to form short text vector set.
CN201810090296.9A 2018-01-30 2018-01-30 Short text similarity calculating method and system Pending CN108334495A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810090296.9A CN108334495A (en) 2018-01-30 2018-01-30 Short text similarity calculating method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810090296.9A CN108334495A (en) 2018-01-30 2018-01-30 Short text similarity calculating method and system

Publications (1)

Publication Number Publication Date
CN108334495A true CN108334495A (en) 2018-07-27

Family

ID=62926328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810090296.9A Pending CN108334495A (en) 2018-01-30 2018-01-30 Short text similarity calculating method and system

Country Status (1)

Country Link
CN (1) CN108334495A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 Similar defect report recommendation method combining weighted word vector and potential semantic analysis
CN109271514A (en) * 2018-09-14 2019-01-25 华南师范大学 Generation method, classification method, device and the storage medium of short text disaggregated model
CN109597992A (en) * 2018-11-27 2019-04-09 苏州浪潮智能软件有限公司 A kind of Question sentence parsing calculation method of combination synonymicon and word insertion vector
CN109871437A (en) * 2018-11-30 2019-06-11 阿里巴巴集团控股有限公司 Method and device for the processing of customer problem sentence
CN110009064A (en) * 2019-04-30 2019-07-12 广东电网有限责任公司 A kind of semantic model training method and device based on electrical network field
CN110059155A (en) * 2018-12-18 2019-07-26 阿里巴巴集团控股有限公司 The calculating of text similarity, intelligent customer service system implementation method and device
CN110096705A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of unsupervised english sentence simplifies algorithm automatically
CN110113228A (en) * 2019-04-25 2019-08-09 新华三信息安全技术有限公司 A kind of network connection detection method and device
CN110266675A (en) * 2019-06-12 2019-09-20 成都积微物联集团股份有限公司 A kind of xss attack automated detection method based on deep learning
CN110287312A (en) * 2019-05-10 2019-09-27 平安科技(深圳)有限公司 Calculation method, device, computer equipment and the computer storage medium of text similarity
CN110674251A (en) * 2019-08-21 2020-01-10 杭州电子科技大学 Computer-assisted secret point annotation method based on semantic information
CN110781277A (en) * 2019-09-23 2020-02-11 厦门快商通科技股份有限公司 Text recognition model similarity training method, system, recognition method and terminal
CN110781687A (en) * 2019-11-06 2020-02-11 三角兽(北京)科技有限公司 Same intention statement acquisition method and device
CN110874528A (en) * 2018-08-10 2020-03-10 珠海格力电器股份有限公司 Text similarity obtaining method and device
WO2020062770A1 (en) * 2018-09-27 2020-04-02 深圳大学 Method and apparatus for constructing domain dictionary, and device and storage medium
CN110956033A (en) * 2019-12-04 2020-04-03 北京中电普华信息技术有限公司 Text similarity calculation method and device
CN111178059A (en) * 2019-12-07 2020-05-19 武汉光谷信息技术股份有限公司 Similarity comparison method and device based on word2vec technology
CN111191469A (en) * 2019-12-17 2020-05-22 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111259649A (en) * 2020-01-19 2020-06-09 深圳壹账通智能科技有限公司 Interactive data classification method and device of information interaction platform and storage medium
CN111368061A (en) * 2018-12-25 2020-07-03 深圳市优必选科技有限公司 Short text filtering method, device, medium and computer equipment
CN111523301A (en) * 2020-06-05 2020-08-11 泰康保险集团股份有限公司 Contract document compliance checking method and device
CN111986007A (en) * 2020-10-26 2020-11-24 北京值得买科技股份有限公司 Method for commodity aggregation and similarity calculation
CN112115715A (en) * 2020-09-04 2020-12-22 北京嘀嘀无限科技发展有限公司 Natural language text processing method and device, storage medium and electronic equipment
CN112257431A (en) * 2020-10-30 2021-01-22 中电万维信息技术有限责任公司 NLP-based short text data processing method
CN113342968A (en) * 2021-05-21 2021-09-03 中国石油天然气股份有限公司 Text abstract extraction method and device
CN116932702A (en) * 2023-09-19 2023-10-24 湖南正宇软件技术开发有限公司 Method, system, device and storage medium for proposal and proposal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN106021223A (en) * 2016-05-09 2016-10-12 Tcl集团股份有限公司 Sentence similarity calculation method and system
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107436864A (en) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN106021223A (en) * 2016-05-09 2016-10-12 Tcl集团股份有限公司 Sentence similarity calculation method and system
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107436864A (en) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
段旭磊,张仰森,孙祎卓: "微博文本的句向量表示及相似度计算方法研究", 《计算机工程》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165382B (en) * 2018-08-03 2022-08-23 南京工业大学 Similar defect report recommendation method combining weighted word vector and potential semantic analysis
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 Similar defect report recommendation method combining weighted word vector and potential semantic analysis
CN110874528B (en) * 2018-08-10 2020-11-10 珠海格力电器股份有限公司 Text similarity obtaining method and device
CN110874528A (en) * 2018-08-10 2020-03-10 珠海格力电器股份有限公司 Text similarity obtaining method and device
CN109271514A (en) * 2018-09-14 2019-01-25 华南师范大学 Generation method, classification method, device and the storage medium of short text disaggregated model
WO2020062770A1 (en) * 2018-09-27 2020-04-02 深圳大学 Method and apparatus for constructing domain dictionary, and device and storage medium
CN109597992B (en) * 2018-11-27 2023-06-27 浪潮金融信息技术有限公司 Question similarity calculation method combining synonym dictionary and word embedding vector
CN109597992A (en) * 2018-11-27 2019-04-09 苏州浪潮智能软件有限公司 A kind of Question sentence parsing calculation method of combination synonymicon and word insertion vector
CN109871437A (en) * 2018-11-30 2019-06-11 阿里巴巴集团控股有限公司 Method and device for the processing of customer problem sentence
CN109871437B (en) * 2018-11-30 2023-04-21 阿里巴巴集团控股有限公司 Method and device for processing user problem statement
CN110059155A (en) * 2018-12-18 2019-07-26 阿里巴巴集团控股有限公司 The calculating of text similarity, intelligent customer service system implementation method and device
CN111368061B (en) * 2018-12-25 2024-04-12 深圳市优必选科技有限公司 Short text filtering method, device, medium and computer equipment
CN111368061A (en) * 2018-12-25 2020-07-03 深圳市优必选科技有限公司 Short text filtering method, device, medium and computer equipment
CN110113228A (en) * 2019-04-25 2019-08-09 新华三信息安全技术有限公司 A kind of network connection detection method and device
CN110096705B (en) * 2019-04-29 2023-09-08 扬州大学 Unsupervised English sentence automatic simplification algorithm
CN110096705A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of unsupervised english sentence simplifies algorithm automatically
CN110009064A (en) * 2019-04-30 2019-07-12 广东电网有限责任公司 A kind of semantic model training method and device based on electrical network field
CN110287312A (en) * 2019-05-10 2019-09-27 平安科技(深圳)有限公司 Calculation method, device, computer equipment and the computer storage medium of text similarity
CN110287312B (en) * 2019-05-10 2023-08-25 平安科技(深圳)有限公司 Text similarity calculation method, device, computer equipment and computer storage medium
CN110266675A (en) * 2019-06-12 2019-09-20 成都积微物联集团股份有限公司 A kind of xss attack automated detection method based on deep learning
CN110674251A (en) * 2019-08-21 2020-01-10 杭州电子科技大学 Computer-assisted secret point annotation method based on semantic information
CN110781277A (en) * 2019-09-23 2020-02-11 厦门快商通科技股份有限公司 Text recognition model similarity training method, system, recognition method and terminal
CN110781687A (en) * 2019-11-06 2020-02-11 三角兽(北京)科技有限公司 Same intention statement acquisition method and device
CN110781687B (en) * 2019-11-06 2021-07-06 腾讯科技(深圳)有限公司 Same intention statement acquisition method and device
CN110956033A (en) * 2019-12-04 2020-04-03 北京中电普华信息技术有限公司 Text similarity calculation method and device
CN111178059A (en) * 2019-12-07 2020-05-19 武汉光谷信息技术股份有限公司 Similarity comparison method and device based on word2vec technology
CN111178059B (en) * 2019-12-07 2023-08-25 武汉光谷信息技术股份有限公司 Similarity comparison method and device based on word2vec technology
CN111191469B (en) * 2019-12-17 2023-09-19 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111191469A (en) * 2019-12-17 2020-05-22 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111259649A (en) * 2020-01-19 2020-06-09 深圳壹账通智能科技有限公司 Interactive data classification method and device of information interaction platform and storage medium
CN111523301A (en) * 2020-06-05 2020-08-11 泰康保险集团股份有限公司 Contract document compliance checking method and device
CN112115715A (en) * 2020-09-04 2020-12-22 北京嘀嘀无限科技发展有限公司 Natural language text processing method and device, storage medium and electronic equipment
CN111986007A (en) * 2020-10-26 2020-11-24 北京值得买科技股份有限公司 Method for commodity aggregation and similarity calculation
CN112257431A (en) * 2020-10-30 2021-01-22 中电万维信息技术有限责任公司 NLP-based short text data processing method
CN113342968A (en) * 2021-05-21 2021-09-03 中国石油天然气股份有限公司 Text abstract extraction method and device
CN113342968B (en) * 2021-05-21 2024-07-30 中国石油天然气股份有限公司 Text abstract extraction method and device
CN116932702A (en) * 2023-09-19 2023-10-24 湖南正宇软件技术开发有限公司 Method, system, device and storage medium for proposal and proposal

Similar Documents

Publication Publication Date Title
CN108334495A (en) Short text similarity calculating method and system
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
CN107193801B (en) Short text feature optimization and emotion analysis method based on deep belief network
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN107729392B (en) Text structuring method, device and system and non-volatile storage medium
CN108363725B (en) Method for extracting user comment opinions and generating opinion labels
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN106844346A (en) Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
WO2018153215A1 (en) Method for automatically generating sentence sample with similar semantics
CN105701084A (en) Characteristic extraction method of text classification on the basis of mutual information
CN103678275A (en) Two-level text similarity calculation method based on subjective and objective semantics
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN111291177A (en) Information processing method and device and computer storage medium
CN110188359B (en) Text entity extraction method
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN105956158A (en) Automatic extraction method of network neologism on the basis of mass microblog texts and use information
Gong et al. A semantic similarity language model to improve automatic image annotation
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
CN110413985B (en) Related text segment searching method and device
CN111104508A (en) Method, system and medium for representing word bag model text based on fault-tolerant rough set
CN114722774B (en) Data compression method, device, electronic equipment and storage medium
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN114169325B (en) Webpage new word discovery and analysis method based on word vector representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180727