CN108334495A - Short text similarity calculating method and system - Google Patents
Short text similarity calculating method and system Download PDFInfo
- Publication number
- CN108334495A CN108334495A CN201810090296.9A CN201810090296A CN108334495A CN 108334495 A CN108334495 A CN 108334495A CN 201810090296 A CN201810090296 A CN 201810090296A CN 108334495 A CN108334495 A CN 108334495A
- Authority
- CN
- China
- Prior art keywords
- term vector
- short text
- word
- vector
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of short text similarity calculating methods, include the following steps:S1, training corpus is segmented, obtains the term vector of each word using word2vec algorithms, and combine and form term vector set;S2, short text to be calculated is segmented respectively, the term vector of each word of short text to be calculated is found in term vector set, and combine and form short text vector set;S3, the cosine similarity for calculating each term vector and each term vector in short text vector set in term vector set, and the maximum similarity value for obtaining each term vector combines to obtain short text sentence vector;Similarity between two S4, calculating short text sentence vectors, you can calculate the similarity between two short texts.The present invention also provides a kind of short text similarity calculation systems.The similarity algorithm of the present invention is by indicating short text sentence with sentence vector, and effectively feature the semantic similarity between short text sentence, accuracy rate is high.
Description
Technical field
The invention belongs to short text similarity technical fields, and in particular to a kind of short text similarity calculating method and be
System.
Background technology
With the fast development of computer science and technology and internet, more and more data occur in the form of short text
On the internet, such as Twitter message, headline, mhkc speech etc..For internet short text data application class, cluster
Equal machine learning techniques, therefrom excavate valuable information and provide useful facility for people’s lives, to meet not Tongfang
Face needs as a very popular project in current big data application technology.However Chinese short text has word dilute
The features such as dredging, is semantic discrete, word is random, it is extremely challenging that this is that the research of Chinese short text is brought, therefore to short essay
Notebook data excavate and carry out accurately cognition to its inherent meaning becoming an extremely urgent task, and with very
Important theoretical significance.
Method about Chinese short text similarity calculation at present mainly indicates text using vector space model (VSM)
This, and then realize the similarity calculation of text.In the model, content of text turns to a point in hyperspace by form,
It is provided by the form of vector, operation vectorial in vector space is reduced to the processing of content of text, makes the complexity of problem
Degree reduces.Analyzing processing is carried out to short text in this way and is primarily present following two points problem, first is due to short essay
The sparsity of eigen word so that since text vector is excessively sparse when it utilizes the algorithm process of common text, cause poly-
Effect is undesirable when class, is unable to reach the comparable effect of long text;Second is to indicate text only using vector space model
The statistical property of word within a context is considered, based on the assumed condition of linear independence between keyword, without considering word sheet
The semantic information of body, therefore there is significant limitation, can not accurately express in sentence semantic meaning.
Invention content
It is an object of the invention to solve the above problems, and provide the advantages of at least will be described later.
It is a still further object of the present invention to provide a kind of short text similarity calculating methods, utilize deep learning word2vec
Algorithm trains to obtain the term vector of each word in training corpus, calculates each term vector and short text vector in term vector set
The cosine similarity of each term vector in set, obtains the maximum similarity value of each term vector in term vector set, will be each
The maximum similarity value of term vector combines to obtain short text sentence vector, then cosine similarity algorithm is used to calculate short text sentence
Similarity between subvector effectively features the semantic similarity between short text sentence.
In order to realize these purposes and other advantages according to the present invention, a kind of short text similarity calculation side is provided
Method includes the following steps:
S1, training corpus is obtained, training corpus is segmented, using deep learning word2vec algorithms to training corpus
It is trained, obtains the term vector (a of each word in training corpus1i,a2i,a3i...), then each term vector is combined to be formed
Term vector set S;
S=((a11,a21,a31…),(a12,a22,a32…),(a13,a23,a33…),…(a1i,a2i,a3i…)…(a1N,
a2N,a3N…))
S2, short text to be calculated is segmented respectively, after short text participle to be calculated is found in term vector set
The corresponding term vector word of each wordi, and combine and form short text vector set sen;
Sen=(word1,word2,word3,…wordi…wordM)
S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each
The cosine similarity of term vector, and obtain the maximum similarity value max of each term vector in term vector seti, by each word to
The maximum similarity value max of amountiCombination obtains short text sentence vector senVec;
SenVec=(max1,max2,max3,…maxi…maxN);
S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short
Similarity between text.
Preferably, the short text similarity calculating method, the method that training corpus is obtained in S1 are:Obtain language material
Data remove non-legible class data in corpus data and obtain training corpus;The term vector of each word in obtaining training corpus
Afterwards, remove stop words in corpus data and word frequency is less than the corresponding term vector of word of predetermined threshold value, the corresponding word of remaining word
Vector combination forms term vector set S, and predetermined threshold value is between 5~10.
Preferably, the short text similarity calculating method passes through HMM model and Viterbi algorithm pair in S2
Short text and training corpus to be calculated is segmented.
Preferably, the short text similarity calculating method, the corpus data are obtained by crawler technology.
Preferably, the short text similarity calculating method, in S4 when similarity value is more than 0.7, then it is assumed that two
It is being semantically similar between a short text sentence.
The present invention also provides a kind of short text similarity calculation systems, including:
Training corpus word-dividing mode is used to obtain training corpus and is segmented to training corpus;
Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training
Language material is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector collection
It closes;
Short text word-dividing mode is used to be segmented to obtain multiple words to short text to be calculated;
Short text vector generation module is connect with term vector training module, and short text vector generation module is used in word
The corresponding term vector of each word after participle is found in vector set, and the term vector of each word after participle is combined into shape
At short text vector set;
First similarity calculation module is connect, described first with term vector training module, short text vector generation module
Similarity calculation module be used to calculate each term vector in term vector set in short text vector set each term vector it is remaining
String similarity;
Comparison module is connect with term vector training module, the comparison module be used for by each word of term vector set to
The similarity value of amount is compared, and obtains the maximum value of the similarity of each term vector in term vector set;
Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used
In combining to obtain short text sentence vector by the maximum similarity value of each term vector in term vector set;
Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation
Module calculates the similarity between short text sentence vector using cosine similarity formula.
Preferably, the short text similarity calculation system, the training corpus word-dividing mode include:
Acquiring unit is used to obtain training corpus;
Participle unit is connect with acquiring unit, and the participle unit is for segmenting training corpus.
Preferably, the short text similarity calculation system, the term vector training module include:
Term vector training unit, be used to that training corpus to be trained to obtain the word of each word in training corpus to
Amount;
Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each word
Vector combination forms term vector set.
Preferably, the short text similarity calculation system, short text vector generation module include:
Searching unit is used to find the corresponding term vector of each word after participle in term vector set;
Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for that will segment
The term vector of each word afterwards combines to form short text vector set.
The present invention includes at least following advantageous effect:
1, the present invention is trained training corpus using deep learning word2vec algorithms, and obtains in training corpus
The term vector of each word, indicate word in the form of term vector, the effectively expressing real inherent meaning of word;It then will be every
A term vector combines to form term vector set, is segmented respectively to short text to be calculated, is calculated using cosine similarity formula
The cosine similarity of each term vector and each term vector in short text vector set in term vector set, and obtain term vector collection
The maximum similarity value of each term vector in conjunction, by the maximum similarity value of each term vector combine to obtain short text sentence to
Amount, i.e., indicate a short text sentence in the form of real vector, and converting short text to mathematic vector by algorithm indicates shape
Formula, the short text sentence vector being built such that fully consider the inherent semantic meaning of sentence word;Then cosine similarity is used
Algorithm calculates the similarity between short text sentence vector, effectively features the semantic similarity between short text sentence, has
There is higher accuracy rate, is the support of the natural language processings task creation powerful techniques such as follow-up short text clustering, classification.
Part is illustrated to embody by further advantage, target and the feature of the present invention by following, and part will also be by this
The research and practice of invention and be understood by the person skilled in the art.
Description of the drawings
Fig. 1 is the flow diagram of the short text similarity calculating method of the present invention;
Fig. 2 is the short text similarity calculation system composition schematic diagram of the present invention.
Specific implementation mode
Present invention will be described in further detail below with reference to the accompanying drawings, to enable those skilled in the art with reference to specification text
Word can be implemented according to this.
It should be appreciated that such as " having ", "comprising" and " comprising " term used herein do not allot one or more
The presence or addition of a other elements or combinations thereof.
As shown in Figure 1, a kind of short text similarity calculating method, includes the following steps:
S1, training corpus is obtained, training corpus is segmented, using deep learning word2vec algorithms to training corpus
It is trained, obtains the term vector (a of each word in training corpus1i,a2i,a3i...), then each term vector is combined to be formed
Term vector set S;
S=((a11,a21,a31…),(a12,a22,a32…),(a13,a23,a33…),…(a1i,a2i,a3i…)…(a1N,
a2N,a3N…))
S2, short text to be calculated is segmented respectively, after short text participle to be calculated is found in term vector set
The corresponding term vector word of each wordi, and combine and form short text vector set sen;
Sen=(word1,word2,word3,…wordi…wordM)
S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each
The cosine similarity of term vector, and obtain the maximum similarity value max of each term vector in term vector seti, by each word to
The maximum similarity value max of amountiCombination obtains short text sentence vector senVec;
SenVec=(max1,max2,max3,…maxi…maxN);
S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short
Similarity between text.
The short text similarity calculating method of the present invention, after obtaining training corpus on the internet first, by segmenting work
Tool segments training corpus, i.e., is the set formed containing a large amount of word in training corpus, word number is super in practice
10,000 are crossed, for example training corpus includes for I, is come from, the U.S., Europe, people is that Chinese seven words are convenient for explanation herein
It is illustrated by taking seven words as an example, the word of each word in training corpus is then calculated by deep learning word2vec algorithms
Vector, such as my term vector are (a11,a21,a31), from term vector be (a12,a22,a32), the term vector in the U.S. is (a13,
a23,a33), European term vector is (a14,a24,a34), the term vector of people is (a15,a25,a35), the term vector for being is (a16,a26,
a36), Chinese term vector is (a17,a27,a37), term vector is three-dimensional herein, can be in practice multidimensional, then by upper predicate
The combination of vector forms term vector set.There are two short text sentences to be calculated now, " I am Chinese ", " during I comes from
State " first respectively segments short text sentence according to segmenting method identical with training corpus, such as " I am Chinese "
Word after participle is for I, be, China, people, the word after " I am from China " participle is me, comes from, China, every after participle
A word can be found in training corpus, can be corresponded in this way in term vector set find the corresponding word of each word to
Amount, then combination form short text vector set, then calculate separately each term vector of term vector set and short text vector set
The cosine similarity of each term vector in conjunction, by taking short text sentence " I am Chinese " as an example, by above-mentioned term vector set I
Corresponding term vector respectively in short text sentence I, be, China, the corresponding term vector of four words of people carry out cosine similarity
Four numerical value are calculated, then compare the size of cosine similarity, chooses cosine similarity maximum value and is denoted as a1, then calculate
Come in term vector set self-corresponding term vector in short text sentence I, be, China, the corresponding term vector progress of four words of people
Four numerical value are calculated in cosine similarity, then compare the size of cosine similarity, choose cosine similarity maximum value and are denoted as
a2, and so on, finally obtain in term vector set me, come from, the U.S., Europe, people is, each word term vector of China it is remaining
String similarity maximum value is a1、a2、a3、a4、a5、a6、a7, then the cosine similarity maximum value of each word term vector is combined
To short text sentence vector, you can the short text sentence vector representation for obtaining " I am Chinese " is denoted as senVec1=
(a1, a2, a3, a4, a5, a6, a7), similarly, calculate the short text sentence vector of " I from China ", first in term vector set I
Corresponding term vector respectively in short text sentence I, come from, the corresponding term vector of Chinese three words carries out cosine similarity
Three numerical value are calculated, then compare the size of cosine similarity, chooses cosine similarity maximum value and is denoted as b1, then calculate
Come in term vector set self-corresponding term vector in short text sentence I, come from, the corresponding term vector progress of Chinese three words
Three numerical value are calculated in cosine similarity, then compare the size of cosine similarity, choose cosine similarity maximum value and are denoted as
b2, and so on, finally obtain in term vector set me, come from, the U.S., Europe, people is, each word term vector of China it is remaining
String similarity maximum value is b1、b2、b3、b4、b5、b6、b7, you can the short text sentence vector for obtaining " I am from China " indicates shape
Formula is denoted as senVec2=(b1, b2, b3, b4, b5, b6, b7).Then in the phase of calculating " I am Chinese " and " I am from China "
Like degree, specific formula is:
Similarity value is bigger closer to 1 two short text sentence similarity between 0~1.
In another technical solution, the short text similarity calculating method, the method that training corpus is obtained in S1
For:Corpus data is obtained, non-legible class data in corpus data is removed and obtains training corpus;Obtaining each of training corpus
After the term vector of word, remove the corresponding term vector of word that stop words and word frequency in corpus data are less than predetermined threshold value, it is remaining
The corresponding term vector of word combines to form term vector set S, and predetermined threshold value is between 5~10.In the technical scheme, it is interconnecting
Online to obtain corpus data, corpus data includes for the article inside mhkc, forum, comment, professional journals magazine etc., removing language
The information data of the non-legible classes such as some link, emoticon in material data obtains training expectation;Using word2vec algorithms
Before being trained to training corpus, classify to training corpus word, count the frequency and stop words that each word occurs,
Stop words includes auxiliary words of mood, adverbial word, preposition, conjunction etc., these words itself have no specific meaning, only put it into one
Just have certain effect in a complete sentence, as it is common " ", " " etc, instructed after being trained to training corpus
Practice the term vector of each word in language material, and removes the corresponding term vector of word of stop words and word frequency less than predetermined threshold value, this
In can set a threshold value in advance, for example threshold value is 8, then word corresponding term vector of the word frequency less than 8 all removes, i.e. frequency
Rate is too small to show that the sentence being made of the word can seldom be ignored substantially, can reduce term vector in term vector set in this way
Quantity improves calculating speed, if short text sentence to be calculated contains the word removed, such as short text sentence to be calculated point
Include A after word1、A2、A3、A4Four words, A3For the low-frequency word removed, then A is searched respectively in term vector set in S21、A2、
A4Corresponding term vector, and combine corresponding term vector to form short text vector set, the calculating of S3, S4 step is then carried out again
The similarity of the short text sentence.
In another technical solution, the short text similarity calculating method, in S2 by HMM model and
Viterbi algorithm segments short text and training corpus to be calculated.Using identical method to short essay to be calculated
This and training corpus are segmented, so that short text to be calculated each word after participle can be looked in training corpus
It arrives.
In another technical solution, the short text similarity calculating method, the corpus data passes through reptile skill
Art obtains.
In another technical solution, the short text similarity calculating method, when similarity value is more than 0.7 in S4
When, then it is assumed that it is being semantically similar between two short text sentences.The even more big then two short text sentences of fruit similarity value
More close, when similarity value is more than 0.7, then two short text sentence semantics are identical.
As shown in Fig. 2, the present invention also provides a kind of short text similarity calculation systems, including:
Training corpus word-dividing mode is used to obtain training corpus and is segmented to training corpus;
Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training
Language material is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector collection
It closes;
Short text word-dividing mode is used to be segmented to obtain multiple words to short text to be calculated;
Short text vector generation module is connect with term vector training module, and short text vector generation module is used in word
The corresponding term vector of each word after participle is found in vector set, and the term vector of each word after participle is combined into shape
At short text vector set;
First similarity calculation module is connect, described first with term vector training module, short text vector generation module
Similarity calculation module be used to calculate each term vector in term vector set in short text vector set each term vector it is remaining
String similarity;
Comparison module is connect with term vector training module, the comparison module be used for by each word of term vector set to
The similarity value of amount is compared, and obtains the maximum value of the similarity of each term vector in term vector set;
Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used
In combining to obtain short text sentence vector by the maximum similarity value of each term vector in term vector set;
Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation
Module calculates the similarity between short text sentence vector using cosine similarity formula.
The short text similarity calculation system of the present invention obtains training corpus and to training by training corpus word-dividing mode
Language material is segmented;Then it is trained to obtain each word in training corpus to training corpus by term vector training module
Then term vector combines each term vector to form term vector set;Again by short text word-dividing mode to short essay to be calculated
This is segmented to obtain multiple words;Each of found in term vector set by short text vector generation module again after participle
The corresponding term vector of word, and combine the term vector of each word after participle to form short text vector set;Then pass through
First similarity calculation module calculate each term vector in term vector set in short text vector set each term vector it is remaining
String similarity;The similarity value of each term vector of term vector set is compared by comparing module, and obtains term vector collection
The maximum value of the similarity of each term vector in conjunction;By short text sentence vector generation module by each word in term vector set
The maximum similarity value of vector combines to obtain short text sentence vector;Cosine similarity is used by the second similarity calculation module
Formula calculates the similarity between short text sentence vector.
In another technical solution, the short text similarity calculation system, the training corpus word-dividing mode packet
It includes:
Acquiring unit is used to obtain training corpus;
Participle unit is connect with acquiring unit, and the participle unit is for segmenting training corpus.
In another technical solution, the short text similarity calculation system, the term vector training module includes:
Term vector training unit, be used to that training corpus to be trained to obtain the word of each word in training corpus to
Amount;
Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each word
Vector combination forms term vector set.
In another technical solution, the short text similarity calculation system, short text vector generation module includes:
Searching unit is used to find the corresponding term vector of each word after participle in term vector set;
Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for that will segment
The term vector of each word afterwards combines to form short text vector set.
Although the embodiments of the present invention have been disclosed as above, but its is not only in the description and the implementation listed
With it can be fully applied to various fields suitable for the present invention, for those skilled in the art, can be easily
Realize other modification, therefore without departing from the general concept defined in the claims and the equivalent scope, the present invention is simultaneously unlimited
In specific details and legend shown and described herein.
Claims (9)
1. a kind of short text similarity calculating method, which is characterized in that include the following steps:
S1, training corpus is obtained, training corpus is segmented, training corpus is carried out using deep learning word2vec algorithms
Training, obtains the term vector (a of each word in training corpus1i,a2i,a3i...), then each term vector is combined to be formed word to
Duration set S;
S=((a11,a21,a31…),(a12,a22,a32…),(a13,a23,a33…),…(a1i,a2i,a3i…)…(a1N,a2N,
a3N…))
Each of S2, short text to be calculated is segmented respectively, found in term vector set after short text to be calculated participle
The corresponding term vector word of wordi, and combine and form short text vector set sen;
Sen=(word1,word2,word3,…wordiwordM)
S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each word to
The cosine similarity of amount, and obtain the maximum similarity value max of each term vector in term vector seti, by each term vector
Maximum similarity value maxiCombination obtains short text sentence vector senVec;
SenVec=(max1,max2,max3,…maxi…maxN);
S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short texts
Between similarity.
2. short text similarity calculating method as described in claim 1, which is characterized in that the method for obtaining training corpus in S1
For:Corpus data is obtained, non-legible class data in corpus data is removed and obtains training corpus;Obtaining each of training corpus
After the term vector of word, remove the corresponding term vector of word that stop words and word frequency in corpus data are less than predetermined threshold value, it is remaining
The corresponding term vector of word combines to form term vector set S, and predetermined threshold value is between 5~10.
3. short text similarity calculating method as described in claim 1, which is characterized in that in S2 by HMM model and
Viterbi algorithm segments short text and training corpus to be calculated.
4. short text similarity calculating method as claimed in claim 2, which is characterized in that the corpus data passes through reptile skill
Art obtains.
5. short text similarity calculating method as described in claim 1, which is characterized in that when similarity value is more than 0.7 in S4
When, then it is assumed that it is being semantically similar between two short text sentences.
6. a kind of short text similarity calculation system as described in claim 1, which is characterized in that including:
Training corpus word-dividing mode is used to obtain training corpus and is segmented to training corpus;
Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training corpus
It is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector set;
Short text word-dividing mode is used to be segmented to obtain multiple words to short text to be calculated;
Short text vector generation module is connect with term vector training module, and short text vector generation module is used in term vector
The corresponding term vector of each word after participle is found in set, and the term vector of each word after participle combined to be formed it is short
Text vector set;
First similarity calculation module is connect with term vector training module, short text vector generation module, and described first is similar
Degree computing module is used to calculate the cosine phase of each term vector and each term vector in short text vector set in term vector set
Like degree;
Comparison module is connect with term vector training module, and the comparison module is used for each term vector of term vector set
Similarity value is compared, and obtains the maximum value of the similarity of each term vector in term vector set;
Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used for will
The maximum similarity value of each term vector combines to obtain short text sentence vector in term vector set;
Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation module
Similarity between short text sentence vector is calculated using cosine similarity formula.
7. short text similarity calculation system as claimed in claim 6, which is characterized in that the training corpus word-dividing mode packet
It includes:
Acquiring unit is used to obtain training corpus;
Participle unit is connect with acquiring unit, and the participle unit is for segmenting training corpus.
8. short text similarity calculation system as claimed in claim 6, which is characterized in that the term vector training module packet
It includes:
Term vector training unit is used to be trained to obtain to training corpus the term vector of each word in training corpus;
Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each term vector
Combination forms term vector set.
9. short text similarity calculation system as claimed in claim 6, which is characterized in that short text vector generation module packet
It includes:
Searching unit is used to find the corresponding term vector of each word after participle in term vector set;
Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for after segmenting
The term vector of each word combines to form short text vector set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810090296.9A CN108334495A (en) | 2018-01-30 | 2018-01-30 | Short text similarity calculating method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810090296.9A CN108334495A (en) | 2018-01-30 | 2018-01-30 | Short text similarity calculating method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108334495A true CN108334495A (en) | 2018-07-27 |
Family
ID=62926328
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810090296.9A Pending CN108334495A (en) | 2018-01-30 | 2018-01-30 | Short text similarity calculating method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108334495A (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165382A (en) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | Similar defect report recommendation method combining weighted word vector and potential semantic analysis |
CN109271514A (en) * | 2018-09-14 | 2019-01-25 | 华南师范大学 | Generation method, classification method, device and the storage medium of short text disaggregated model |
CN109597992A (en) * | 2018-11-27 | 2019-04-09 | 苏州浪潮智能软件有限公司 | A kind of Question sentence parsing calculation method of combination synonymicon and word insertion vector |
CN109871437A (en) * | 2018-11-30 | 2019-06-11 | 阿里巴巴集团控股有限公司 | Method and device for the processing of customer problem sentence |
CN110009064A (en) * | 2019-04-30 | 2019-07-12 | 广东电网有限责任公司 | A kind of semantic model training method and device based on electrical network field |
CN110059155A (en) * | 2018-12-18 | 2019-07-26 | 阿里巴巴集团控股有限公司 | The calculating of text similarity, intelligent customer service system implementation method and device |
CN110096705A (en) * | 2019-04-29 | 2019-08-06 | 扬州大学 | A kind of unsupervised english sentence simplifies algorithm automatically |
CN110113228A (en) * | 2019-04-25 | 2019-08-09 | 新华三信息安全技术有限公司 | A kind of network connection detection method and device |
CN110266675A (en) * | 2019-06-12 | 2019-09-20 | 成都积微物联集团股份有限公司 | A kind of xss attack automated detection method based on deep learning |
CN110287312A (en) * | 2019-05-10 | 2019-09-27 | 平安科技(深圳)有限公司 | Calculation method, device, computer equipment and the computer storage medium of text similarity |
CN110674251A (en) * | 2019-08-21 | 2020-01-10 | 杭州电子科技大学 | Computer-assisted secret point annotation method based on semantic information |
CN110781277A (en) * | 2019-09-23 | 2020-02-11 | 厦门快商通科技股份有限公司 | Text recognition model similarity training method, system, recognition method and terminal |
CN110781687A (en) * | 2019-11-06 | 2020-02-11 | 三角兽(北京)科技有限公司 | Same intention statement acquisition method and device |
CN110874528A (en) * | 2018-08-10 | 2020-03-10 | 珠海格力电器股份有限公司 | Text similarity obtaining method and device |
WO2020062770A1 (en) * | 2018-09-27 | 2020-04-02 | 深圳大学 | Method and apparatus for constructing domain dictionary, and device and storage medium |
CN110956033A (en) * | 2019-12-04 | 2020-04-03 | 北京中电普华信息技术有限公司 | Text similarity calculation method and device |
CN111178059A (en) * | 2019-12-07 | 2020-05-19 | 武汉光谷信息技术股份有限公司 | Similarity comparison method and device based on word2vec technology |
CN111191469A (en) * | 2019-12-17 | 2020-05-22 | 语联网(武汉)信息技术有限公司 | Large-scale corpus cleaning and aligning method and device |
CN111259649A (en) * | 2020-01-19 | 2020-06-09 | 深圳壹账通智能科技有限公司 | Interactive data classification method and device of information interaction platform and storage medium |
CN111368061A (en) * | 2018-12-25 | 2020-07-03 | 深圳市优必选科技有限公司 | Short text filtering method, device, medium and computer equipment |
CN111523301A (en) * | 2020-06-05 | 2020-08-11 | 泰康保险集团股份有限公司 | Contract document compliance checking method and device |
CN111986007A (en) * | 2020-10-26 | 2020-11-24 | 北京值得买科技股份有限公司 | Method for commodity aggregation and similarity calculation |
CN112115715A (en) * | 2020-09-04 | 2020-12-22 | 北京嘀嘀无限科技发展有限公司 | Natural language text processing method and device, storage medium and electronic equipment |
CN112257431A (en) * | 2020-10-30 | 2021-01-22 | 中电万维信息技术有限责任公司 | NLP-based short text data processing method |
CN113342968A (en) * | 2021-05-21 | 2021-09-03 | 中国石油天然气股份有限公司 | Text abstract extraction method and device |
CN116932702A (en) * | 2023-09-19 | 2023-10-24 | 湖南正宇软件技术开发有限公司 | Method, system, device and storage medium for proposal and proposal |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389307A (en) * | 2015-12-02 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Statement intention category identification method and apparatus |
CN106021223A (en) * | 2016-05-09 | 2016-10-12 | Tcl集团股份有限公司 | Sentence similarity calculation method and system |
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN106844350A (en) * | 2017-02-15 | 2017-06-13 | 广州索答信息科技有限公司 | A kind of computational methods of short text semantic similarity |
CN107436864A (en) * | 2017-08-04 | 2017-12-05 | 逸途(北京)科技有限公司 | A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec |
-
2018
- 2018-01-30 CN CN201810090296.9A patent/CN108334495A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389307A (en) * | 2015-12-02 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Statement intention category identification method and apparatus |
CN106021223A (en) * | 2016-05-09 | 2016-10-12 | Tcl集团股份有限公司 | Sentence similarity calculation method and system |
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN106844350A (en) * | 2017-02-15 | 2017-06-13 | 广州索答信息科技有限公司 | A kind of computational methods of short text semantic similarity |
CN107436864A (en) * | 2017-08-04 | 2017-12-05 | 逸途(北京)科技有限公司 | A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec |
Non-Patent Citations (1)
Title |
---|
段旭磊,张仰森,孙祎卓: "微博文本的句向量表示及相似度计算方法研究", 《计算机工程》 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165382B (en) * | 2018-08-03 | 2022-08-23 | 南京工业大学 | Similar defect report recommendation method combining weighted word vector and potential semantic analysis |
CN109165382A (en) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | Similar defect report recommendation method combining weighted word vector and potential semantic analysis |
CN110874528B (en) * | 2018-08-10 | 2020-11-10 | 珠海格力电器股份有限公司 | Text similarity obtaining method and device |
CN110874528A (en) * | 2018-08-10 | 2020-03-10 | 珠海格力电器股份有限公司 | Text similarity obtaining method and device |
CN109271514A (en) * | 2018-09-14 | 2019-01-25 | 华南师范大学 | Generation method, classification method, device and the storage medium of short text disaggregated model |
WO2020062770A1 (en) * | 2018-09-27 | 2020-04-02 | 深圳大学 | Method and apparatus for constructing domain dictionary, and device and storage medium |
CN109597992B (en) * | 2018-11-27 | 2023-06-27 | 浪潮金融信息技术有限公司 | Question similarity calculation method combining synonym dictionary and word embedding vector |
CN109597992A (en) * | 2018-11-27 | 2019-04-09 | 苏州浪潮智能软件有限公司 | A kind of Question sentence parsing calculation method of combination synonymicon and word insertion vector |
CN109871437A (en) * | 2018-11-30 | 2019-06-11 | 阿里巴巴集团控股有限公司 | Method and device for the processing of customer problem sentence |
CN109871437B (en) * | 2018-11-30 | 2023-04-21 | 阿里巴巴集团控股有限公司 | Method and device for processing user problem statement |
CN110059155A (en) * | 2018-12-18 | 2019-07-26 | 阿里巴巴集团控股有限公司 | The calculating of text similarity, intelligent customer service system implementation method and device |
CN111368061B (en) * | 2018-12-25 | 2024-04-12 | 深圳市优必选科技有限公司 | Short text filtering method, device, medium and computer equipment |
CN111368061A (en) * | 2018-12-25 | 2020-07-03 | 深圳市优必选科技有限公司 | Short text filtering method, device, medium and computer equipment |
CN110113228A (en) * | 2019-04-25 | 2019-08-09 | 新华三信息安全技术有限公司 | A kind of network connection detection method and device |
CN110096705B (en) * | 2019-04-29 | 2023-09-08 | 扬州大学 | Unsupervised English sentence automatic simplification algorithm |
CN110096705A (en) * | 2019-04-29 | 2019-08-06 | 扬州大学 | A kind of unsupervised english sentence simplifies algorithm automatically |
CN110009064A (en) * | 2019-04-30 | 2019-07-12 | 广东电网有限责任公司 | A kind of semantic model training method and device based on electrical network field |
CN110287312A (en) * | 2019-05-10 | 2019-09-27 | 平安科技(深圳)有限公司 | Calculation method, device, computer equipment and the computer storage medium of text similarity |
CN110287312B (en) * | 2019-05-10 | 2023-08-25 | 平安科技(深圳)有限公司 | Text similarity calculation method, device, computer equipment and computer storage medium |
CN110266675A (en) * | 2019-06-12 | 2019-09-20 | 成都积微物联集团股份有限公司 | A kind of xss attack automated detection method based on deep learning |
CN110674251A (en) * | 2019-08-21 | 2020-01-10 | 杭州电子科技大学 | Computer-assisted secret point annotation method based on semantic information |
CN110781277A (en) * | 2019-09-23 | 2020-02-11 | 厦门快商通科技股份有限公司 | Text recognition model similarity training method, system, recognition method and terminal |
CN110781687A (en) * | 2019-11-06 | 2020-02-11 | 三角兽(北京)科技有限公司 | Same intention statement acquisition method and device |
CN110781687B (en) * | 2019-11-06 | 2021-07-06 | 腾讯科技(深圳)有限公司 | Same intention statement acquisition method and device |
CN110956033A (en) * | 2019-12-04 | 2020-04-03 | 北京中电普华信息技术有限公司 | Text similarity calculation method and device |
CN111178059A (en) * | 2019-12-07 | 2020-05-19 | 武汉光谷信息技术股份有限公司 | Similarity comparison method and device based on word2vec technology |
CN111178059B (en) * | 2019-12-07 | 2023-08-25 | 武汉光谷信息技术股份有限公司 | Similarity comparison method and device based on word2vec technology |
CN111191469B (en) * | 2019-12-17 | 2023-09-19 | 语联网(武汉)信息技术有限公司 | Large-scale corpus cleaning and aligning method and device |
CN111191469A (en) * | 2019-12-17 | 2020-05-22 | 语联网(武汉)信息技术有限公司 | Large-scale corpus cleaning and aligning method and device |
CN111259649A (en) * | 2020-01-19 | 2020-06-09 | 深圳壹账通智能科技有限公司 | Interactive data classification method and device of information interaction platform and storage medium |
CN111523301A (en) * | 2020-06-05 | 2020-08-11 | 泰康保险集团股份有限公司 | Contract document compliance checking method and device |
CN112115715A (en) * | 2020-09-04 | 2020-12-22 | 北京嘀嘀无限科技发展有限公司 | Natural language text processing method and device, storage medium and electronic equipment |
CN111986007A (en) * | 2020-10-26 | 2020-11-24 | 北京值得买科技股份有限公司 | Method for commodity aggregation and similarity calculation |
CN112257431A (en) * | 2020-10-30 | 2021-01-22 | 中电万维信息技术有限责任公司 | NLP-based short text data processing method |
CN113342968A (en) * | 2021-05-21 | 2021-09-03 | 中国石油天然气股份有限公司 | Text abstract extraction method and device |
CN113342968B (en) * | 2021-05-21 | 2024-07-30 | 中国石油天然气股份有限公司 | Text abstract extraction method and device |
CN116932702A (en) * | 2023-09-19 | 2023-10-24 | 湖南正宇软件技术开发有限公司 | Method, system, device and storage medium for proposal and proposal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108334495A (en) | Short text similarity calculating method and system | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN110413986B (en) | Text clustering multi-document automatic summarization method and system for improving word vector model | |
CN107193801B (en) | Short text feature optimization and emotion analysis method based on deep belief network | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN107729392B (en) | Text structuring method, device and system and non-volatile storage medium | |
CN108363725B (en) | Method for extracting user comment opinions and generating opinion labels | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
CN106844346A (en) | Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec | |
WO2018153215A1 (en) | Method for automatically generating sentence sample with similar semantics | |
CN105701084A (en) | Characteristic extraction method of text classification on the basis of mutual information | |
CN103678275A (en) | Two-level text similarity calculation method based on subjective and objective semantics | |
CN110879834B (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
CN110188359B (en) | Text entity extraction method | |
CN108710611A (en) | A kind of short text topic model generation method of word-based network and term vector | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
CN105956158A (en) | Automatic extraction method of network neologism on the basis of mass microblog texts and use information | |
Gong et al. | A semantic similarity language model to improve automatic image annotation | |
CN113761104A (en) | Method and device for detecting entity relationship in knowledge graph and electronic equipment | |
CN110413985B (en) | Related text segment searching method and device | |
CN111104508A (en) | Method, system and medium for representing word bag model text based on fault-tolerant rough set | |
CN114722774B (en) | Data compression method, device, electronic equipment and storage medium | |
CN110597982A (en) | Short text topic clustering algorithm based on word co-occurrence network | |
CN114169325B (en) | Webpage new word discovery and analysis method based on word vector representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180727 |