CN109992772A - A kind of Text similarity computing method and device - Google Patents
A kind of Text similarity computing method and device Download PDFInfo
- Publication number
- CN109992772A CN109992772A CN201910191756.1A CN201910191756A CN109992772A CN 109992772 A CN109992772 A CN 109992772A CN 201910191756 A CN201910191756 A CN 201910191756A CN 109992772 A CN109992772 A CN 109992772A
- Authority
- CN
- China
- Prior art keywords
- text
- calculated
- similarity
- term vector
- advance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 26
- 239000013598 vector Substances 0.000 claims abstract description 104
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000011218 segmentation Effects 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 35
- 238000012546 transfer Methods 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 8
- 230000004927 fusion Effects 0.000 abstract description 6
- 238000000605 extraction Methods 0.000 abstract description 4
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000012512 characterization method Methods 0.000 description 17
- 230000008901 benefit Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000007739 conversion coating Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of Text similarity computing method and devices, this method comprises: S1: carrying out vectorization respectively to text to be calculated using term vector model trained in advance, obtain the term vector of text to be calculated;S2: the first similarity obtained between text to be calculated is calculated;S3: according to the term vector and the first similarity of the prediction model, text to be calculated constructed in advance, the second similarity between text to be calculated is obtained.One aspect of the present invention utilizes supervised learning technology, merge Chinese word segmentation, Tf-Idf, LSA, LDA, a variety of natural language Feature Extraction Technologies such as Word2Vec, the calculation method of a variety of text similarities such as Jaccard, WMD (distance), improve the accuracy of Text similarity computing, on the other hand Model Fusion technology is utilized, deep learning and traditional characteristic study are combined, the accuracy of Text similarity computing is further improved.
Description
Technical field
The present invention relates to natural language processing technique field, in particular to a kind of Text similarity computing method and device.
Background technique
The calculating of text similarity (distance) has a wide range of applications in reality, for example, text is poly- in information retrieval
Class, question answering system etc. have use.Currently, the method for being used to calculate the similarity degree between two texts has very much.
The thought of a kind of method is the similarity degree between direct relatively text-string.Longest such as two character strings is public
The ratio of subsequence (LCS) and two text maximum lengths altogether, editing distance (Levenshtein Distance), Jaccard
Similarity etc..The directly more literal difference degree of such methods, the weight of each word is equal when calculating
And the Similar Problems of semantic level are not accounted for, so accuracy is poor in practical applications.
Another kind of method is that word, word or text are characterized with digital vectors, then calculates the similarity between vector.Word
Bag model BOW (Bag of Word), word frequency-inverse text frequency TF-IDF, latent semantic analysis LSI, term vector model
Word2vec, glove, fasttext etc. are the methods of common text vector.Pass through two after calculating vectorization
Cosine or Euclidean distance between text characterize the similarity degree of text.It is direct also by word (word) is calculated
Distance calculates the distance of entire text, such as word moving distance (Word Mover Distance, WMD), and this method calculates two
Most short transportation range characterizes distance between text between the word of a text.BM25 is also the then base by digitizing to words
In the formula that probability derives, a kind of method of degree of correlation between two texts is calculated.These methods are all by texts digitization
Text similarity is calculated later, and due to having obtained a part of semantic relation after number, effect is than the in practical applications
A kind of method has biggish promotion.
The above method is inherently non-supervisory similarity calculating method, can not be optimized for application scenarios and smart
True property is not also high, and the requirement of practical application scene is many times much not achieved.Therefore, it needs to propose a kind of based on supervised learning
The similarity calculating method of thought building.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of Text similarity computing method and dresses
It sets, to overcome the problems such as Text similarity computing method accuracy rate is not high in the prior art and semantic computation one-sidedness.
To solve said one or multiple technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of Text similarity computing method is provided, described method includes following steps:
S1: using term vector model trained in advance vectorization is carried out to text to be calculated respectively, obtained described to be calculated
The term vector of text;
S2: the first similarity obtained between the text to be calculated is calculated;
S3: it according to the term vector of the prediction model, the text to be calculated constructed in advance and first similarity, obtains
Take the second similarity between the text to be calculated.
Further, the step S1 is specifically included:
S1.1: pre-processing training corpus, trains term vector model in advance using pretreated training corpus;
S1.2: carrying out word segmentation processing to the text to be calculated respectively, obtains the corresponding word of the text to be calculated respectively;
S1.3: vectorization is carried out to the corresponding word of the text to be calculated using term vector model trained in advance, is obtained
The term vector of the text to be calculated.
Further, it is described to training corpus carry out pretreatment include:
Text is carried out to label to training corpus, is to label by the identical text of meaning, the different text of meaning is to mark
It is denoted as 0.
Further, the step S2 is specifically included:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/
Or,
Based on the term vector of the text to be calculated, the second similarity obtained between the text to be calculated is calculated;With/
Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or it is based on Tf-idf
Characterization calculates the first similarity obtained between the text to be calculated;And/or characterized based on LSI, calculate obtain it is described to
Calculate the first similarity between text.
Further, the step S3 is specifically included:
S3.1: non-linear transfer is carried out to the term vector of the text to be calculated, obtains non-linear transfer result;
S3.2: splicing the non-linear transfer result, is accordingly calculated splicing result, obtains and calculates knot
Fruit;.
S3.3: the calculated result and first similarity are connected into a long vector;
S3.4: the second similarity obtained between the text to be calculated is calculated according to the long vector.
On the other hand, a kind of Text similarity computing device is provided, described device includes:
Vectorization module is obtained for carrying out vectorization respectively to text to be calculated using term vector model trained in advance
Take the term vector of the text to be calculated;
Computing module, for calculating the first similarity obtained between the text to be calculated;
Prediction module, for according to the term vector of the prediction model, the text to be calculated constructed in advance and described the
One similarity obtains the second similarity between the text to be calculated.
Further, the vectorization module includes:
Training unit, for being pre-processed to training corpus, using pretreated training corpus in advance train word to
Measure model:
Participle unit obtains the text to be calculated for carrying out word segmentation processing respectively to the text to be calculated respectively
Corresponding word;
Vectorization unit, for using term vector model trained in advance to the corresponding word of the text to be calculated carry out to
Quantization obtains the term vector of the text to be calculated.
Further, the vectorization module further include:
The identical text of meaning is meaning to label for carrying out text to label to training corpus by marking unit
Different texts are to label.
Further, the computing module is specifically used for:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/
Or,
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/
Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
Further, the prediction module includes:
Non-linear transfer unit carries out non-linear transfer for the term vector to the text to be calculated, obtains non-linear
Conversion results;
First connection unit, for splicing to the non-linear transfer result;
Computing unit obtains calculated result for accordingly being calculated splicing result;
Second connection unit, for the calculated result and first similarity to be connected into a long vector;
Predicting unit, for calculating the second similarity obtained between the text to be calculated according to the long vector.
Technical solution provided in an embodiment of the present invention has the benefit that
1, Text similarity computing method and device provided in an embodiment of the present invention, using supervised learning technology, in fusion
Text participle, Tf-Idf, LSA, a variety of natural language Feature Extraction Technologies such as LDA, Word2Vec, a variety of texts such as Jaccard, WMD
The calculation method of this similarity (distance), improves the accuracy of Text similarity computing;
2, Text similarity computing method and device provided in an embodiment of the present invention, using Model Fusion technology, by depth
Study and traditional characteristic study combine, and further improve the accuracy of Text similarity computing.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is the flow chart of Text similarity computing method shown according to an exemplary embodiment;
Fig. 2 is the schematic diagram of prediction model shown according to an exemplary embodiment;
Fig. 3 is the structural schematic diagram of Text similarity computing device shown according to an exemplary embodiment.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention
Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this
Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Fig. 1 is the flow chart of Text similarity computing method shown according to an exemplary embodiment, shown referring to Fig.1,
This method comprises the following steps:
S1: using term vector model trained in advance vectorization is carried out to text to be calculated respectively, obtained described to be calculated
The term vector of text.
Specifically, meeting text based term vector calculates the similarity of the vector between text in the embodiment of the present invention, because
This, it is necessary first in advance training one term vector model, then text to be calculated is carried out respectively using the term vector model to
Quantification treatment obtains the corresponding term vector of text to be calculated.
S2: the first similarity obtained between the text to be calculated is calculated.
Specifically, in order to improve the accuracy of Text similarity computing, in the embodiment of the present invention, merged Chinese word segmentation,
A variety of tradition texts such as a variety of natural language Feature Extraction Technologies such as Tf-Idf, LSI, LCS, Word2Vec and Jaccard, WMD
The calculation method of this similarity (distance).What needs to be explained here is that the first similarity mentioned here is based on some tradition
The text that is calculated of calculation method between similarity, traditional calculation method include but is not limited to Tf-Idf, LSI, LCS,
Jaccard, WMD etc..
S3: it according to the term vector of the prediction model, the text to be calculated constructed in advance and first similarity, obtains
Take the second similarity between the text to be calculated.
Specifically, in the embodiment of the present invention, utilizing model to further improve the accuracy of Text similarity computing
Integration technology combines deep learning and traditional characteristic study, first passes through term vector model for text vector to be calculated,
Obtain the corresponding term vector of text to be calculated, then by the term vector and based on conventional method calculate obtain the first similarity it is defeated
Enter to the prediction model constructed in advance, obtains the second similarity between text to be calculated.Here second similarity
Text similarity computing result as of the invention.
As a kind of preferably embodiment, in the embodiment of the present invention, the step S1 is specifically included:
S1.1: pre-processing training corpus, trains term vector model in advance using pretreated training corpus.
Specifically, collecting training corpus of the related corpus as model training, then training corpus is pre-processed.This
In it should be noted that the prediction model constructed in advance is passed through using pretreated training corpus in the embodiment of the present invention
What the method training of supervised learning obtained.
S1.2: carrying out word segmentation processing to the text to be calculated respectively, obtains the corresponding word of the text to be calculated respectively.
Specifically, carrying out word segmentation processing to text using Chinese word segmentation machine, and in word segmentation result in the embodiment of the present invention
Remove punctuation mark.
S1.3: vectorization is carried out to the corresponding word of the text to be calculated using term vector model trained in advance, is obtained
The term vector of the text to be calculated.
Specifically, using the term vectors technology such as word2vec, glove or fasttext, combined training corpus or other
Chinese corpus training term vector model.Then vector is carried out to the corresponding word of text to be calculated using trained term vector model
Change, obtains the term vector of text to be calculated.
It is described pretreatment is carried out to training corpus to include: as a kind of preferably embodiment
Text is carried out to label to training corpus, is to label by the identical text of meaning, the different text of meaning is to mark
It is denoted as 0.
Specifically, needing to carry out corresponding data markers to training corpus before training pattern.It will meaning when label
To being marked as 1, the different text of meaning is the identical text of justice to label, such as: by " how I will lose weight? " with
" how reducing my weight? " labeled as 1, will " what the best mode of weight-reducing is? " " how effectively taking exercises " mark
It is denoted as 0.
As a kind of preferably embodiment, in the embodiment of the present invention, the step S2 is specifically included:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/
Or,
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/
Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
Specifically, assuming that two texts for needing to calculate the similarity between text are respectively q1 and q2, it is based on tradition side
Method, it includes below one or more for calculating the first similarity obtained between text to be calculated:
1, based on the feature of character string:
1) the word number after two texts respectively segment is calculated, is denoted as len respectively1, len2;
2) longest common subsequence (LCS) of the sequence after calculating two text participles and the sequence after two text participles
The ratio of maximum length, is denoted as lcs:
Lcs=LCS (segment (q1), segment (q2))/max(len1, len2)
3) ratio of the longest common subsequence of two text-strings and the maximum length of two text-strings is calculated,
It is denoted as olcs:
Olcs=LCS (q1, q2)/max(Len(q1), Len (q2))
4) the jaccard distance for segmenting latter two text is calculated, jac is denoted as:
Jac=Jaccard (segment (q1), segment (q2))
5) the jaccard distance for calculating two text-strings, remembers ojac:
Ojac=Jaccard (q1, q2)
6) editing distance (Levenshtein Distance) for segmenting latter two text is calculated, lev is denoted as:
Lev=Levenshtein (segment (qi), segment (q2))
7) editing distance (Levenshtein Distance) for calculating two text-strings, is denoted as olev:
Olev=Levenshtein (q1, q2)
8) calculate segment latter two text public word Tf-Idf and the Tf-Idf value total with two texts and ratio:
Wherein S is the set of public word.
2, based on the feature of term vector
1) term vector model is utilized, is calculated Word Mover Distance (WMD).During calculating WMD, word and word
Between distance cosine and Euclidean distance can be respectively adopted, corresponding WMD is denoted as wmdcosine, wmdeuc;
2) term vector model is utilized, the vector of two texts is calculated, what needs to be explained here is that, by the term vector in text
Addition is averaged the vector that text can be obtained, and is denoted as vec1=[u1, u2..., un], vec2=[v1, v2..., vn], wherein n
For the length of term vector;
3) cosine (cosine) distance for calculating two text vectors, is denoted as cosvec;
4) Euclid (euclidean) distance for calculating two text vectors, is denoted as eucvec;
eucvec=| | vec1-vec2||2
5) the braycurits distance for calculating two text vectors, is denoted as brayvec;
6) the chebyshev distance for calculating two text vectors, is denoted as chebvec;
chebvec=maxi|ui-vi|
7) the canberra distance for calculating two text vectors, is denoted as canbvec;
8) the cityblock distance for calculating two text vectors, is denoted as cityvec;
cityvec=∑ | ui-vi|
3, it is based on the feature of bag of words (BOW) model
Bag of words are exactly the one-hot characterization of text in fact, by comparing the one-hot characterization between two texts
Distance can be used to indicate the distance between text.
1) the Dice distance for calculating two text one-hot characterization, is denoted as diceonehot;
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi
=1 number.
2) the hamming distance for calculating two text one-hot characterization, is denoted as hamonehot;
Wherein, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi=1 number, n are one-hot characterization
Length.
3) the kulsinski distance for calculating two text one-hot characterization, is denoted as kulonehot;
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi
=1 number, n are the length of one-hot characterization;
4) the regerstanimoto distance for calculating two text one-hot characterization, is denoted as rogeronehot;
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi
=1 number, CFFIndicate ui=0 and vi=0 number;
5) the yule distance for calculating two text one-hot characterization, is denoted as yuleonehot;
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi
=1 number, CFFIndicate ui=0 and vi=0 number
4, based on the feature of Tf-Idf
Using tf-idf formula, calculate the characterization of the tf-idf of each text, then using text tf-idf characterization come
Calculate corresponding distance.
1) cosine and Euclidean distance for calculating two text tf-idf characterizations, are denoted as cos respectivelytf-idf, euctf-idf;
2) mahalanobis distance for calculating two text tf-idf characterization, is denoted as mahtf-idf;
Wherein V is the covariance matrix of the matrix of the tf-idf characterization of entire training corpus text.
5, based on the feature of LSI
1) LSI algorithm is utilized, the LSI characterization of two texts is calculated, is denoted as lsi respectivelyq1, lsiq2。
2) cosine and Euclidean distance for calculating two text LSI characterizations, are denoted as cos respectivelylsi, euclsi。
As a kind of preferably embodiment, in the embodiment of the present invention, the step S3 is specifically included:
S3.1: non-linear transfer is carried out to the term vector of the text to be calculated, obtains non-linear transfer result.
Specifically, the prediction model in the embodiment of the present invention includes non-linear transfer layer (BiLSTM layers), non-linear transfer
(BiLSTM layers) of layer include two-way LSTM unit.Two-way LSTM is compared to the advantage of unidirectional LSTM, can observe simultaneously
Past and Future Information in a period of time, about current word.Non-linear transfer layer (BiLSTM layers) is to the word of text to be calculated
Vector carries out non-linear transfer, obtains non-linear transfer result.What needs to be explained here is that in the embodiment of the present invention, it is non-linear
Conversion coating is specifically used for extracting the language ambience information of word, and the information of specific word (term vector of text i.e. to be calculated) is converted to combination
Context and above after information.
S3.2: splicing the non-linear transfer result, is accordingly calculated splicing result, obtains and calculates knot
Fruit.
Specifically, the prediction model in the embodiment of the present invention further includes articulamentum, articulamentum is connected to non-linear transfer layer
Below.Here articulamentum is one Attention layers, and articulamentum spells the conversion results of the second non-linear transfer layer
It connects.
Then the splicing result of articulamentum output is accordingly calculated again, obtains calculated result.Need exist for explanation
It is that the result that two texts to be calculated are exported from Attention layers is denoted as u, v respectively, then counted by the present invention in implementing
Calculate uv and | u-v |.About Attention layers, here using dot product Attention method.Non-linear transfer layer is defeated
Result out is denoted as Q, random initializtion one trainable state vector c, then:
Attention (Q)=sum (softmax (QcT) Q, axis=0)
S3.3: the calculated result and first similarity are connected into a long vector.
Specifically, the prediction model in the embodiment of the present invention further includes merging layer, merges layer for above-mentioned steps and calculate acquisition
Uv and | u-v | and the first similarity is stitched together, and generates a long vector.What needs to be explained here is that the present invention is real
Applying the merging layer in example is realized by Concatenate function.
S3.4: the second similarity obtained between the text to be calculated is calculated according to the long vector.
Specifically, finally, calculating the second similarity obtained between the text to be calculated according to long vector.It needs exist for
Illustrate, in the embodiment of the present invention, by merging behind layer through one Softmax layers of full articulamentum connection, output phase
Like degree calculated result.Wherein full articulamentum here is one Dense layers.What needs to be explained here is that after usual vector merges
The special scale of feature it is different, feature can effectively be extracted by one layer of nonlinear full articulamentum.If directly used
Softmax output, usually not plus one layer of full articulamentum effect is good.In traditional shallow-layer neural network, if do not connected entirely
Layer, only input layer and output layer are connect, without hidden layer, effect is bad certainly.Thus, full articulamentum, which is arranged, can be improved
The performance of algorithm.
Wherein, the similarity calculation of Softmax layers of output is the result is that a probability (the whether similar probability of two texts).
For example, being directed to two texts q1 and q2, it is expressed as the function of following manner:
F (q1, q2) → [0,1]
Wherein 1 indicate that two the text meanings are identical, 0 indicates two the text meaning differences.
Fig. 2 is the schematic diagram of prediction model shown according to an exemplary embodiment, and referring to shown in Fig. 2, which includes:
Input layer (input layers), for inputting text to be calculated, it is assumed that two texts to be calculated are respectively q1 and q2;
Embeding layer (Embedding layers), for by the text vector to be calculated after word segmentation processing.In the embodiment of the present invention
Embeding layer in prediction model is a term vector model of fusion;
Non-linear transfer layer (BiLSTM layers) carries out non-linear transfer for the term vector to text to be calculated, obtains non-
Linear transfor result;
Articulamentum (Attention layers), for splicing to non-linear transfer result, wherein Attention layers defeated
Result is denoted as u and v respectively out;
Merge layer (Concatenate layer), for by uv and | u-v | and the first similarity is stitched together, generation one
A long vector;
Full articulamentum (Dense layers) merges layer and output layer for connecting;
Output layer (Softmax layers), for exporting similarity calculated result, wherein the similarity meter of Softmax layers of output
It calculates the result is that a probability.
Fig. 3 is the structural schematic diagram of Text similarity computing device shown according to an exemplary embodiment, referring to Fig. 3 institute
Show, which includes:
Vectorization module is obtained for carrying out vectorization respectively to text to be calculated using term vector model trained in advance
Take the term vector of the text to be calculated;
Computing module, for calculating the first similarity obtained between the text to be calculated;
Prediction module, for according to the term vector of the prediction model, the text to be calculated constructed in advance and described the
One similarity obtains the second similarity between the text to be calculated.
As a kind of preferably embodiment, in the embodiment of the present invention, the vectorization module includes:
Training unit, for being pre-processed to training corpus, using pretreated training corpus in advance train word to
Measure model;
Participle unit obtains the text to be calculated for carrying out word segmentation processing respectively to the text to be calculated respectively
Corresponding word;
Vectorization unit, for using term vector model trained in advance to the corresponding word of the text to be calculated carry out to
Quantization obtains the term vector of the text to be calculated.
Wherein, vectorization unit is by merging a term vector model realization.
As a kind of preferably embodiment, in the embodiment of the present invention, the vectorization module further include:
The identical text of meaning is meaning to label for carrying out text to label to training corpus by marking unit
Different texts are to label.
As a kind of preferably embodiment, in the embodiment of the present invention, the computing module is specifically used for:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/
Or,
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/
Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
As a kind of preferably embodiment, in the embodiment of the present invention, the prediction module includes:
Non-linear transfer unit carries out non-linear transfer for the term vector to the text to be calculated, obtains non-linear
Conversion results;
Wherein, non-linear transfer unit is realized by (BiLSTM layers) of non-linear transfer layer, non-linear transfer layer
(BiLSTM layers) include two-way LSTM unit.
First connection unit, for splicing to the non-linear transfer result;
Wherein, the first connection unit is realized by (Attention layers) of articulamentum.
Computing unit obtains calculated result for accordingly being calculated splicing result;
Second connection unit, for the calculated result and first similarity to be connected into a long vector;
Wherein, the second connection unit is by merging (Concatenate layers) of layer realizations.
Predicting unit, for calculating the similarity obtained between the text to be calculated according to the long vector.;
Wherein, predicting unit is realized by (Softmax layers) of output layer.
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, Text similarity computing method and device provided in an embodiment of the present invention, using supervised learning technology, in fusion
Text participle, Tf-Idf, LSA, a variety of natural language Feature Extraction Technologies such as LDA, Word2Vec, a variety of texts such as Jaccard, WMD
The calculation method of this similarity (distance), improves the accuracy of Text similarity computing;
2, Text similarity computing method and device provided in an embodiment of the present invention, using Model Fusion technology, by depth
Study and traditional characteristic study combine, and further improve the accuracy of Text similarity computing.
It should be understood that Text similarity computing device provided by the above embodiment is in triggering similarity calculation business
When, only the example of the division of the above functional modules, in practical application, it can according to need and divide above-mentioned function
With being completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete above description
All or part of function.In addition, Text similarity computing device provided by the above embodiment and Text similarity computing side
Method embodiment belongs to same design, i.e. this method is based on the system, and specific implementation process is detailed in embodiment of the method, here
It repeats no more.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of Text similarity computing method, which is characterized in that described method includes following steps:
S1: using term vector model trained in advance vectorization is carried out to text to be calculated respectively, obtains the text to be calculated
Term vector;
S2: the first similarity obtained between the text to be calculated is calculated;
S3: according to the term vector of the prediction model, the text to be calculated constructed in advance and first similarity, institute is obtained
State the second similarity between text to be calculated.
2. Text similarity computing method according to claim 1, which is characterized in that the step S1 is specifically included:
S1.1: pre-processing training corpus, trains term vector model in advance using pretreated training corpus;
S1.2: carrying out word segmentation processing to the text to be calculated respectively, obtains the corresponding word of the text to be calculated respectively;
S1.3: vectorization carried out to the corresponding word of the text to be calculated using term vector model trained in advance, described in acquisition
The term vector of text to be calculated.
3. Text similarity computing method according to claim 2, which is characterized in that described to be located in advance to training corpus
Reason includes:
To training corpus carry out text to label, be to label by the identical text of meaning, the different text of meaning to label for
0。
4. Text similarity computing method according to claim 1 or 2, which is characterized in that the step S2 is specifically included:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
5. Text similarity computing method according to claim 1 or 2, which is characterized in that the step S3 is specifically included:
S3.1: non-linear transfer is carried out to the term vector of the text to be calculated, obtains non-linear transfer result;
S3.2: splicing the non-linear transfer result, is accordingly calculated splicing result, and calculated result is obtained;.
S3.3: the calculated result and first similarity are connected into a long vector;
S3.4: the second similarity obtained between the text to be calculated is calculated according to the long vector.
6. a kind of Text similarity computing device, which is characterized in that described device includes:
Vectorization module obtains institute for carrying out vectorization respectively to text to be calculated using term vector model trained in advance
State the term vector of text to be calculated;
Computing module, for calculating the first similarity obtained between the text to be calculated;
Prediction module, the term vector of the prediction model, the text to be calculated that are constructed in advance for basis and first phase
Like degree, the second similarity between the text to be calculated is obtained.
7. Text similarity computing device according to claim 6, which is characterized in that the vectorization module includes:
Training unit trains term vector mould using pretreated training corpus for pre-processing to training corpus in advance
Type;
It is corresponding to obtain the text to be calculated for carrying out word segmentation processing respectively to the text to be calculated respectively for participle unit
Word;
Vectorization unit, for carrying out vector to the corresponding word of the text to be calculated using term vector model trained in advance
Change, obtains the term vector of the text to be calculated.
8. Text similarity computing device according to claim 7, which is characterized in that the vectorization module further include:
The identical text of meaning is that meaning is different to label for carrying out text to label to training corpus by marking unit
Text be to label.
9. Text similarity computing device according to claim 6 or 7, which is characterized in that the computing module is specifically used
In:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
10. Text similarity computing device according to claim 6 or 7, which is characterized in that the prediction module includes:
Non-linear transfer unit carries out non-linear transfer for the term vector to the text to be calculated, obtains non-linear transfer
As a result;
First connection unit, for splicing to the non-linear transfer result;
Computing unit obtains calculated result for accordingly being calculated splicing result;
Second connection unit, for the calculated result and first similarity to be connected into a long vector;
Predicting unit, for calculating the second similarity obtained between the text to be calculated according to the long vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910191756.1A CN109992772A (en) | 2019-03-13 | 2019-03-13 | A kind of Text similarity computing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910191756.1A CN109992772A (en) | 2019-03-13 | 2019-03-13 | A kind of Text similarity computing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109992772A true CN109992772A (en) | 2019-07-09 |
Family
ID=67130655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910191756.1A Pending CN109992772A (en) | 2019-03-13 | 2019-03-13 | A kind of Text similarity computing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109992772A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110489549A (en) * | 2019-07-16 | 2019-11-22 | 北京大米科技有限公司 | Teaching transcription comparison method, device, electronic equipment and medium |
CN111191464A (en) * | 2020-01-17 | 2020-05-22 | 珠海横琴极盛科技有限公司 | Semantic similarity calculation method based on combined distance |
CN111639661A (en) * | 2019-08-29 | 2020-09-08 | 上海卓繁信息技术股份有限公司 | Text similarity discrimination method |
CN112085091A (en) * | 2020-09-07 | 2020-12-15 | 中国平安财产保险股份有限公司 | Artificial intelligence-based short text matching method, device, equipment and storage medium |
CN112185573A (en) * | 2020-09-25 | 2021-01-05 | 志诺维思(北京)基因科技有限公司 | LCS and TF-IDF based similar character string determination method and device |
CN112308464A (en) * | 2020-11-24 | 2021-02-02 | 中国人民公安大学 | Business process data processing method and device |
CN113743077A (en) * | 2020-08-14 | 2021-12-03 | 北京京东振世信息技术有限公司 | Method and device for determining text similarity |
CN115953130A (en) * | 2023-01-05 | 2023-04-11 | 深圳市坂云科技有限公司 | Intelligent analysis processing system for customs declaration data |
CN116010603A (en) * | 2023-01-31 | 2023-04-25 | 浙江中电远为科技有限公司 | Feature clustering dimension reduction method for commercial text classification |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291699A (en) * | 2017-07-04 | 2017-10-24 | 湖南星汉数智科技有限公司 | A kind of sentence semantic similarity computational methods |
CN107729300A (en) * | 2017-09-18 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Processing method, device, equipment and the computer-readable storage medium of text similarity |
CN109344399A (en) * | 2018-09-14 | 2019-02-15 | 重庆邂智科技有限公司 | A kind of Text similarity computing method based on the two-way lstm neural network of stacking |
CN109460549A (en) * | 2018-10-12 | 2019-03-12 | 北京奔影网络科技有限公司 | The processing method and processing device of semantic vector |
-
2019
- 2019-03-13 CN CN201910191756.1A patent/CN109992772A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291699A (en) * | 2017-07-04 | 2017-10-24 | 湖南星汉数智科技有限公司 | A kind of sentence semantic similarity computational methods |
CN107729300A (en) * | 2017-09-18 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Processing method, device, equipment and the computer-readable storage medium of text similarity |
CN109344399A (en) * | 2018-09-14 | 2019-02-15 | 重庆邂智科技有限公司 | A kind of Text similarity computing method based on the two-way lstm neural network of stacking |
CN109460549A (en) * | 2018-10-12 | 2019-03-12 | 北京奔影网络科技有限公司 | The processing method and processing device of semantic vector |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110489549A (en) * | 2019-07-16 | 2019-11-22 | 北京大米科技有限公司 | Teaching transcription comparison method, device, electronic equipment and medium |
CN111639661A (en) * | 2019-08-29 | 2020-09-08 | 上海卓繁信息技术股份有限公司 | Text similarity discrimination method |
CN111191464A (en) * | 2020-01-17 | 2020-05-22 | 珠海横琴极盛科技有限公司 | Semantic similarity calculation method based on combined distance |
CN113743077A (en) * | 2020-08-14 | 2021-12-03 | 北京京东振世信息技术有限公司 | Method and device for determining text similarity |
CN113743077B (en) * | 2020-08-14 | 2023-09-29 | 北京京东振世信息技术有限公司 | Method and device for determining text similarity |
CN112085091A (en) * | 2020-09-07 | 2020-12-15 | 中国平安财产保险股份有限公司 | Artificial intelligence-based short text matching method, device, equipment and storage medium |
CN112085091B (en) * | 2020-09-07 | 2024-04-26 | 中国平安财产保险股份有限公司 | Short text matching method, device, equipment and storage medium based on artificial intelligence |
CN112185573A (en) * | 2020-09-25 | 2021-01-05 | 志诺维思(北京)基因科技有限公司 | LCS and TF-IDF based similar character string determination method and device |
CN112185573B (en) * | 2020-09-25 | 2023-11-03 | 志诺维思(北京)基因科技有限公司 | Similar character string determining method and device based on LCS and TF-IDF |
CN112308464A (en) * | 2020-11-24 | 2021-02-02 | 中国人民公安大学 | Business process data processing method and device |
CN112308464B (en) * | 2020-11-24 | 2023-11-24 | 中国人民公安大学 | Business process data processing method and device |
CN115953130A (en) * | 2023-01-05 | 2023-04-11 | 深圳市坂云科技有限公司 | Intelligent analysis processing system for customs declaration data |
CN115953130B (en) * | 2023-01-05 | 2023-08-11 | 深圳市坂云科技有限公司 | Intelligent analysis processing system for gateway declaration data |
CN116010603A (en) * | 2023-01-31 | 2023-04-25 | 浙江中电远为科技有限公司 | Feature clustering dimension reduction method for commercial text classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109992772A (en) | A kind of Text similarity computing method and device | |
Liu et al. | Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval | |
Shen et al. | Weakly supervised dense video captioning | |
Gao et al. | Deep label distribution learning with label ambiguity | |
Wang et al. | MoFAP: A multi-level representation for action recognition | |
Wang et al. | Metasearch: Incremental product search via deep meta-learning | |
Xiao et al. | Convolutional hierarchical attention network for query-focused video summarization | |
CN111985538A (en) | Small sample picture classification model and method based on semantic auxiliary attention mechanism | |
CN109726387A (en) | Man-machine interaction method and system | |
Han et al. | A unified perspective of classification-based loss and distance-based loss for cross-view gait recognition | |
CN112580362A (en) | Visual behavior recognition method and system based on text semantic supervision and computer readable medium | |
Zhao et al. | TUCH: Turning Cross-view Hashing into Single-view Hashing via Generative Adversarial Nets. | |
Li et al. | Combining local and global features into a Siamese network for sentence similarity | |
CN113806554A (en) | Knowledge graph construction method for massive conference texts | |
Ouyang et al. | Collaborative image relevance learning for visual re-ranking | |
Avgoustinakis et al. | Audio-based near-duplicate video retrieval with audio similarity learning | |
Xu et al. | Idhashgan: deep hashing with generative adversarial nets for incomplete data retrieval | |
CN113641854B (en) | Method and system for converting text into video | |
Zhu et al. | Learning clip guided visual-text fusion transformer for video-based pedestrian attribute recognition | |
Lin et al. | Region-based context enhanced network for robust multiple face alignment | |
Niu | Music Emotion Recognition Model Using Gated Recurrent Unit Networks and Multi-Feature Extraction | |
Du et al. | Captioning videos using large-scale image corpus | |
Li et al. | Deep unsupervised hashing for large-scale cross-modal retrieval using knowledge distillation model | |
Mi et al. | Dual-branch network with a subtle motion detector for microaction recognition in videos | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190709 |