CN109992772A - A kind of Text similarity computing method and device - Google Patents

A kind of Text similarity computing method and device Download PDF

Info

Publication number
CN109992772A
CN109992772A CN201910191756.1A CN201910191756A CN109992772A CN 109992772 A CN109992772 A CN 109992772A CN 201910191756 A CN201910191756 A CN 201910191756A CN 109992772 A CN109992772 A CN 109992772A
Authority
CN
China
Prior art keywords
text
calculated
similarity
term vector
advance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910191756.1A
Other languages
Chinese (zh)
Inventor
张永煦
倪博溢
冯璠
雷画雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongan Information Technology Service Co Ltd
Original Assignee
Zhongan Information Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongan Information Technology Service Co Ltd filed Critical Zhongan Information Technology Service Co Ltd
Priority to CN201910191756.1A priority Critical patent/CN109992772A/en
Publication of CN109992772A publication Critical patent/CN109992772A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Text similarity computing method and devices, this method comprises: S1: carrying out vectorization respectively to text to be calculated using term vector model trained in advance, obtain the term vector of text to be calculated;S2: the first similarity obtained between text to be calculated is calculated;S3: according to the term vector and the first similarity of the prediction model, text to be calculated constructed in advance, the second similarity between text to be calculated is obtained.One aspect of the present invention utilizes supervised learning technology, merge Chinese word segmentation, Tf-Idf, LSA, LDA, a variety of natural language Feature Extraction Technologies such as Word2Vec, the calculation method of a variety of text similarities such as Jaccard, WMD (distance), improve the accuracy of Text similarity computing, on the other hand Model Fusion technology is utilized, deep learning and traditional characteristic study are combined, the accuracy of Text similarity computing is further improved.

Description

A kind of Text similarity computing method and device
Technical field
The present invention relates to natural language processing technique field, in particular to a kind of Text similarity computing method and device.
Background technique
The calculating of text similarity (distance) has a wide range of applications in reality, for example, text is poly- in information retrieval Class, question answering system etc. have use.Currently, the method for being used to calculate the similarity degree between two texts has very much.
The thought of a kind of method is the similarity degree between direct relatively text-string.Longest such as two character strings is public The ratio of subsequence (LCS) and two text maximum lengths altogether, editing distance (Levenshtein Distance), Jaccard Similarity etc..The directly more literal difference degree of such methods, the weight of each word is equal when calculating And the Similar Problems of semantic level are not accounted for, so accuracy is poor in practical applications.
Another kind of method is that word, word or text are characterized with digital vectors, then calculates the similarity between vector.Word Bag model BOW (Bag of Word), word frequency-inverse text frequency TF-IDF, latent semantic analysis LSI, term vector model Word2vec, glove, fasttext etc. are the methods of common text vector.Pass through two after calculating vectorization Cosine or Euclidean distance between text characterize the similarity degree of text.It is direct also by word (word) is calculated Distance calculates the distance of entire text, such as word moving distance (Word Mover Distance, WMD), and this method calculates two Most short transportation range characterizes distance between text between the word of a text.BM25 is also the then base by digitizing to words In the formula that probability derives, a kind of method of degree of correlation between two texts is calculated.These methods are all by texts digitization Text similarity is calculated later, and due to having obtained a part of semantic relation after number, effect is than the in practical applications A kind of method has biggish promotion.
The above method is inherently non-supervisory similarity calculating method, can not be optimized for application scenarios and smart True property is not also high, and the requirement of practical application scene is many times much not achieved.Therefore, it needs to propose a kind of based on supervised learning The similarity calculating method of thought building.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of Text similarity computing method and dresses It sets, to overcome the problems such as Text similarity computing method accuracy rate is not high in the prior art and semantic computation one-sidedness.
To solve said one or multiple technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of Text similarity computing method is provided, described method includes following steps:
S1: using term vector model trained in advance vectorization is carried out to text to be calculated respectively, obtained described to be calculated The term vector of text;
S2: the first similarity obtained between the text to be calculated is calculated;
S3: it according to the term vector of the prediction model, the text to be calculated constructed in advance and first similarity, obtains Take the second similarity between the text to be calculated.
Further, the step S1 is specifically included:
S1.1: pre-processing training corpus, trains term vector model in advance using pretreated training corpus;
S1.2: carrying out word segmentation processing to the text to be calculated respectively, obtains the corresponding word of the text to be calculated respectively;
S1.3: vectorization is carried out to the corresponding word of the text to be calculated using term vector model trained in advance, is obtained The term vector of the text to be calculated.
Further, it is described to training corpus carry out pretreatment include:
Text is carried out to label to training corpus, is to label by the identical text of meaning, the different text of meaning is to mark It is denoted as 0.
Further, the step S2 is specifically included:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/ Or,
Based on the term vector of the text to be calculated, the second similarity obtained between the text to be calculated is calculated;With/ Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or it is based on Tf-idf Characterization calculates the first similarity obtained between the text to be calculated;And/or characterized based on LSI, calculate obtain it is described to Calculate the first similarity between text.
Further, the step S3 is specifically included:
S3.1: non-linear transfer is carried out to the term vector of the text to be calculated, obtains non-linear transfer result;
S3.2: splicing the non-linear transfer result, is accordingly calculated splicing result, obtains and calculates knot Fruit;.
S3.3: the calculated result and first similarity are connected into a long vector;
S3.4: the second similarity obtained between the text to be calculated is calculated according to the long vector.
On the other hand, a kind of Text similarity computing device is provided, described device includes:
Vectorization module is obtained for carrying out vectorization respectively to text to be calculated using term vector model trained in advance Take the term vector of the text to be calculated;
Computing module, for calculating the first similarity obtained between the text to be calculated;
Prediction module, for according to the term vector of the prediction model, the text to be calculated constructed in advance and described the One similarity obtains the second similarity between the text to be calculated.
Further, the vectorization module includes:
Training unit, for being pre-processed to training corpus, using pretreated training corpus in advance train word to Measure model:
Participle unit obtains the text to be calculated for carrying out word segmentation processing respectively to the text to be calculated respectively Corresponding word;
Vectorization unit, for using term vector model trained in advance to the corresponding word of the text to be calculated carry out to Quantization obtains the term vector of the text to be calculated.
Further, the vectorization module further include:
The identical text of meaning is meaning to label for carrying out text to label to training corpus by marking unit Different texts are to label.
Further, the computing module is specifically used for:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/ Or,
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/ Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
Further, the prediction module includes:
Non-linear transfer unit carries out non-linear transfer for the term vector to the text to be calculated, obtains non-linear Conversion results;
First connection unit, for splicing to the non-linear transfer result;
Computing unit obtains calculated result for accordingly being calculated splicing result;
Second connection unit, for the calculated result and first similarity to be connected into a long vector;
Predicting unit, for calculating the second similarity obtained between the text to be calculated according to the long vector.
Technical solution provided in an embodiment of the present invention has the benefit that
1, Text similarity computing method and device provided in an embodiment of the present invention, using supervised learning technology, in fusion Text participle, Tf-Idf, LSA, a variety of natural language Feature Extraction Technologies such as LDA, Word2Vec, a variety of texts such as Jaccard, WMD The calculation method of this similarity (distance), improves the accuracy of Text similarity computing;
2, Text similarity computing method and device provided in an embodiment of the present invention, using Model Fusion technology, by depth Study and traditional characteristic study combine, and further improve the accuracy of Text similarity computing.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the flow chart of Text similarity computing method shown according to an exemplary embodiment;
Fig. 2 is the schematic diagram of prediction model shown according to an exemplary embodiment;
Fig. 3 is the structural schematic diagram of Text similarity computing device shown according to an exemplary embodiment.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Fig. 1 is the flow chart of Text similarity computing method shown according to an exemplary embodiment, shown referring to Fig.1, This method comprises the following steps:
S1: using term vector model trained in advance vectorization is carried out to text to be calculated respectively, obtained described to be calculated The term vector of text.
Specifically, meeting text based term vector calculates the similarity of the vector between text in the embodiment of the present invention, because This, it is necessary first in advance training one term vector model, then text to be calculated is carried out respectively using the term vector model to Quantification treatment obtains the corresponding term vector of text to be calculated.
S2: the first similarity obtained between the text to be calculated is calculated.
Specifically, in order to improve the accuracy of Text similarity computing, in the embodiment of the present invention, merged Chinese word segmentation, A variety of tradition texts such as a variety of natural language Feature Extraction Technologies such as Tf-Idf, LSI, LCS, Word2Vec and Jaccard, WMD The calculation method of this similarity (distance).What needs to be explained here is that the first similarity mentioned here is based on some tradition The text that is calculated of calculation method between similarity, traditional calculation method include but is not limited to Tf-Idf, LSI, LCS, Jaccard, WMD etc..
S3: it according to the term vector of the prediction model, the text to be calculated constructed in advance and first similarity, obtains Take the second similarity between the text to be calculated.
Specifically, in the embodiment of the present invention, utilizing model to further improve the accuracy of Text similarity computing Integration technology combines deep learning and traditional characteristic study, first passes through term vector model for text vector to be calculated, Obtain the corresponding term vector of text to be calculated, then by the term vector and based on conventional method calculate obtain the first similarity it is defeated Enter to the prediction model constructed in advance, obtains the second similarity between text to be calculated.Here second similarity Text similarity computing result as of the invention.
As a kind of preferably embodiment, in the embodiment of the present invention, the step S1 is specifically included:
S1.1: pre-processing training corpus, trains term vector model in advance using pretreated training corpus.
Specifically, collecting training corpus of the related corpus as model training, then training corpus is pre-processed.This In it should be noted that the prediction model constructed in advance is passed through using pretreated training corpus in the embodiment of the present invention What the method training of supervised learning obtained.
S1.2: carrying out word segmentation processing to the text to be calculated respectively, obtains the corresponding word of the text to be calculated respectively.
Specifically, carrying out word segmentation processing to text using Chinese word segmentation machine, and in word segmentation result in the embodiment of the present invention Remove punctuation mark.
S1.3: vectorization is carried out to the corresponding word of the text to be calculated using term vector model trained in advance, is obtained The term vector of the text to be calculated.
Specifically, using the term vectors technology such as word2vec, glove or fasttext, combined training corpus or other Chinese corpus training term vector model.Then vector is carried out to the corresponding word of text to be calculated using trained term vector model Change, obtains the term vector of text to be calculated.
It is described pretreatment is carried out to training corpus to include: as a kind of preferably embodiment
Text is carried out to label to training corpus, is to label by the identical text of meaning, the different text of meaning is to mark It is denoted as 0.
Specifically, needing to carry out corresponding data markers to training corpus before training pattern.It will meaning when label To being marked as 1, the different text of meaning is the identical text of justice to label, such as: by " how I will lose weight? " with " how reducing my weight? " labeled as 1, will " what the best mode of weight-reducing is? " " how effectively taking exercises " mark It is denoted as 0.
As a kind of preferably embodiment, in the embodiment of the present invention, the step S2 is specifically included:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/ Or,
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/ Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
Specifically, assuming that two texts for needing to calculate the similarity between text are respectively q1 and q2, it is based on tradition side Method, it includes below one or more for calculating the first similarity obtained between text to be calculated:
1, based on the feature of character string:
1) the word number after two texts respectively segment is calculated, is denoted as len respectively1, len2
2) longest common subsequence (LCS) of the sequence after calculating two text participles and the sequence after two text participles The ratio of maximum length, is denoted as lcs:
Lcs=LCS (segment (q1), segment (q2))/max(len1, len2)
3) ratio of the longest common subsequence of two text-strings and the maximum length of two text-strings is calculated, It is denoted as olcs:
Olcs=LCS (q1, q2)/max(Len(q1), Len (q2))
4) the jaccard distance for segmenting latter two text is calculated, jac is denoted as:
Jac=Jaccard (segment (q1), segment (q2))
5) the jaccard distance for calculating two text-strings, remembers ojac:
Ojac=Jaccard (q1, q2)
6) editing distance (Levenshtein Distance) for segmenting latter two text is calculated, lev is denoted as:
Lev=Levenshtein (segment (qi), segment (q2))
7) editing distance (Levenshtein Distance) for calculating two text-strings, is denoted as olev:
Olev=Levenshtein (q1, q2)
8) calculate segment latter two text public word Tf-Idf and the Tf-Idf value total with two texts and ratio:
Wherein S is the set of public word.
2, based on the feature of term vector
1) term vector model is utilized, is calculated Word Mover Distance (WMD).During calculating WMD, word and word Between distance cosine and Euclidean distance can be respectively adopted, corresponding WMD is denoted as wmdcosine, wmdeuc
2) term vector model is utilized, the vector of two texts is calculated, what needs to be explained here is that, by the term vector in text Addition is averaged the vector that text can be obtained, and is denoted as vec1=[u1, u2..., un], vec2=[v1, v2..., vn], wherein n For the length of term vector;
3) cosine (cosine) distance for calculating two text vectors, is denoted as cosvec
4) Euclid (euclidean) distance for calculating two text vectors, is denoted as eucvec
eucvec=| | vec1-vec2||2
5) the braycurits distance for calculating two text vectors, is denoted as brayvec
6) the chebyshev distance for calculating two text vectors, is denoted as chebvec
chebvec=maxi|ui-vi|
7) the canberra distance for calculating two text vectors, is denoted as canbvec
8) the cityblock distance for calculating two text vectors, is denoted as cityvec
cityvec=∑ | ui-vi|
3, it is based on the feature of bag of words (BOW) model
Bag of words are exactly the one-hot characterization of text in fact, by comparing the one-hot characterization between two texts Distance can be used to indicate the distance between text.
1) the Dice distance for calculating two text one-hot characterization, is denoted as diceonehot
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi =1 number.
2) the hamming distance for calculating two text one-hot characterization, is denoted as hamonehot
Wherein, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi=1 number, n are one-hot characterization Length.
3) the kulsinski distance for calculating two text one-hot characterization, is denoted as kulonehot
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi =1 number, n are the length of one-hot characterization;
4) the regerstanimoto distance for calculating two text one-hot characterization, is denoted as rogeronehot
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi =1 number, CFFIndicate ui=0 and vi=0 number;
5) the yule distance for calculating two text one-hot characterization, is denoted as yuleonehot
Wherein, CTTIndicate ui=1 and vi=1 number, CTFIndicate ui=1 and vi=0 number, CFTIndicate ui=0 and vi =1 number, CFFIndicate ui=0 and vi=0 number
4, based on the feature of Tf-Idf
Using tf-idf formula, calculate the characterization of the tf-idf of each text, then using text tf-idf characterization come Calculate corresponding distance.
1) cosine and Euclidean distance for calculating two text tf-idf characterizations, are denoted as cos respectivelytf-idf, euctf-idf
2) mahalanobis distance for calculating two text tf-idf characterization, is denoted as mahtf-idf
Wherein V is the covariance matrix of the matrix of the tf-idf characterization of entire training corpus text.
5, based on the feature of LSI
1) LSI algorithm is utilized, the LSI characterization of two texts is calculated, is denoted as lsi respectivelyq1, lsiq2
2) cosine and Euclidean distance for calculating two text LSI characterizations, are denoted as cos respectivelylsi, euclsi
As a kind of preferably embodiment, in the embodiment of the present invention, the step S3 is specifically included:
S3.1: non-linear transfer is carried out to the term vector of the text to be calculated, obtains non-linear transfer result.
Specifically, the prediction model in the embodiment of the present invention includes non-linear transfer layer (BiLSTM layers), non-linear transfer (BiLSTM layers) of layer include two-way LSTM unit.Two-way LSTM is compared to the advantage of unidirectional LSTM, can observe simultaneously Past and Future Information in a period of time, about current word.Non-linear transfer layer (BiLSTM layers) is to the word of text to be calculated Vector carries out non-linear transfer, obtains non-linear transfer result.What needs to be explained here is that in the embodiment of the present invention, it is non-linear Conversion coating is specifically used for extracting the language ambience information of word, and the information of specific word (term vector of text i.e. to be calculated) is converted to combination Context and above after information.
S3.2: splicing the non-linear transfer result, is accordingly calculated splicing result, obtains and calculates knot Fruit.
Specifically, the prediction model in the embodiment of the present invention further includes articulamentum, articulamentum is connected to non-linear transfer layer Below.Here articulamentum is one Attention layers, and articulamentum spells the conversion results of the second non-linear transfer layer It connects.
Then the splicing result of articulamentum output is accordingly calculated again, obtains calculated result.Need exist for explanation It is that the result that two texts to be calculated are exported from Attention layers is denoted as u, v respectively, then counted by the present invention in implementing Calculate uv and | u-v |.About Attention layers, here using dot product Attention method.Non-linear transfer layer is defeated Result out is denoted as Q, random initializtion one trainable state vector c, then:
Attention (Q)=sum (softmax (QcT) Q, axis=0)
S3.3: the calculated result and first similarity are connected into a long vector.
Specifically, the prediction model in the embodiment of the present invention further includes merging layer, merges layer for above-mentioned steps and calculate acquisition Uv and | u-v | and the first similarity is stitched together, and generates a long vector.What needs to be explained here is that the present invention is real Applying the merging layer in example is realized by Concatenate function.
S3.4: the second similarity obtained between the text to be calculated is calculated according to the long vector.
Specifically, finally, calculating the second similarity obtained between the text to be calculated according to long vector.It needs exist for Illustrate, in the embodiment of the present invention, by merging behind layer through one Softmax layers of full articulamentum connection, output phase Like degree calculated result.Wherein full articulamentum here is one Dense layers.What needs to be explained here is that after usual vector merges The special scale of feature it is different, feature can effectively be extracted by one layer of nonlinear full articulamentum.If directly used Softmax output, usually not plus one layer of full articulamentum effect is good.In traditional shallow-layer neural network, if do not connected entirely Layer, only input layer and output layer are connect, without hidden layer, effect is bad certainly.Thus, full articulamentum, which is arranged, can be improved The performance of algorithm.
Wherein, the similarity calculation of Softmax layers of output is the result is that a probability (the whether similar probability of two texts). For example, being directed to two texts q1 and q2, it is expressed as the function of following manner:
F (q1, q2) → [0,1]
Wherein 1 indicate that two the text meanings are identical, 0 indicates two the text meaning differences.
Fig. 2 is the schematic diagram of prediction model shown according to an exemplary embodiment, and referring to shown in Fig. 2, which includes:
Input layer (input layers), for inputting text to be calculated, it is assumed that two texts to be calculated are respectively q1 and q2;
Embeding layer (Embedding layers), for by the text vector to be calculated after word segmentation processing.In the embodiment of the present invention Embeding layer in prediction model is a term vector model of fusion;
Non-linear transfer layer (BiLSTM layers) carries out non-linear transfer for the term vector to text to be calculated, obtains non- Linear transfor result;
Articulamentum (Attention layers), for splicing to non-linear transfer result, wherein Attention layers defeated Result is denoted as u and v respectively out;
Merge layer (Concatenate layer), for by uv and | u-v | and the first similarity is stitched together, generation one A long vector;
Full articulamentum (Dense layers) merges layer and output layer for connecting;
Output layer (Softmax layers), for exporting similarity calculated result, wherein the similarity meter of Softmax layers of output It calculates the result is that a probability.
Fig. 3 is the structural schematic diagram of Text similarity computing device shown according to an exemplary embodiment, referring to Fig. 3 institute Show, which includes:
Vectorization module is obtained for carrying out vectorization respectively to text to be calculated using term vector model trained in advance Take the term vector of the text to be calculated;
Computing module, for calculating the first similarity obtained between the text to be calculated;
Prediction module, for according to the term vector of the prediction model, the text to be calculated constructed in advance and described the One similarity obtains the second similarity between the text to be calculated.
As a kind of preferably embodiment, in the embodiment of the present invention, the vectorization module includes:
Training unit, for being pre-processed to training corpus, using pretreated training corpus in advance train word to Measure model;
Participle unit obtains the text to be calculated for carrying out word segmentation processing respectively to the text to be calculated respectively Corresponding word;
Vectorization unit, for using term vector model trained in advance to the corresponding word of the text to be calculated carry out to Quantization obtains the term vector of the text to be calculated.
Wherein, vectorization unit is by merging a term vector model realization.
As a kind of preferably embodiment, in the embodiment of the present invention, the vectorization module further include:
The identical text of meaning is meaning to label for carrying out text to label to training corpus by marking unit Different texts are to label.
As a kind of preferably embodiment, in the embodiment of the present invention, the computing module is specifically used for:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/ Or,
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;With/ Or,
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
As a kind of preferably embodiment, in the embodiment of the present invention, the prediction module includes:
Non-linear transfer unit carries out non-linear transfer for the term vector to the text to be calculated, obtains non-linear Conversion results;
Wherein, non-linear transfer unit is realized by (BiLSTM layers) of non-linear transfer layer, non-linear transfer layer (BiLSTM layers) include two-way LSTM unit.
First connection unit, for splicing to the non-linear transfer result;
Wherein, the first connection unit is realized by (Attention layers) of articulamentum.
Computing unit obtains calculated result for accordingly being calculated splicing result;
Second connection unit, for the calculated result and first similarity to be connected into a long vector;
Wherein, the second connection unit is by merging (Concatenate layers) of layer realizations.
Predicting unit, for calculating the similarity obtained between the text to be calculated according to the long vector.;
Wherein, predicting unit is realized by (Softmax layers) of output layer.
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, Text similarity computing method and device provided in an embodiment of the present invention, using supervised learning technology, in fusion Text participle, Tf-Idf, LSA, a variety of natural language Feature Extraction Technologies such as LDA, Word2Vec, a variety of texts such as Jaccard, WMD The calculation method of this similarity (distance), improves the accuracy of Text similarity computing;
2, Text similarity computing method and device provided in an embodiment of the present invention, using Model Fusion technology, by depth Study and traditional characteristic study combine, and further improve the accuracy of Text similarity computing.
It should be understood that Text similarity computing device provided by the above embodiment is in triggering similarity calculation business When, only the example of the division of the above functional modules, in practical application, it can according to need and divide above-mentioned function With being completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete above description All or part of function.In addition, Text similarity computing device provided by the above embodiment and Text similarity computing side Method embodiment belongs to same design, i.e. this method is based on the system, and specific implementation process is detailed in embodiment of the method, here It repeats no more.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of Text similarity computing method, which is characterized in that described method includes following steps:
S1: using term vector model trained in advance vectorization is carried out to text to be calculated respectively, obtains the text to be calculated Term vector;
S2: the first similarity obtained between the text to be calculated is calculated;
S3: according to the term vector of the prediction model, the text to be calculated constructed in advance and first similarity, institute is obtained State the second similarity between text to be calculated.
2. Text similarity computing method according to claim 1, which is characterized in that the step S1 is specifically included:
S1.1: pre-processing training corpus, trains term vector model in advance using pretreated training corpus;
S1.2: carrying out word segmentation processing to the text to be calculated respectively, obtains the corresponding word of the text to be calculated respectively;
S1.3: vectorization carried out to the corresponding word of the text to be calculated using term vector model trained in advance, described in acquisition The term vector of text to be calculated.
3. Text similarity computing method according to claim 2, which is characterized in that described to be located in advance to training corpus Reason includes:
To training corpus carry out text to label, be to label by the identical text of meaning, the different text of meaning to label for 0。
4. Text similarity computing method according to claim 1 or 2, which is characterized in that the step S2 is specifically included:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
5. Text similarity computing method according to claim 1 or 2, which is characterized in that the step S3 is specifically included:
S3.1: non-linear transfer is carried out to the term vector of the text to be calculated, obtains non-linear transfer result;
S3.2: splicing the non-linear transfer result, is accordingly calculated splicing result, and calculated result is obtained;.
S3.3: the calculated result and first similarity are connected into a long vector;
S3.4: the second similarity obtained between the text to be calculated is calculated according to the long vector.
6. a kind of Text similarity computing device, which is characterized in that described device includes:
Vectorization module obtains institute for carrying out vectorization respectively to text to be calculated using term vector model trained in advance State the term vector of text to be calculated;
Computing module, for calculating the first similarity obtained between the text to be calculated;
Prediction module, the term vector of the prediction model, the text to be calculated that are constructed in advance for basis and first phase Like degree, the second similarity between the text to be calculated is obtained.
7. Text similarity computing device according to claim 6, which is characterized in that the vectorization module includes:
Training unit trains term vector mould using pretreated training corpus for pre-processing to training corpus in advance Type;
It is corresponding to obtain the text to be calculated for carrying out word segmentation processing respectively to the text to be calculated respectively for participle unit Word;
Vectorization unit, for carrying out vector to the corresponding word of the text to be calculated using term vector model trained in advance Change, obtains the term vector of the text to be calculated.
8. Text similarity computing device according to claim 7, which is characterized in that the vectorization module further include:
The identical text of meaning is that meaning is different to label for carrying out text to label to training corpus by marking unit Text be to label.
9. Text similarity computing device according to claim 6 or 7, which is characterized in that the computing module is specifically used In:
Based on the character string of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on the term vector of the text to be calculated, the first similarity obtained between the text to be calculated is calculated;And/or
Based on bag of words, the first similarity obtained between the text to be calculated is calculated;And/or
It is characterized based on Tf-idf, calculates the first similarity obtained between the text to be calculated;And/or
It is characterized based on LSI, calculates the first similarity obtained between the text to be calculated.
10. Text similarity computing device according to claim 6 or 7, which is characterized in that the prediction module includes:
Non-linear transfer unit carries out non-linear transfer for the term vector to the text to be calculated, obtains non-linear transfer As a result;
First connection unit, for splicing to the non-linear transfer result;
Computing unit obtains calculated result for accordingly being calculated splicing result;
Second connection unit, for the calculated result and first similarity to be connected into a long vector;
Predicting unit, for calculating the second similarity obtained between the text to be calculated according to the long vector.
CN201910191756.1A 2019-03-13 2019-03-13 A kind of Text similarity computing method and device Pending CN109992772A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910191756.1A CN109992772A (en) 2019-03-13 2019-03-13 A kind of Text similarity computing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910191756.1A CN109992772A (en) 2019-03-13 2019-03-13 A kind of Text similarity computing method and device

Publications (1)

Publication Number Publication Date
CN109992772A true CN109992772A (en) 2019-07-09

Family

ID=67130655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910191756.1A Pending CN109992772A (en) 2019-03-13 2019-03-13 A kind of Text similarity computing method and device

Country Status (1)

Country Link
CN (1) CN109992772A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489549A (en) * 2019-07-16 2019-11-22 北京大米科技有限公司 Teaching transcription comparison method, device, electronic equipment and medium
CN111191464A (en) * 2020-01-17 2020-05-22 珠海横琴极盛科技有限公司 Semantic similarity calculation method based on combined distance
CN111639661A (en) * 2019-08-29 2020-09-08 上海卓繁信息技术股份有限公司 Text similarity discrimination method
CN112085091A (en) * 2020-09-07 2020-12-15 中国平安财产保险股份有限公司 Artificial intelligence-based short text matching method, device, equipment and storage medium
CN112185573A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 LCS and TF-IDF based similar character string determination method and device
CN112308464A (en) * 2020-11-24 2021-02-02 中国人民公安大学 Business process data processing method and device
CN113743077A (en) * 2020-08-14 2021-12-03 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN115953130A (en) * 2023-01-05 2023-04-11 深圳市坂云科技有限公司 Intelligent analysis processing system for customs declaration data
CN116010603A (en) * 2023-01-31 2023-04-25 浙江中电远为科技有限公司 Feature clustering dimension reduction method for commercial text classification

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291699A (en) * 2017-07-04 2017-10-24 湖南星汉数智科技有限公司 A kind of sentence semantic similarity computational methods
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity
CN109344399A (en) * 2018-09-14 2019-02-15 重庆邂智科技有限公司 A kind of Text similarity computing method based on the two-way lstm neural network of stacking
CN109460549A (en) * 2018-10-12 2019-03-12 北京奔影网络科技有限公司 The processing method and processing device of semantic vector

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291699A (en) * 2017-07-04 2017-10-24 湖南星汉数智科技有限公司 A kind of sentence semantic similarity computational methods
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity
CN109344399A (en) * 2018-09-14 2019-02-15 重庆邂智科技有限公司 A kind of Text similarity computing method based on the two-way lstm neural network of stacking
CN109460549A (en) * 2018-10-12 2019-03-12 北京奔影网络科技有限公司 The processing method and processing device of semantic vector

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489549A (en) * 2019-07-16 2019-11-22 北京大米科技有限公司 Teaching transcription comparison method, device, electronic equipment and medium
CN111639661A (en) * 2019-08-29 2020-09-08 上海卓繁信息技术股份有限公司 Text similarity discrimination method
CN111191464A (en) * 2020-01-17 2020-05-22 珠海横琴极盛科技有限公司 Semantic similarity calculation method based on combined distance
CN113743077A (en) * 2020-08-14 2021-12-03 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN113743077B (en) * 2020-08-14 2023-09-29 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN112085091A (en) * 2020-09-07 2020-12-15 中国平安财产保险股份有限公司 Artificial intelligence-based short text matching method, device, equipment and storage medium
CN112085091B (en) * 2020-09-07 2024-04-26 中国平安财产保险股份有限公司 Short text matching method, device, equipment and storage medium based on artificial intelligence
CN112185573A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 LCS and TF-IDF based similar character string determination method and device
CN112185573B (en) * 2020-09-25 2023-11-03 志诺维思(北京)基因科技有限公司 Similar character string determining method and device based on LCS and TF-IDF
CN112308464A (en) * 2020-11-24 2021-02-02 中国人民公安大学 Business process data processing method and device
CN112308464B (en) * 2020-11-24 2023-11-24 中国人民公安大学 Business process data processing method and device
CN115953130A (en) * 2023-01-05 2023-04-11 深圳市坂云科技有限公司 Intelligent analysis processing system for customs declaration data
CN115953130B (en) * 2023-01-05 2023-08-11 深圳市坂云科技有限公司 Intelligent analysis processing system for gateway declaration data
CN116010603A (en) * 2023-01-31 2023-04-25 浙江中电远为科技有限公司 Feature clustering dimension reduction method for commercial text classification

Similar Documents

Publication Publication Date Title
CN109992772A (en) A kind of Text similarity computing method and device
Liu et al. Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval
Shen et al. Weakly supervised dense video captioning
Gao et al. Deep label distribution learning with label ambiguity
Wang et al. MoFAP: A multi-level representation for action recognition
Wang et al. Metasearch: Incremental product search via deep meta-learning
Xiao et al. Convolutional hierarchical attention network for query-focused video summarization
CN111985538A (en) Small sample picture classification model and method based on semantic auxiliary attention mechanism
CN109726387A (en) Man-machine interaction method and system
Han et al. A unified perspective of classification-based loss and distance-based loss for cross-view gait recognition
CN112580362A (en) Visual behavior recognition method and system based on text semantic supervision and computer readable medium
Zhao et al. TUCH: Turning Cross-view Hashing into Single-view Hashing via Generative Adversarial Nets.
Li et al. Combining local and global features into a Siamese network for sentence similarity
CN113806554A (en) Knowledge graph construction method for massive conference texts
Ouyang et al. Collaborative image relevance learning for visual re-ranking
Avgoustinakis et al. Audio-based near-duplicate video retrieval with audio similarity learning
Xu et al. Idhashgan: deep hashing with generative adversarial nets for incomplete data retrieval
CN113641854B (en) Method and system for converting text into video
Zhu et al. Learning clip guided visual-text fusion transformer for video-based pedestrian attribute recognition
Lin et al. Region-based context enhanced network for robust multiple face alignment
Niu Music Emotion Recognition Model Using Gated Recurrent Unit Networks and Multi-Feature Extraction
Du et al. Captioning videos using large-scale image corpus
Li et al. Deep unsupervised hashing for large-scale cross-modal retrieval using knowledge distillation model
Mi et al. Dual-branch network with a subtle motion detector for microaction recognition in videos
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190709