CN107967255A

CN107967255A - A kind of method and system for judging text similarity

Info

Publication number: CN107967255A
Application number: CN201711088831.9A
Authority: CN
Inventors: 冯素梅; 江国进; 孙永滨; 白涛; 杜乔瑞; 王晓燕; 张亚栋; 徐先柱
Original assignee: China General Nuclear Power Corp; China Techenergy Co Ltd
Current assignee: China General Nuclear Power Corp; China Techenergy Co Ltd
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2018-04-27

Abstract

The invention belongs to the technical field of text classification, and in order to solve the deficiency that three kinds of text similarities in the prior art judge that algorithm is respectively present, the present invention provides a kind of method and system for judging text similarity, the described method includes：S1, structure vector space model so that text is quantized into accessible object；S2, using Siamese network struction text semantic similitude extraction models, and in Siamese networks, together with semantic feature extraction network with similitude differentiates series network, while optimized in the sample training stage；S3, the semantic feature expression based on training stage sample, construct the Text similarity computing function of the included angle cosine of feature based vector, and final loss function；Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on Siamese network handles, calculate two vectorial cosine angle distances, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, it is determined as similar, is otherwise determined as dissmilarity.

Description

A kind of method and system for judging text similarity

Technical field

The present invention relates to the technical field of text classification, more particularly to nuclear safe level software verification to lead with the technology confirmed Domain；More particularly, to a kind of method and system for judging text similarity.

Background technology

In nuclear safe level software verification and confirm during (V＆V), it is necessary to assess performing document, to traceable Property analyzed, and danger is analyzed etc., with being continuously increased for technical documentation, what each project each stage repeated Performing these activities needs a large amount of manpowers, therefore the automatic identification item to be assessed in document evaluation process, in traceable analysis certainly The dynamic semantic dependency for judging the superior and the subordinate's file, the Auto-matching of the failure mode of like product, becomes during hazard analysis V＆V personnel's problems to be solved.

Method currently used for judging text similarity mainly has cosine similarity, SimHash algorithms and Latent Semantic Descriptor index method (LSI).Wherein, cosine similarity method is selected by pretreatment, text feature item, weights, generates vector space model After calculate cosine；SimHash handles the similar decision method of text of the use of magnanimity webpage, the main mesh of this method for Google Be dimensionality reduction, i.e., the fingerprint by the maps feature vectors of higher-dimension into f-bit, by compare the Hamming distance of two document fingerprints come Characterize document repetition or similitude.；Latent Semantic Indexing (LSI) utilizes " singular value decomposition (SVD) " technology in matrix theory, Frequency matrix is converted into singular matrix, the matrix is then subjected to singular value decomposition, less singular value is removed, as a result very Incorgruous amount and singular value matrix are used to document vector sum query vector being mapped in a sub-spaces, within this space, come It is retained from the semantic relation of document matrix, the inner product that may then pass through standardization is calculated to calculate more than the angle between vector String similarity, and then according to the similarity between comparison of computational results text；Influenced so LSI eliminates to calculate Documents Similarity Less feature, the feature that those remained are large impact position of the document vector in m-dimensional space.

But inventor has found in the implementation of the present invention：1st, cosine similarity algorithm is applied to web page title conjunction more And in being clustered with title, result of calculation is accurate, but the algorithm only considered the statistical property of word within a context, it is assumed that crucial Linear independence between word, without considering the semantic information of word in itself, it is impossible to solve natural language present in text well Problem, such as synonym and polysemant, therefore has certain limitation.2nd, SimHash methods processing speed is fast, to mass text Similar judgement is very suitable for；But since the data source for Hash calculation of short text is less, short text similarity is known Not rate is low.3rd, Latent Semantic Indexing method is more more reliable than the similarity measure originally based on original text vector, but for magnanimity It is potential semantic that text data, singular value decomposition dyscalculia, and excessively sparse language material cannot embody it well.

The content of the invention

In order to solve the deficiency that three kinds of text similarities in the prior art judge that algorithm is respectively present, the present invention provides a kind of Judge the method and system of text similarity, the judgement of plurality of classes text can be solved and discrimination is high.

To achieve these goals, technical solution provided by the invention includes：

One aspect of the present invention provides a kind of method for judging text similarity, it is characterised in that including：

S1, structure vector space model so that text is quantized into accessible object；

S2, using Siamese network struction text semantic similitude extraction models, and in the Siamese networks, Semantic feature extraction network optimizes together with similitude differentiation series network in the sample training stage；

S3, the semantic feature expression based on training stage sample, construct the text phase of the included angle cosine of feature based vector Function, and final loss function are calculated like degree so that some region of the maps feature vectors of similar sample pair to space；

S4, input two texts to be measured, based on the Siamese network handles survey text carry out semantic feature extraction it Afterwards, two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, It is determined as similar, is otherwise determined as dissmilarity.

Preferably, in the step S2, the structures of the Siamese networks can have two or more to the embodiment of the present invention A parallel path, correspondingly, can input two or more text features at the same time, and the Siamese networks are by each Feature is extracted from the Nonlinear Mapping of path, then the feature of multiple texts can be combined after characteristic layer and then semanteme is related Property judge.

The Siamese networks preferably, in the step S2, are divided into first half and latter half of by the embodiment of the present invention Point；First half is used for the extraction of text semantic feature, is made of full articulamentum；Latter half is used for similarity measurement, by net The feature integration that network first half extracts；The distance between a pair of sample feature after being extracted through different branches is calculated, or Corresponding element is directly weighted to connection；And then in step s3 by the similitude between cosine angle metric function differentiation sample.

Preferably, the first half is used for the extraction of text semantic feature to the embodiment of the present invention, is made of full articulamentum, And including word, phrase, sentence, paragraph, article and semanteme, the network structure of 5 layers of hidden layer of structure.

The embodiment of the present invention preferably, in the S3 and S4 the calculating of included angle cosine use ternary metric function, in ternary Group sample (x, x⁺, x^-) there is identical network parameter W after feature extraction, the semantic feature expression of three samples is obtained, point G is not denoted as it_w(x)、G_w(x⁺), G_w(x^-)；The Text similarity computing function of included angle cosine is：D(x_i,x_j)=cos (x_i,x_j)= x_i.x_j/|x_i|.|x_j|。

The embodiment of the present invention is it is further preferred that Text similarity computing function with the included angle cosine, corresponding damage Losing function is：

Wherein, α is x and x⁺The distance between and x and x^-The distance between in minimum interval, and be default fixation Parameter.

The embodiment of the present invention preferably, in the S3 and S4 the calculating of included angle cosine use ternary metric function, in ternary Group sample (x', x, x') has different network parameter W after feature extraction, obtains the semantic feature expression of three samples, G is denoted as respectively_w1(x')、G_w2(x), G_w3(x')；As D (G_w1(x'),G_w2(x))-D(G_w3(x'),G_w2(x)) during ＞ α, it is judged as phase Seemingly, otherwise it is judged as dissmilarity；Wherein, α is x and x⁺The distance between and x and x^-The distance between in minimum interval, and It is default preset parameter.

Preferably, the sample x is to concentrate to select a sample at random from training data to further embodiment of the present invention, described Sample is x⁺To belong to an of a sort sample, the sample x with x^-For a sample inhomogeneous with x；In sample training Principle is：It is required that x and x⁺Angle between feature vector is as small as possible, and x and x^-Angle between semantic feature vector is as far as possible Greatly.

The embodiment of the present invention preferably, in the step S2 selects a pair of of text to be denoted as (x as input_i,x_j)；By text Paragraph heading and text be divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.

Another aspect of the present invention also provides a kind of system for judging text similarity, it is characterised in that including：Controller, The controller is used for the corresponding program of method for loading and performing any one above-mentioned judgement text similarity.

The above-mentioned technical proposal provided using the application, can at least obtain one kind in following beneficial effect：

1st, the neutral net used is the overlapped in series of linear transformation and simple non-linear functions, and warp is used in the training stage The stochastic gradient descent method of allusion quotation, therefore all there is no the difficulty on calculating during trained process and test.

2nd, not only only rest in the rank of word and text is handled, but semantic-based rank judges document Similitude, can solve natural language problem present in text well；The result of judgement is more accurate.

3rd, the measuring similarity model come out using Siamese network trainings, is possessed so that similar text distance diminishes, Dissimilar text distance becomes larger；The classification, quantity and length of text are not all required, so can be good at solving those Classification number is more, and the less decision problem of sample data of partial category.

4th, due to using Siamese network integration ternary loss functions, in the network training stage, the application corresponds to technical side During the similarity determination of case, target is so that the feature of relevant Text Feature Extraction is as small as possible, incoherent Text Feature Extraction Characteristic difference it is as big as possible；Due to training sample include much to data, e-learning, which has gone out, after training this sentences The whether relevant ability of disconnected text, and this ability is built upon on very much, a variety of data pair, is come out Ability, can be generalized in the classification not occurred, and here it is the generalization ability of neutral net；That is the generalization ability of type result in This property；Accordingly even when for the classification not having in sample, the technical solution provided using the application still can be determined that two The similitude of a text.

5th, one in the region rather than space that similar sample is mapped in space using ternary loss function Point, simplifies the complexity of problem so that the generalization ability of algorithm is substantially improved.

The further feature and advantage of invention will illustrate in the following description, also, partly become aobvious from specification And be clear to, or understood by implementing technical scheme.The purpose of the present invention and other advantages can be by illustrating Specifically noted structure and/or flow are realized and obtained in book, claims and attached drawing.

Brief description of the drawings

Fig. 1 is a kind of flow chart for judgement text similarity method that one embodiment of the invention provides.

Fig. 2 provides a kind of text similarity training network structure diagram for one embodiment of the invention.

Fig. 3 is a kind of full articulamentum structure diagram that one embodiment of the invention provides.

Fig. 4 is a kind of similitude prediction model structure diagram that one embodiment of the invention provides.

Fig. 5 is a kind of structural schematic block diagram for judgement text similarity system that one embodiment of the invention provides.

Fig. 6 is a kind of similarity measurement model structure that another embodiment of the present invention provides.

Fig. 7 is a kind of similarity measurement model structure that yet another embodiment of the invention provides.

Embodiment

Carry out the embodiment that the present invention will be described in detail below with reference to accompanying drawings and embodiments, how the present invention is applied whereby Technological means solves technical problem, and that reaches technique effect realizes that process can fully understand and implement according to this.Need to illustrate , these specific descriptions are to allow those of ordinary skill in the art to be more prone to, clearly understand the present invention, rather than to this hair Bright limited explanation；And if conflict is not formed, each embodiment in the present invention and each spy in each embodiment Sign can be combined with each other, and the technical solution formed is within protection scope of the present invention.

In addition, step shown in the flowchart of the accompanying drawings can be in the control system of a such as group controller executable instruction Middle execution, although also, show logical order in flow charts, in some cases, can be with different from herein Order performs shown or described step.

Below by the drawings and specific embodiments, technical scheme is described in detail：

Embodiment one

As shown in Figure 1, the present embodiment provides a kind of method for judging text similarity, this method is to be based on Siamese nets The text similarity decision method of network；Firstly, it is necessary to establish VSM models to text data, its process includes pretreatment, segments, Stop words is removed, text is quantized into accessible feature vector；Then Siamese networks (also referred to as twin network) extraction is built The Semantic Similarity feature of the sample pair of feature based vector；Finally construct the triplet based on high dimension vector included angle cosine Loss (also referred to as triple losses) loss function is differentiating the correlation of text pair.This method specifically includes：

In text-processing, it is necessary first to text is quantized into accessible object, it is preferable that employ in text-processing Vector space model (abbreviation VSM) method of structure, including：1st, Text Pretreatment：Pretreatment be handle text mess code and Non-textual content, segments and goes stop words, will have little significance according to the word disabled in vocabulary in language material to content of text identification But the very high word of the frequency of occurrences, symbol, punctuate and mess code etc. remove.2nd, feature vector calculates：Filter out common adverbial word, auxiliary word etc. After the high word of frequency, some characteristic items are determined according to the frequency of remaining word.Preferably, using TF-IDF (Term Frequency-Inverse document frequency) method calculates the weight of text feature item, and is normalized Processing, obtains urtext feature vector.

Wherein, the basic thought of vector space model be the m that document is reduced to using the weight of characteristic item as component tie up to Amount represents, that is, converts the text to the accessible quantization characteristic described by mathematic sign.The quantizing process of this model is false If linear independence between word and word, therefore the model can not directly carry out semantic relevant judgement, but utilize follow-up Siamese Network (Chinese is also referred to as Siam's network) carries out further semantic feature extraction to the quantization characteristic of urtext, is judged with reaching The purpose of text semantic similitude.

S2, using Siamese network struction text semantic similitude extraction models, it is semantic and in Siamese networks Feature extraction network optimizes together with similitude differentiation series network in the sample training stage；

The Semantic Similarity of two sections of texts is judged, it is necessary to extract sample pair after the quantization vector of each text chunk is obtained Similarity feature judged；The present embodiment combination Siamese networks to carry out the quantization characteristic of urtext further Semantic feature extraction, semantic feature extraction network and similitude differentiated together with series network, while in the sample training stage Optimize；

Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on Siamese network handles, meter Two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, are determined as It is similar, otherwise it is determined as dissmilarity.

With reference to the training stage schematic diagram 2 (while input three samples, two are relevant, and one is incoherent, use To train network) with the schematic diagram 4 of test phase (only needing two samples of input to judge mutually uncorrelated)；To the skill of the present embodiment Art further explains：

Wherein, the structure of S2 map networks and the physical significance of imparting, the parameter of network are W, network in the present embodiment Preceding part is used for extracting semantic feature, and rear part is used for extracting similarity feature；S3 corresponds to loss function part hereinafter Realize, i.e. D (can hereafter be further explained) with reference to Fig. 2.Neutral net is linked by the weight coefficient between node, trained mesh Be exactly to optimize these coefficients so that neutral net is issued to required function in the guidance of training data.Treat that neutral net is instructed After perfecting, these coefficients are fixed, and when test, are inputted after the change of these coefficients, are obtained as a result, and then judging whether It is similar.

Preferably, in step S2, the structures of Siamese networks can have that two or more are parallel logical to the present embodiment Road, correspondingly, can input two or more text features at the same time, and Siamese networks are non-linear by respective path Mapping extraction feature, then the feature of multiple texts can be combined after characteristic layer and then semantic dependency judges.In order to make The technical solution that the present embodiment is more clearly understood in skilled person is obtained, the present embodiment is hereafter combining Fig. 2 explanations When the structure of Siamese networks, illustrated with triple channel.

Siamese networks are divided into first half and latter half by the present embodiment it is further preferred that in step S2；Before Half part is used for the extraction of text semantic feature, is made of full articulamentum；Latter half is used for similarity measurement, by network first half The feature integration that part is extracted；Calculate the distance between a pair of sample feature after being extracted through different branches；And then in step Differentiate the similitude between sample by cosine angle metric function in S3.

Preferably, first half is used for the extraction of text semantic feature to the present embodiment, is made of full articulamentum, and including Word, phrase, sentence, paragraph, article and semanteme, build the network structure of 5 layers of hidden layer.

The present embodiment preferably, in S3 and S4 the calculating of included angle cosine use ternary metric function, triple sample (x, x⁺, x^-) there is identical network parameter W after feature extraction, the semantic feature expression of three samples is obtained, is denoted as G respectively_w (x)、G_w(x⁺), G_w(x^-)；The Text similarity computing function of included angle cosine is：D(x_i,x_j)=cos (x_i,x_j)=x_i.x_j/|x_i |.|x_j|。

Since ternary is two relevant samples, an incoherent sample；Accordingly, it is considered to feature between correlated samples Distance is it is contemplated that the distance of uncorrelated sample feature, that is, considers inter- object distance it is contemplated that between class distance, so Obtained grader is more stable, and generalization ability is stronger.Traditional method, simply considers the distance of correlated samples, holds in processing The easily error such as correlation judgement between confusing sample.

Preferably, the Text similarity computing function with included angle cosine, corresponding loss function is the present embodiment：

Preferably, sample x is to concentrate to select a sample, sample x at random from training data to the present embodiment⁺To belong to x An of a sort sample, sample x^-For a sample inhomogeneous with x；It is in the principle of sample training：It is required that x and x⁺Feature Angle between vector is as small as possible, and x and x^-Angle between semantic feature vector is as big as possible.

More specifically preferred solution：In above-mentioned steps S2, it is contemplated that the spy of Siamese network structures (as shown in Figure 2) Point, the i.e. structure of Siamese network models can have two or more parallel paths, therefore can be by two or more Text feature inputs network at the same time, extracts feature by the Nonlinear Mapping of respective path, then the feature of multiple texts can be with It is combined after characteristic layer and then semantic dependency judges.Therefore the present embodiment is introduced into Siamese networks to extract text pair Semantic Similarity feature extraction.

In the training stage, which has three branches (as shown in Figure 2), these three branches possess same network structure, And similitude judgement is convenient in order to extract public characteristic, each branch is arranged to weight and shared by us, i.e. three branches With same network parameter W.Siamese networks are divided into first half, latter half by us.First half is used for text language The extraction of adopted feature, is made of full articulamentum, activation primitive selection ReLU, i.e. f (x)=max (0, x).For first half Network structure, in order to preferably extract the semanteme of text, includes word, short from shallow to deep according to the feature hierarchy structure of text in itself Language, sentence, paragraph, article and semanteme, build 5 layers of hidden layer network structure (only illustrate as shown in Figure 3 wherein two layers, its His is similar), and every layer belongs to full articulamentum, realizes extraction text semantic, while reduce the purpose of vector dimension.

Latter half is used for similarity measurement, the feature integration that network first half is extracted, and then is measured by certain Similitude between criteria function sample, on the selection of metric function, will be expanded on further in lower section.Training when sample according to Following mode tissue, first concentrates from training data and selects a sample to be denoted as x as anchor point, the sample at random, then random again Selection belongs to an of a sort sample and an inhomogeneous sample, and the corresponding Positive that is known as of the two samples (is denoted as x⁺) and Negative (be denoted as x^-), thus form one group of training sample (x, x⁺, x^-), while Siamese networks are inputted, pass through BP Algorithm, optimizes network parameter so that network output valve is moved closer to true value, until network convergence.

In above-mentioned Siamese network structures, semantic feature extraction network is together with similitude differentiation series network, together Shi Youhua, is the network structure of end-to-end (end-to-end).Characteristic extraction part is influenced be subject to differentiation part so that is carried The feature taken is beneficial to the similitude for differentiating sample.Compared to the method for part optimization, end-to-end methods are global optimizations, effect Fruit is more preferable.

More specifically preferred solution：In above-mentioned steps S3, triple sample obtains three samples after feature extraction Semantic feature expression, be denoted as G respectively_w(x)、G_w(x⁺), G_w(x^-).In the differentiation part of network, lower surface construction feature based to The similitude of the distance between amount metric function judgement sample.Since the feature vector of extraction is high dimension vector, traditional is European Distance etc. cannot react the distance between high dimension vector well, and the present invention constructs the text of the included angle cosine of feature based vector Similarity measure function, is：

D(x_i,x_j)=cos (x_i,x_j)=x_i.x_j/|x_i|.|x_j| (formula 1)

Relative to the similitude using binary loss function measurement sample pair, since the effect of binary loss function is all Similar sample is mapped to a point in feature space, this requires feature extraction function to have identical sound to all similar samples Should, e-learning is got up difficulty.The present embodiment selection ternary loss function completes differentiation task；Obtain the feature of triple sample After vector, while require x and x⁺Angle (i.e. distance as small as possible between feature vectorValue to the greatest extent may be used Can be big, i.e., angle is as small as possible), and x and x^-Angle between semantic feature vector is as big (i.e. as possibleTo the greatest extent May be small), and to allow x and x⁺The distance between and x and x^-The distance between have a minimum interval α；Wherein, α be to Fixed hyper parameter, final loss function are：

Wherein, the decimal that α gives in the starting stage, such as 0.1, usually by defined distance metric function Dimension determines, such as cosine used herein, and codomain is [0,1], and parameter selects 0.1.

Different from binary loss function, the purpose of ternary loss function is an area being mapped to similar sample in space Domain so that inter- object distance is less than between class distance, and is subject to preset parameter α constraint controls in class between class between interface；Will be same The maps feature vectors of class sample pair are to a point in some region of space rather than space so that the complexity of problem is big It is big to simplify, while do not influence to judge the effect of Similarity Problem in itself again, thus the effect of algorithm and generalization ability have it is very big Lifting.

More specifically preferred solution：In above-mentioned steps S4, prediction when, it is only necessary to select network two paths (by It is that weights are shared in network individual channel, therefore selects any two paths not influence)；Input sample is first according to rear S1 extracts the quantization vector of two texts to be measured, is denoted as input1, input2 respectively；Then Siamese network model knots are inputted Structure as shown in figure 4, by Siamese networks to sample to carry out semantic feature extraction after, be passed to differentiate network calculations two The cosine angle distance of vector, sets threshold value ζ, when calculated value is more than ζ, be determined as similar, is otherwise determined as dissmilarity.

Wherein, threshold value ζ parameters are given in advance；The distance of network output may be considered text to the general of similitude Rate, this parameter provide meet what probable value both regarded as it is similar, depending on specific tasks are to the demand of similitude rank；And And demand is higher, which can be set bigger.

As shown in figure 5, on the other hand the present embodiment also provides a kind of system for judging text similarity, which includes： Controller, controller are used for the corresponding program of method for loading and performing any one above-mentioned judgement text similarity.

Embodiment two

The present embodiment, in order to make training pattern more flexible, can cause Siamese networks on the basis of embodiment one Three branches in weighted, the number of plies is different, namely three functions are orthogonal；Simply calculated in last distance, will They are associated together；The content of other non-repeat specifications is identical with embodiment one.Specifically, as shown in fig. 6, the present embodiment is excellent Selection of land, in Fig. 1 corresponding S3 and S4 the calculating of included angle cosine use ternary metric function, at triple sample (x', x, x') There is different network parameter W after feature extraction, obtain the semantic feature expression of three samples, be denoted as respectively

G_w1(x')、G_w2(x), G_w3(x')；

As D (G_w1(x'),G_w2(x))-D(G_w3(x'),G_w2(x)) during ＞ α, it is judged as similar, is otherwise judged as dissmilarity； Wherein, α is x and x⁺The distance between and x and x^-The distance between in minimum interval, and be default preset parameter.On The value of α, with reference to the description of embodiment one, details are not described herein.

Correspondingly, on the other hand the present embodiment also provides a kind of system for judging text similarity, which includes：Control Device, controller are used to load and perform program corresponding with Siamese network structures in Fig. 6.

Embodiment three

As the technological document of software development, in particular for the software of nuclear power field, its file edit will meet standard Specification, title have the generality and similitude of height, if when vector space model is established to text, taking will be whole A text is uniformly processed, and can greatly lose the important information that title is brought to classification.Therefore, during training and test The contribution for answering appropriate consideration paragraph heading to measure text similarity.As shown in fig. 7, the present embodiment is preferably, in embodiment A pair of of text is selected to be denoted as (x as input in the one or correspondence step of embodiment two S2_i,x_j)；By the paragraph heading of text with And text is divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.

Specifically, a pair of of text is chosen first as input, is denoted as, can is similar or dissimilar, by the section of text Fall title and text is divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.Training pattern For structure as shown in fig. 7, wherein, decision-making layer network uses full connection Rotating fields, activation primitive uses Sigmoid functions,

That is σ (x)=1/ (1+exp (- x))；A ∈ (0,1),

X_Text=(x_i_Text,x_j_ Text), x_Title=(x_i_Title,x_j_Title)。

During test, still using the model, threshold value is set, when output is more than threshold value, is judged as similar, is not otherwise It is similar.On threshold value, the description of ζ in embodiment one is refer to, details are not described herein.

But in the corresponding similitude determination methods of the present embodiment, correspondingly no longer adopted in implementation procedure in S2 steps With " calculating the distance between a pair of sample feature after being extracted through different branches ", but by the corresponding element directly company of weighting Connect.

Correspondingly, the present embodiment also provides a kind of system for judging text similarity, which includes：Controller, control Device is used to load and perform program corresponding with Siamese network structures in Fig. 7.

1st, the neutral net used is the overlapped in series of linear transformation and simple non-linear functions, and warp is used in the training stage The stochastic gradient descent method of allusion quotation, therefore, all there is no the difficulty on calculating during trained process and test.

4th, due to using Siamese network integration ternary loss functions, in the network training stage, the application corresponds to technical side During the similarity determination of case, target is so that the feature of relevant Text Feature Extraction is as small as possible, incoherent Text Feature Extraction Characteristic difference it is as big as possible；Due to training sample include much to data, e-learning, which has gone out, after training this sentences The whether relevant ability of disconnected text, and this ability is built upon on very much, a variety of data pair, is come out Ability, can be generalized in the classification not occurred, and here it is the generalization ability of neutral net；That is the generalization ability of model causes This property；Accordingly even when for the classification not having in sample, the technical solution provided using the application still can be determined that The similitude of two texts.

Finally it should be noted that described above is only highly preferred embodiment of the present invention, not the present invention is appointed What formal limitation.Any those skilled in the art, it is without departing from the scope of the present invention, all available The way and technology contents of the disclosure above make technical solution of the present invention many possible variations and simple replacement etc., these Belong to the scope of technical solution of the present invention protection.

Claims

A kind of 1. method for judging text similarity, it is characterised in that including：

S1, structure vector space model so that text is quantized into accessible object；

S2, using Siamese network struction text semantic similitude extraction models, it is semantic and in the Siamese networks Feature extraction network optimizes together with similitude differentiation series network in the sample training stage；

S3, the semantic feature expression based on training stage sample, construct the text similarity of the included angle cosine of feature based vector Calculate function, and final loss function so that some region of the maps feature vectors of similar sample pair to space；

Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on the Siamese network handles, meter Two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, are determined as It is similar, otherwise it is determined as dissmilarity.
2. according to the method described in claim 1, it is characterized in that, in the step S2, the structure of the Siamese networks can There are two or more parallel paths, correspondingly, two or more text features can be inputted at the same time, it is described Siamese networks are by the Nonlinear Mapping extraction feature of respective path, and then the feature of multiple texts can be after characteristic layer It is combined and then semantic dependency judges.
3. according to the method described in claim 2, it is characterized in that, in the step S2, before the Siamese networks are divided into Half part and latter half；First half is used for the extraction of text semantic feature, is made of full articulamentum；Latter half is used for phase Measured like property, the feature integration that network first half is extracted；Calculate through different branches extract after a pair of sample feature it Between distance, or corresponding element is directly weighted to connection；And then differentiate sample by cosine angle metric function in step s3 The similitude of this.
4. according to the method in claim 2 or 3, it is characterised in that the first half is used for carrying for text semantic feature Take, be made of full articulamentum, and including word, phrase, sentence, paragraph, article and semanteme, the network knot of 5 layers of hidden layer of structure Structure.
5. according to the method described in claim 1, it is characterized in that, in the S3 and S4 the calculating of included angle cosine use three elementary lengths Flow function, in triple sample (x, x⁺, x^-) there is identical network parameter W after feature extraction, obtain the language of three samples Adopted feature representation, is denoted as G respectively_w(x)、G_w(x⁺), G_w(x^-)；The Text similarity computing function of included angle cosine is：D(x_i,x_j) =cos (x_i,x_j)=x_i.x_j/|x_i|.|x_j|。
6. according to the method described in claim 5, it is characterized in that, Text similarity computing function with the included angle cosine, Corresponding loss function is：

<mrow> <mi>l</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mi>&alpha;</mi> <mo>-</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>+</mo> </msubsup> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>-</mo> </msubsup> <mo>)</mo> <mo>)</mo> </mrow> <mo>}</mo> <mo>.</mo> </mrow>

Wherein, α is x and x⁺The distance between and x and x^-The distance between in minimum interval, and be default preset parameter.
7. according to the method described in claim 1, it is characterized in that, in the S3 and S4 the calculating of included angle cosine use three elementary lengths Flow function, has different network parameter W in triple sample (x', x, x') after feature extraction, obtains three samples Semantic feature is expressed, and is denoted as G respectively_w1(x')、G_w2(x), G_w3(x')；As D (G_w1(x'),G_w2(x))-D(G_w3(x'),G_w2(x)) During ＞ α, it is judged as similar, is otherwise judged as dissmilarity；Wherein, α is x and x⁺The distance between and x and x^-The distance between in most Small interval, and be default preset parameter.
8. the method according to the description of claim 7 is characterized in that the sample x is to concentrate to select one at random from training data Sample, the sample are x⁺To belong to an of a sort sample, the sample x with x^-For a sample inhomogeneous with x； The principle of sample training is：It is required that x and x⁺Angle between feature vector is as small as possible, and x and x^-Between semantic feature vector Angle is as big as possible.
9. according to the method described in claim 1, it is characterized in that, a pair of of text is selected in the step S2 as input, note For (x_i,x_j)；The paragraph heading of text and text are divided into two parts, meanwhile, the text of two texts and title are closed respectively And as input.
A kind of 10. system for judging text similarity, it is characterised in that including：Controller, the controller are used to load and hold Any one in row such as claim 1-9 judges the corresponding program of method of text similarity.