CN107967255A - A kind of method and system for judging text similarity - Google Patents

A kind of method and system for judging text similarity Download PDF

Info

Publication number
CN107967255A
CN107967255A CN201711088831.9A CN201711088831A CN107967255A CN 107967255 A CN107967255 A CN 107967255A CN 201711088831 A CN201711088831 A CN 201711088831A CN 107967255 A CN107967255 A CN 107967255A
Authority
CN
China
Prior art keywords
text
sample
feature
network
msub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711088831.9A
Other languages
Chinese (zh)
Inventor
冯素梅
江国进
孙永滨
白涛
杜乔瑞
王晓燕
张亚栋
徐先柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China General Nuclear Power Corp
China Techenergy Co Ltd
Original Assignee
China General Nuclear Power Corp
China Techenergy Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China General Nuclear Power Corp, China Techenergy Co Ltd filed Critical China General Nuclear Power Corp
Priority to CN201711088831.9A priority Critical patent/CN107967255A/en
Publication of CN107967255A publication Critical patent/CN107967255A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention belongs to the technical field of text classification, and in order to solve the deficiency that three kinds of text similarities in the prior art judge that algorithm is respectively present, the present invention provides a kind of method and system for judging text similarity, the described method includes:S1, structure vector space model so that text is quantized into accessible object;S2, using Siamese network struction text semantic similitude extraction models, and in Siamese networks, together with semantic feature extraction network with similitude differentiates series network, while optimized in the sample training stage;S3, the semantic feature expression based on training stage sample, construct the Text similarity computing function of the included angle cosine of feature based vector, and final loss function;Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on Siamese network handles, calculate two vectorial cosine angle distances, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, it is determined as similar, is otherwise determined as dissmilarity.

Description

A kind of method and system for judging text similarity
Technical field
The present invention relates to the technical field of text classification, more particularly to nuclear safe level software verification to lead with the technology confirmed Domain;More particularly, to a kind of method and system for judging text similarity.
Background technology
In nuclear safe level software verification and confirm during (V&V), it is necessary to assess performing document, to traceable Property analyzed, and danger is analyzed etc., with being continuously increased for technical documentation, what each project each stage repeated Performing these activities needs a large amount of manpowers, therefore the automatic identification item to be assessed in document evaluation process, in traceable analysis certainly The dynamic semantic dependency for judging the superior and the subordinate's file, the Auto-matching of the failure mode of like product, becomes during hazard analysis V&V personnel's problems to be solved.
Method currently used for judging text similarity mainly has cosine similarity, SimHash algorithms and Latent Semantic Descriptor index method (LSI).Wherein, cosine similarity method is selected by pretreatment, text feature item, weights, generates vector space model After calculate cosine;SimHash handles the similar decision method of text of the use of magnanimity webpage, the main mesh of this method for Google Be dimensionality reduction, i.e., the fingerprint by the maps feature vectors of higher-dimension into f-bit, by compare the Hamming distance of two document fingerprints come Characterize document repetition or similitude.;Latent Semantic Indexing (LSI) utilizes " singular value decomposition (SVD) " technology in matrix theory, Frequency matrix is converted into singular matrix, the matrix is then subjected to singular value decomposition, less singular value is removed, as a result very Incorgruous amount and singular value matrix are used to document vector sum query vector being mapped in a sub-spaces, within this space, come It is retained from the semantic relation of document matrix, the inner product that may then pass through standardization is calculated to calculate more than the angle between vector String similarity, and then according to the similarity between comparison of computational results text;Influenced so LSI eliminates to calculate Documents Similarity Less feature, the feature that those remained are large impact position of the document vector in m-dimensional space.
But inventor has found in the implementation of the present invention:1st, cosine similarity algorithm is applied to web page title conjunction more And in being clustered with title, result of calculation is accurate, but the algorithm only considered the statistical property of word within a context, it is assumed that crucial Linear independence between word, without considering the semantic information of word in itself, it is impossible to solve natural language present in text well Problem, such as synonym and polysemant, therefore has certain limitation.2nd, SimHash methods processing speed is fast, to mass text Similar judgement is very suitable for;But since the data source for Hash calculation of short text is less, short text similarity is known Not rate is low.3rd, Latent Semantic Indexing method is more more reliable than the similarity measure originally based on original text vector, but for magnanimity It is potential semantic that text data, singular value decomposition dyscalculia, and excessively sparse language material cannot embody it well.
The content of the invention
In order to solve the deficiency that three kinds of text similarities in the prior art judge that algorithm is respectively present, the present invention provides a kind of Judge the method and system of text similarity, the judgement of plurality of classes text can be solved and discrimination is high.
To achieve these goals, technical solution provided by the invention includes:
One aspect of the present invention provides a kind of method for judging text similarity, it is characterised in that including:
S1, structure vector space model so that text is quantized into accessible object;
S2, using Siamese network struction text semantic similitude extraction models, and in the Siamese networks, Semantic feature extraction network optimizes together with similitude differentiation series network in the sample training stage;
S3, the semantic feature expression based on training stage sample, construct the text phase of the included angle cosine of feature based vector Function, and final loss function are calculated like degree so that some region of the maps feature vectors of similar sample pair to space;
S4, input two texts to be measured, based on the Siamese network handles survey text carry out semantic feature extraction it Afterwards, two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, It is determined as similar, is otherwise determined as dissmilarity.
Preferably, in the step S2, the structures of the Siamese networks can have two or more to the embodiment of the present invention A parallel path, correspondingly, can input two or more text features at the same time, and the Siamese networks are by each Feature is extracted from the Nonlinear Mapping of path, then the feature of multiple texts can be combined after characteristic layer and then semanteme is related Property judge.
The Siamese networks preferably, in the step S2, are divided into first half and latter half of by the embodiment of the present invention Point;First half is used for the extraction of text semantic feature, is made of full articulamentum;Latter half is used for similarity measurement, by net The feature integration that network first half extracts;The distance between a pair of sample feature after being extracted through different branches is calculated, or Corresponding element is directly weighted to connection;And then in step s3 by the similitude between cosine angle metric function differentiation sample.
Preferably, the first half is used for the extraction of text semantic feature to the embodiment of the present invention, is made of full articulamentum, And including word, phrase, sentence, paragraph, article and semanteme, the network structure of 5 layers of hidden layer of structure.
The embodiment of the present invention preferably, in the S3 and S4 the calculating of included angle cosine use ternary metric function, in ternary Group sample (x, x+, x-) there is identical network parameter W after feature extraction, the semantic feature expression of three samples is obtained, point G is not denoted as itw(x)、Gw(x+), Gw(x-);The Text similarity computing function of included angle cosine is:D(xi,xj)=cos (xi,xj)= xi.xj/|xi|.|xj|。
The embodiment of the present invention is it is further preferred that Text similarity computing function with the included angle cosine, corresponding damage Losing function is:
Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default fixation Parameter.
The embodiment of the present invention preferably, in the S3 and S4 the calculating of included angle cosine use ternary metric function, in ternary Group sample (x', x, x') has different network parameter W after feature extraction, obtains the semantic feature expression of three samples, G is denoted as respectivelyw1(x')、Gw2(x), Gw3(x');As D (Gw1(x'),Gw2(x))-D(Gw3(x'),Gw2(x)) during > α, it is judged as phase Seemingly, otherwise it is judged as dissmilarity;Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and It is default preset parameter.
Preferably, the sample x is to concentrate to select a sample at random from training data to further embodiment of the present invention, described Sample is x+To belong to an of a sort sample, the sample x with x-For a sample inhomogeneous with x;In sample training Principle is:It is required that x and x+Angle between feature vector is as small as possible, and x and x-Angle between semantic feature vector is as far as possible Greatly.
The embodiment of the present invention preferably, in the step S2 selects a pair of of text to be denoted as (x as inputi,xj);By text Paragraph heading and text be divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.
Another aspect of the present invention also provides a kind of system for judging text similarity, it is characterised in that including:Controller, The controller is used for the corresponding program of method for loading and performing any one above-mentioned judgement text similarity.
The above-mentioned technical proposal provided using the application, can at least obtain one kind in following beneficial effect:
1st, the neutral net used is the overlapped in series of linear transformation and simple non-linear functions, and warp is used in the training stage The stochastic gradient descent method of allusion quotation, therefore all there is no the difficulty on calculating during trained process and test.
2nd, not only only rest in the rank of word and text is handled, but semantic-based rank judges document Similitude, can solve natural language problem present in text well;The result of judgement is more accurate.
3rd, the measuring similarity model come out using Siamese network trainings, is possessed so that similar text distance diminishes, Dissimilar text distance becomes larger;The classification, quantity and length of text are not all required, so can be good at solving those Classification number is more, and the less decision problem of sample data of partial category.
4th, due to using Siamese network integration ternary loss functions, in the network training stage, the application corresponds to technical side During the similarity determination of case, target is so that the feature of relevant Text Feature Extraction is as small as possible, incoherent Text Feature Extraction Characteristic difference it is as big as possible;Due to training sample include much to data, e-learning, which has gone out, after training this sentences The whether relevant ability of disconnected text, and this ability is built upon on very much, a variety of data pair, is come out Ability, can be generalized in the classification not occurred, and here it is the generalization ability of neutral net;That is the generalization ability of type result in This property;Accordingly even when for the classification not having in sample, the technical solution provided using the application still can be determined that two The similitude of a text.
5th, one in the region rather than space that similar sample is mapped in space using ternary loss function Point, simplifies the complexity of problem so that the generalization ability of algorithm is substantially improved.
The further feature and advantage of invention will illustrate in the following description, also, partly become aobvious from specification And be clear to, or understood by implementing technical scheme.The purpose of the present invention and other advantages can be by illustrating Specifically noted structure and/or flow are realized and obtained in book, claims and attached drawing.
Brief description of the drawings
Fig. 1 is a kind of flow chart for judgement text similarity method that one embodiment of the invention provides.
Fig. 2 provides a kind of text similarity training network structure diagram for one embodiment of the invention.
Fig. 3 is a kind of full articulamentum structure diagram that one embodiment of the invention provides.
Fig. 4 is a kind of similitude prediction model structure diagram that one embodiment of the invention provides.
Fig. 5 is a kind of structural schematic block diagram for judgement text similarity system that one embodiment of the invention provides.
Fig. 6 is a kind of similarity measurement model structure that another embodiment of the present invention provides.
Fig. 7 is a kind of similarity measurement model structure that yet another embodiment of the invention provides.
Embodiment
Carry out the embodiment that the present invention will be described in detail below with reference to accompanying drawings and embodiments, how the present invention is applied whereby Technological means solves technical problem, and that reaches technique effect realizes that process can fully understand and implement according to this.Need to illustrate , these specific descriptions are to allow those of ordinary skill in the art to be more prone to, clearly understand the present invention, rather than to this hair Bright limited explanation;And if conflict is not formed, each embodiment in the present invention and each spy in each embodiment Sign can be combined with each other, and the technical solution formed is within protection scope of the present invention.
In addition, step shown in the flowchart of the accompanying drawings can be in the control system of a such as group controller executable instruction Middle execution, although also, show logical order in flow charts, in some cases, can be with different from herein Order performs shown or described step.
Below by the drawings and specific embodiments, technical scheme is described in detail:
Embodiment one
As shown in Figure 1, the present embodiment provides a kind of method for judging text similarity, this method is to be based on Siamese nets The text similarity decision method of network;Firstly, it is necessary to establish VSM models to text data, its process includes pretreatment, segments, Stop words is removed, text is quantized into accessible feature vector;Then Siamese networks (also referred to as twin network) extraction is built The Semantic Similarity feature of the sample pair of feature based vector;Finally construct the triplet based on high dimension vector included angle cosine Loss (also referred to as triple losses) loss function is differentiating the correlation of text pair.This method specifically includes:
S1, structure vector space model so that text is quantized into accessible object;
In text-processing, it is necessary first to text is quantized into accessible object, it is preferable that employ in text-processing Vector space model (abbreviation VSM) method of structure, including:1st, Text Pretreatment:Pretreatment be handle text mess code and Non-textual content, segments and goes stop words, will have little significance according to the word disabled in vocabulary in language material to content of text identification But the very high word of the frequency of occurrences, symbol, punctuate and mess code etc. remove.2nd, feature vector calculates:Filter out common adverbial word, auxiliary word etc. After the high word of frequency, some characteristic items are determined according to the frequency of remaining word.Preferably, using TF-IDF (Term Frequency-Inverse document frequency) method calculates the weight of text feature item, and is normalized Processing, obtains urtext feature vector.
Wherein, the basic thought of vector space model be the m that document is reduced to using the weight of characteristic item as component tie up to Amount represents, that is, converts the text to the accessible quantization characteristic described by mathematic sign.The quantizing process of this model is false If linear independence between word and word, therefore the model can not directly carry out semantic relevant judgement, but utilize follow-up Siamese Network (Chinese is also referred to as Siam's network) carries out further semantic feature extraction to the quantization characteristic of urtext, is judged with reaching The purpose of text semantic similitude.
S2, using Siamese network struction text semantic similitude extraction models, it is semantic and in Siamese networks Feature extraction network optimizes together with similitude differentiation series network in the sample training stage;
The Semantic Similarity of two sections of texts is judged, it is necessary to extract sample pair after the quantization vector of each text chunk is obtained Similarity feature judged;The present embodiment combination Siamese networks to carry out the quantization characteristic of urtext further Semantic feature extraction, semantic feature extraction network and similitude differentiated together with series network, while in the sample training stage Optimize;
S3, the semantic feature expression based on training stage sample, construct the text phase of the included angle cosine of feature based vector Function, and final loss function are calculated like degree so that some region of the maps feature vectors of similar sample pair to space;
Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on Siamese network handles, meter Two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, are determined as It is similar, otherwise it is determined as dissmilarity.
With reference to the training stage schematic diagram 2 (while input three samples, two are relevant, and one is incoherent, use To train network) with the schematic diagram 4 of test phase (only needing two samples of input to judge mutually uncorrelated);To the skill of the present embodiment Art further explains:
Wherein, the structure of S2 map networks and the physical significance of imparting, the parameter of network are W, network in the present embodiment Preceding part is used for extracting semantic feature, and rear part is used for extracting similarity feature;S3 corresponds to loss function part hereinafter Realize, i.e. D (can hereafter be further explained) with reference to Fig. 2.Neutral net is linked by the weight coefficient between node, trained mesh Be exactly to optimize these coefficients so that neutral net is issued to required function in the guidance of training data.Treat that neutral net is instructed After perfecting, these coefficients are fixed, and when test, are inputted after the change of these coefficients, are obtained as a result, and then judging whether It is similar.
Preferably, in step S2, the structures of Siamese networks can have that two or more are parallel logical to the present embodiment Road, correspondingly, can input two or more text features at the same time, and Siamese networks are non-linear by respective path Mapping extraction feature, then the feature of multiple texts can be combined after characteristic layer and then semantic dependency judges.In order to make The technical solution that the present embodiment is more clearly understood in skilled person is obtained, the present embodiment is hereafter combining Fig. 2 explanations When the structure of Siamese networks, illustrated with triple channel.
Siamese networks are divided into first half and latter half by the present embodiment it is further preferred that in step S2;Before Half part is used for the extraction of text semantic feature, is made of full articulamentum;Latter half is used for similarity measurement, by network first half The feature integration that part is extracted;Calculate the distance between a pair of sample feature after being extracted through different branches;And then in step Differentiate the similitude between sample by cosine angle metric function in S3.
Preferably, first half is used for the extraction of text semantic feature to the present embodiment, is made of full articulamentum, and including Word, phrase, sentence, paragraph, article and semanteme, build the network structure of 5 layers of hidden layer.
The present embodiment preferably, in S3 and S4 the calculating of included angle cosine use ternary metric function, triple sample (x, x+, x-) there is identical network parameter W after feature extraction, the semantic feature expression of three samples is obtained, is denoted as G respectivelyw (x)、Gw(x+), Gw(x-);The Text similarity computing function of included angle cosine is:D(xi,xj)=cos (xi,xj)=xi.xj/|xi |.|xj|。
Since ternary is two relevant samples, an incoherent sample;Accordingly, it is considered to feature between correlated samples Distance is it is contemplated that the distance of uncorrelated sample feature, that is, considers inter- object distance it is contemplated that between class distance, so Obtained grader is more stable, and generalization ability is stronger.Traditional method, simply considers the distance of correlated samples, holds in processing The easily error such as correlation judgement between confusing sample.
Preferably, the Text similarity computing function with included angle cosine, corresponding loss function is the present embodiment:
Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default fixation Parameter.
Preferably, sample x is to concentrate to select a sample, sample x at random from training data to the present embodiment+To belong to x An of a sort sample, sample x-For a sample inhomogeneous with x;It is in the principle of sample training:It is required that x and x+Feature Angle between vector is as small as possible, and x and x-Angle between semantic feature vector is as big as possible.
More specifically preferred solution:In above-mentioned steps S2, it is contemplated that the spy of Siamese network structures (as shown in Figure 2) Point, the i.e. structure of Siamese network models can have two or more parallel paths, therefore can be by two or more Text feature inputs network at the same time, extracts feature by the Nonlinear Mapping of respective path, then the feature of multiple texts can be with It is combined after characteristic layer and then semantic dependency judges.Therefore the present embodiment is introduced into Siamese networks to extract text pair Semantic Similarity feature extraction.
In the training stage, which has three branches (as shown in Figure 2), these three branches possess same network structure, And similitude judgement is convenient in order to extract public characteristic, each branch is arranged to weight and shared by us, i.e. three branches With same network parameter W.Siamese networks are divided into first half, latter half by us.First half is used for text language The extraction of adopted feature, is made of full articulamentum, activation primitive selection ReLU, i.e. f (x)=max (0, x).For first half Network structure, in order to preferably extract the semanteme of text, includes word, short from shallow to deep according to the feature hierarchy structure of text in itself Language, sentence, paragraph, article and semanteme, build 5 layers of hidden layer network structure (only illustrate as shown in Figure 3 wherein two layers, its His is similar), and every layer belongs to full articulamentum, realizes extraction text semantic, while reduce the purpose of vector dimension.
Latter half is used for similarity measurement, the feature integration that network first half is extracted, and then is measured by certain Similitude between criteria function sample, on the selection of metric function, will be expanded on further in lower section.Training when sample according to Following mode tissue, first concentrates from training data and selects a sample to be denoted as x as anchor point, the sample at random, then random again Selection belongs to an of a sort sample and an inhomogeneous sample, and the corresponding Positive that is known as of the two samples (is denoted as x+) and Negative (be denoted as x-), thus form one group of training sample (x, x+, x-), while Siamese networks are inputted, pass through BP Algorithm, optimizes network parameter so that network output valve is moved closer to true value, until network convergence.
In above-mentioned Siamese network structures, semantic feature extraction network is together with similitude differentiation series network, together Shi Youhua, is the network structure of end-to-end (end-to-end).Characteristic extraction part is influenced be subject to differentiation part so that is carried The feature taken is beneficial to the similitude for differentiating sample.Compared to the method for part optimization, end-to-end methods are global optimizations, effect Fruit is more preferable.
More specifically preferred solution:In above-mentioned steps S3, triple sample obtains three samples after feature extraction Semantic feature expression, be denoted as G respectivelyw(x)、Gw(x+), Gw(x-).In the differentiation part of network, lower surface construction feature based to The similitude of the distance between amount metric function judgement sample.Since the feature vector of extraction is high dimension vector, traditional is European Distance etc. cannot react the distance between high dimension vector well, and the present invention constructs the text of the included angle cosine of feature based vector Similarity measure function, is:
D(xi,xj)=cos (xi,xj)=xi.xj/|xi|.|xj| (formula 1)
Relative to the similitude using binary loss function measurement sample pair, since the effect of binary loss function is all Similar sample is mapped to a point in feature space, this requires feature extraction function to have identical sound to all similar samples Should, e-learning is got up difficulty.The present embodiment selection ternary loss function completes differentiation task;Obtain the feature of triple sample After vector, while require x and x+Angle (i.e. distance as small as possible between feature vectorValue to the greatest extent may be used Can be big, i.e., angle is as small as possible), and x and x-Angle between semantic feature vector is as big (i.e. as possibleTo the greatest extent May be small), and to allow x and x+The distance between and x and x-The distance between have a minimum interval α;Wherein, α be to Fixed hyper parameter, final loss function are:
Wherein, the decimal that α gives in the starting stage, such as 0.1, usually by defined distance metric function Dimension determines, such as cosine used herein, and codomain is [0,1], and parameter selects 0.1.
Different from binary loss function, the purpose of ternary loss function is an area being mapped to similar sample in space Domain so that inter- object distance is less than between class distance, and is subject to preset parameter α constraint controls in class between class between interface;Will be same The maps feature vectors of class sample pair are to a point in some region of space rather than space so that the complexity of problem is big It is big to simplify, while do not influence to judge the effect of Similarity Problem in itself again, thus the effect of algorithm and generalization ability have it is very big Lifting.
More specifically preferred solution:In above-mentioned steps S4, prediction when, it is only necessary to select network two paths (by It is that weights are shared in network individual channel, therefore selects any two paths not influence);Input sample is first according to rear S1 extracts the quantization vector of two texts to be measured, is denoted as input1, input2 respectively;Then Siamese network model knots are inputted Structure as shown in figure 4, by Siamese networks to sample to carry out semantic feature extraction after, be passed to differentiate network calculations two The cosine angle distance of vector, sets threshold value ζ, when calculated value is more than ζ, be determined as similar, is otherwise determined as dissmilarity.
Wherein, threshold value ζ parameters are given in advance;The distance of network output may be considered text to the general of similitude Rate, this parameter provide meet what probable value both regarded as it is similar, depending on specific tasks are to the demand of similitude rank;And And demand is higher, which can be set bigger.
As shown in figure 5, on the other hand the present embodiment also provides a kind of system for judging text similarity, which includes: Controller, controller are used for the corresponding program of method for loading and performing any one above-mentioned judgement text similarity.
Embodiment two
The present embodiment, in order to make training pattern more flexible, can cause Siamese networks on the basis of embodiment one Three branches in weighted, the number of plies is different, namely three functions are orthogonal;Simply calculated in last distance, will They are associated together;The content of other non-repeat specifications is identical with embodiment one.Specifically, as shown in fig. 6, the present embodiment is excellent Selection of land, in Fig. 1 corresponding S3 and S4 the calculating of included angle cosine use ternary metric function, at triple sample (x', x, x') There is different network parameter W after feature extraction, obtain the semantic feature expression of three samples, be denoted as respectively
Gw1(x')、Gw2(x), Gw3(x');
As D (Gw1(x'),Gw2(x))-D(Gw3(x'),Gw2(x)) during > α, it is judged as similar, is otherwise judged as dissmilarity; Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default preset parameter.On The value of α, with reference to the description of embodiment one, details are not described herein.
Correspondingly, on the other hand the present embodiment also provides a kind of system for judging text similarity, which includes:Control Device, controller are used to load and perform program corresponding with Siamese network structures in Fig. 6.
Embodiment three
As the technological document of software development, in particular for the software of nuclear power field, its file edit will meet standard Specification, title have the generality and similitude of height, if when vector space model is established to text, taking will be whole A text is uniformly processed, and can greatly lose the important information that title is brought to classification.Therefore, during training and test The contribution for answering appropriate consideration paragraph heading to measure text similarity.As shown in fig. 7, the present embodiment is preferably, in embodiment A pair of of text is selected to be denoted as (x as input in the one or correspondence step of embodiment two S2i,xj);By the paragraph heading of text with And text is divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.
Specifically, a pair of of text is chosen first as input, is denoted as, can is similar or dissimilar, by the section of text Fall title and text is divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.Training pattern For structure as shown in fig. 7, wherein, decision-making layer network uses full connection Rotating fields, activation primitive uses Sigmoid functions,
That is σ (x)=1/ (1+exp (- x));A ∈ (0,1),
X_Text=(xi_Text,xj_ Text), x_Title=(xi_Title,xj_Title)。
During test, still using the model, threshold value is set, when output is more than threshold value, is judged as similar, is not otherwise It is similar.On threshold value, the description of ζ in embodiment one is refer to, details are not described herein.
But in the corresponding similitude determination methods of the present embodiment, correspondingly no longer adopted in implementation procedure in S2 steps With " calculating the distance between a pair of sample feature after being extracted through different branches ", but by the corresponding element directly company of weighting Connect.
Correspondingly, the present embodiment also provides a kind of system for judging text similarity, which includes:Controller, control Device is used to load and perform program corresponding with Siamese network structures in Fig. 7.
The above-mentioned technical proposal provided using the application, can at least obtain one kind in following beneficial effect:
1st, the neutral net used is the overlapped in series of linear transformation and simple non-linear functions, and warp is used in the training stage The stochastic gradient descent method of allusion quotation, therefore, all there is no the difficulty on calculating during trained process and test.
2nd, not only only rest in the rank of word and text is handled, but semantic-based rank judges document Similitude, can solve natural language problem present in text well;The result of judgement is more accurate.
3rd, the measuring similarity model come out using Siamese network trainings, is possessed so that similar text distance diminishes, Dissimilar text distance becomes larger;The classification, quantity and length of text are not all required, so can be good at solving those Classification number is more, and the less decision problem of sample data of partial category.
4th, due to using Siamese network integration ternary loss functions, in the network training stage, the application corresponds to technical side During the similarity determination of case, target is so that the feature of relevant Text Feature Extraction is as small as possible, incoherent Text Feature Extraction Characteristic difference it is as big as possible;Due to training sample include much to data, e-learning, which has gone out, after training this sentences The whether relevant ability of disconnected text, and this ability is built upon on very much, a variety of data pair, is come out Ability, can be generalized in the classification not occurred, and here it is the generalization ability of neutral net;That is the generalization ability of model causes This property;Accordingly even when for the classification not having in sample, the technical solution provided using the application still can be determined that The similitude of two texts.
5th, one in the region rather than space that similar sample is mapped in space using ternary loss function Point, simplifies the complexity of problem so that the generalization ability of algorithm is substantially improved.
Finally it should be noted that described above is only highly preferred embodiment of the present invention, not the present invention is appointed What formal limitation.Any those skilled in the art, it is without departing from the scope of the present invention, all available The way and technology contents of the disclosure above make technical solution of the present invention many possible variations and simple replacement etc., these Belong to the scope of technical solution of the present invention protection.

Claims (10)

  1. A kind of 1. method for judging text similarity, it is characterised in that including:
    S1, structure vector space model so that text is quantized into accessible object;
    S2, using Siamese network struction text semantic similitude extraction models, it is semantic and in the Siamese networks Feature extraction network optimizes together with similitude differentiation series network in the sample training stage;
    S3, the semantic feature expression based on training stage sample, construct the text similarity of the included angle cosine of feature based vector Calculate function, and final loss function so that some region of the maps feature vectors of similar sample pair to space;
    Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on the Siamese network handles, meter Two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, are determined as It is similar, otherwise it is determined as dissmilarity.
  2. 2. according to the method described in claim 1, it is characterized in that, in the step S2, the structure of the Siamese networks can There are two or more parallel paths, correspondingly, two or more text features can be inputted at the same time, it is described Siamese networks are by the Nonlinear Mapping extraction feature of respective path, and then the feature of multiple texts can be after characteristic layer It is combined and then semantic dependency judges.
  3. 3. according to the method described in claim 2, it is characterized in that, in the step S2, before the Siamese networks are divided into Half part and latter half;First half is used for the extraction of text semantic feature, is made of full articulamentum;Latter half is used for phase Measured like property, the feature integration that network first half is extracted;Calculate through different branches extract after a pair of sample feature it Between distance, or corresponding element is directly weighted to connection;And then differentiate sample by cosine angle metric function in step s3 The similitude of this.
  4. 4. according to the method in claim 2 or 3, it is characterised in that the first half is used for carrying for text semantic feature Take, be made of full articulamentum, and including word, phrase, sentence, paragraph, article and semanteme, the network knot of 5 layers of hidden layer of structure Structure.
  5. 5. according to the method described in claim 1, it is characterized in that, in the S3 and S4 the calculating of included angle cosine use three elementary lengths Flow function, in triple sample (x, x+, x-) there is identical network parameter W after feature extraction, obtain the language of three samples Adopted feature representation, is denoted as G respectivelyw(x)、Gw(x+), Gw(x-);The Text similarity computing function of included angle cosine is:D(xi,xj) =cos (xi,xj)=xi.xj/|xi|.|xj|。
  6. 6. according to the method described in claim 5, it is characterized in that, Text similarity computing function with the included angle cosine, Corresponding loss function is:
    <mrow> <mi>l</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mi>&amp;alpha;</mi> <mo>-</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>+</mo> </msubsup> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>-</mo> </msubsup> <mo>)</mo> <mo>)</mo> </mrow> <mo>}</mo> <mo>.</mo> </mrow>
    Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default preset parameter.
  7. 7. according to the method described in claim 1, it is characterized in that, in the S3 and S4 the calculating of included angle cosine use three elementary lengths Flow function, has different network parameter W in triple sample (x', x, x') after feature extraction, obtains three samples Semantic feature is expressed, and is denoted as G respectivelyw1(x')、Gw2(x), Gw3(x');As D (Gw1(x'),Gw2(x))-D(Gw3(x'),Gw2(x)) During > α, it is judged as similar, is otherwise judged as dissmilarity;Wherein, α is x and x+The distance between and x and x-The distance between in most Small interval, and be default preset parameter.
  8. 8. the method according to the description of claim 7 is characterized in that the sample x is to concentrate to select one at random from training data Sample, the sample are x+To belong to an of a sort sample, the sample x with x-For a sample inhomogeneous with x; The principle of sample training is:It is required that x and x+Angle between feature vector is as small as possible, and x and x-Between semantic feature vector Angle is as big as possible.
  9. 9. according to the method described in claim 1, it is characterized in that, a pair of of text is selected in the step S2 as input, note For (xi,xj);The paragraph heading of text and text are divided into two parts, meanwhile, the text of two texts and title are closed respectively And as input.
  10. A kind of 10. system for judging text similarity, it is characterised in that including:Controller, the controller are used to load and hold Any one in row such as claim 1-9 judges the corresponding program of method of text similarity.
CN201711088831.9A 2017-11-08 2017-11-08 A kind of method and system for judging text similarity Pending CN107967255A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711088831.9A CN107967255A (en) 2017-11-08 2017-11-08 A kind of method and system for judging text similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711088831.9A CN107967255A (en) 2017-11-08 2017-11-08 A kind of method and system for judging text similarity

Publications (1)

Publication Number Publication Date
CN107967255A true CN107967255A (en) 2018-04-27

Family

ID=62000824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711088831.9A Pending CN107967255A (en) 2017-11-08 2017-11-08 A kind of method and system for judging text similarity

Country Status (1)

Country Link
CN (1) CN107967255A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145529A (en) * 2018-09-12 2019-01-04 重庆工业职业技术学院 A kind of text similarity analysis method and system for copyright authentication
CN109214002A (en) * 2018-08-27 2019-01-15 成都四方伟业软件股份有限公司 A kind of transcription comparison method, device and its computer storage medium
CN110046240A (en) * 2019-04-16 2019-07-23 浙江爱闻格环保科技有限公司 In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network
CN110309503A (en) * 2019-05-21 2019-10-08 昆明理工大学 A kind of subjective item Rating Model and methods of marking based on deep learning BERT--CNN
CN110348010A (en) * 2019-06-21 2019-10-18 北京小米智能科技有限公司 Synonymous phrase acquisition methods and device
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110555198A (en) * 2018-05-31 2019-12-10 北京百度网讯科技有限公司 method, apparatus, device and computer-readable storage medium for generating article
CN110598066A (en) * 2019-09-10 2019-12-20 民生科技有限责任公司 Bank full-name rapid matching method based on word vector expression and cosine similarity
CN110597986A (en) * 2019-08-16 2019-12-20 杭州微洱网络科技有限公司 Text clustering system and method based on fine tuning characteristics
CN110826338A (en) * 2019-10-28 2020-02-21 桂林电子科技大学 Fine-grained semantic similarity recognition method for single-choice gate and inter-class measurement
CN110888920A (en) * 2019-12-06 2020-03-17 北京中电普华信息技术有限公司 Method and device for determining similarity of project functions
CN111144129A (en) * 2019-12-26 2020-05-12 成都航天科工大数据研究院有限公司 Semantic similarity obtaining method based on autoregression and self-coding
CN111178084A (en) * 2019-12-26 2020-05-19 厦门快商通科技股份有限公司 Training method and device for improving semantic similarity
CN111460401A (en) * 2020-05-20 2020-07-28 南京大学 Automatic product tracking method combining software product process information and text similarity
CN111723164A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Address information processing method and device
CN111738010A (en) * 2019-03-20 2020-10-02 百度在线网络技术(北京)有限公司 Method and apparatus for generating semantic matching model
CN111930929A (en) * 2020-07-09 2020-11-13 车智互联(北京)科技有限公司 Article title generation method and device and computing equipment
CN112561904A (en) * 2020-12-24 2021-03-26 凌云光技术股份有限公司 Method and system for reducing false detection rate of AOI (argon oxygen decarburization) defects on display screen appearance
CN112949319A (en) * 2021-03-12 2021-06-11 江南大学 Method, device, processor and storage medium for marking ambiguous words in text
CN113221530A (en) * 2021-04-19 2021-08-06 杭州火石数智科技有限公司 Text similarity matching method and device based on circle loss, computer equipment and storage medium
CN115630613A (en) * 2022-12-19 2023-01-20 长沙冉星信息科技有限公司 Automatic coding system and method for evaluation problems in questionnaire survey

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184778A (en) * 2015-08-25 2015-12-23 广州视源电子科技股份有限公司 Detection method and apparatus
US20160350336A1 (en) * 2015-05-31 2016-12-01 Allyke, Inc. Automated image searching, exploration and discovery
CN106909625A (en) * 2017-01-20 2017-06-30 清华大学 A kind of image search method and system based on Siamese networks
CN107292259A (en) * 2017-06-15 2017-10-24 国家新闻出版广电总局广播科学研究院 The integrated approach of depth characteristic and traditional characteristic based on AdaRank

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350336A1 (en) * 2015-05-31 2016-12-01 Allyke, Inc. Automated image searching, exploration and discovery
CN105184778A (en) * 2015-08-25 2015-12-23 广州视源电子科技股份有限公司 Detection method and apparatus
CN106909625A (en) * 2017-01-20 2017-06-30 清华大学 A kind of image search method and system based on Siamese networks
CN107292259A (en) * 2017-06-15 2017-10-24 国家新闻出版广电总局广播科学研究院 The integrated approach of depth characteristic and traditional characteristic based on AdaRank

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
FLORIAN SCHROFF 等: ""A unified embedding for face recognition and clustering"", 《2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
PO-SEN HUANG: ""Learning deep structured semantic models for web search using clickthrough data"", 《PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 *
SUMIT CHOPRA 等: ""Learning a similarity metric discriminatively, with application to face verification"", 《2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR"05)》 *
刘博: ""子空间学习及其在图像集分类中的应用研究"", 《中国博士学位论文全文数据库 信息科技辑》 *
刘阳: ""基于LSTM的英文文本蕴含识别方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
庞亮 等: ""深度文本匹配综述"", 《计算机学报》 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555198A (en) * 2018-05-31 2019-12-10 北京百度网讯科技有限公司 method, apparatus, device and computer-readable storage medium for generating article
CN110555198B (en) * 2018-05-31 2023-05-23 北京百度网讯科技有限公司 Method, apparatus, device and computer readable storage medium for generating articles
CN109214002A (en) * 2018-08-27 2019-01-15 成都四方伟业软件股份有限公司 A kind of transcription comparison method, device and its computer storage medium
CN109145529A (en) * 2018-09-12 2019-01-04 重庆工业职业技术学院 A kind of text similarity analysis method and system for copyright authentication
CN111723164A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Address information processing method and device
CN111723164B (en) * 2019-03-18 2023-12-12 阿里巴巴集团控股有限公司 Address information processing method and device
CN111738010B (en) * 2019-03-20 2023-10-17 百度在线网络技术(北京)有限公司 Method and device for generating semantic matching model
CN111738010A (en) * 2019-03-20 2020-10-02 百度在线网络技术(北京)有限公司 Method and apparatus for generating semantic matching model
CN110046240A (en) * 2019-04-16 2019-07-23 浙江爱闻格环保科技有限公司 In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network
CN110309503A (en) * 2019-05-21 2019-10-08 昆明理工大学 A kind of subjective item Rating Model and methods of marking based on deep learning BERT--CNN
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110413988B (en) * 2019-06-17 2023-01-31 平安科技(深圳)有限公司 Text information matching measurement method, device, server and storage medium
CN110348010A (en) * 2019-06-21 2019-10-18 北京小米智能科技有限公司 Synonymous phrase acquisition methods and device
CN110348010B (en) * 2019-06-21 2023-06-02 北京小米智能科技有限公司 Synonymous phrase acquisition method and apparatus
CN110597986A (en) * 2019-08-16 2019-12-20 杭州微洱网络科技有限公司 Text clustering system and method based on fine tuning characteristics
CN110598066A (en) * 2019-09-10 2019-12-20 民生科技有限责任公司 Bank full-name rapid matching method based on word vector expression and cosine similarity
CN110598066B (en) * 2019-09-10 2022-05-10 民生科技有限责任公司 Bank full-name rapid matching method based on word vector expression and cosine similarity
CN110826338A (en) * 2019-10-28 2020-02-21 桂林电子科技大学 Fine-grained semantic similarity recognition method for single-choice gate and inter-class measurement
CN110826338B (en) * 2019-10-28 2022-06-17 桂林电子科技大学 Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement
CN110888920A (en) * 2019-12-06 2020-03-17 北京中电普华信息技术有限公司 Method and device for determining similarity of project functions
CN111178084A (en) * 2019-12-26 2020-05-19 厦门快商通科技股份有限公司 Training method and device for improving semantic similarity
CN111144129B (en) * 2019-12-26 2023-06-06 成都航天科工大数据研究院有限公司 Semantic similarity acquisition method based on autoregressive and autoencoding
CN111144129A (en) * 2019-12-26 2020-05-12 成都航天科工大数据研究院有限公司 Semantic similarity obtaining method based on autoregression and self-coding
CN111460401B (en) * 2020-05-20 2023-08-22 南京大学 Product automatic tracking method combining software product process information and text similarity
CN111460401A (en) * 2020-05-20 2020-07-28 南京大学 Automatic product tracking method combining software product process information and text similarity
CN111930929B (en) * 2020-07-09 2023-11-10 车智互联(北京)科技有限公司 Article title generation method and device and computing equipment
CN111930929A (en) * 2020-07-09 2020-11-13 车智互联(北京)科技有限公司 Article title generation method and device and computing equipment
CN112561904A (en) * 2020-12-24 2021-03-26 凌云光技术股份有限公司 Method and system for reducing false detection rate of AOI (argon oxygen decarburization) defects on display screen appearance
CN112949319A (en) * 2021-03-12 2021-06-11 江南大学 Method, device, processor and storage medium for marking ambiguous words in text
CN113221530A (en) * 2021-04-19 2021-08-06 杭州火石数智科技有限公司 Text similarity matching method and device based on circle loss, computer equipment and storage medium
CN113221530B (en) * 2021-04-19 2024-02-13 杭州火石数智科技有限公司 Text similarity matching method and device, computer equipment and storage medium
CN115630613A (en) * 2022-12-19 2023-01-20 长沙冉星信息科技有限公司 Automatic coding system and method for evaluation problems in questionnaire survey

Similar Documents

Publication Publication Date Title
CN107967255A (en) A kind of method and system for judging text similarity
CN104462066B (en) Semantic character labeling method and device
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
CN110516245A (en) Fine granularity sentiment analysis method, apparatus, computer equipment and storage medium
CN110245229A (en) A kind of deep learning theme sensibility classification method based on data enhancing
CN110033022A (en) Processing method, device and the storage medium of text
CN110222178A (en) Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing
CN112990296B (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN106910497A (en) A kind of Chinese word pronunciation Forecasting Methodology and device
CN110795571A (en) Cultural tourism resource recommendation method based on deep learning and knowledge graph
CN109710744A (en) A kind of data matching method, device, equipment and storage medium
CN106779053A (en) The knowledge point of a kind of allowed for influencing factors and neutral net is known the real situation method
CN107608953A (en) A kind of term vector generation method based on random length context
Wang et al. Intelligent auto-grading system
CN115455171B (en) Text video mutual inspection rope and model training method, device, equipment and medium
CN108920446A (en) A kind of processing method of Engineering document
CN112559734A (en) Presentation generation method and device, electronic equipment and computer readable storage medium
CN109670169B (en) Deep learning emotion classification method based on feature extraction
Schicchi et al. Machine learning models for measuring syntax complexity of english text
CN114492451A (en) Text matching method and device, electronic equipment and computer readable storage medium
CN112000788B (en) Data processing method, device and computer readable storage medium
CN113779190A (en) Event cause and effect relationship identification method and device, electronic equipment and storage medium
CN112579794A (en) Method and system for predicting semantic tree for Chinese and English word pairs
Pathuri et al. Feature based sentimental analysis for prediction of mobile reviews using hybrid bag-boost algorithm
CN103514194B (en) Determine method and apparatus and the classifier training method of the dependency of language material and entity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180427