CN107967255A - A kind of method and system for judging text similarity - Google Patents
A kind of method and system for judging text similarity Download PDFInfo
- Publication number
- CN107967255A CN107967255A CN201711088831.9A CN201711088831A CN107967255A CN 107967255 A CN107967255 A CN 107967255A CN 201711088831 A CN201711088831 A CN 201711088831A CN 107967255 A CN107967255 A CN 107967255A
- Authority
- CN
- China
- Prior art keywords
- text
- sample
- feature
- network
- msub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention belongs to the technical field of text classification, and in order to solve the deficiency that three kinds of text similarities in the prior art judge that algorithm is respectively present, the present invention provides a kind of method and system for judging text similarity, the described method includes:S1, structure vector space model so that text is quantized into accessible object;S2, using Siamese network struction text semantic similitude extraction models, and in Siamese networks, together with semantic feature extraction network with similitude differentiates series network, while optimized in the sample training stage;S3, the semantic feature expression based on training stage sample, construct the Text similarity computing function of the included angle cosine of feature based vector, and final loss function;Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on Siamese network handles, calculate two vectorial cosine angle distances, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, it is determined as similar, is otherwise determined as dissmilarity.
Description
Technical field
The present invention relates to the technical field of text classification, more particularly to nuclear safe level software verification to lead with the technology confirmed
Domain;More particularly, to a kind of method and system for judging text similarity.
Background technology
In nuclear safe level software verification and confirm during (V&V), it is necessary to assess performing document, to traceable
Property analyzed, and danger is analyzed etc., with being continuously increased for technical documentation, what each project each stage repeated
Performing these activities needs a large amount of manpowers, therefore the automatic identification item to be assessed in document evaluation process, in traceable analysis certainly
The dynamic semantic dependency for judging the superior and the subordinate's file, the Auto-matching of the failure mode of like product, becomes during hazard analysis
V&V personnel's problems to be solved.
Method currently used for judging text similarity mainly has cosine similarity, SimHash algorithms and Latent Semantic
Descriptor index method (LSI).Wherein, cosine similarity method is selected by pretreatment, text feature item, weights, generates vector space model
After calculate cosine;SimHash handles the similar decision method of text of the use of magnanimity webpage, the main mesh of this method for Google
Be dimensionality reduction, i.e., the fingerprint by the maps feature vectors of higher-dimension into f-bit, by compare the Hamming distance of two document fingerprints come
Characterize document repetition or similitude.;Latent Semantic Indexing (LSI) utilizes " singular value decomposition (SVD) " technology in matrix theory,
Frequency matrix is converted into singular matrix, the matrix is then subjected to singular value decomposition, less singular value is removed, as a result very
Incorgruous amount and singular value matrix are used to document vector sum query vector being mapped in a sub-spaces, within this space, come
It is retained from the semantic relation of document matrix, the inner product that may then pass through standardization is calculated to calculate more than the angle between vector
String similarity, and then according to the similarity between comparison of computational results text;Influenced so LSI eliminates to calculate Documents Similarity
Less feature, the feature that those remained are large impact position of the document vector in m-dimensional space.
But inventor has found in the implementation of the present invention:1st, cosine similarity algorithm is applied to web page title conjunction more
And in being clustered with title, result of calculation is accurate, but the algorithm only considered the statistical property of word within a context, it is assumed that crucial
Linear independence between word, without considering the semantic information of word in itself, it is impossible to solve natural language present in text well
Problem, such as synonym and polysemant, therefore has certain limitation.2nd, SimHash methods processing speed is fast, to mass text
Similar judgement is very suitable for;But since the data source for Hash calculation of short text is less, short text similarity is known
Not rate is low.3rd, Latent Semantic Indexing method is more more reliable than the similarity measure originally based on original text vector, but for magnanimity
It is potential semantic that text data, singular value decomposition dyscalculia, and excessively sparse language material cannot embody it well.
The content of the invention
In order to solve the deficiency that three kinds of text similarities in the prior art judge that algorithm is respectively present, the present invention provides a kind of
Judge the method and system of text similarity, the judgement of plurality of classes text can be solved and discrimination is high.
To achieve these goals, technical solution provided by the invention includes:
One aspect of the present invention provides a kind of method for judging text similarity, it is characterised in that including:
S1, structure vector space model so that text is quantized into accessible object;
S2, using Siamese network struction text semantic similitude extraction models, and in the Siamese networks,
Semantic feature extraction network optimizes together with similitude differentiation series network in the sample training stage;
S3, the semantic feature expression based on training stage sample, construct the text phase of the included angle cosine of feature based vector
Function, and final loss function are calculated like degree so that some region of the maps feature vectors of similar sample pair to space;
S4, input two texts to be measured, based on the Siamese network handles survey text carry out semantic feature extraction it
Afterwards, two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value,
It is determined as similar, is otherwise determined as dissmilarity.
Preferably, in the step S2, the structures of the Siamese networks can have two or more to the embodiment of the present invention
A parallel path, correspondingly, can input two or more text features at the same time, and the Siamese networks are by each
Feature is extracted from the Nonlinear Mapping of path, then the feature of multiple texts can be combined after characteristic layer and then semanteme is related
Property judge.
The Siamese networks preferably, in the step S2, are divided into first half and latter half of by the embodiment of the present invention
Point;First half is used for the extraction of text semantic feature, is made of full articulamentum;Latter half is used for similarity measurement, by net
The feature integration that network first half extracts;The distance between a pair of sample feature after being extracted through different branches is calculated, or
Corresponding element is directly weighted to connection;And then in step s3 by the similitude between cosine angle metric function differentiation sample.
Preferably, the first half is used for the extraction of text semantic feature to the embodiment of the present invention, is made of full articulamentum,
And including word, phrase, sentence, paragraph, article and semanteme, the network structure of 5 layers of hidden layer of structure.
The embodiment of the present invention preferably, in the S3 and S4 the calculating of included angle cosine use ternary metric function, in ternary
Group sample (x, x+, x-) there is identical network parameter W after feature extraction, the semantic feature expression of three samples is obtained, point
G is not denoted as itw(x)、Gw(x+), Gw(x-);The Text similarity computing function of included angle cosine is:D(xi,xj)=cos (xi,xj)=
xi.xj/|xi|.|xj|。
The embodiment of the present invention is it is further preferred that Text similarity computing function with the included angle cosine, corresponding damage
Losing function is:
Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default fixation
Parameter.
The embodiment of the present invention preferably, in the S3 and S4 the calculating of included angle cosine use ternary metric function, in ternary
Group sample (x', x, x') has different network parameter W after feature extraction, obtains the semantic feature expression of three samples,
G is denoted as respectivelyw1(x')、Gw2(x), Gw3(x');As D (Gw1(x'),Gw2(x))-D(Gw3(x'),Gw2(x)) during > α, it is judged as phase
Seemingly, otherwise it is judged as dissmilarity;Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and
It is default preset parameter.
Preferably, the sample x is to concentrate to select a sample at random from training data to further embodiment of the present invention, described
Sample is x+To belong to an of a sort sample, the sample x with x-For a sample inhomogeneous with x;In sample training
Principle is:It is required that x and x+Angle between feature vector is as small as possible, and x and x-Angle between semantic feature vector is as far as possible
Greatly.
The embodiment of the present invention preferably, in the step S2 selects a pair of of text to be denoted as (x as inputi,xj);By text
Paragraph heading and text be divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.
Another aspect of the present invention also provides a kind of system for judging text similarity, it is characterised in that including:Controller,
The controller is used for the corresponding program of method for loading and performing any one above-mentioned judgement text similarity.
The above-mentioned technical proposal provided using the application, can at least obtain one kind in following beneficial effect:
1st, the neutral net used is the overlapped in series of linear transformation and simple non-linear functions, and warp is used in the training stage
The stochastic gradient descent method of allusion quotation, therefore all there is no the difficulty on calculating during trained process and test.
2nd, not only only rest in the rank of word and text is handled, but semantic-based rank judges document
Similitude, can solve natural language problem present in text well;The result of judgement is more accurate.
3rd, the measuring similarity model come out using Siamese network trainings, is possessed so that similar text distance diminishes,
Dissimilar text distance becomes larger;The classification, quantity and length of text are not all required, so can be good at solving those
Classification number is more, and the less decision problem of sample data of partial category.
4th, due to using Siamese network integration ternary loss functions, in the network training stage, the application corresponds to technical side
During the similarity determination of case, target is so that the feature of relevant Text Feature Extraction is as small as possible, incoherent Text Feature Extraction
Characteristic difference it is as big as possible;Due to training sample include much to data, e-learning, which has gone out, after training this sentences
The whether relevant ability of disconnected text, and this ability is built upon on very much, a variety of data pair, is come out
Ability, can be generalized in the classification not occurred, and here it is the generalization ability of neutral net;That is the generalization ability of type result in
This property;Accordingly even when for the classification not having in sample, the technical solution provided using the application still can be determined that two
The similitude of a text.
5th, one in the region rather than space that similar sample is mapped in space using ternary loss function
Point, simplifies the complexity of problem so that the generalization ability of algorithm is substantially improved.
The further feature and advantage of invention will illustrate in the following description, also, partly become aobvious from specification
And be clear to, or understood by implementing technical scheme.The purpose of the present invention and other advantages can be by illustrating
Specifically noted structure and/or flow are realized and obtained in book, claims and attached drawing.
Brief description of the drawings
Fig. 1 is a kind of flow chart for judgement text similarity method that one embodiment of the invention provides.
Fig. 2 provides a kind of text similarity training network structure diagram for one embodiment of the invention.
Fig. 3 is a kind of full articulamentum structure diagram that one embodiment of the invention provides.
Fig. 4 is a kind of similitude prediction model structure diagram that one embodiment of the invention provides.
Fig. 5 is a kind of structural schematic block diagram for judgement text similarity system that one embodiment of the invention provides.
Fig. 6 is a kind of similarity measurement model structure that another embodiment of the present invention provides.
Fig. 7 is a kind of similarity measurement model structure that yet another embodiment of the invention provides.
Embodiment
Carry out the embodiment that the present invention will be described in detail below with reference to accompanying drawings and embodiments, how the present invention is applied whereby
Technological means solves technical problem, and that reaches technique effect realizes that process can fully understand and implement according to this.Need to illustrate
, these specific descriptions are to allow those of ordinary skill in the art to be more prone to, clearly understand the present invention, rather than to this hair
Bright limited explanation;And if conflict is not formed, each embodiment in the present invention and each spy in each embodiment
Sign can be combined with each other, and the technical solution formed is within protection scope of the present invention.
In addition, step shown in the flowchart of the accompanying drawings can be in the control system of a such as group controller executable instruction
Middle execution, although also, show logical order in flow charts, in some cases, can be with different from herein
Order performs shown or described step.
Below by the drawings and specific embodiments, technical scheme is described in detail:
Embodiment one
As shown in Figure 1, the present embodiment provides a kind of method for judging text similarity, this method is to be based on Siamese nets
The text similarity decision method of network;Firstly, it is necessary to establish VSM models to text data, its process includes pretreatment, segments,
Stop words is removed, text is quantized into accessible feature vector;Then Siamese networks (also referred to as twin network) extraction is built
The Semantic Similarity feature of the sample pair of feature based vector;Finally construct the triplet based on high dimension vector included angle cosine
Loss (also referred to as triple losses) loss function is differentiating the correlation of text pair.This method specifically includes:
S1, structure vector space model so that text is quantized into accessible object;
In text-processing, it is necessary first to text is quantized into accessible object, it is preferable that employ in text-processing
Vector space model (abbreviation VSM) method of structure, including:1st, Text Pretreatment:Pretreatment be handle text mess code and
Non-textual content, segments and goes stop words, will have little significance according to the word disabled in vocabulary in language material to content of text identification
But the very high word of the frequency of occurrences, symbol, punctuate and mess code etc. remove.2nd, feature vector calculates:Filter out common adverbial word, auxiliary word etc.
After the high word of frequency, some characteristic items are determined according to the frequency of remaining word.Preferably, using TF-IDF (Term
Frequency-Inverse document frequency) method calculates the weight of text feature item, and is normalized
Processing, obtains urtext feature vector.
Wherein, the basic thought of vector space model be the m that document is reduced to using the weight of characteristic item as component tie up to
Amount represents, that is, converts the text to the accessible quantization characteristic described by mathematic sign.The quantizing process of this model is false
If linear independence between word and word, therefore the model can not directly carry out semantic relevant judgement, but utilize follow-up Siamese
Network (Chinese is also referred to as Siam's network) carries out further semantic feature extraction to the quantization characteristic of urtext, is judged with reaching
The purpose of text semantic similitude.
S2, using Siamese network struction text semantic similitude extraction models, it is semantic and in Siamese networks
Feature extraction network optimizes together with similitude differentiation series network in the sample training stage;
The Semantic Similarity of two sections of texts is judged, it is necessary to extract sample pair after the quantization vector of each text chunk is obtained
Similarity feature judged;The present embodiment combination Siamese networks to carry out the quantization characteristic of urtext further
Semantic feature extraction, semantic feature extraction network and similitude differentiated together with series network, while in the sample training stage
Optimize;
S3, the semantic feature expression based on training stage sample, construct the text phase of the included angle cosine of feature based vector
Function, and final loss function are calculated like degree so that some region of the maps feature vectors of similar sample pair to space;
Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on Siamese network handles, meter
Two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, are determined as
It is similar, otherwise it is determined as dissmilarity.
With reference to the training stage schematic diagram 2 (while input three samples, two are relevant, and one is incoherent, use
To train network) with the schematic diagram 4 of test phase (only needing two samples of input to judge mutually uncorrelated);To the skill of the present embodiment
Art further explains:
Wherein, the structure of S2 map networks and the physical significance of imparting, the parameter of network are W, network in the present embodiment
Preceding part is used for extracting semantic feature, and rear part is used for extracting similarity feature;S3 corresponds to loss function part hereinafter
Realize, i.e. D (can hereafter be further explained) with reference to Fig. 2.Neutral net is linked by the weight coefficient between node, trained mesh
Be exactly to optimize these coefficients so that neutral net is issued to required function in the guidance of training data.Treat that neutral net is instructed
After perfecting, these coefficients are fixed, and when test, are inputted after the change of these coefficients, are obtained as a result, and then judging whether
It is similar.
Preferably, in step S2, the structures of Siamese networks can have that two or more are parallel logical to the present embodiment
Road, correspondingly, can input two or more text features at the same time, and Siamese networks are non-linear by respective path
Mapping extraction feature, then the feature of multiple texts can be combined after characteristic layer and then semantic dependency judges.In order to make
The technical solution that the present embodiment is more clearly understood in skilled person is obtained, the present embodiment is hereafter combining Fig. 2 explanations
When the structure of Siamese networks, illustrated with triple channel.
Siamese networks are divided into first half and latter half by the present embodiment it is further preferred that in step S2;Before
Half part is used for the extraction of text semantic feature, is made of full articulamentum;Latter half is used for similarity measurement, by network first half
The feature integration that part is extracted;Calculate the distance between a pair of sample feature after being extracted through different branches;And then in step
Differentiate the similitude between sample by cosine angle metric function in S3.
Preferably, first half is used for the extraction of text semantic feature to the present embodiment, is made of full articulamentum, and including
Word, phrase, sentence, paragraph, article and semanteme, build the network structure of 5 layers of hidden layer.
The present embodiment preferably, in S3 and S4 the calculating of included angle cosine use ternary metric function, triple sample (x,
x+, x-) there is identical network parameter W after feature extraction, the semantic feature expression of three samples is obtained, is denoted as G respectivelyw
(x)、Gw(x+), Gw(x-);The Text similarity computing function of included angle cosine is:D(xi,xj)=cos (xi,xj)=xi.xj/|xi
|.|xj|。
Since ternary is two relevant samples, an incoherent sample;Accordingly, it is considered to feature between correlated samples
Distance is it is contemplated that the distance of uncorrelated sample feature, that is, considers inter- object distance it is contemplated that between class distance, so
Obtained grader is more stable, and generalization ability is stronger.Traditional method, simply considers the distance of correlated samples, holds in processing
The easily error such as correlation judgement between confusing sample.
Preferably, the Text similarity computing function with included angle cosine, corresponding loss function is the present embodiment:
Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default fixation
Parameter.
Preferably, sample x is to concentrate to select a sample, sample x at random from training data to the present embodiment+To belong to x
An of a sort sample, sample x-For a sample inhomogeneous with x;It is in the principle of sample training:It is required that x and x+Feature
Angle between vector is as small as possible, and x and x-Angle between semantic feature vector is as big as possible.
More specifically preferred solution:In above-mentioned steps S2, it is contemplated that the spy of Siamese network structures (as shown in Figure 2)
Point, the i.e. structure of Siamese network models can have two or more parallel paths, therefore can be by two or more
Text feature inputs network at the same time, extracts feature by the Nonlinear Mapping of respective path, then the feature of multiple texts can be with
It is combined after characteristic layer and then semantic dependency judges.Therefore the present embodiment is introduced into Siamese networks to extract text pair
Semantic Similarity feature extraction.
In the training stage, which has three branches (as shown in Figure 2), these three branches possess same network structure,
And similitude judgement is convenient in order to extract public characteristic, each branch is arranged to weight and shared by us, i.e. three branches
With same network parameter W.Siamese networks are divided into first half, latter half by us.First half is used for text language
The extraction of adopted feature, is made of full articulamentum, activation primitive selection ReLU, i.e. f (x)=max (0, x).For first half
Network structure, in order to preferably extract the semanteme of text, includes word, short from shallow to deep according to the feature hierarchy structure of text in itself
Language, sentence, paragraph, article and semanteme, build 5 layers of hidden layer network structure (only illustrate as shown in Figure 3 wherein two layers, its
His is similar), and every layer belongs to full articulamentum, realizes extraction text semantic, while reduce the purpose of vector dimension.
Latter half is used for similarity measurement, the feature integration that network first half is extracted, and then is measured by certain
Similitude between criteria function sample, on the selection of metric function, will be expanded on further in lower section.Training when sample according to
Following mode tissue, first concentrates from training data and selects a sample to be denoted as x as anchor point, the sample at random, then random again
Selection belongs to an of a sort sample and an inhomogeneous sample, and the corresponding Positive that is known as of the two samples (is denoted as x+) and Negative (be denoted as x-), thus form one group of training sample (x, x+, x-), while Siamese networks are inputted, pass through BP
Algorithm, optimizes network parameter so that network output valve is moved closer to true value, until network convergence.
In above-mentioned Siamese network structures, semantic feature extraction network is together with similitude differentiation series network, together
Shi Youhua, is the network structure of end-to-end (end-to-end).Characteristic extraction part is influenced be subject to differentiation part so that is carried
The feature taken is beneficial to the similitude for differentiating sample.Compared to the method for part optimization, end-to-end methods are global optimizations, effect
Fruit is more preferable.
More specifically preferred solution:In above-mentioned steps S3, triple sample obtains three samples after feature extraction
Semantic feature expression, be denoted as G respectivelyw(x)、Gw(x+), Gw(x-).In the differentiation part of network, lower surface construction feature based to
The similitude of the distance between amount metric function judgement sample.Since the feature vector of extraction is high dimension vector, traditional is European
Distance etc. cannot react the distance between high dimension vector well, and the present invention constructs the text of the included angle cosine of feature based vector
Similarity measure function, is:
D(xi,xj)=cos (xi,xj)=xi.xj/|xi|.|xj| (formula 1)
Relative to the similitude using binary loss function measurement sample pair, since the effect of binary loss function is all
Similar sample is mapped to a point in feature space, this requires feature extraction function to have identical sound to all similar samples
Should, e-learning is got up difficulty.The present embodiment selection ternary loss function completes differentiation task;Obtain the feature of triple sample
After vector, while require x and x+Angle (i.e. distance as small as possible between feature vectorValue to the greatest extent may be used
Can be big, i.e., angle is as small as possible), and x and x-Angle between semantic feature vector is as big (i.e. as possibleTo the greatest extent
May be small), and to allow x and x+The distance between and x and x-The distance between have a minimum interval α;Wherein, α be to
Fixed hyper parameter, final loss function are:
Wherein, the decimal that α gives in the starting stage, such as 0.1, usually by defined distance metric function
Dimension determines, such as cosine used herein, and codomain is [0,1], and parameter selects 0.1.
Different from binary loss function, the purpose of ternary loss function is an area being mapped to similar sample in space
Domain so that inter- object distance is less than between class distance, and is subject to preset parameter α constraint controls in class between class between interface;Will be same
The maps feature vectors of class sample pair are to a point in some region of space rather than space so that the complexity of problem is big
It is big to simplify, while do not influence to judge the effect of Similarity Problem in itself again, thus the effect of algorithm and generalization ability have it is very big
Lifting.
More specifically preferred solution:In above-mentioned steps S4, prediction when, it is only necessary to select network two paths (by
It is that weights are shared in network individual channel, therefore selects any two paths not influence);Input sample is first according to rear
S1 extracts the quantization vector of two texts to be measured, is denoted as input1, input2 respectively;Then Siamese network model knots are inputted
Structure as shown in figure 4, by Siamese networks to sample to carry out semantic feature extraction after, be passed to differentiate network calculations two
The cosine angle distance of vector, sets threshold value ζ, when calculated value is more than ζ, be determined as similar, is otherwise determined as dissmilarity.
Wherein, threshold value ζ parameters are given in advance;The distance of network output may be considered text to the general of similitude
Rate, this parameter provide meet what probable value both regarded as it is similar, depending on specific tasks are to the demand of similitude rank;And
And demand is higher, which can be set bigger.
As shown in figure 5, on the other hand the present embodiment also provides a kind of system for judging text similarity, which includes:
Controller, controller are used for the corresponding program of method for loading and performing any one above-mentioned judgement text similarity.
Embodiment two
The present embodiment, in order to make training pattern more flexible, can cause Siamese networks on the basis of embodiment one
Three branches in weighted, the number of plies is different, namely three functions are orthogonal;Simply calculated in last distance, will
They are associated together;The content of other non-repeat specifications is identical with embodiment one.Specifically, as shown in fig. 6, the present embodiment is excellent
Selection of land, in Fig. 1 corresponding S3 and S4 the calculating of included angle cosine use ternary metric function, at triple sample (x', x, x')
There is different network parameter W after feature extraction, obtain the semantic feature expression of three samples, be denoted as respectively
Gw1(x')、Gw2(x), Gw3(x');
As D (Gw1(x'),Gw2(x))-D(Gw3(x'),Gw2(x)) during > α, it is judged as similar, is otherwise judged as dissmilarity;
Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default preset parameter.On
The value of α, with reference to the description of embodiment one, details are not described herein.
Correspondingly, on the other hand the present embodiment also provides a kind of system for judging text similarity, which includes:Control
Device, controller are used to load and perform program corresponding with Siamese network structures in Fig. 6.
Embodiment three
As the technological document of software development, in particular for the software of nuclear power field, its file edit will meet standard
Specification, title have the generality and similitude of height, if when vector space model is established to text, taking will be whole
A text is uniformly processed, and can greatly lose the important information that title is brought to classification.Therefore, during training and test
The contribution for answering appropriate consideration paragraph heading to measure text similarity.As shown in fig. 7, the present embodiment is preferably, in embodiment
A pair of of text is selected to be denoted as (x as input in the one or correspondence step of embodiment two S2i,xj);By the paragraph heading of text with
And text is divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.
Specifically, a pair of of text is chosen first as input, is denoted as, can is similar or dissimilar, by the section of text
Fall title and text is divided into two parts, meanwhile, the text of two texts and title are incorporated as inputting respectively.Training pattern
For structure as shown in fig. 7, wherein, decision-making layer network uses full connection Rotating fields, activation primitive uses Sigmoid functions,
That is σ (x)=1/ (1+exp (- x));A ∈ (0,1),
X_Text=(xi_Text,xj_ Text), x_Title=(xi_Title,xj_Title)。
During test, still using the model, threshold value is set, when output is more than threshold value, is judged as similar, is not otherwise
It is similar.On threshold value, the description of ζ in embodiment one is refer to, details are not described herein.
But in the corresponding similitude determination methods of the present embodiment, correspondingly no longer adopted in implementation procedure in S2 steps
With " calculating the distance between a pair of sample feature after being extracted through different branches ", but by the corresponding element directly company of weighting
Connect.
Correspondingly, the present embodiment also provides a kind of system for judging text similarity, which includes:Controller, control
Device is used to load and perform program corresponding with Siamese network structures in Fig. 7.
The above-mentioned technical proposal provided using the application, can at least obtain one kind in following beneficial effect:
1st, the neutral net used is the overlapped in series of linear transformation and simple non-linear functions, and warp is used in the training stage
The stochastic gradient descent method of allusion quotation, therefore, all there is no the difficulty on calculating during trained process and test.
2nd, not only only rest in the rank of word and text is handled, but semantic-based rank judges document
Similitude, can solve natural language problem present in text well;The result of judgement is more accurate.
3rd, the measuring similarity model come out using Siamese network trainings, is possessed so that similar text distance diminishes,
Dissimilar text distance becomes larger;The classification, quantity and length of text are not all required, so can be good at solving those
Classification number is more, and the less decision problem of sample data of partial category.
4th, due to using Siamese network integration ternary loss functions, in the network training stage, the application corresponds to technical side
During the similarity determination of case, target is so that the feature of relevant Text Feature Extraction is as small as possible, incoherent Text Feature Extraction
Characteristic difference it is as big as possible;Due to training sample include much to data, e-learning, which has gone out, after training this sentences
The whether relevant ability of disconnected text, and this ability is built upon on very much, a variety of data pair, is come out
Ability, can be generalized in the classification not occurred, and here it is the generalization ability of neutral net;That is the generalization ability of model causes
This property;Accordingly even when for the classification not having in sample, the technical solution provided using the application still can be determined that
The similitude of two texts.
5th, one in the region rather than space that similar sample is mapped in space using ternary loss function
Point, simplifies the complexity of problem so that the generalization ability of algorithm is substantially improved.
Finally it should be noted that described above is only highly preferred embodiment of the present invention, not the present invention is appointed
What formal limitation.Any those skilled in the art, it is without departing from the scope of the present invention, all available
The way and technology contents of the disclosure above make technical solution of the present invention many possible variations and simple replacement etc., these
Belong to the scope of technical solution of the present invention protection.
Claims (10)
- A kind of 1. method for judging text similarity, it is characterised in that including:S1, structure vector space model so that text is quantized into accessible object;S2, using Siamese network struction text semantic similitude extraction models, it is semantic and in the Siamese networks Feature extraction network optimizes together with similitude differentiation series network in the sample training stage;S3, the semantic feature expression based on training stage sample, construct the text similarity of the included angle cosine of feature based vector Calculate function, and final loss function so that some region of the maps feature vectors of similar sample pair to space;Two S4, input texts to be measured, after surveying text progress semantic feature extraction based on the Siamese network handles, meter Two vectorial cosine angle distances are calculated, and threshold value is set, when two vectorial cosine angle distances are more than threshold value, are determined as It is similar, otherwise it is determined as dissmilarity.
- 2. according to the method described in claim 1, it is characterized in that, in the step S2, the structure of the Siamese networks can There are two or more parallel paths, correspondingly, two or more text features can be inputted at the same time, it is described Siamese networks are by the Nonlinear Mapping extraction feature of respective path, and then the feature of multiple texts can be after characteristic layer It is combined and then semantic dependency judges.
- 3. according to the method described in claim 2, it is characterized in that, in the step S2, before the Siamese networks are divided into Half part and latter half;First half is used for the extraction of text semantic feature, is made of full articulamentum;Latter half is used for phase Measured like property, the feature integration that network first half is extracted;Calculate through different branches extract after a pair of sample feature it Between distance, or corresponding element is directly weighted to connection;And then differentiate sample by cosine angle metric function in step s3 The similitude of this.
- 4. according to the method in claim 2 or 3, it is characterised in that the first half is used for carrying for text semantic feature Take, be made of full articulamentum, and including word, phrase, sentence, paragraph, article and semanteme, the network knot of 5 layers of hidden layer of structure Structure.
- 5. according to the method described in claim 1, it is characterized in that, in the S3 and S4 the calculating of included angle cosine use three elementary lengths Flow function, in triple sample (x, x+, x-) there is identical network parameter W after feature extraction, obtain the language of three samples Adopted feature representation, is denoted as G respectivelyw(x)、Gw(x+), Gw(x-);The Text similarity computing function of included angle cosine is:D(xi,xj) =cos (xi,xj)=xi.xj/|xi|.|xj|。
- 6. according to the method described in claim 5, it is characterized in that, Text similarity computing function with the included angle cosine, Corresponding loss function is:<mrow> <mi>l</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mi>&alpha;</mi> <mo>-</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>+</mo> </msubsup> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>D</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>,</mo> <msub> <mi>G</mi> <mi>w</mi> </msub> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>-</mo> </msubsup> <mo>)</mo> <mo>)</mo> </mrow> <mo>}</mo> <mo>.</mo> </mrow>Wherein, α is x and x+The distance between and x and x-The distance between in minimum interval, and be default preset parameter.
- 7. according to the method described in claim 1, it is characterized in that, in the S3 and S4 the calculating of included angle cosine use three elementary lengths Flow function, has different network parameter W in triple sample (x', x, x') after feature extraction, obtains three samples Semantic feature is expressed, and is denoted as G respectivelyw1(x')、Gw2(x), Gw3(x');As D (Gw1(x'),Gw2(x))-D(Gw3(x'),Gw2(x)) During > α, it is judged as similar, is otherwise judged as dissmilarity;Wherein, α is x and x+The distance between and x and x-The distance between in most Small interval, and be default preset parameter.
- 8. the method according to the description of claim 7 is characterized in that the sample x is to concentrate to select one at random from training data Sample, the sample are x+To belong to an of a sort sample, the sample x with x-For a sample inhomogeneous with x; The principle of sample training is:It is required that x and x+Angle between feature vector is as small as possible, and x and x-Between semantic feature vector Angle is as big as possible.
- 9. according to the method described in claim 1, it is characterized in that, a pair of of text is selected in the step S2 as input, note For (xi,xj);The paragraph heading of text and text are divided into two parts, meanwhile, the text of two texts and title are closed respectively And as input.
- A kind of 10. system for judging text similarity, it is characterised in that including:Controller, the controller are used to load and hold Any one in row such as claim 1-9 judges the corresponding program of method of text similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711088831.9A CN107967255A (en) | 2017-11-08 | 2017-11-08 | A kind of method and system for judging text similarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711088831.9A CN107967255A (en) | 2017-11-08 | 2017-11-08 | A kind of method and system for judging text similarity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107967255A true CN107967255A (en) | 2018-04-27 |
Family
ID=62000824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711088831.9A Pending CN107967255A (en) | 2017-11-08 | 2017-11-08 | A kind of method and system for judging text similarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107967255A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145529A (en) * | 2018-09-12 | 2019-01-04 | 重庆工业职业技术学院 | A kind of text similarity analysis method and system for copyright authentication |
CN109214002A (en) * | 2018-08-27 | 2019-01-15 | 成都四方伟业软件股份有限公司 | A kind of transcription comparison method, device and its computer storage medium |
CN110046240A (en) * | 2019-04-16 | 2019-07-23 | 浙江爱闻格环保科技有限公司 | In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network |
CN110309503A (en) * | 2019-05-21 | 2019-10-08 | 昆明理工大学 | A kind of subjective item Rating Model and methods of marking based on deep learning BERT--CNN |
CN110348010A (en) * | 2019-06-21 | 2019-10-18 | 北京小米智能科技有限公司 | Synonymous phrase acquisition methods and device |
CN110413988A (en) * | 2019-06-17 | 2019-11-05 | 平安科技(深圳)有限公司 | Method, apparatus, server and the storage medium of text information matching measurement |
CN110555198A (en) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | method, apparatus, device and computer-readable storage medium for generating article |
CN110598066A (en) * | 2019-09-10 | 2019-12-20 | 民生科技有限责任公司 | Bank full-name rapid matching method based on word vector expression and cosine similarity |
CN110597986A (en) * | 2019-08-16 | 2019-12-20 | 杭州微洱网络科技有限公司 | Text clustering system and method based on fine tuning characteristics |
CN110826338A (en) * | 2019-10-28 | 2020-02-21 | 桂林电子科技大学 | Fine-grained semantic similarity recognition method for single-choice gate and inter-class measurement |
CN110888920A (en) * | 2019-12-06 | 2020-03-17 | 北京中电普华信息技术有限公司 | Method and device for determining similarity of project functions |
CN111144129A (en) * | 2019-12-26 | 2020-05-12 | 成都航天科工大数据研究院有限公司 | Semantic similarity obtaining method based on autoregression and self-coding |
CN111178084A (en) * | 2019-12-26 | 2020-05-19 | 厦门快商通科技股份有限公司 | Training method and device for improving semantic similarity |
CN111460401A (en) * | 2020-05-20 | 2020-07-28 | 南京大学 | Automatic product tracking method combining software product process information and text similarity |
CN111723164A (en) * | 2019-03-18 | 2020-09-29 | 阿里巴巴集团控股有限公司 | Address information processing method and device |
CN111738010A (en) * | 2019-03-20 | 2020-10-02 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating semantic matching model |
CN111930929A (en) * | 2020-07-09 | 2020-11-13 | 车智互联(北京)科技有限公司 | Article title generation method and device and computing equipment |
CN112561904A (en) * | 2020-12-24 | 2021-03-26 | 凌云光技术股份有限公司 | Method and system for reducing false detection rate of AOI (argon oxygen decarburization) defects on display screen appearance |
CN112949319A (en) * | 2021-03-12 | 2021-06-11 | 江南大学 | Method, device, processor and storage medium for marking ambiguous words in text |
CN113221530A (en) * | 2021-04-19 | 2021-08-06 | 杭州火石数智科技有限公司 | Text similarity matching method and device based on circle loss, computer equipment and storage medium |
CN115630613A (en) * | 2022-12-19 | 2023-01-20 | 长沙冉星信息科技有限公司 | Automatic coding system and method for evaluation problems in questionnaire survey |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105184778A (en) * | 2015-08-25 | 2015-12-23 | 广州视源电子科技股份有限公司 | Detection method and apparatus |
US20160350336A1 (en) * | 2015-05-31 | 2016-12-01 | Allyke, Inc. | Automated image searching, exploration and discovery |
CN106909625A (en) * | 2017-01-20 | 2017-06-30 | 清华大学 | A kind of image search method and system based on Siamese networks |
CN107292259A (en) * | 2017-06-15 | 2017-10-24 | 国家新闻出版广电总局广播科学研究院 | The integrated approach of depth characteristic and traditional characteristic based on AdaRank |
-
2017
- 2017-11-08 CN CN201711088831.9A patent/CN107967255A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160350336A1 (en) * | 2015-05-31 | 2016-12-01 | Allyke, Inc. | Automated image searching, exploration and discovery |
CN105184778A (en) * | 2015-08-25 | 2015-12-23 | 广州视源电子科技股份有限公司 | Detection method and apparatus |
CN106909625A (en) * | 2017-01-20 | 2017-06-30 | 清华大学 | A kind of image search method and system based on Siamese networks |
CN107292259A (en) * | 2017-06-15 | 2017-10-24 | 国家新闻出版广电总局广播科学研究院 | The integrated approach of depth characteristic and traditional characteristic based on AdaRank |
Non-Patent Citations (6)
Title |
---|
FLORIAN SCHROFF 等: ""A unified embedding for face recognition and clustering"", 《2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 * |
PO-SEN HUANG: ""Learning deep structured semantic models for web search using clickthrough data"", 《PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 * |
SUMIT CHOPRA 等: ""Learning a similarity metric discriminatively, with application to face verification"", 《2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR"05)》 * |
刘博: ""子空间学习及其在图像集分类中的应用研究"", 《中国博士学位论文全文数据库 信息科技辑》 * |
刘阳: ""基于LSTM的英文文本蕴含识别方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
庞亮 等: ""深度文本匹配综述"", 《计算机学报》 * |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555198A (en) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | method, apparatus, device and computer-readable storage medium for generating article |
CN110555198B (en) * | 2018-05-31 | 2023-05-23 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer readable storage medium for generating articles |
CN109214002A (en) * | 2018-08-27 | 2019-01-15 | 成都四方伟业软件股份有限公司 | A kind of transcription comparison method, device and its computer storage medium |
CN109145529A (en) * | 2018-09-12 | 2019-01-04 | 重庆工业职业技术学院 | A kind of text similarity analysis method and system for copyright authentication |
CN111723164A (en) * | 2019-03-18 | 2020-09-29 | 阿里巴巴集团控股有限公司 | Address information processing method and device |
CN111723164B (en) * | 2019-03-18 | 2023-12-12 | 阿里巴巴集团控股有限公司 | Address information processing method and device |
CN111738010B (en) * | 2019-03-20 | 2023-10-17 | 百度在线网络技术(北京)有限公司 | Method and device for generating semantic matching model |
CN111738010A (en) * | 2019-03-20 | 2020-10-02 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating semantic matching model |
CN110046240A (en) * | 2019-04-16 | 2019-07-23 | 浙江爱闻格环保科技有限公司 | In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network |
CN110309503A (en) * | 2019-05-21 | 2019-10-08 | 昆明理工大学 | A kind of subjective item Rating Model and methods of marking based on deep learning BERT--CNN |
CN110413988A (en) * | 2019-06-17 | 2019-11-05 | 平安科技(深圳)有限公司 | Method, apparatus, server and the storage medium of text information matching measurement |
CN110413988B (en) * | 2019-06-17 | 2023-01-31 | 平安科技(深圳)有限公司 | Text information matching measurement method, device, server and storage medium |
CN110348010A (en) * | 2019-06-21 | 2019-10-18 | 北京小米智能科技有限公司 | Synonymous phrase acquisition methods and device |
CN110348010B (en) * | 2019-06-21 | 2023-06-02 | 北京小米智能科技有限公司 | Synonymous phrase acquisition method and apparatus |
CN110597986A (en) * | 2019-08-16 | 2019-12-20 | 杭州微洱网络科技有限公司 | Text clustering system and method based on fine tuning characteristics |
CN110598066A (en) * | 2019-09-10 | 2019-12-20 | 民生科技有限责任公司 | Bank full-name rapid matching method based on word vector expression and cosine similarity |
CN110598066B (en) * | 2019-09-10 | 2022-05-10 | 民生科技有限责任公司 | Bank full-name rapid matching method based on word vector expression and cosine similarity |
CN110826338A (en) * | 2019-10-28 | 2020-02-21 | 桂林电子科技大学 | Fine-grained semantic similarity recognition method for single-choice gate and inter-class measurement |
CN110826338B (en) * | 2019-10-28 | 2022-06-17 | 桂林电子科技大学 | Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement |
CN110888920A (en) * | 2019-12-06 | 2020-03-17 | 北京中电普华信息技术有限公司 | Method and device for determining similarity of project functions |
CN111178084A (en) * | 2019-12-26 | 2020-05-19 | 厦门快商通科技股份有限公司 | Training method and device for improving semantic similarity |
CN111144129B (en) * | 2019-12-26 | 2023-06-06 | 成都航天科工大数据研究院有限公司 | Semantic similarity acquisition method based on autoregressive and autoencoding |
CN111144129A (en) * | 2019-12-26 | 2020-05-12 | 成都航天科工大数据研究院有限公司 | Semantic similarity obtaining method based on autoregression and self-coding |
CN111460401B (en) * | 2020-05-20 | 2023-08-22 | 南京大学 | Product automatic tracking method combining software product process information and text similarity |
CN111460401A (en) * | 2020-05-20 | 2020-07-28 | 南京大学 | Automatic product tracking method combining software product process information and text similarity |
CN111930929B (en) * | 2020-07-09 | 2023-11-10 | 车智互联(北京)科技有限公司 | Article title generation method and device and computing equipment |
CN111930929A (en) * | 2020-07-09 | 2020-11-13 | 车智互联(北京)科技有限公司 | Article title generation method and device and computing equipment |
CN112561904A (en) * | 2020-12-24 | 2021-03-26 | 凌云光技术股份有限公司 | Method and system for reducing false detection rate of AOI (argon oxygen decarburization) defects on display screen appearance |
CN112949319A (en) * | 2021-03-12 | 2021-06-11 | 江南大学 | Method, device, processor and storage medium for marking ambiguous words in text |
CN113221530A (en) * | 2021-04-19 | 2021-08-06 | 杭州火石数智科技有限公司 | Text similarity matching method and device based on circle loss, computer equipment and storage medium |
CN113221530B (en) * | 2021-04-19 | 2024-02-13 | 杭州火石数智科技有限公司 | Text similarity matching method and device, computer equipment and storage medium |
CN115630613A (en) * | 2022-12-19 | 2023-01-20 | 长沙冉星信息科技有限公司 | Automatic coding system and method for evaluation problems in questionnaire survey |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107967255A (en) | A kind of method and system for judging text similarity | |
CN104462066B (en) | Semantic character labeling method and device | |
CN108563703A (en) | A kind of determination method of charge, device and computer equipment, storage medium | |
CN110516245A (en) | Fine granularity sentiment analysis method, apparatus, computer equipment and storage medium | |
CN110245229A (en) | A kind of deep learning theme sensibility classification method based on data enhancing | |
CN110033022A (en) | Processing method, device and the storage medium of text | |
CN110222178A (en) | Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing | |
CN112990296B (en) | Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation | |
CN106910497A (en) | A kind of Chinese word pronunciation Forecasting Methodology and device | |
CN110795571A (en) | Cultural tourism resource recommendation method based on deep learning and knowledge graph | |
CN109710744A (en) | A kind of data matching method, device, equipment and storage medium | |
CN106779053A (en) | The knowledge point of a kind of allowed for influencing factors and neutral net is known the real situation method | |
CN107608953A (en) | A kind of term vector generation method based on random length context | |
Wang et al. | Intelligent auto-grading system | |
CN115455171B (en) | Text video mutual inspection rope and model training method, device, equipment and medium | |
CN108920446A (en) | A kind of processing method of Engineering document | |
CN112559734A (en) | Presentation generation method and device, electronic equipment and computer readable storage medium | |
CN109670169B (en) | Deep learning emotion classification method based on feature extraction | |
Schicchi et al. | Machine learning models for measuring syntax complexity of english text | |
CN114492451A (en) | Text matching method and device, electronic equipment and computer readable storage medium | |
CN112000788B (en) | Data processing method, device and computer readable storage medium | |
CN113779190A (en) | Event cause and effect relationship identification method and device, electronic equipment and storage medium | |
CN112579794A (en) | Method and system for predicting semantic tree for Chinese and English word pairs | |
Pathuri et al. | Feature based sentimental analysis for prediction of mobile reviews using hybrid bag-boost algorithm | |
CN103514194B (en) | Determine method and apparatus and the classifier training method of the dependency of language material and entity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180427 |