CN106844331A

CN106844331A - Sentence similarity calculation method and system

Info

Publication number: CN106844331A
Application number: CN201611143723.2A
Authority: CN
Inventors: 杨萌; 李培峰; 朱巧明; 周国栋; 朱晓旭
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2017-06-13

Abstract

The invention relates to a sentence similarity calculation method and a sentence similarity calculation system, which use structural features to express the similarity of sentences. The invention obtains the structural characteristics suitable for sentence similarity calculation through proper modification on the basis of the shallow syntax tree, and combines the structural characteristics with the plane characteristics to calculate the sentence similarity.

Description

A kind of sentence similarity computational methods and system

Technical field

The present invention relates to natural language processing field, more particularly to a kind of sentence similarity computational methods and system.

Background technology

Similarity Measure is the element task of natural language processing.Current sentence similarity computational methods mainly have 4 classes, point Be not word-based overlap method, based on corpus statistics method, based on philological method and mixed method.

The method of word-based overlap is the phase for calculating sentence by some vocabularies common to two sentences with a group Like the measure of degree.Jacob etc. [4] propose Jaccard Similar operators, the method calculate two sentences in word occur simultaneously with The ratio of word union calculates the similarity of sentence in two sentences.Metzler etc. [5] uses inverse document frequency (IDF) conduct The weight of the word occurred in two sentences, improves result of calculation.Banerjee etc. [6] phrase-based length and they The characteristics of frequency of use is distributed in Zipfian designs phrase-based sentence similarity computational methods.

The set of words that sentence centering occurs is used as feature set by the method based on corpus, will be based on corpus The cosine angle value of vector is used as similarity.Landauer etc. [7] is united by analyzing a large-scale database for natural language The TF-IDF values for counting keyword form sentence semantics vector, and sentence semantic similarity is calculated with the cosine angle of vector.Lund Obtain high-dimensional vector space to calculate sentence or short Documents Similarity Deng the co-occurrence between [8] statistics vocabulary.

The similarity of sentence is determined using the semantic relation and its grammatical item between vocabulary based on philological method. Kashyap etc. [9] is based on the similarity between word semantic similarity measurement sentence, it is considered to which there are word different separating capacities to come Carry out the similarity calculating method of sentence vector.Malik etc. [10] is by the summation of the similarity between the word for constituting sentence pair Maximum normalizes income value as sentence similarity value by sentence length.

Mixed method is the mixed method based on above method.Chukfong etc. [11-14] is based on various method realities above Existing sentence similarity is calculated.

The sentence similarity evaluation work for being now based on structured representation is fewer, and Aliaksei [15] proposes a kind of base In the computational methods that simple structure is represented.

Existing sentence similarity calculates patent：

A kind of similarity calculating method and device based on semanteme：This invention provides a kind of based on semantic similarity meter Method and apparatus are calculated, wherein method includes：Obtain sentence S1 and S2 to be compared；Participle is carried out to the S1 and S2 respectively；It is right The word that there is Semantic mapping in each word obtained after the participle is mapped as normalized statement；Calculate through step C treatment Similarity Sim (S1, S2) between rear S1 and S2.The present invention is returned by the way that the word that there is Semantic mapping in sentence is mapped to One statement changed, and the calculating of similarity is incorporated, it is not only so as to the similarity between semantically sentence is embodied Literal similarity degree, improves the accuracy of similarity between calculating sentence.

Sentence similarity computational methods and device：This invention provide a kind of degree of accuracy sentence similarity computational methods high and Device.The sentence similarity computational methods, including：For the first sentence and the second sentence determine repetitor, the first orphan deposit word and Second orphan deposits word, wherein, repetitor had not only belonged to the first sentence but also had belonged to the second sentence, and the first orphan deposits word and only belongs to the first sentence, Second orphan deposits word and only belongs to the second sentence；Word is deposited according to all first orphans and all second orphans deposit word, calculate orphan and deposit Word similarity Total contribution margin G is total, wherein, G is total >=0, and all first orphans to deposit the similarity degree that word and all second orphans deposited between word higher, G total values are bigger；According to formula calculating SIM (A, B), the sentence of wherein SIM (A, B) the first sentences of expression and the second sentence is similar Degree, G_AlwaysRepresent corresponding first vector of the first sentence, G_AlwaysRepresent corresponding second vector of the second sentence.

The computational methods and system of a kind of sentence similarity：This invention provides a kind of sentence similarity computational methods and System, by using word2vec algorithms, the corpus to pre-building is trained, obtain all words in corpus to Amount；Two sentences to similarity to be calculated carry out Word Intelligent Segmentation, and first sentence and second are found out from corpus Vector in sentence corresponding to each participle, calculates the phase between each participle of the first sentence and the second sentence each participle successively Like degree；The two component set of words that the similarity between participle exceedes predetermined threshold are obtained, and according to every component lexeme in sentence The side-play amount of sub- position, calculates the contribution margin per component word similarity in whole sentence；By the contribution of participle in two sentences Value is added, and obtains the similarity between sentence.

Existing most of sentence similarity computational methods represent a pair of sentences using a large amount of planimetric shape outline features Similarity degree.It is that its is representational weaker to the problem of similarity to represent sentence using only plane characteristic vector.

Some newest similarity calculating methods, the knowledge (Wiki hundred for depending on the collocation of word and being obtained from big data Section etc.) Similarity Measure is carried out, do not consider the structured messages such as sentence syntax.Assuming that two sentences S1 and S2 are given, these Method can typically do following treatment：The first step, each word in S1 will match somebody with somebody with S2 with its similarity highest word It is right.Similarity between second step, all of pairing word adds up, and carries out standardization processing to similarity by the way that the sentence of S1 is long, enters And obtain the similarity of sentence S1 and S2.

Now analyze a pair of sentence S1:Tigers hit lions S2：Lions hit tigers.By mentioned above Method, each word in S1 can find the high word of similarity (being identical word in this example) pairing in S2, So as to Similarity Measure result will be considered that then two sentence implications are identical.As shown in figure 1, the dependency tree of analysis S1, S2 is obtained Agent and the word denoting the receiver of an action person for going out them are reverse.Although the word occurred in two sentences is identical, by analyzing its interdependent pass System tree, it can be deduced that their implication is simultaneously different.

The structured messages such as syntactic structure are very important information in natural language processing application.But, how each Using structured message but it is common problem in kind of task.When using plane characteristic vector representation structured features, When structured features are converted into plane characteristic, may lost part effective information.

In view of above-mentioned defect, the design people is actively subject to research and innovation, in the calculating represented based on simple structure On the basis of method, propose a kind of new structured representation method, calculated for sentence similarity, with embody Sentence Grammar, semanteme, Dependence.

Term is explained：

Pearson correlation coefficient (Pearson Correlation Coefficient)：For measuring two variable Xs and Y Between correlation (linear correlation), its value is between -1 and 1.In natural science field, the coefficient is widely used in measurement two Degree of correlation between individual variable.

Support vector regression model (Support Vector Regression, abbreviation SVR)：Mainly after rising dimension, Construct linear decision function in higher dimensional space to realize linear regression, during function insensitive with e, it is unwise that its basis is mainly e Sense function and Kernels.If the Mathematical Modeling that will be fitted expresses a certain curve of hyperspace, according to the insensitive functions of e The result of gained, is exactly " the e pipelines " for including the curve and training points.In all sample points, only it is distributed on " tube wall " That a part of sample point determine the position of pipeline.This part of training sample is referred to as " supporting vector ".It is adaptation training sample What is collected is non-linear, and traditional approximating method is typically behind linear equation plus higher order term.This method is really effective, but thus increases Adjustable parameter increased the risk of over-fitting rather.Support vector regression algorithm solves this contradiction using kernel function.Use core Function can make original linear algorithm " non-linearization " instead of the linear term in linear equation, can do nonlinear regression.With This introduces the purpose that kernel function has reached " rise dimension " simultaneously, and increased adjustable parameter to be over-fitting can still control.

Kernel method (Kernel Methods)：A mapping from lower dimensional space to higher dimensional space is imply, and this reflects Penetrating can become linear separability linear inseparable two classes point in lower dimensional space.For SVMs.

Tree kernel method (Tree Kernel Methods)：By directly calculating two entity relationship objects (i.e. syntax tree) The number of identical subtree compare similarity.

Name Entity recognition (Named Entity Recognition, abbreviation NER)：Also referred to as " proper name identification ", refer to There is the entity of certain sense, mainly including name, place name, mechanism's name, proper noun etc. in identification text.

WordNet：It is by the psychologist of Princeton University the one of linguist and Computer Engineer's co-design Plant the English dictionary based on cognitive linguistics.It is not that light alphabetically arranges word, and according to the meaning group of word Into one " network of word ".It is the broad English glossary semantic net of a coverage.Noun, verb, adjective and pair Word is each organized into a network for synonym, and each TongYiCi CiLin represents a basic semantic concept, and this Also connected by various relations between a little set.

Tree：Dendrogram is a kind of data structure, and it is by n (n>=1) individual limited node constitutes one has hierarchical relationship Set.It is called " tree " because it looks like a projecting tree, that is to say, that it be root upward, and leaf is down 's.The characteristics of it has following：Each node has zero or more child node；There is no the node referred to as root node of father node；Often One non-root node has and only one of which father node；In addition to root node, each child node can be divided into multiple disjoint sons Tree.

N-gram models：N-Gram is a kind of language model commonly used during big vocabulary is continuously recognized, model utilizes context In collocation information between adjacent word, the sentence with maximum probability can be calculated.

Bibliography

[1] Culotta A, Sorensen J.Dependency tree kernels for relation Extraction [C] //Meeting of the Association for Computational Linguistics, 21- 26July, 2004, Barcelona, Spain.2010:423--429.

[2] Bunescu R C, Mooney R J.A shortest path dependency kernel for relation extraction[C]//Conference on Human Language Technology and Empirical Methods in Natural Language Processing.Association for Computational Linguistics, 2005:724-731.

[3] Zhang M, Zhang J, Su J, et al.A Composite Kernel to Extract Relations between Entities with Both Flat and Structured Features.[C]//International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics, 17-21 July.2006, Sydney, Australia.

[4] Jacob B, Benjamin C.Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia[OL].http:// Www.infosci.cornell.edu/weblab/papers/Bank2008.pdf, 2008

[5] Metzler D, Bernstein Y, Croft W B, et al.Similarity measures for tracking information flow[C]//ACM CIKM International Conference on Information and Knowledge Management,Bremen,Germany,October 31-November.2005: 517-524.

[6] Banerjee S, Pedersen T.Extended gloss overlap as a measure of semantic relatedness[c]∥Proceedings of International Joint Conference on Artificial Intelligence.2003:805-810

[7] Landauer T K, Foltz P W, Laham D.Introduction to Latent Semantic Analysis [J] .Discourse Processes, 1998,25 (2/3)：259-284

[8] Lund K, Burgess C.Producing high-dimensional semantic spaces from Lexical co-occurrence [J] .Behavior Research Methods, Instruments＆Computers, 1996,28 (2):203-208

[9] Kashyap A, Han L, Yus R, et al.Robust semantic text similarity using LSA, machine learning, and linguistic resources [J] .Language Resources＆ Evaluation, 2016,50 (1):125-161.

[10] Malik R, Subramaniam L V, Kaushik S.Automatically Selecting Answer Templates to Respond to Customer Emails. [C] //IJCAI 2007, Proceedings of the, International Joint Conference on Artificial Intelligence, Hyderabad, India, January.2007:1659-1664.

[11] Jaffe E, Jin L, King D, et al.AZMAT:Sentence Similarity Using Associative Matrices[C]//International Workshop on Semantic Evaluation.2015.

[12] Haque R, Naskar S K, Way A, et al.Sentence Similarity-Based Source Context Modelling in PBSMT[C]//International Conference on Asian Language Processing, Ialp 2010, Harbin, Heilongjiang, China, 28-30December.2010:257-260.

[13] Li R, Li S, Zhang Z.The Semantic Computing Model of Sentence Similarity Based on Chinese FrameNet[C]//2009IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.IEEE Computer Society, 2009:255-258.

[14] Liu Y, Liu Q.Chinese Sentence Similarity Based on Multi-feature Combination [C] //Intelligent Systems, 2009.GCIS'09.WRI Global Congress on.2009: 14-19.

Severyn A, Nicosia M, Moschitti A.Learning Semantic Textual Similarity with Structural Representations[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia,2013:714-718.

The content of the invention

In order to solve the above technical problems, it is an object of the invention to provide a kind of sentence similarity computational methods and system, being Based on shallow-layer syntax tree, the new syntax tree construction of proposition makes structured features more to show the information such as the syntactic-semantic of sentence, And the syntax tree is applied to sentence similarity calculating, to obtain good performance.

Sentence similarity computational methods of the invention, it is characterised in that including step：

S10, sentence is called to training text and sentence to all sentences in test text part-of-speech tagging, syntactic analysis, Name Entity recognition, WordNet identification facilities carry out part-of-speech tagging, syntactic analysis, name Entity recognition, WordNet knowledges respectively Huo get not part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts and part of speech mark Note test text, phrase test text, name entity test text, WordNet test texts,

Wherein, the sentence be to test text to training text and sentence every row contain two need calculate similarity The text of sentence；

S20, based on part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts This acquisition shallow-layer syntax tree training text,

Obtained based on part-of-speech tagging test text, phrase test text, name entity test text, WordNet test texts Obtain shallow-layer syntax tree test text；

S30, multiple plane characteristics are obtained to a pair of sentences of every row to training text based on sentence, obtain plane characteristic training Text, by plane characteristic training text, shallow-layer syntax tree training text is combined with sentence to artificial scoring training text obtains shallow Layer syntax tree features training text,

Multiple plane characteristics are obtained to a pair of sentences of every row to test text based on sentence, plane characteristic test text is obtained This, plane characteristic test text is combined with shallow-layer syntax tree test text and is obtained shallow-layer syntax tree characteristic test text；

S40, it is trained based on shallow-layer syntax tree features training text using SVR models, training pattern is obtained, by training Model and shallow-layer syntax tree characteristic test text obtain Similarity Measure resulting text.

Further, the detailed process of the step S10 is as follows：

S101, to sentence to all sentences in training text using part-of-speech tagging instrument (such as：Stanford Postagger the part of speech of each word in sentence) is obtained, corresponding part-of-speech tagging training text is obtained；

Same treatment is carried out to test text to sentence and obtains part-of-speech tagging test text；

S102, to sentence to all sentences in training text using syntactic analysis instrument (such as：Stanford parser) obtain The phrase belonging to each word is obtained, phrase training text is obtained；

Same treatment is carried out to test text to sentence and obtains phrase test text；

S103, based on sentence to training text using name Entity recognition instrument (such as SST-light tagger) obtain list Name Entity recognition result belonging to word, obtains name entity training text；

Same treatment is carried out to test text to sentence and obtains name entity test text；

S104, based on sentence to training text using WordNet identification facilities (such as：SST-light tagger) obtain single On WordNet belonging to word adopted (WNSS), if justice is represented with space on no WordNet, WordNet training texts are obtained；

Same treatment is carried out to test text to sentence and obtains WordNet test texts.

Further, the detailed process of the step S20 is as follows：

S201, according to part-of-speech tagging training text, be sentence to each the syntax shallow-layer syntax tree in training text, Obtain basic shallow-layer syntax tree training text；Basic shallow-layer syntax is obtained to test text and part-of-speech tagging test text by sentence Tree test text；

Wherein, shallow-layer syntax tree is the tree that depth is 3, and building method is as follows：Word in one sentence is generated as The leaf node of the bottom；Using the part of speech of each leaf node equivalent as each leaf node father node；Finally, set The father node of all of part of speech node is root node；

S202, according to phrase training text, be each syntax deeper in basic shallow-layer syntax tree training text The shallow-layer syntax tree of layer obtains phrase shallow-layer syntax tree training text；Phrase-based test text and basic shallow-layer syntax tree are tested Text obtains phrase shallow-layer syntax tree test text；

Wherein, more further shallow-layer syntax tree is the tree that depth is 4, and building method is as follows：By sentence phrase chunking knot Really, the information for belonging to same phrase word is obtained；It is connected on the part of speech father node of the word leaf node that will belong to same phrase Same chunker nodes；Contacting between root node and part of speech node is disconnected, chunker nodes are connected to corresponding part of speech Node；Finally, the father node for setting all of part of speech node is root node；

S203, phrase-based shallow-layer syntax tree training text, name entity training text and WordNet training texts are obtained Semantic shallow-layer syntax tree training text；Phrase-based shallow-layer syntax tree test text, name entity test text and WordNet are surveyed Examination text obtains semantic shallow-layer syntax tree test text；

Semantic shallow-layer syntax tree training text is to add semantic information on phrase shallow-layer syntax tree training text, specific side Method is as follows：If a word in phrase shallow-layer syntax tree training text is in name entity training text and WordNet training There is NER or WNSS information in text, the syntactic information of the chunker nodes comprising the word is modified as NER or WNSS information； If meeting above-mentioned situation containing multiple words in a phrase node, the NER and WNSS of last word in phrase are used Information；

S204, based on semantic shallow-layer syntax tree training text delete definite article and conjunction interdependent node, obtain prune shallow-layer Syntax tree training text；Definite article and conjunction interdependent node are deleted based on semantic shallow-layer syntax tree test text, is obtained and is pruned shallow Layer syntax tree test text；

The present invention reduces definite article and conjunction and their father's node (part of speech node) in structured representation, subtracts Few influence of the insignificant information to result of calculation, in shallow-layer syntax tree if the word that certain leaf node is represented be article or Conjunction, deletes the leaf node, and its father's node (part of speech node) and father father's node (chunker nodes)；

S205, based on prune shallow-layer syntax tree training text, the shallow-layer syntax tree relevant portion of a pair of sentences is associated To obtain shallow-layer syntax tree training text；Based on shallow-layer syntax tree test text is pruned, by a pair of shallow-layer syntax tree phases of sentence Close partial association and get up to obtain shallow-layer syntax tree test text；

Wherein, the method for the corresponding shallow-layer syntax tree of a pair of sentences being associated：If certain word in two sentences Identical (leaf node is identical), obtains their father's node (part of speech node), grandparent node and be nonterminal node, on mark REL。

Further, the detailed process of the step S30 is as follows：

S301, based on sentence to training text obtain plane characteristic training text, test text is put down based on sentence Region feature test text；

Wherein, plane characteristic training text and plane characteristic test text are respectively sentence to training text and sentence to surveying The Similarity Measure plane characteristic of every a pair of sentences of row in examination text；

The present invention provides 11 plane characteristics：

1st, Longest Common Substring method result of calculation：Calculate the length of continuation character sequence most long；

2nd, longest common subsequence method result of calculation：There is no successional requirement compared to Longest Common Substring method, and Allow to calculate similarity in the case of the insertion of word or missing；

3rd, greedy string-concatenation method result of calculation：Allow the out of order portion to calculate common similar substring of text Point, the maximum length of each substring matching；

4th, 5,6, the 7 sentence editing distance result of calculation based on character n-gram：Assuming that there is a sentence s, then the word The N-Gram for according with string means that the word section obtained by length N cuttings original word, that is, all length is the substring of N in s；If Want, if two character strings, their N-Gram then to be sought respectively, then just from this angle of the quantity of their total substring Degree goes to define the N-Gram distances between two character strings, N=1,2,3,4, totally four features；

8,9,10, the 11 sentence editing distance result of calculations based on word n-gram：With feature 4,5,6,7 is similar, will be with Character is changed into word as segmentation unit for segmentation unit；

S302, shallow-layer syntax tree features training text is obtained by plane characteristic training text and shallow-layer syntax tree training text This；Shallow-layer syntax tree characteristic test text is obtained by plane characteristic test text and shallow-layer syntax tree test text；

Wherein, often row includes corresponding sentence for shallow-layer syntax tree features training text and shallow-layer syntax tree characteristic test text To ten plane characteristics and a pair of shallow-layer syntax tree features.

Further, the detailed process of the step S40 is as follows：

S401, using SVR obtain similarity calculation, entered in SVR models by shallow-layer syntax tree features training text Row training obtains training pattern；

S402, using training pattern and shallow-layer syntax tree characteristic test text as input, obtain similar using SVR instruments Degree result of calculation text；

Wherein, the numerical value that Similarity Measure resulting text is often gone corresponds to phase of the sentence to every a pair of the sentences of row of test text Like degree result of calculation.

Sentence similarity computing system of the invention, including：

- pretreatment module, part-of-speech tagging, sentence are called to sentence to training text and sentence to all sentences in test text Method analysis, name Entity recognition, WordNet identification facilities carry out respectively part-of-speech tagging, syntactic analysis, name Entity recognition, WordNet identifications obtain part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts With part-of-speech tagging test text, phrase test text, name entity test text, WordNet test texts；

- structured features the module based on shallow-layer syntax tree, based on part-of-speech tagging training text, phrase training text, life Name entity training text, WordNet training texts obtain shallow-layer syntax tree training text；

The feature set module of-Similarity Measure based on shallow-layer syntax tree, based on sentence to training text to a pair of every row Sentence obtains multiple plane characteristics, obtains plane characteristic training text, by plane characteristic training text, shallow-layer syntax tree training text This is combined with sentence to artificial scoring training text obtains shallow-layer syntax tree features training text；

- the similarity calculation module based on shallow-layer syntax tree, shallow-layer syntax tree features training text is based on using SVR models It is trained, obtains training pattern, Similarity Measure result text is obtained by training pattern and shallow-layer syntax tree characteristic test text This.

Further, the pretreatment module is specifically included：

- part-of-speech tagging unit, part-of-speech tagging instrument is used (such as to sentence to all sentences in training text：Stanford Postagger the part of speech of each word in sentence) is obtained, corresponding part-of-speech tagging training text is obtained；

- phrase tagging unit, syntactic analysis instrument is used (such as to sentence to all sentences in training text：Stanford Parser the phrase belonging to each word) is obtained, phrase test text is obtained；Same treatment is carried out to sentence to test text to obtain Obtain phrase test text；

- name Entity recognition unit, name Entity recognition instrument (such as SST-light is used based on sentence to training text Tagger the name Entity recognition result belonging to word) is obtained, name entity training text is obtained, sentence is entered to test text Row same treatment obtains name entity test text；

- WordNet recognition units, WordNet identification facilities are used (such as based on sentence to training text：SST-light Tagger) obtain on the WordNet belonging to word adopted (WNSS), if justice is represented with space on no WordNet, obtain WordNet training texts, carry out same treatment to test text to sentence and obtain WordNet test texts.

Further, the structured features module based on shallow-layer syntax tree is specifically included：

- basic shallow-layer syntax tree unit, is sentence to each sentence in training text according to part-of-speech tagging training text Construction shallow-layer syntax tree, obtains basic shallow-layer syntax tree training text；By sentence to test text and part-of-speech tagging test text Obtain basic shallow-layer syntax tree test text；

The shallow-layer syntax tree unit of-addition phrase level syntactic structure, is basic shallow-layer syntax tree according to phrase training text The more further shallow-layer syntax tree of each syntax in training text obtains phrase shallow-layer syntax tree training text；Based on short Language test text and basic shallow-layer syntax tree test text obtain phrase shallow-layer syntax tree test text；

- add semantic shallow-layer syntax tree unit, phrase-based shallow-layer syntax tree training text to name entity training text Semantic shallow-layer syntax tree training text is obtained with WordNet training texts；Phrase-based shallow-layer syntax tree test text, name are real Body test text and WordNet test texts obtain semantic shallow-layer syntax tree test text；

The shallow-layer syntax tree unit of the insignificant information of-deletion, definite article is deleted based on semantic shallow-layer syntax tree training text With conjunction interdependent node, obtain and prune shallow-layer syntax tree training text；Fixed hat is deleted based on semantic shallow-layer syntax tree test text Word and conjunction interdependent node, obtain and prune shallow-layer syntax tree test text；

The present invention reduces definite article and conjunction and their father's node (part of speech node) in structured representation, subtracts Few influence of the insignificant information to result of calculation.In shallow-layer syntax tree if certain leaf node represent word be article or Conjunction, deletes the leaf node, and its father's node (part of speech node) and father father's node (chunker nodes)；

- sentence represents unit to joint, based on shallow-layer syntax tree training text is pruned, by a pair of shallow-layer syntax trees of sentence Relevant portion associates acquisition shallow-layer syntax tree training text；Based on shallow-layer syntax tree test text is pruned, by a pair of sentences Shallow-layer syntax tree relevant portion associate acquisition shallow-layer syntax tree test text；

Wherein, the method for the corresponding shallow-layer syntax tree of a pair of sentences being associated：If certain word in two sentences It is identical, obtain their father's node, grandparent node and be nonterminal node, REL on mark.

Further, the feature set module of the Similarity Measure based on shallow-layer syntax tree is specifically included：

- plane characteristic collection unit, plane characteristic training text is obtained based on sentence to training text, based on sentence to test Text obtains plane characteristic test text；

The present invention provides 11 plane characteristics.

8th, 9,10, the 11 sentence editing distance result of calculation based on word n-gram：It is similar to feature 4,5,6,7, will be with Character is changed into word as segmentation unit for segmentation unit；

- shallow-layer syntax tree feature set unit, shallow-layer is obtained by plane characteristic training text and shallow-layer syntax tree training text Syntax tree features training text, obtains shallow-layer syntax tree feature and surveys by plane characteristic test text and shallow-layer syntax tree test text Examination text；

Further, the similarity calculation module based on shallow-layer syntax tree is specifically included：

- training unit, obtains similarity calculation, by shallow-layer syntax tree features training text in SVR models using SVR In be trained acquisition training pattern；

- test cell, using training pattern and shallow-layer syntax tree characteristic test text as input, is obtained using SVR instruments Obtain Similarity Measure resulting text；

By such scheme, the present invention at least has advantages below：

1st, sentence similarity computational methods and system based on structured representation proposed by the present invention, solve using only flat Region feature vector represents representational weaker problem of the sentence to similarity；

2 is different from the method for feature based vector, and the present invention directly calculates two structured features using tree kernel function The identical subtree number of (such as dependency tree) compares similarity, the method based on tree kernel function need not construct high dimensional feature to Quantity space, tree Kernel-Based Methods are to deal with objects with structure tree, by directly calculating two discrete objects (such as syntactic structure tree) Between similarity classified, this causes that implicit high dimensional feature can be explored on the theoretical method based on tree kernel function empty Between, such that it is able to effectively utilize the structured messages such as syntax tree；

3rd, the present invention is based on shallow-layer syntax tree, it is proposed that a kind of new syntax tree construction, structured features is more showed The information such as the syntactic-semantic of sentence, and the syntax tree is applied to sentence similarity calculating, obtain good performance.

Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention, And can be practiced according to the content of specification, below with presently preferred embodiments of the present invention and coordinate accompanying drawing describe in detail as after.

Brief description of the drawings

Fig. 1 is the dependency tree of sentence " Tigers hit lions " and " Lions hit tigers " in background technology Schematic diagram；

Fig. 2 is sentence similarity computing system structure chart；

Fig. 3 is pretreatment module structure chart；

Fig. 4 is the structured features function structure chart based on shallow-layer syntax tree；

Fig. 5 is the feature set function structure chart of the Similarity Measure based on shallow-layer syntax tree；

Fig. 6 is the similarity calculation module structure chart ` based on shallow-layer syntax tree；

Fig. 7 is sentence similarity computational methods flow chart；

Fig. 8 is pretreatment module flow chart；

Fig. 9 is the structured features block flow diagram based on shallow-layer syntax tree；

Figure 10 is the feature set block flow diagram of the Similarity Measure based on shallow-layer syntax tree；

Figure 11 is the similarity calculation module flow chart based on shallow-layer syntax tree；

Figure 12 is the example of basic shallow-layer syntax tree；

Figure 13 is the example of phrase shallow-layer syntax tree；

Figure 14 is the example of semantic shallow-layer syntax tree；

Figure 15 is the example for pruning shallow-layer syntax tree；

Figure 16 is the example of joint shallow-layer syntax.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiment of the invention is described in further detail.Hereinafter implement Example is not limited to the scope of the present invention for illustrating the present invention.

The target that sentence similarity of the present invention is calculated is one scoring systems of study, gives a pair of sentences, and the system is returned Similarity score, fraction range is 0~5.0 to represent this implication to sentence completely irrelevant, and 5 to represent this identical to sentence implication. The performance of system is assessed by system-computed fraction and the artificial Pearson came relative coefficient for judging fraction.

Flow, implementation process of the invention is illustrated below in conjunction with illustration for the purpose of simplifying the description.

Sentence similarity computing system based on structured features, as shown in Fig. 2 including：

- pretreatment module 10, sentence is called to training text and sentence to all sentences in test text part-of-speech tagging, Syntactic analysis, name Entity recognition, WordNet identification facilities carry out respectively part-of-speech tagging, syntactic analysis, name Entity recognition, WordNet identifications obtain part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts With part-of-speech tagging test text, phrase test text, name entity test text, WordNet test texts；

- based on shallow-layer syntax tree structured features module 20, based on part-of-speech tagging training text, phrase training text, Name entity training text, WordNet training texts obtain shallow-layer syntax tree training text；

The feature set module 30 of-Similarity Measure based on shallow-layer syntax tree, based on sentence to training text to every row one Multiple plane characteristics are obtained to sentence, plane characteristic training text is obtained, plane characteristic training text, shallow-layer syntax tree are trained Text is combined with sentence to artificial scoring training text obtains shallow-layer syntax tree features training text；

- the similarity calculation module 40 based on shallow-layer syntax tree, shallow-layer syntax tree features training text is based on using SVR models Originally it is trained, obtains training pattern, Similarity Measure result is obtained by training pattern and shallow-layer syntax tree characteristic test text Text.

As shown in figure 3, pretreatment module 10 is specifically included：

- part-of-speech tagging unit 101, uses in part-of-speech tagging instrument acquisition sentence all sentences in training text sentence The part of speech of each word, obtains corresponding part-of-speech tagging training text；

- phrase tagging unit 102, each list is obtained to sentence to all sentences in training text using syntactic analysis instrument Phrase belonging to word, obtains phrase test text；Same treatment is carried out to test text to sentence and obtains phrase test text；

- name Entity recognition unit 103, word institute is obtained based on sentence to training text using name Entity recognition instrument The name Entity recognition result of category, obtains name entity training text, carries out same treatment to test text to sentence and is ordered Name entity test text；

- WordNet recognition units 104, are obtained belonging to word to training text based on sentence using WordNet identification facilities WordNet on justice, if no WordNet on justice represented with space, obtain WordNet training texts, to sentence to test Text carries out same treatment and obtains WordNet test texts.

As shown in figure 4, the structured features module 20 based on shallow-layer syntax tree is specifically included：

- basic shallow-layer syntax tree unit 201, is sentence to each in training text according to part-of-speech tagging training text Syntax shallow-layer syntax tree, obtains basic shallow-layer syntax tree training text；Test text and part-of-speech tagging are tested by sentence Text obtains basic shallow-layer syntax tree test text；

Wherein, shallow-layer syntax tree building method is as follows：Word in one sentence is generated as the leaf node of the bottom； Using the part of speech of each leaf node equivalent as each leaf node father node；Finally, all of part of speech node is set Father node is root node；

The shallow-layer syntax tree unit 202 of-addition phrase level syntactic structure, is basic shallow-layer sentence according to phrase training text The more further shallow-layer syntax tree of each syntax in method tree training text obtains phrase shallow-layer syntax tree training text；Base Phrase shallow-layer syntax tree test text is obtained in phrase test text and basic shallow-layer syntax tree test text；

Wherein, more further shallow-layer syntax tree building method is as follows：By sentence phrase chunking result, acquisition belongs to same The information of phrase word；Same chunker nodes are connected on the part of speech father node of the word leaf node that will belong to same phrase； Contacting between root node and part of speech node is disconnected, chunker nodes are connected to corresponding part of speech node；Finally, institute is set The father node of some part of speech nodes is root node；

- add semantic shallow-layer syntax tree unit 203, phrase-based shallow-layer syntax tree training text, name entity training Text and WordNet training texts obtain semantic shallow-layer syntax tree training text；Phrase-based shallow-layer syntax tree test text, life Name entity test text and WordNet test texts obtain semantic shallow-layer syntax tree test text；

The shallow-layer syntax tree unit 204 of the insignificant information of-deletion, fixed hat is deleted based on semantic shallow-layer syntax tree training text Word and conjunction interdependent node, obtain and prune shallow-layer syntax tree training text；Delete fixed based on semantic shallow-layer syntax tree test text Article and conjunction interdependent node, obtain and prune shallow-layer syntax tree test text；

- sentence represents unit 205 to joint, based on shallow-layer syntax tree training text is pruned, by a pair of shallow-layer sentences of sentence Method tree relevant portion associates acquisition shallow-layer syntax tree training text；Based on shallow-layer syntax tree test text is pruned, by a pair The shallow-layer syntax tree relevant portion of sentence associates acquisition shallow-layer syntax tree test text；

As shown in figure 5, the feature set module 30 of the Similarity Measure based on shallow-layer syntax tree is specifically included：

- plane characteristic collection unit 301, obtains plane characteristic training text, based on sentence pair based on sentence to training text Test text obtains plane characteristic test text；

- shallow-layer syntax tree feature set unit 302, obtains shallow by plane characteristic training text and shallow-layer syntax tree training text Layer syntax tree features training text, shallow-layer syntax tree feature is obtained by plane characteristic test text and shallow-layer syntax tree test text Test text.

As shown in fig. 6, the similarity calculation module 40 based on shallow-layer syntax tree is specifically included：

- training unit 401, obtains similarity calculation, by shallow-layer syntax tree features training text in SVR using SVR Acquisition training pattern is trained in model；

- test cell 402, using training pattern and shallow-layer syntax tree characteristic test text as input, using SVR instruments Obtain Similarity Measure resulting text；

Sentence similarity computational methods based on structured features, as shown in fig. 7, comprises：

S10, sentence is called to training text and sentence to all sentences in test text part-of-speech tagging, syntactic analysis, Name Entity recognition, WordNet identification facilities carry out part-of-speech tagging, syntactic analysis, name Entity recognition, WordNet knowledges respectively Huo get not part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts and part of speech mark Note test text, phrase test text, name entity test text, WordNet test texts；

The sentence is that every row contains two sentences for needing to calculate similarity to test text to training text and sentence Text.

S20, based on part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts This acquisition shallow-layer syntax tree training text；

Obtained based on part-of-speech tagging test text, phrase test text, name entity test text, WordNet test texts Obtain shallow-layer syntax tree test text.

S30,11 plane characteristics are obtained to a pair of sentences of every row to training text based on sentence, obtain plane characteristic training Text, plane characteristic training text, shallow-layer syntax tree training text is combined with sentence to artificial scoring training text and is obtained shallow Layer syntax tree features training text；

11 plane characteristics are obtained to a pair of sentences of every row to test text based on sentence, plane characteristic test text is obtained This, plane characteristic test text is combined with shallow-layer syntax tree test text and is obtained shallow-layer syntax tree characteristic test text.

Wherein, as shown in figure 8, the detailed process of S10 is as follows：

S101, to sentence to all sentences in training text using part-of-speech tagging instrument (such as：Stanford Postagger the part of speech of each word in sentence) is obtained, corresponding part-of-speech tagging training text is obtained.

Same treatment is carried out to test text to sentence and obtains part-of-speech tagging test text.

Example 1：" A woman and man are dancing in the rain " after part-of-speech tagging by obtaining:

Example 2：“DT NN CC NN VBP VBG IN DT NN”.

Wherein, DT represents definite article, and NN represents noun, and CC represents conjunction, and VBP represents verb third-person singular, VBG tables Show gerund and present participle, IN represents preposition or subordinate conjunction.

S102, to sentence to all sentences in training text using syntactic analysis instrument (such as：Stanford parser) obtain The phrase belonging to each word is obtained, phrase test text is obtained；Same treatment is carried out to test text to sentence and obtains phrase survey Examination text.

Example 1 carries out phrase chunking and obtains 4 phrases：

Example 3：“NP(a woman and man)VP(bedance)PP(in)NP(the rain)”

Wherein NP represents noun phrase, and VP represents verb phrase, and PP represents prepositional phrase.

S103, based on sentence to training text using name Entity recognition instrument (such as SST-light tagger) obtain list Name Entity recognition result belonging to word, obtains name entity training text；Same treatment is carried out to sentence to test text to obtain Entity test text must be named.

Example 1 is named Entity recognition and obtains two entities：

Example 4：“E:PER_DESC woman E:PER_DESC man”

Wherein E:PER_DESC is represented and is identified that entity type is people.

S104, based on sentence to training text using WordNet identification facilities (such as：SST-light tagger) obtain single On WordNet belonging to word adopted (WNSS)；If justice is represented with space on no WordNet, WordNet training texts are obtained； Same treatment is carried out to test text to sentence and obtains WordNet test texts.

Example 1 carries out WNSS identifications and obtains：

Example 5：“female woman male man weather rain”

Wherein female, male, weather are respectively woman, justice on the WordNet of man, rain.

As shown in figure 9, the detailed process of S20 is as follows：

S201, according to part-of-speech tagging training text, be sentence to each the syntax shallow-layer syntax tree in training text, Obtain basic shallow-layer syntax tree training text；Basic shallow-layer syntax is obtained to test text and part-of-speech tagging test text by sentence Tree test text.

Wherein, shallow-layer syntax tree is the tree that depth is 3, and building method is as follows：Word in one sentence is generated as 9 words are exactly 9 leaf nodes in the leaf node of the bottom, such as example 1；Using the part of speech of each leaf node equivalent as every The father node of individual leaf node；Finally, the father node for setting all of part of speech node is root node.

Example 1 is as shown in figure 12 by construction shallow-layer syntax tree.

S202, according to phrase training text, be each syntax deeper in basic shallow-layer syntax tree training text The shallow-layer syntax tree of layer obtains phrase shallow-layer syntax tree training text；Phrase-based test text and basic shallow-layer syntax tree are tested Text obtains phrase shallow-layer syntax tree test text.

Wherein, more further shallow-layer syntax tree is the tree that depth is 4, and building method is as follows：By sentence phrase chunking knot Really, the information for belonging to same phrase word is obtained.It is connected on the part of speech father node of the word leaf node that will belong to same phrase Same chunker nodes.It is exactly 4 chunker nodes as 9 words belong to 4 phrases in example 1；Disconnect root node and part of speech Contact between node, corresponding part of speech node is connected to by chunker nodes, finally, sets the father of all of part of speech node Node is root node.

Figure 12 adds after one layer of chunker node as shown in figure 13.

S203, phrase-based shallow-layer syntax tree training text, name entity training text and WordNet training texts are obtained Semantic shallow-layer syntax tree training text；Phrase-based shallow-layer syntax tree test text, name entity test text and WordNet are surveyed Examination text obtains semantic shallow-layer syntax tree test text.

Semantic shallow-layer syntax tree training text is to add semantic information on phrase shallow-layer syntax tree training text, specific side Method is as follows：If a word in phrase shallow-layer syntax tree training text is in name entity training text and WordNet training There is NER or WNSS information in text, the syntactic information of the chunker nodes comprising the word is modified as NER or WNSS information； If meeting above-mentioned situation containing multiple words in a phrase node, the NER and WNSS of last word in phrase are used Information.

After Figure 13 addition semantic informations as shown in figure 14.

S204, definite article and conjunction interdependent node are deleted based on semantic shallow-layer syntax tree training text, are obtained and are pruned shallow-layer Syntax tree training text；Definite article and conjunction interdependent node are deleted based on semantic shallow-layer syntax tree test text, is obtained and is pruned shallow Layer syntax tree test text.

The present invention will reduce definite article and conjunction and their father's node (part of speech node) in structured representation, Reduce influence of the insignificant information to result of calculation.In shallow-layer syntax tree if certain leaf node represent word be article or Person's conjunction, deletes the leaf node, and its father's node (part of speech node) and father father's node (chunker nodes).Such as 3 words are definite article and conjunction in 9 words in example 1, delete two nodes above this 3 nodes, and each of which.

Figure 14 is as shown in figure 15 after deleting non-keynote message.

S205, based on prune shallow-layer syntax tree training text, the shallow-layer syntax tree relevant portion of a pair of sentences is associated To obtain shallow-layer syntax tree training text；Based on shallow-layer syntax tree test text is pruned, by a pair of shallow-layer syntax tree phases of sentence Close partial association and get up to obtain shallow-layer syntax tree test text.

Wherein, the method for the corresponding shallow-layer syntax tree of a pair of sentences being associated：If certain word in two sentences Identical (leaf node is identical), obtains their father's node (part of speech node), grandparent node and be nonterminal node, on mark REL.As in example 8, there is 6 words after two shallow-layer syntax tree-prunings of sentence, first sentence has 4 words at second Occur in sentence, by the father node of these leaf nodes (node of same word in i.e. two sentences) and the father node of father node Syntax and semantic information before add " REL- " mark.The shallow-layer syntax tree of example 8 is as shown in figure 16 after being associated.

Example 8：“the girl sing into a microphone”“the girl sing into the phone”

Wherein, as shown in Figure 10, the detailed process of S30 is as follows：

S301, is obtained by acquisition plane characteristic training text, based on sentence to test text based on sentence to training text Obtain plane characteristic test text.

Wherein, plane characteristic training text and plane characteristic test text are sentence literary to test to training text and sentence The every Similarity Measure plane characteristic of a pair of sentences of row in this.The present invention provides 11 plane characteristics.

4th, 5,6, the 7 sentence editing distance result of calculation based on character n-gram：Assuming that there is a sentence s, then the word The N-Gram for according with string means that the word section obtained by length N cuttings original word, that is, all length is the substring of N in s.If Want, if two character strings, their N-Gram then to be sought respectively, then just from this angle of the quantity of their total substring Degree goes to define the N-Gram distances between two character strings, N=1,2,3,4, totally four features；

8th, 9,10, the 11 sentence editing distance result of calculation based on word n-gram：It is similar to feature 4,5,6,7, will be with Character is changed into word as segmentation unit for segmentation unit.

10 plane characteristics of example 8 are：

Example 10：1:1.000000000000697 2:0.9391768400811316 3:0.99999999999998314: 0.503555802360086 5:0.5677922252084567 6:0.520906295824116 7:0.98: 0.3333333333333333 9:0.910:0.2142857142857142711:0.07142857142857142

Wherein with 1:As a example by 1.000000000000697, "：" before numeral represent feature number, "：" behind numeral Value is characterized, calls special instrument (such as UKP) to obtain.

S302, shallow-layer syntax tree features training text is obtained by plane characteristic training text and shallow-layer syntax tree training text This；Shallow-layer syntax tree characteristic test text is obtained by plane characteristic test text and shallow-layer syntax tree test text.

The feature set of example 8 is:

Example 11：|BT|(ROOT(root(REL-dance(REL-nsubj(REL-woman(REL-det a)(REL-cc and)(REL-conj:and man)))(REL-nsubj man)(REL-cop be)(REL-nmod:in(REL-rain(REL- case in)(det the))))))|BT|(ROOT(root(REL-dance(REL-nsubj(REL-man(REL-det a) (REL-cc and)(REL-conj:and woman)))(REL-nsubj woman)(REL-cop be)(REL-nmod:in (REL-rain(REL-case in))))))|ET|1:1.000000000000697 2:0.9391768400811316 3: 0.9999999999999831 4:0.503555802360086 5:0.5677922252084567 6: 0.520906295824116 7:0.9 8:0.3333333333333333 9:0.9 10:0.2142857142857142711: 0.07142857142857142

Wherein | BT | present invention corresponding with 8 two sentences of string table example between | BT | and | BT | and | ET | is proposed The shallow-layer syntax tree similar to Figure 14, root represents root node, and " (" represents child nodes opening flag, ") " represents child's section Point end mark.Content between " (" and ") " is the content of node.Finally splice the plane characteristic of upper example 10.

Wherein, as shown in figure 11, the detailed process of S40 is as follows：

S401, using SVR obtain similarity calculation, shallow-layer syntax tree features training text as input utilization SVR instruments are trained and obtain training pattern.

S402, using training pattern and shallow-layer syntax tree characteristic test text as input, obtain similar using SVR instruments Degree result of calculation text.

Wherein, the numerical value that Similarity Measure resulting text is often gone corresponds to phase of the sentence to every a pair of the sentences of row of test text Like degree result of calculation.If 8 two similarity scores of sentence of example are 3.8731014.

The above is only the preferred embodiment of the present invention, is not intended to limit the invention, it is noted that for this skill For the those of ordinary skill in art field, on the premise of the technology of the present invention principle is not departed from, can also make it is some improvement and Modification, these are improved and modification also should be regarded as protection scope of the present invention.

Claims

1. a kind of sentence similarity computational methods, it is characterised in that including step：

S10, part-of-speech tagging, syntactic analysis, name are called to all sentences in test text to training text and sentence to sentence Entity recognition, WordNet identification facilities carry out part-of-speech tagging, syntactic analysis, name Entity recognition, WordNet identifications to obtain respectively Part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts and part-of-speech tagging is obtained to survey Examination text, phrase test text, name entity test text, WordNet test texts,

Wherein, the sentence is that every row contains two sentences for needing to calculate similarity to test text to training text and sentence Text；

S20, obtained based on part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts Shallow-layer syntax tree training text is obtained,

Obtain shallow based on part-of-speech tagging test text, phrase test text, name entity test text, WordNet test texts Layer syntax tree test text；

S30, multiple plane characteristics are obtained to a pair of sentences of every row to training text based on sentence, obtain plane characteristic training text This, by plane characteristic training text, shallow-layer syntax tree training text is combined with sentence to artificial scoring training text obtains shallow-layer Syntax tree features training text,

Multiple plane characteristics are obtained to a pair of sentences of every row to test text based on sentence, plane characteristic test text is obtained, will Plane characteristic test text is combined with shallow-layer syntax tree test text obtains shallow-layer syntax tree characteristic test text；

S40, it is trained based on shallow-layer syntax tree features training text using SVR models, training pattern is obtained, by training pattern Similarity Measure resulting text is obtained with shallow-layer syntax tree characteristic test text.

2. sentence similarity computational methods according to claim 1, it is characterised in that：The detailed process of the step S10 It is as follows：

S101, the part of speech for obtaining each word in sentence using part-of-speech tagging instrument to all sentences in training text to sentence, Obtain corresponding part-of-speech tagging training text；

S102, phrase belonging to each word is obtained using syntactic analysis instrument to all sentences in training text to sentence, obtained Obtain phrase training text；

S103, the name Entity recognition knot for obtaining belonging to word using name Entity recognition instrument to training text based on sentence Really, name entity training text is obtained；

S104, the WordNet obtained using WordNet identification facilities to training text belonging to word based on sentence are upper adopted, if There is no justice on WordNet to be represented with space, obtain WordNet training texts；

3. sentence similarity computational methods according to claim 1, it is characterised in that：The detailed process of the step S20 It is as follows：

S201, according to part-of-speech tagging training text, be sentence to each the syntax shallow-layer syntax tree in training text, obtain Basic shallow-layer syntax tree training text；Basic shallow-layer syntax tree is obtained to test text and part-of-speech tagging test text by sentence to survey Examination text；

Wherein, shallow-layer syntax tree building method is as follows：Word in one sentence is generated as the leaf node of the bottom；Every The part of speech of individual leaf node equivalent as each leaf node father node；Finally, father's section of all of part of speech node is set Point is root node；

S202, according to phrase training text, be that each syntax in basic shallow-layer syntax tree training text is more further Shallow-layer syntax tree obtains phrase shallow-layer syntax tree training text；Phrase-based test text and basic shallow-layer syntax tree test text Obtain phrase shallow-layer syntax tree test text；

Wherein, more further shallow-layer syntax tree building method is as follows：By sentence phrase chunking result, acquisition belongs to same phrase The information of word；Same chunker nodes are connected on the part of speech father node of the word leaf node that will belong to same phrase；Disconnect Contacting between root node and part of speech node, corresponding part of speech node is connected to by chunker nodes；Finally, set all of The father node of part of speech node is root node；

S203, phrase-based shallow-layer syntax tree training text, name entity training text and WordNet training texts obtain semantic Shallow-layer syntax tree training text；Phrase-based shallow-layer syntax tree test text, name entity test text and WordNet test texts This acquisition semanteme shallow-layer syntax tree test text；

Semantic shallow-layer syntax tree training text is to add semantic information on phrase shallow-layer syntax tree training text, and specific method is such as Under：If a word in phrase shallow-layer syntax tree training text is in name entity training text and WordNet training texts In have NER or WNSS information, the syntactic information of the chunker nodes comprising the word is modified as NER or WNSS information；If Meet above-mentioned situation containing multiple words in one phrase node, use NER the and WNSS information of last word in phrase；

S204, based on semantic shallow-layer syntax tree training text delete definite article and conjunction interdependent node, obtain prune shallow-layer syntax Tree training text；Definite article and conjunction interdependent node are deleted based on semantic shallow-layer syntax tree test text, is obtained and is pruned shallow-layer sentence Method tree test text；

S205, based on prune shallow-layer syntax tree training text, the shallow-layer syntax tree relevant portion of a pair of sentences is associated and is obtained Obtain shallow-layer syntax tree training text；Based on shallow-layer syntax tree test text is pruned, by a pair of shallow-layer syntax tree dependent parts of sentence Divide and associate acquisition shallow-layer syntax tree test text；

Wherein, the method for the corresponding shallow-layer syntax tree of a pair of sentences being associated：If certain word is identical in two sentences, Obtain their father's node, grandparent node and be nonterminal node, REL on mark.

4. sentence similarity computational methods according to claim 1, it is characterised in that：The detailed process of the step S30 It is as follows：

S301, plane characteristic training text is obtained to training text based on sentence, it is special to test text to obtain plane based on sentence Levy test text；

Wherein, plane characteristic training text and plane characteristic test text are respectively sentence to training text and sentence to test text The every Similarity Measure plane characteristic of a pair of sentences of row in this；

S302, shallow-layer syntax tree features training text is obtained by plane characteristic training text and shallow-layer syntax tree training text；By Plane characteristic test text obtains shallow-layer syntax tree characteristic test text with shallow-layer syntax tree test text.

5. sentence similarity computational methods according to claim 1, it is characterised in that：The detailed process of the step S40 It is as follows：

S401, using SVR obtain similarity calculation, instructed in SVR models by shallow-layer syntax tree features training text Practice and obtain training pattern；

S402, using training pattern and shallow-layer syntax tree characteristic test text as input, obtain similarity meter using SVR instruments Calculate resulting text；

Wherein, the numerical value that Similarity Measure resulting text is often gone corresponds to similarity of the sentence to every a pair of the sentences of row of test text Result of calculation.

6. a kind of sentence similarity computing system, it is characterised in that including：

- pretreatment module, part-of-speech tagging, syntax point are called to sentence to training text and sentence to all sentences in test text Analysis, name Entity recognition, WordNet identification facilities carry out part-of-speech tagging, syntactic analysis, name Entity recognition, WordNet respectively Identification obtains part-of-speech tagging training text, phrase training text, name entity training text, WordNet training texts and part of speech Mark test text, phrase test text, name entity test text, WordNet test texts；

- structured features the module based on shallow-layer syntax tree, it is real based on part-of-speech tagging training text, phrase training text, name Body training text, WordNet training texts obtain shallow-layer syntax tree training text；

The feature set module of-Similarity Measure based on shallow-layer syntax tree, based on sentence to training text to a pair of sentences of every row Obtain multiple plane characteristics, obtain plane characteristic training text, by plane characteristic training text, shallow-layer syntax tree training text with Sentence is combined to artificial scoring training text and obtains shallow-layer syntax tree features training text；

- the similarity calculation module based on shallow-layer syntax tree, is carried out using SVR models based on shallow-layer syntax tree features training text Training, obtains training pattern, and Similarity Measure resulting text is obtained by training pattern and shallow-layer syntax tree characteristic test text.

7. sentence similarity computing system according to claim 6, it is characterised in that：The pretreatment module is specifically wrapped Include：

- part-of-speech tagging unit, each list in part-of-speech tagging instrument acquisition sentence is used to sentence to all sentences in training text The part of speech of word, obtains corresponding part-of-speech tagging training text；

- phrase tagging unit, is obtained belonging to each word to all sentences in training text to sentence using syntactic analysis instrument Phrase, obtain phrase test text；Same treatment is carried out to test text to sentence and obtains phrase test text；

- name Entity recognition unit, the life belonging to word is obtained to training text based on sentence using name Entity recognition instrument Name Entity recognition result, obtains name entity training text, carries out same treatment to test text to sentence and obtains name entity Test text；

- WordNet recognition units, are obtained belonging to word to training text based on sentence using WordNet identification facilities The upper justice of WordNet, if justice is represented with space on no WordNet, obtains WordNet training texts, to sentence to test text Originally carry out same treatment and obtain WordNet test texts.

8. sentence similarity computing system according to claim 6, it is characterised in that：The knot based on shallow-layer syntax tree Structure characteristic module is specifically included：

- basic shallow-layer syntax tree unit, is sentence to each syntax in training text according to part-of-speech tagging training text Shallow-layer syntax tree, obtains basic shallow-layer syntax tree training text；Test text and part-of-speech tagging test text are obtained by sentence Basic shallow-layer syntax tree test text；

The shallow-layer syntax tree unit of-addition phrase level syntactic structure, is basic shallow-layer syntax tree training according to phrase training text The more further shallow-layer syntax tree of each syntax in text obtains phrase shallow-layer syntax tree training text；Phrase-based survey Examination text and basic shallow-layer syntax tree test text obtain phrase shallow-layer syntax tree test text；

- add semantic shallow-layer syntax tree unit, phrase-based shallow-layer syntax tree training text, name entity training text and WordNet training texts obtain semantic shallow-layer syntax tree training text；Phrase-based shallow-layer syntax tree test text, name entity Test text and WordNet test texts obtain semantic shallow-layer syntax tree test text；

The shallow-layer syntax tree unit of the insignificant information of-deletion, definite article and company are deleted based on semantic shallow-layer syntax tree training text Word interdependent node, obtains and prunes shallow-layer syntax tree training text；Based on semantic shallow-layer syntax tree test text delete definite article and Conjunction interdependent node, obtains and prunes shallow-layer syntax tree test text；

- sentence represents unit to joint, based on shallow-layer syntax tree training text is pruned, a pair of shallow-layer syntax trees of sentence are related Partial association gets up to obtain shallow-layer syntax tree training text；Based on shallow-layer syntax tree test text is pruned, by the shallow of a pair of sentences Layer syntax tree relevant portion associates acquisition shallow-layer syntax tree test text；

9. sentence similarity computing system according to claim 6, it is characterised in that：The phase based on shallow-layer syntax tree The feature set module calculated like degree is specifically included：

- plane characteristic collection unit, obtains plane characteristic training text, based on sentence to test text based on sentence to training text Obtain plane characteristic test text；

- shallow-layer syntax tree feature set unit, shallow-layer syntax is obtained by plane characteristic training text and shallow-layer syntax tree training text Tree features training text, shallow-layer syntax tree characteristic test text is obtained by plane characteristic test text and shallow-layer syntax tree test text This.

10. sentence similarity computing system according to claim 6, it is characterised in that：It is described based on shallow-layer syntax tree Similarity calculation module is specifically included：

- training unit, similarity calculation is obtained using SVR, is entered in SVR models by shallow-layer syntax tree features training text Row training obtains training pattern；

- test cell, using training pattern and shallow-layer syntax tree characteristic test text as input, phase is obtained using SVR instruments Like degree result of calculation text；