CN108509409A

CN108509409A - A method of automatically generating semantic similarity sentence sample

Info

Publication number: CN108509409A
Application number: CN201710109325.7A
Authority: CN
Inventors: 王昊; 陈见耸; 高鹏
Original assignee: Yutou Technology Hangzhou Co Ltd
Current assignee: Yutou Technology Hangzhou Co Ltd
Priority date: 2017-02-27
Filing date: 2017-02-27
Publication date: 2018-09-07
Also published as: TW201841121A; WO2018153215A1; TWI662425B

Abstract

The invention discloses a kind of methods automatically generating semantic similarity sentence sample, belong to language processing techniques field；Method includes：It obtains sentence sample and carries out word segmentation processing；The set of the close word with the semantic similarity of each word is obtained using term vector model；A close word is chosen from set respectively and replaces word, to form semantic similarity sentence sample；Using language model, it is directed to probable value of each semantic similarity sentence sample generation for indicating semantic possibility respectively, and sort from high to low to all semantic similarity sentence samples according to probable value；Top n semantic similarity sentence sample is chosen and retains, to carry out subsequent processing steps according to retained semantic similarity sentence sample.The advantageous effect of above-mentioned technical proposal is：The sentence sample that large batch of semantic similarity can be automatically generated in the case where not needing the follow-up sentence set of magnanimity, eliminates a large amount of manual work.

Description

A method of automatically generating semantic similarity sentence sample

Technical field

The present invention relates to natural language processing technique field more particularly to a kind of semantic similarity sentence samples of automatically generating Method.

Background technology

In the prior art, in the processing procedure of natural language, many processing tasks are required for the sentence of a large amount of semantic similarities Son or clause set, the sentence of these semantic similarities or the set of clause usually require manually to write, therefore can expend a large amount of Manpower and the time.

With the development of automatic technology, the compiling procedures of more and more semantic similarity sentences can be by the side that automates Formula is realized.The mode of the sentence set of the semantic similarity of high-volume acquisition at present is mainly the following：

1) large batch of semantic similarity sentence is obtained by the way of retrieval type.So-called retrieval type mode, refers in magnanimity Candidate sentences in the sentence set of semantic similarity is found by certain retrieval type.The premise of this method application is to need first There is a candidate sentences set of magnanimity, and for semanteme during being searched using retrieval type and generative semantics close sentence The performance requirement of similarity searching module is very high, i.e. the performance of semantic similarity searching module is determined using retrieval type mode The levels of precision of the semantic similarity sentence of acquisition.

2) large batch of semantic similarity sentence is obtained by the way of sequence to sequence.This mode is current It is very active in the research of academic scientific research field, but many sentences for generating in practical applications in this way and do not conform to Reason, performance are not practicability that is fine, therefore lacking certain.

Invention content

According to the above-mentioned problems in the prior art, a kind of method automatically generating semantic similarity sentence sample is now provided Technical solution, it is intended to the sentence sample for effectively automatically generating large batch of semantic similarity eliminates a large amount of manual work.

Above-mentioned technical proposal specifically includes：

A method of semantic similarity sentence sample being automatically generated, during being suitable for natural language processing；Wherein, in advance First train and formed the term vector model for handling the word for obtaining semantic similarity, and the semanteme for judging to generate The language model of the semantic possibility of close sentence sample further includes：

Step S1 obtains externally input sentence sample；

Step S2, to the sentence sample carry out word segmentation processing, by the sentence sample be decomposed into including it is multiple sequentially The combination of the word of arrangement；

Step S3, using the term vector model, it each of includes the word to respectively obtain with the sentence sample Semantic similarity close word set；

Step S4 chooses a close word and is replaced from the set corresponding with each word respectively The word, to form the semantic similarity sentence sample for being associated with the sentence sample；

Step S5 judges whether also have the close word being not yet selected in the set：

If so, then returning to the step S4；

Step S6 is directed to each semantic similarity sentence sample and is generated for indicating respectively using the language model The probable value of the semanteme possibility, and all semantic similarity sentence samples are arranged from high to low according to the probable value Sequence；

Step S7 chooses and retains semantic similarity sentence sample described in top n, with according to the retained semantic similarity Sentence sample carries out subsequent processing steps.

Preferably, method for automatically generating semantic similarity sentence sample, wherein the type of the sentence sample includes：

The sentence sample of sentence type, the sentence type includes multiple words of sequential；

Clause type, the sentence sample of the clause type include multiple words of sequential and multiple The part of speech label of the word or the sentence sample of the clause type include multiple parts of speech of sequential Label；

The step S1 is specifically included：

Step S11 obtains the externally input sentence sample；

Step S12 judges the type of the sentence sample：

If the sentence sample is the clause type, step S13 is turned to；

If the sentence sample is the sentence type, directly to the step S2；

Each of the sentence sample part of speech label is substituted for respectively corresponding to the part of speech mark by step S13 One high frequency words of label are subsequently diverted to the step S2 to form the complete sentence sample.

Preferably, method for automatically generating semantic similarity sentence sample, wherein advance using a preset segmenting method It trains and forms the term vector model；

Then in the step S2, word segmentation processing is carried out to the sentence sample using the preset segmenting method.

Preferably, method for automatically generating semantic similarity sentence sample, wherein in the step S4, be selected and be used in combination The part of speech having the same between the close word and the word being replaced of replacement.

Preferably, the method for automatically generating semantic similarity sentence sample, wherein in the step S6, each institute's predicate The probable value of the close sentence sample of justice is for indicating each semantic similarity sentence sample as a complete sentence The semantics scoring for the possibility that son is set up.

Preferably, method for automatically generating semantic similarity sentence sample, wherein the class of the semantic similarity sentence sample Type includes：

The semantic similarity sentence sample of sentence type, the sentence type includes multiple institute's predicates of sequential Language；

The semantic similarity sentence sample of clause type, the clause type includes multiple institute's predicates of sequential The sentence sample of the part of speech label or the clause type of language and multiple words includes the multiple of sequential The part of speech label；

Then the step S7 is specifically included：

Step S71 chooses and retains semantic similarity sentence sample described in top n；

Step S72 judges whether the semantic similarity sentence sample for needing to export the clause type：

If so, turning to step S73；

If it is not, then turning to step S74；

The word that the semantic similarity sentence sample includes is substituted for the corresponding part of speech label by step S73, with The complete semantic similarity sentence sample is formed, subsequent processing steps are then carried out；

Step S74 carries out subsequent processing steps according to the retained semantic similarity sentence sample.

The advantageous effect of above-mentioned technical proposal is：A kind of method automatically generating semantic similarity sentence sample is provided, it can The sentence sample that large batch of semantic similarity is automatically generated in the case where not needing the follow-up sentence set of magnanimity eliminates big The manual work of amount.

Description of the drawings

Fig. 1 is a kind of totality for the method automatically generating semantic similarity sentence sample in the preferred embodiment of the present invention Flow diagram；

Fig. 2 is on the basis of Fig. 1, to obtain externally input sentence sample in the preferred embodiment of the present invention and go forward side by side The flow diagram of row processing；

Fig. 3 is on the basis of Fig. 1, to choose in the preferred embodiment of the present invention and retain semantic similarity sentence sample While flow diagram that the semantic similarity sentence sample of output is handled.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of not making creative work it is all its His embodiment, shall fall within the protection scope of the present invention.

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.

The invention will be further described in the following with reference to the drawings and specific embodiments, but not as limiting to the invention.

Based on the above-mentioned problems in the prior art, a kind of side automatically generating semantic similarity sentence sample is now provided Method, during this method is suitable for natural language processing.

In the above method, the term vector model for handling the word for obtaining semantic similarity is trained and formed in advance, and The language model of the semantic possibility of semantic similarity sentence sample for judging to generate.

The above method is specific as shown in Figure 1, including：

Step S1 obtains externally input sentence sample；

Step S2, distich subsample carry out word segmentation processing, sentence sample are decomposed into the word for including multiple sequentials The combination of language；

Step S3 respectively obtains the semantic similarity for each word for including with sentence sample using term vector model The set of close word；

Step S4 chooses a close word from set corresponding with each word and replaces word, respectively to be formed It is associated with the semantic similarity sentence sample of sentence sample；

Step S5 judges whether also have the close word being not yet selected in set：

If so, then return to step S4；

Step S6 is directed to each semantic similarity sentence sample and is generated for indicating semantic possible respectively using language model Property probable value, and according to probable value from high to low to all semantic similarity sentence samples sort；

Step S7 chooses and retains top n semantic similarity sentence sample, with according to retained semantic similarity sentence sample Carry out subsequent processing steps.

In the present embodiment, above-mentioned term vector model may be used the tool that word is characterized as real number value vector by some and be formed, Such as Word2vec, the tool can utilize the thought of deep learning, be tieed up by training K is reduced to the processing of content of text Vector operation in vector space, and the similarity in vector space can be used for indicating the similarity on text semantic.It is above-mentioned Term vector refers to being modeled to language model using neural network, while obtaining a kind of expression of word in vector space, The close word of the word can be obtained according to the similarity between word by being handled word using term vector.

Specifically, in the present embodiment, it can be a large amount of textual data that training, which forms the training sample of above-mentioned term vector model, According to these text datas can derive from the text data in the different forums crawled, and be needed by participle before input Processing.

After above-mentioned term vector model, the real vector for shoulding be the low dimensional for indicating word of output is instructed The real vector of a low dimensional should be corresponded to by practicing each word in language material.

Above-mentioned real vector can usually be expressed as [0.792, -0.177, -0.107,0.109, -0.542 ...] or class As form, dimension is tieed up relatively common with 50 peacekeepings 100.Then most traditional Euclidean can be used at a distance from vector of the word between word Distance is weighed, and can also be weighed with cos angles.The vector indicated in this way, the distance meeting of " Mike " and " microphone " Far smaller than " Mike " and " weather ".Such as.It may be used and calculate the mode of cos angles to calculate similarity, to be referred to Determine the close word of word.During the similarity for calculating other words and specified word, higher similarity is close word.

Correspondingly, in the present embodiment, above-mentioned language model can be the model of the probability that forms a complete sentence for calculating a sentence, Such as it is expressed as P (W1, W2 ... Wk).Utilizing language model, it may be determined which word sequence is the possibility bigger of sentence, or Person gives several words, can predict the word that next most probable occurs.Briefly, language model is used for judging several phrases At word sequence whether meet the custom that people speaks, i.e. the word sequence is the possibility of sentence.One in the present invention is preferable Embodiment in, n-gram model realizations may be used in above-mentioned language model.

Specifically, during being trained to language model, input model is each text by word segmentation processing Sentence, the probability that can be combined for collocations in each text sentence of output.

Then in the present embodiment, in above-mentioned steps S1, obtains externally input sentence sample and be likely to be by being manually entered , it is also possible to it is obtained by the sentence sample database outside connection.Acquired sentence sample can be purely random sentence Subsample, it is only necessary to most basic semantics rule is followed, such as meets the necessary condition for constituting sentence in semantics, and And it is a clear and coherent sentence.

In the present embodiment, in above-mentioned steps S2, word segmentation processing is carried out respectively to each sentence sample, therefore can be one Sentence sample resolves into the combination of the word including multiple sequentials.Such as " I will listen Zhou Jielun for a sentence sample Blue and white porcelain ", then formed after participle be " my+wanting+listen+Zhou Jielun++ blue and white porcelain ", wherein needing subsequently walking What is paid close attention in rapid should be the word with concrete meaning, such as noun " Zhou Jielun " and noun " blue and white porcelain ".Further, exist Each word in above-mentioned sentence sample has a corresponding part of speech label, such as the part of speech label of " Zhou Jielun " is " singer " (may be indicated with " singer " in computer processing procedure), the part of speech label of " blue and white porcelain " are " song " (in computer May be indicated with " song " in processing procedure) etc..In the present embodiment, above-mentioned part of speech label can also be referred to as the mark of the word Label.

In the present embodiment, after distich subsample carries out word segmentation processing, according to each word using at term vector model Reason obtains the set of its corresponding close word.Specifically, so-called close word refers to the semantic similarity consistent with the part of speech of the word Word, such as " Zhou Jielun ", label is " singer ", then the correspondence mark obtained according to term vector model treatment The close word of label may have " Wang Lihong ", " pottery Zhe ", " Chen Yixun " and " that English " etc., then can be handled according to term vector model The collection for obtaining above-mentioned close word merges output.Correspondingly, if its label is " songster " (in computer for " Zhou Jielun " May be indicated with " male-Singer " in processing procedure), then the close word for corresponding to the label may have " Wang Lihong ", " pottery Zhe " and " Chen Yixun " etc..In other words, the corresponding label of different terms determines the set of the close word of the word.

In the present embodiment, in above-mentioned steps S4, a close word is chosen from set corresponding with each word respectively And word is replaced, to form the semantic similarity sentence sample for being associated with sentence sample.For example, a corresponding sentence sample may be deposited In a word, i.e. a sentence sample is formed by a word sequential, and there are one close words for each word tool Set, each set is internal has the b semantic and most similar close word of the word, then a sentence sample may correspond to presence b^aA semantic similarity sentence sample, that is, being directed to a sentence sample, there are the set of a semantic similarity sentence sample, for more A sentence sample may have the set of multiple semantic similarity sentence samples, therefore can realize and automatically generate large batch of language The close sentence sample of justice.

In the present embodiment, above-mentioned steps S5 is to select the cycle of close set of words, i.e. above-mentioned steps S4-S5 realizations are The operation of large batch of semantic similarity sentence sample is generated for the sentence sample of a batch input.

In the present embodiment, in the close sentence sample of generative semantics, some semantic similarity sentence samples may be due to simple The piling up of close word causes semantically obstructed, enters subsequent processing to cannot function as a normal sentence sample.Cause This is in above-mentioned steps S6, after the close sentence sample of generative semantics, needs the language mould using above-mentioned advance training and generation Type analyzes the semantic possibility of each semantic similarity sentence sample, may finally be directed to each semantic similarity sentence sample The probable value of the semantic possibility for indicating the sentence is generated, which can be used to indicate that the sentence in semantics Reasonability.Then semantic similarity sentence sample is arranged from high to low according to the probable value.Specifically, for giving sentence S=W1, W2 ..., Wk, wherein S are for marking sentence, Wk (k=1,2,3 ...) to be used to indicate k-th of word in the sentence Language.

Then the probable value of above-mentioned sentence can be expressed as：P (S)=P (W1, W2 ..., Wk)~P (W1) P (W2 | W1) ... P (Wk | W1, W2 ..., Wk-1), " P (W1) " in above-mentioned formula, " P (W2 | W1) " equiprobability are to train shape by above-mentioned language model At.Therefore its probable value P (S) can be obtained for each sentence S processing by language model, which can also be considered as The semantics score of the sentence.

Finally in above-mentioned steps S7, chooses top n semantic similarity sentence sample and retain, then to retained semanteme Close sentence sample carries out subsequent processing steps, gives up other semantic similarity sentence samples not being retained.Above-mentioned N can be certainly So number, and its value can freely be set by user according to actual conditions.

Specifically, for above-mentioned steps S7, in of the invention preferred embodiment, the sentence each inputted can be directed to Subsample retains top n semantic similarity sentence sample.In an alternative embodiment of the invention, all formation can also be directed to Semantic similarity sentence sample only retains top n.The object range of above-mentioned selection can be by user's sets itself as needed.

In the preferred embodiment of the present invention, the type of the sentence sample of above-mentioned input includes：

The sentence sample of sentence type, sentence type includes multiple words of sequential；

Clause type, the sentence sample of clause type include the multiple words and part of speech label of sequential, or only Part of speech label including multiple sequentials；

Then above-mentioned steps S1 is specific as shown in Fig. 2, including：

Step S11 obtains externally input sentence sample；

Step S12 judges the type of sentence sample：

If sentence sample is clause type, step S13 is turned to；

If sentence sample is sentence type, directly to step S2；

Each part of speech label in sentence sample is substituted for the high frequency words corresponding to part of speech label by step S13 respectively, To form complete sentence sample, it is subsequently diverted to step S2.

Specifically, in the present embodiment, the type of above-mentioned sentence sample may include sentence type and clause type.

So-called sentence type, refers to the sentence for the multiple words for including sequential, such as " I will listen, and Zhou Jielun's is blue and white Porcelain " is just a sentence.

So-called clause type refers to the multiple words and part of speech label for including sequential, or only includes sequential Multiple part of speech labels sentence, such as " I will listen ' song ' of ' singer ' " just be a clause, wherein " singer " and " sing It is bent " it is part of speech label.

Further, as long as soon as occurring a part of speech label in sentence sample, which is a clause type Sentence sample.Such as " I will listen ' song ' of Zhou Jielun " is just the sentence sample of a clause type.

It, can be into subsequently being grasped in above-mentioned steps S2 without doing any processing for sentence sample then in the present embodiment Make.

And for clause sample, the word that part of speech label therein is replaced into the corresponding label is needed, to form one Complete sentence is re-fed into above-mentioned steps S2 and carries out subsequent processing.

Specifically, in above-mentioned steps S13, the part of speech label in the sentence sample for being judged as clause type is replaced into this High frequency words in label, to form complete sentence sample.So-called high frequency words, refer to by statistical data obtain in a word Category sign occurrence number it is more, using more frequent word, the sentence sample of clause type is substituted using these high frequency words In corresponding part of speech label, can form that a comparison is reasonable and complete sentence sample.

In the preferred embodiment of the present invention, using preset segmenting method training in advance and term vector model is formed；

Then in above-mentioned steps S2, word segmentation processing is carried out using preset segmenting method distich subsample.

Specifically, in the present embodiment, the identical segmenting method of above-mentioned term vector model is formed come to sentence using with training Sample carries out word segmentation processing, and the outer word of collection can be reduced in subsequent processing step, therefore helps to promote final processing effect Fruit.

In the preferred embodiment of the present invention, the forward direction based on dictionary may be used in above-mentioned preset segmenting method The processing method that maximum matching idea is segmented：Take m character in sentence to be slit as matching field from left to right, M is the word length of longest word in dictionary；It searches dictionary to be matched, if successful match, by the field of successful match It is come out as a word segmentation；If matching is unsuccessful, the last character of matching field is removed, remaining character string is as new Matching field, matched, repeated the above process again, until being syncopated as all words.

In another preferred embodiment of the present invention, above-mentioned preset segmenting method may be used based on the inverse of dictionary The processing method segmented to maximum matching idea, specially：The m character conduct of sentence to be slit is taken from right to left With field, m is the word length of longest word in dictionary；It searches dictionary to be matched, if successful match, by successful match Field come out as a word segmentation；If matching is unsuccessful, the most previous word of matching field is removed, remaining character string It as new matching field, is matched, is repeated the above process again, until being syncopated as all words.

In another preferred embodiment of the present invention, above-mentioned preset segmenting method can also be used based on dictionary The processing method that two-way maximum matching idea is segmented combines above-mentioned Forward Maximum Method thought and reverse maximum matching to think The method for wanting to carry out word segmentation processing.Specially：

If Forward Maximum Method is identical with reverse maximum matched result, the result of any one and output are taken；

If Forward Maximum Method and reverse maximum matched result are different, first selection segment after less that of word number As a result；If word number is identical, reverse maximum matched result is selected.

" dictionary " so-called in above-described embodiment refers to including a large amount of words by one formed after compiling Dictionary database.

In the other embodiment of the present invention, other segmenting methods are readily applicable in the present invention, have no effect on the present invention Protection domain.

The present invention preferred embodiment in, in above-mentioned steps S4, be selected and for replacement close word be replaced Word between part of speech having the same, such as be similarly noun or be similarly verb, therefore can ensure replacement operation Accuracy avoids unreasonable by replaced sentence logic.

In the preferred embodiment of the present invention, the type of above-mentioned semantic similarity sentence sample includes：

Sentence type, the semantic similarity sentence sample of sentence type include multiple words of sequential；

The semantic similarity sentence sample of clause type, clause type includes the multiple words and part of speech mark of sequential Label, or only include the part of speech label of multiple sequentials；

Then as shown in figure 3, above-mentioned steps S7 is specifically included：

Step S71 chooses and retains top n semantic similarity sentence sample；

Step S72 judges whether the semantic similarity sentence sample for needing to export clause type：

If so, turning to step S73；

If it is not, then turning to step S74；

The word that semantic similarity sentence sample includes is substituted for corresponding part of speech label by step S73, complete to be formed Semantic similarity sentence sample, then carry out subsequent processing steps；

Step S74 carries out subsequent processing steps according to retained semantic similarity sentence sample.

Specifically, similar the above, above-mentioned semantic similarity sentence sample equally includes sentence type and clause type. Then in the present embodiment, user can be using the semantic similarity sentence sample of sets itself final output as sentence type or clause Type：

If user sets the semantic similarity sentence sample of final output as sentence type, directly output passes through language mould The semantic similarity sentence sample of type screening simultaneously carries out subsequent processing steps.

If user sets the semantic similarity sentence sample of final output as clause type, need semantic similarity sentence The word that sample includes is substituted for corresponding part of speech label, to form the semantic similarity sentence sample of complete clause type, Followed by subsequent processing steps.

In the preferred embodiment of the present invention, the above subsequent processing steps may include that basis automatically generates Large batch of semantic similarity sentence sample carry out the exploitation of semantic open platform, or carry out the calculating etc. of semantic similarity.

Specifically, in preferred embodiment of the invention, the function of semantic open platform is semantic interface opening Other developers are given, developer is helped to complete the exploitation of detailed programs.When user inputs a sentence or clause, in use Method described herein can automatically generate a large amount of similar sentences or clause, to increase semantic generalization ability, enhance language Adopted understandability, and a large amount of manual operation is reduced, the time is saved, raising efficiency.

Correspondingly, it in preferred embodiment of the invention, in the calculating process of semantic similarity, needs to use a large amount of Semantic similarity sentence or clause, then can in large quantity be generated for semantic similarity using the above method The sentence sample of the training process of calculating.

In the preferred embodiment of the present invention, in above-mentioned steps S7, it may finally export including retained semantic similarity The set of sentence sample, for subsequently being handled.

The foregoing is merely preferred embodiments of the present invention, are not intended to limit embodiments of the present invention and protection model It encloses, to those skilled in the art, should can appreciate that all with made by description of the invention and diagramatic content Equivalent replacement and obviously change obtained scheme, should all be included within the scope of the present invention.

Claims

1. a kind of method automatically generating semantic similarity sentence sample, during being suitable for natural language processing；Its feature exists In advance to train and formed the term vector model for handling the word for obtaining semantic similarity, and the institute for judging to generate The language model for stating the semantic possibility of semantic similarity sentence sample further includes：

Step S1 obtains externally input sentence sample；

Step S2 carries out word segmentation processing to the sentence sample, the sentence sample is decomposed into including multiple sequentials Word combination；

Step S3, using the term vector model, respectively obtain with the sentence sample each of include the word language The set of close word similar in justice；

Step S4 chooses a close word and described in replacing from the set corresponding with each word respectively Word, to form the semantic similarity sentence sample for being associated with the sentence sample；

If so, then returning to the step S4；

Step S6 is directed to each semantic similarity sentence sample and is generated for indicating described respectively using the language model The probable value of semantic possibility, and sort from high to low to all semantic similarity sentence samples according to the probable value；

2. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the sentence sample Type include：

Clause type, the sentence sample of the clause type include multiple words of sequential and multiple described The part of speech label of word or the sentence sample of the clause type include multiple part of speech marks of sequential Label；

The step S1 is specifically included：

Step S11 obtains the externally input sentence sample；

Step S12 judges the type of the sentence sample：

If the sentence sample is the clause type, step S13 is turned to；

If the sentence sample is the sentence type, directly to the step S2；

Each of the sentence sample part of speech label is substituted for respectively corresponding to the part of speech label by step S13 One high frequency words are subsequently diverted to the step S2 to form the complete sentence sample.

3. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that preset using one Segmenting method training in advance simultaneously forms the term vector model；

4. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S4 In, it is selected and for part of speech having the same between the close word and the word being replaced of replacement.

5. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S6 In, the probable value of each semantic similarity sentence sample is for indicating each semantic similarity sentence sample conduct The semantics scoring for the possibility that one complete sentence is set up.

6. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the semantic similarity The type of sentence sample includes：

The semantic similarity sentence sample of sentence type, the sentence type includes multiple words of sequential；

Clause type, the semantic similarity sentence sample of the clause type include sequential multiple words and The part of speech label of multiple words or the sentence sample of the clause type include the multiple described of sequential Part of speech label；

Then the step S7 is specifically included：

If so, turning to step S73；

If it is not, then turning to step S74；

The word that the semantic similarity sentence sample includes is substituted for the corresponding part of speech label by step S73, to be formed The complete semantic similarity sentence sample, then carries out subsequent processing steps；

7. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S7 In, after choosing and retaining semantic similarity sentence sample described in top n, output includes the semantic similarity sentence sample being retained Sample set, to carry out subsequent processing steps.