CN108509409A - A method of automatically generating semantic similarity sentence sample - Google Patents

A method of automatically generating semantic similarity sentence sample Download PDF

Info

Publication number
CN108509409A
CN108509409A CN201710109325.7A CN201710109325A CN108509409A CN 108509409 A CN108509409 A CN 108509409A CN 201710109325 A CN201710109325 A CN 201710109325A CN 108509409 A CN108509409 A CN 108509409A
Authority
CN
China
Prior art keywords
sentence sample
semantic similarity
sentence
sample
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710109325.7A
Other languages
Chinese (zh)
Inventor
王昊
陈见耸
高鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yutou Technology Hangzhou Co Ltd
Original Assignee
Yutou Technology Hangzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yutou Technology Hangzhou Co Ltd filed Critical Yutou Technology Hangzhou Co Ltd
Priority to CN201710109325.7A priority Critical patent/CN108509409A/en
Priority to PCT/CN2018/074325 priority patent/WO2018153215A1/en
Priority to TW107105170A priority patent/TWI662425B/en
Publication of CN108509409A publication Critical patent/CN108509409A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of methods automatically generating semantic similarity sentence sample, belong to language processing techniques field;Method includes:It obtains sentence sample and carries out word segmentation processing;The set of the close word with the semantic similarity of each word is obtained using term vector model;A close word is chosen from set respectively and replaces word, to form semantic similarity sentence sample;Using language model, it is directed to probable value of each semantic similarity sentence sample generation for indicating semantic possibility respectively, and sort from high to low to all semantic similarity sentence samples according to probable value;Top n semantic similarity sentence sample is chosen and retains, to carry out subsequent processing steps according to retained semantic similarity sentence sample.The advantageous effect of above-mentioned technical proposal is:The sentence sample that large batch of semantic similarity can be automatically generated in the case where not needing the follow-up sentence set of magnanimity, eliminates a large amount of manual work.

Description

A method of automatically generating semantic similarity sentence sample
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of semantic similarity sentence samples of automatically generating Method.
Background technology
In the prior art, in the processing procedure of natural language, many processing tasks are required for the sentence of a large amount of semantic similarities Son or clause set, the sentence of these semantic similarities or the set of clause usually require manually to write, therefore can expend a large amount of Manpower and the time.
With the development of automatic technology, the compiling procedures of more and more semantic similarity sentences can be by the side that automates Formula is realized.The mode of the sentence set of the semantic similarity of high-volume acquisition at present is mainly the following:
1) large batch of semantic similarity sentence is obtained by the way of retrieval type.So-called retrieval type mode, refers in magnanimity Candidate sentences in the sentence set of semantic similarity is found by certain retrieval type.The premise of this method application is to need first There is a candidate sentences set of magnanimity, and for semanteme during being searched using retrieval type and generative semantics close sentence The performance requirement of similarity searching module is very high, i.e. the performance of semantic similarity searching module is determined using retrieval type mode The levels of precision of the semantic similarity sentence of acquisition.
2) large batch of semantic similarity sentence is obtained by the way of sequence to sequence.This mode is current It is very active in the research of academic scientific research field, but many sentences for generating in practical applications in this way and do not conform to Reason, performance are not practicability that is fine, therefore lacking certain.
Invention content
According to the above-mentioned problems in the prior art, a kind of method automatically generating semantic similarity sentence sample is now provided Technical solution, it is intended to the sentence sample for effectively automatically generating large batch of semantic similarity eliminates a large amount of manual work.
Above-mentioned technical proposal specifically includes:
A method of semantic similarity sentence sample being automatically generated, during being suitable for natural language processing;Wherein, in advance First train and formed the term vector model for handling the word for obtaining semantic similarity, and the semanteme for judging to generate The language model of the semantic possibility of close sentence sample further includes:
Step S1 obtains externally input sentence sample;
Step S2, to the sentence sample carry out word segmentation processing, by the sentence sample be decomposed into including it is multiple sequentially The combination of the word of arrangement;
Step S3, using the term vector model, it each of includes the word to respectively obtain with the sentence sample Semantic similarity close word set;
Step S4 chooses a close word and is replaced from the set corresponding with each word respectively The word, to form the semantic similarity sentence sample for being associated with the sentence sample;
Step S5 judges whether also have the close word being not yet selected in the set:
If so, then returning to the step S4;
Step S6 is directed to each semantic similarity sentence sample and is generated for indicating respectively using the language model The probable value of the semanteme possibility, and all semantic similarity sentence samples are arranged from high to low according to the probable value Sequence;
Step S7 chooses and retains semantic similarity sentence sample described in top n, with according to the retained semantic similarity Sentence sample carries out subsequent processing steps.
Preferably, method for automatically generating semantic similarity sentence sample, wherein the type of the sentence sample includes:
The sentence sample of sentence type, the sentence type includes multiple words of sequential;
Clause type, the sentence sample of the clause type include multiple words of sequential and multiple The part of speech label of the word or the sentence sample of the clause type include multiple parts of speech of sequential Label;
The step S1 is specifically included:
Step S11 obtains the externally input sentence sample;
Step S12 judges the type of the sentence sample:
If the sentence sample is the clause type, step S13 is turned to;
If the sentence sample is the sentence type, directly to the step S2;
Each of the sentence sample part of speech label is substituted for respectively corresponding to the part of speech mark by step S13 One high frequency words of label are subsequently diverted to the step S2 to form the complete sentence sample.
Preferably, method for automatically generating semantic similarity sentence sample, wherein advance using a preset segmenting method It trains and forms the term vector model;
Then in the step S2, word segmentation processing is carried out to the sentence sample using the preset segmenting method.
Preferably, method for automatically generating semantic similarity sentence sample, wherein in the step S4, be selected and be used in combination The part of speech having the same between the close word and the word being replaced of replacement.
Preferably, the method for automatically generating semantic similarity sentence sample, wherein in the step S6, each institute's predicate The probable value of the close sentence sample of justice is for indicating each semantic similarity sentence sample as a complete sentence The semantics scoring for the possibility that son is set up.
Preferably, method for automatically generating semantic similarity sentence sample, wherein the class of the semantic similarity sentence sample Type includes:
The semantic similarity sentence sample of sentence type, the sentence type includes multiple institute's predicates of sequential Language;
The semantic similarity sentence sample of clause type, the clause type includes multiple institute's predicates of sequential The sentence sample of the part of speech label or the clause type of language and multiple words includes the multiple of sequential The part of speech label;
Then the step S7 is specifically included:
Step S71 chooses and retains semantic similarity sentence sample described in top n;
Step S72 judges whether the semantic similarity sentence sample for needing to export the clause type:
If so, turning to step S73;
If it is not, then turning to step S74;
The word that the semantic similarity sentence sample includes is substituted for the corresponding part of speech label by step S73, with The complete semantic similarity sentence sample is formed, subsequent processing steps are then carried out;
Step S74 carries out subsequent processing steps according to the retained semantic similarity sentence sample.
The advantageous effect of above-mentioned technical proposal is:A kind of method automatically generating semantic similarity sentence sample is provided, it can The sentence sample that large batch of semantic similarity is automatically generated in the case where not needing the follow-up sentence set of magnanimity eliminates big The manual work of amount.
Description of the drawings
Fig. 1 is a kind of totality for the method automatically generating semantic similarity sentence sample in the preferred embodiment of the present invention Flow diagram;
Fig. 2 is on the basis of Fig. 1, to obtain externally input sentence sample in the preferred embodiment of the present invention and go forward side by side The flow diagram of row processing;
Fig. 3 is on the basis of Fig. 1, to choose in the preferred embodiment of the present invention and retain semantic similarity sentence sample While flow diagram that the semantic similarity sentence sample of output is handled.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of not making creative work it is all its His embodiment, shall fall within the protection scope of the present invention.
It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.
The invention will be further described in the following with reference to the drawings and specific embodiments, but not as limiting to the invention.
Based on the above-mentioned problems in the prior art, a kind of side automatically generating semantic similarity sentence sample is now provided Method, during this method is suitable for natural language processing.
In the above method, the term vector model for handling the word for obtaining semantic similarity is trained and formed in advance, and The language model of the semantic possibility of semantic similarity sentence sample for judging to generate.
The above method is specific as shown in Figure 1, including:
Step S1 obtains externally input sentence sample;
Step S2, distich subsample carry out word segmentation processing, sentence sample are decomposed into the word for including multiple sequentials The combination of language;
Step S3 respectively obtains the semantic similarity for each word for including with sentence sample using term vector model The set of close word;
Step S4 chooses a close word from set corresponding with each word and replaces word, respectively to be formed It is associated with the semantic similarity sentence sample of sentence sample;
Step S5 judges whether also have the close word being not yet selected in set:
If so, then return to step S4;
Step S6 is directed to each semantic similarity sentence sample and is generated for indicating semantic possible respectively using language model Property probable value, and according to probable value from high to low to all semantic similarity sentence samples sort;
Step S7 chooses and retains top n semantic similarity sentence sample, with according to retained semantic similarity sentence sample Carry out subsequent processing steps.
In the present embodiment, above-mentioned term vector model may be used the tool that word is characterized as real number value vector by some and be formed, Such as Word2vec, the tool can utilize the thought of deep learning, be tieed up by training K is reduced to the processing of content of text Vector operation in vector space, and the similarity in vector space can be used for indicating the similarity on text semantic.It is above-mentioned Term vector refers to being modeled to language model using neural network, while obtaining a kind of expression of word in vector space, The close word of the word can be obtained according to the similarity between word by being handled word using term vector.
Specifically, in the present embodiment, it can be a large amount of textual data that training, which forms the training sample of above-mentioned term vector model, According to these text datas can derive from the text data in the different forums crawled, and be needed by participle before input Processing.
After above-mentioned term vector model, the real vector for shoulding be the low dimensional for indicating word of output is instructed The real vector of a low dimensional should be corresponded to by practicing each word in language material.
Above-mentioned real vector can usually be expressed as [0.792, -0.177, -0.107,0.109, -0.542 ...] or class As form, dimension is tieed up relatively common with 50 peacekeepings 100.Then most traditional Euclidean can be used at a distance from vector of the word between word Distance is weighed, and can also be weighed with cos angles.The vector indicated in this way, the distance meeting of " Mike " and " microphone " Far smaller than " Mike " and " weather ".Such as.It may be used and calculate the mode of cos angles to calculate similarity, to be referred to Determine the close word of word.During the similarity for calculating other words and specified word, higher similarity is close word.
Correspondingly, in the present embodiment, above-mentioned language model can be the model of the probability that forms a complete sentence for calculating a sentence, Such as it is expressed as P (W1, W2 ... Wk).Utilizing language model, it may be determined which word sequence is the possibility bigger of sentence, or Person gives several words, can predict the word that next most probable occurs.Briefly, language model is used for judging several phrases At word sequence whether meet the custom that people speaks, i.e. the word sequence is the possibility of sentence.One in the present invention is preferable Embodiment in, n-gram model realizations may be used in above-mentioned language model.
Specifically, during being trained to language model, input model is each text by word segmentation processing Sentence, the probability that can be combined for collocations in each text sentence of output.
Then in the present embodiment, in above-mentioned steps S1, obtains externally input sentence sample and be likely to be by being manually entered , it is also possible to it is obtained by the sentence sample database outside connection.Acquired sentence sample can be purely random sentence Subsample, it is only necessary to most basic semantics rule is followed, such as meets the necessary condition for constituting sentence in semantics, and And it is a clear and coherent sentence.
In the present embodiment, in above-mentioned steps S2, word segmentation processing is carried out respectively to each sentence sample, therefore can be one Sentence sample resolves into the combination of the word including multiple sequentials.Such as " I will listen Zhou Jielun for a sentence sample Blue and white porcelain ", then formed after participle be " my+wanting+listen+Zhou Jielun++ blue and white porcelain ", wherein needing subsequently walking What is paid close attention in rapid should be the word with concrete meaning, such as noun " Zhou Jielun " and noun " blue and white porcelain ".Further, exist Each word in above-mentioned sentence sample has a corresponding part of speech label, such as the part of speech label of " Zhou Jielun " is " singer " (may be indicated with " singer " in computer processing procedure), the part of speech label of " blue and white porcelain " are " song " (in computer May be indicated with " song " in processing procedure) etc..In the present embodiment, above-mentioned part of speech label can also be referred to as the mark of the word Label.
In the present embodiment, after distich subsample carries out word segmentation processing, according to each word using at term vector model Reason obtains the set of its corresponding close word.Specifically, so-called close word refers to the semantic similarity consistent with the part of speech of the word Word, such as " Zhou Jielun ", label is " singer ", then the correspondence mark obtained according to term vector model treatment The close word of label may have " Wang Lihong ", " pottery Zhe ", " Chen Yixun " and " that English " etc., then can be handled according to term vector model The collection for obtaining above-mentioned close word merges output.Correspondingly, if its label is " songster " (in computer for " Zhou Jielun " May be indicated with " male-Singer " in processing procedure), then the close word for corresponding to the label may have " Wang Lihong ", " pottery Zhe " and " Chen Yixun " etc..In other words, the corresponding label of different terms determines the set of the close word of the word.
In the present embodiment, in above-mentioned steps S4, a close word is chosen from set corresponding with each word respectively And word is replaced, to form the semantic similarity sentence sample for being associated with sentence sample.For example, a corresponding sentence sample may be deposited In a word, i.e. a sentence sample is formed by a word sequential, and there are one close words for each word tool Set, each set is internal has the b semantic and most similar close word of the word, then a sentence sample may correspond to presence baA semantic similarity sentence sample, that is, being directed to a sentence sample, there are the set of a semantic similarity sentence sample, for more A sentence sample may have the set of multiple semantic similarity sentence samples, therefore can realize and automatically generate large batch of language The close sentence sample of justice.
In the present embodiment, above-mentioned steps S5 is to select the cycle of close set of words, i.e. above-mentioned steps S4-S5 realizations are The operation of large batch of semantic similarity sentence sample is generated for the sentence sample of a batch input.
In the present embodiment, in the close sentence sample of generative semantics, some semantic similarity sentence samples may be due to simple The piling up of close word causes semantically obstructed, enters subsequent processing to cannot function as a normal sentence sample.Cause This is in above-mentioned steps S6, after the close sentence sample of generative semantics, needs the language mould using above-mentioned advance training and generation Type analyzes the semantic possibility of each semantic similarity sentence sample, may finally be directed to each semantic similarity sentence sample The probable value of the semantic possibility for indicating the sentence is generated, which can be used to indicate that the sentence in semantics Reasonability.Then semantic similarity sentence sample is arranged from high to low according to the probable value.Specifically, for giving sentence S=W1, W2 ..., Wk, wherein S are for marking sentence, Wk (k=1,2,3 ...) to be used to indicate k-th of word in the sentence Language.
Then the probable value of above-mentioned sentence can be expressed as:P (S)=P (W1, W2 ..., Wk)~P (W1) P (W2 | W1) ... P (Wk | W1, W2 ..., Wk-1), " P (W1) " in above-mentioned formula, " P (W2 | W1) " equiprobability are to train shape by above-mentioned language model At.Therefore its probable value P (S) can be obtained for each sentence S processing by language model, which can also be considered as The semantics score of the sentence.
Finally in above-mentioned steps S7, chooses top n semantic similarity sentence sample and retain, then to retained semanteme Close sentence sample carries out subsequent processing steps, gives up other semantic similarity sentence samples not being retained.Above-mentioned N can be certainly So number, and its value can freely be set by user according to actual conditions.
Specifically, for above-mentioned steps S7, in of the invention preferred embodiment, the sentence each inputted can be directed to Subsample retains top n semantic similarity sentence sample.In an alternative embodiment of the invention, all formation can also be directed to Semantic similarity sentence sample only retains top n.The object range of above-mentioned selection can be by user's sets itself as needed.
In the preferred embodiment of the present invention, the type of the sentence sample of above-mentioned input includes:
The sentence sample of sentence type, sentence type includes multiple words of sequential;
Clause type, the sentence sample of clause type include the multiple words and part of speech label of sequential, or only Part of speech label including multiple sequentials;
Then above-mentioned steps S1 is specific as shown in Fig. 2, including:
Step S11 obtains externally input sentence sample;
Step S12 judges the type of sentence sample:
If sentence sample is clause type, step S13 is turned to;
If sentence sample is sentence type, directly to step S2;
Each part of speech label in sentence sample is substituted for the high frequency words corresponding to part of speech label by step S13 respectively, To form complete sentence sample, it is subsequently diverted to step S2.
Specifically, in the present embodiment, the type of above-mentioned sentence sample may include sentence type and clause type.
So-called sentence type, refers to the sentence for the multiple words for including sequential, such as " I will listen, and Zhou Jielun's is blue and white Porcelain " is just a sentence.
So-called clause type refers to the multiple words and part of speech label for including sequential, or only includes sequential Multiple part of speech labels sentence, such as " I will listen ' song ' of ' singer ' " just be a clause, wherein " singer " and " sing It is bent " it is part of speech label.
Further, as long as soon as occurring a part of speech label in sentence sample, which is a clause type Sentence sample.Such as " I will listen ' song ' of Zhou Jielun " is just the sentence sample of a clause type.
It, can be into subsequently being grasped in above-mentioned steps S2 without doing any processing for sentence sample then in the present embodiment Make.
And for clause sample, the word that part of speech label therein is replaced into the corresponding label is needed, to form one Complete sentence is re-fed into above-mentioned steps S2 and carries out subsequent processing.
Specifically, in above-mentioned steps S13, the part of speech label in the sentence sample for being judged as clause type is replaced into this High frequency words in label, to form complete sentence sample.So-called high frequency words, refer to by statistical data obtain in a word Category sign occurrence number it is more, using more frequent word, the sentence sample of clause type is substituted using these high frequency words In corresponding part of speech label, can form that a comparison is reasonable and complete sentence sample.
In the preferred embodiment of the present invention, using preset segmenting method training in advance and term vector model is formed;
Then in above-mentioned steps S2, word segmentation processing is carried out using preset segmenting method distich subsample.
Specifically, in the present embodiment, the identical segmenting method of above-mentioned term vector model is formed come to sentence using with training Sample carries out word segmentation processing, and the outer word of collection can be reduced in subsequent processing step, therefore helps to promote final processing effect Fruit.
In the preferred embodiment of the present invention, the forward direction based on dictionary may be used in above-mentioned preset segmenting method The processing method that maximum matching idea is segmented:Take m character in sentence to be slit as matching field from left to right, M is the word length of longest word in dictionary;It searches dictionary to be matched, if successful match, by the field of successful match It is come out as a word segmentation;If matching is unsuccessful, the last character of matching field is removed, remaining character string is as new Matching field, matched, repeated the above process again, until being syncopated as all words.
In another preferred embodiment of the present invention, above-mentioned preset segmenting method may be used based on the inverse of dictionary The processing method segmented to maximum matching idea, specially:The m character conduct of sentence to be slit is taken from right to left With field, m is the word length of longest word in dictionary;It searches dictionary to be matched, if successful match, by successful match Field come out as a word segmentation;If matching is unsuccessful, the most previous word of matching field is removed, remaining character string It as new matching field, is matched, is repeated the above process again, until being syncopated as all words.
In another preferred embodiment of the present invention, above-mentioned preset segmenting method can also be used based on dictionary The processing method that two-way maximum matching idea is segmented combines above-mentioned Forward Maximum Method thought and reverse maximum matching to think The method for wanting to carry out word segmentation processing.Specially:
If Forward Maximum Method is identical with reverse maximum matched result, the result of any one and output are taken;
If Forward Maximum Method and reverse maximum matched result are different, first selection segment after less that of word number As a result;If word number is identical, reverse maximum matched result is selected.
" dictionary " so-called in above-described embodiment refers to including a large amount of words by one formed after compiling Dictionary database.
In the other embodiment of the present invention, other segmenting methods are readily applicable in the present invention, have no effect on the present invention Protection domain.
The present invention preferred embodiment in, in above-mentioned steps S4, be selected and for replacement close word be replaced Word between part of speech having the same, such as be similarly noun or be similarly verb, therefore can ensure replacement operation Accuracy avoids unreasonable by replaced sentence logic.
In the preferred embodiment of the present invention, the type of above-mentioned semantic similarity sentence sample includes:
Sentence type, the semantic similarity sentence sample of sentence type include multiple words of sequential;
The semantic similarity sentence sample of clause type, clause type includes the multiple words and part of speech mark of sequential Label, or only include the part of speech label of multiple sequentials;
Then as shown in figure 3, above-mentioned steps S7 is specifically included:
Step S71 chooses and retains top n semantic similarity sentence sample;
Step S72 judges whether the semantic similarity sentence sample for needing to export clause type:
If so, turning to step S73;
If it is not, then turning to step S74;
The word that semantic similarity sentence sample includes is substituted for corresponding part of speech label by step S73, complete to be formed Semantic similarity sentence sample, then carry out subsequent processing steps;
Step S74 carries out subsequent processing steps according to retained semantic similarity sentence sample.
Specifically, similar the above, above-mentioned semantic similarity sentence sample equally includes sentence type and clause type. Then in the present embodiment, user can be using the semantic similarity sentence sample of sets itself final output as sentence type or clause Type:
If user sets the semantic similarity sentence sample of final output as sentence type, directly output passes through language mould The semantic similarity sentence sample of type screening simultaneously carries out subsequent processing steps.
If user sets the semantic similarity sentence sample of final output as clause type, need semantic similarity sentence The word that sample includes is substituted for corresponding part of speech label, to form the semantic similarity sentence sample of complete clause type, Followed by subsequent processing steps.
In the preferred embodiment of the present invention, the above subsequent processing steps may include that basis automatically generates Large batch of semantic similarity sentence sample carry out the exploitation of semantic open platform, or carry out the calculating etc. of semantic similarity.
Specifically, in preferred embodiment of the invention, the function of semantic open platform is semantic interface opening Other developers are given, developer is helped to complete the exploitation of detailed programs.When user inputs a sentence or clause, in use Method described herein can automatically generate a large amount of similar sentences or clause, to increase semantic generalization ability, enhance language Adopted understandability, and a large amount of manual operation is reduced, the time is saved, raising efficiency.
Correspondingly, it in preferred embodiment of the invention, in the calculating process of semantic similarity, needs to use a large amount of Semantic similarity sentence or clause, then can in large quantity be generated for semantic similarity using the above method The sentence sample of the training process of calculating.
In the preferred embodiment of the present invention, in above-mentioned steps S7, it may finally export including retained semantic similarity The set of sentence sample, for subsequently being handled.
The foregoing is merely preferred embodiments of the present invention, are not intended to limit embodiments of the present invention and protection model It encloses, to those skilled in the art, should can appreciate that all with made by description of the invention and diagramatic content Equivalent replacement and obviously change obtained scheme, should all be included within the scope of the present invention.

Claims (7)

1. a kind of method automatically generating semantic similarity sentence sample, during being suitable for natural language processing;Its feature exists In advance to train and formed the term vector model for handling the word for obtaining semantic similarity, and the institute for judging to generate The language model for stating the semantic possibility of semantic similarity sentence sample further includes:
Step S1 obtains externally input sentence sample;
Step S2 carries out word segmentation processing to the sentence sample, the sentence sample is decomposed into including multiple sequentials Word combination;
Step S3, using the term vector model, respectively obtain with the sentence sample each of include the word language The set of close word similar in justice;
Step S4 chooses a close word and described in replacing from the set corresponding with each word respectively Word, to form the semantic similarity sentence sample for being associated with the sentence sample;
Step S5 judges whether also have the close word being not yet selected in the set:
If so, then returning to the step S4;
Step S6 is directed to each semantic similarity sentence sample and is generated for indicating described respectively using the language model The probable value of semantic possibility, and sort from high to low to all semantic similarity sentence samples according to the probable value;
Step S7 chooses and retains semantic similarity sentence sample described in top n, with according to the retained semantic similarity sentence Sample carries out subsequent processing steps.
2. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the sentence sample Type include:
The sentence sample of sentence type, the sentence type includes multiple words of sequential;
Clause type, the sentence sample of the clause type include multiple words of sequential and multiple described The part of speech label of word or the sentence sample of the clause type include multiple part of speech marks of sequential Label;
The step S1 is specifically included:
Step S11 obtains the externally input sentence sample;
Step S12 judges the type of the sentence sample:
If the sentence sample is the clause type, step S13 is turned to;
If the sentence sample is the sentence type, directly to the step S2;
Each of the sentence sample part of speech label is substituted for respectively corresponding to the part of speech label by step S13 One high frequency words are subsequently diverted to the step S2 to form the complete sentence sample.
3. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that preset using one Segmenting method training in advance simultaneously forms the term vector model;
Then in the step S2, word segmentation processing is carried out to the sentence sample using the preset segmenting method.
4. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S4 In, it is selected and for part of speech having the same between the close word and the word being replaced of replacement.
5. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S6 In, the probable value of each semantic similarity sentence sample is for indicating each semantic similarity sentence sample conduct The semantics scoring for the possibility that one complete sentence is set up.
6. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the semantic similarity The type of sentence sample includes:
The semantic similarity sentence sample of sentence type, the sentence type includes multiple words of sequential;
Clause type, the semantic similarity sentence sample of the clause type include sequential multiple words and The part of speech label of multiple words or the sentence sample of the clause type include the multiple described of sequential Part of speech label;
Then the step S7 is specifically included:
Step S71 chooses and retains semantic similarity sentence sample described in top n;
Step S72 judges whether the semantic similarity sentence sample for needing to export the clause type:
If so, turning to step S73;
If it is not, then turning to step S74;
The word that the semantic similarity sentence sample includes is substituted for the corresponding part of speech label by step S73, to be formed The complete semantic similarity sentence sample, then carries out subsequent processing steps;
Step S74 carries out subsequent processing steps according to the retained semantic similarity sentence sample.
7. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S7 In, after choosing and retaining semantic similarity sentence sample described in top n, output includes the semantic similarity sentence sample being retained Sample set, to carry out subsequent processing steps.
CN201710109325.7A 2017-02-27 2017-02-27 A method of automatically generating semantic similarity sentence sample Pending CN108509409A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201710109325.7A CN108509409A (en) 2017-02-27 2017-02-27 A method of automatically generating semantic similarity sentence sample
PCT/CN2018/074325 WO2018153215A1 (en) 2017-02-27 2018-01-26 Method for automatically generating sentence sample with similar semantics
TW107105170A TWI662425B (en) 2017-02-27 2018-02-13 A method of automatically generating semantic similar sentence samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710109325.7A CN108509409A (en) 2017-02-27 2017-02-27 A method of automatically generating semantic similarity sentence sample

Publications (1)

Publication Number Publication Date
CN108509409A true CN108509409A (en) 2018-09-07

Family

ID=63254281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710109325.7A Pending CN108509409A (en) 2017-02-27 2017-02-27 A method of automatically generating semantic similarity sentence sample

Country Status (3)

Country Link
CN (1) CN108509409A (en)
TW (1) TWI662425B (en)
WO (1) WO2018153215A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657231A (en) * 2018-11-09 2019-04-19 广东电网有限责任公司 A kind of long SMS compressing method and system
CN110334197A (en) * 2019-06-28 2019-10-15 科大讯飞股份有限公司 Corpus processing method and relevant apparatus
CN110633359A (en) * 2019-09-04 2019-12-31 北京百分点信息科技有限公司 Sentence equivalence judgment method and device
CN111709234A (en) * 2020-05-28 2020-09-25 北京百度网讯科技有限公司 Training method and device of text processing model and electronic equipment
CN111950237A (en) * 2019-04-29 2020-11-17 深圳市优必选科技有限公司 Sentence rewriting method, sentence rewriting device and electronic equipment

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096572B (en) * 2019-04-12 2023-09-15 成都美满科技有限责任公司 Sample generation method, device and computer readable medium
CN110929522A (en) * 2019-08-19 2020-03-27 网娱互动科技(北京)股份有限公司 Intelligent synonym replacement method and system
CN110929526A (en) * 2019-10-28 2020-03-27 深圳绿米联创科技有限公司 Sample generation method and device and electronic equipment
CN111178059B (en) * 2019-12-07 2023-08-25 武汉光谷信息技术股份有限公司 Similarity comparison method and device based on word2vec technology
CN112395867B (en) * 2020-11-16 2023-08-08 中国平安人寿保险股份有限公司 Synonym mining method and device, storage medium and computer equipment
CN112883150B (en) * 2021-01-21 2023-07-25 平安科技(深圳)有限公司 Method, device, equipment and storage medium for distinguishing trademark words from general words
CN113688239B (en) * 2021-08-20 2024-04-16 平安国际智慧城市科技股份有限公司 Text classification method and device under small sample, electronic equipment and storage medium
US11741302B1 (en) 2022-05-18 2023-08-29 Microsoft Technology Licensing, Llc Automated artificial intelligence driven readability scoring techniques

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006059105A (en) * 2004-08-19 2006-03-02 Mitsubishi Electric Corp Apparatus, method and program for preparing language model
US20130018649A1 (en) * 2011-07-13 2013-01-17 Nuance Communications, Inc. System and a Method for Generating Semantically Similar Sentences for Building a Robust SLM
CN103218444A (en) * 2013-04-22 2013-07-24 中央民族大学 Method of Tibetan language webpage text classification based on semanteme
CN103823794A (en) * 2014-02-25 2014-05-28 浙江大学 Automatic question setting method about query type short answer question of English reading comprehension test
CN104281565A (en) * 2014-09-30 2015-01-14 百度在线网络技术(北京)有限公司 Semantic dictionary constructing method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI227417B (en) * 2003-12-02 2005-02-01 Inst Information Industry Digital resource recommendation system, method and machine-readable medium using semantic comparison of query sentence
GB201010545D0 (en) * 2010-06-23 2010-08-11 Rolls Royce Plc Entity recognition
CN106033416B (en) * 2015-03-09 2019-12-24 阿里巴巴集团控股有限公司 Character string processing method and device
CN105677637A (en) * 2015-12-31 2016-06-15 上海智臻智能网络科技股份有限公司 Method and device for updating abstract semantics database in intelligent question-answering system
CN106021223B (en) * 2016-05-09 2020-06-23 Tcl科技集团股份有限公司 Sentence similarity calculation method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006059105A (en) * 2004-08-19 2006-03-02 Mitsubishi Electric Corp Apparatus, method and program for preparing language model
JP4245530B2 (en) * 2004-08-19 2009-03-25 三菱電機株式会社 Language model creation apparatus and method, and program
US20130018649A1 (en) * 2011-07-13 2013-01-17 Nuance Communications, Inc. System and a Method for Generating Semantically Similar Sentences for Building a Robust SLM
CN103218444A (en) * 2013-04-22 2013-07-24 中央民族大学 Method of Tibetan language webpage text classification based on semanteme
CN103823794A (en) * 2014-02-25 2014-05-28 浙江大学 Automatic question setting method about query type short answer question of English reading comprehension test
CN104281565A (en) * 2014-09-30 2015-01-14 百度在线网络技术(北京)有限公司 Semantic dictionary constructing method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657231A (en) * 2018-11-09 2019-04-19 广东电网有限责任公司 A kind of long SMS compressing method and system
CN109657231B (en) * 2018-11-09 2023-04-07 广东电网有限责任公司 Long short message simplifying method and system
CN111950237A (en) * 2019-04-29 2020-11-17 深圳市优必选科技有限公司 Sentence rewriting method, sentence rewriting device and electronic equipment
CN111950237B (en) * 2019-04-29 2023-06-09 深圳市优必选科技有限公司 Sentence rewriting method, sentence rewriting device and electronic equipment
CN110334197A (en) * 2019-06-28 2019-10-15 科大讯飞股份有限公司 Corpus processing method and relevant apparatus
CN110633359A (en) * 2019-09-04 2019-12-31 北京百分点信息科技有限公司 Sentence equivalence judgment method and device
CN110633359B (en) * 2019-09-04 2022-03-29 北京百分点科技集团股份有限公司 Sentence equivalence judgment method and device
CN111709234A (en) * 2020-05-28 2020-09-25 北京百度网讯科技有限公司 Training method and device of text processing model and electronic equipment

Also Published As

Publication number Publication date
TW201841121A (en) 2018-11-16
WO2018153215A1 (en) 2018-08-30
TWI662425B (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
US10496749B2 (en) Unified semantics-focused language processing and zero base knowledge building system
CN101599071B (en) Automatic extraction method of conversation text topic
Constant et al. MWU-aware part-of-speech tagging with a CRF model and lexical resources
CN110489760A (en) Based on deep neural network text auto-collation and device
CN109408642A (en) A kind of domain entities relation on attributes abstracting method based on distance supervision
CN105138514B (en) It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN106126620A (en) Method of Chinese Text Automatic Abstraction based on machine learning
Suleiman et al. The use of hidden Markov model in natural ARABIC language processing: a survey
CN112328800A (en) System and method for automatically generating programming specification question answers
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN113505209A (en) Intelligent question-answering system for automobile field
CN112364132A (en) Similarity calculation model and system based on dependency syntax and method for building system
CN110222344B (en) Composition element analysis algorithm for composition tutoring of pupils
CN113361252A (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
Fiser et al. Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources.
Sharma et al. Lexicon a linguistic approach for sentiment classification
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Srinivasagan et al. An automated system for tamil named entity recognition using hybrid approach
Li et al. Multilingual toxic text classification model based on deep learning
Basumatary et al. Deep Learning Based Bodo Parts of Speech Tagger
Filippova et al. Bilingual terminology extraction using neural word embeddings on comparable corpora

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1252738

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20180907

RJ01 Rejection of invention patent application after publication