CN108509409A - A method of automatically generating semantic similarity sentence sample - Google Patents
A method of automatically generating semantic similarity sentence sample Download PDFInfo
- Publication number
- CN108509409A CN108509409A CN201710109325.7A CN201710109325A CN108509409A CN 108509409 A CN108509409 A CN 108509409A CN 201710109325 A CN201710109325 A CN 201710109325A CN 108509409 A CN108509409 A CN 108509409A
- Authority
- CN
- China
- Prior art keywords
- sentence sample
- semantic similarity
- sentence
- sample
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of methods automatically generating semantic similarity sentence sample, belong to language processing techniques field;Method includes:It obtains sentence sample and carries out word segmentation processing;The set of the close word with the semantic similarity of each word is obtained using term vector model;A close word is chosen from set respectively and replaces word, to form semantic similarity sentence sample;Using language model, it is directed to probable value of each semantic similarity sentence sample generation for indicating semantic possibility respectively, and sort from high to low to all semantic similarity sentence samples according to probable value;Top n semantic similarity sentence sample is chosen and retains, to carry out subsequent processing steps according to retained semantic similarity sentence sample.The advantageous effect of above-mentioned technical proposal is:The sentence sample that large batch of semantic similarity can be automatically generated in the case where not needing the follow-up sentence set of magnanimity, eliminates a large amount of manual work.
Description
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of semantic similarity sentence samples of automatically generating
Method.
Background technology
In the prior art, in the processing procedure of natural language, many processing tasks are required for the sentence of a large amount of semantic similarities
Son or clause set, the sentence of these semantic similarities or the set of clause usually require manually to write, therefore can expend a large amount of
Manpower and the time.
With the development of automatic technology, the compiling procedures of more and more semantic similarity sentences can be by the side that automates
Formula is realized.The mode of the sentence set of the semantic similarity of high-volume acquisition at present is mainly the following:
1) large batch of semantic similarity sentence is obtained by the way of retrieval type.So-called retrieval type mode, refers in magnanimity
Candidate sentences in the sentence set of semantic similarity is found by certain retrieval type.The premise of this method application is to need first
There is a candidate sentences set of magnanimity, and for semanteme during being searched using retrieval type and generative semantics close sentence
The performance requirement of similarity searching module is very high, i.e. the performance of semantic similarity searching module is determined using retrieval type mode
The levels of precision of the semantic similarity sentence of acquisition.
2) large batch of semantic similarity sentence is obtained by the way of sequence to sequence.This mode is current
It is very active in the research of academic scientific research field, but many sentences for generating in practical applications in this way and do not conform to
Reason, performance are not practicability that is fine, therefore lacking certain.
Invention content
According to the above-mentioned problems in the prior art, a kind of method automatically generating semantic similarity sentence sample is now provided
Technical solution, it is intended to the sentence sample for effectively automatically generating large batch of semantic similarity eliminates a large amount of manual work.
Above-mentioned technical proposal specifically includes:
A method of semantic similarity sentence sample being automatically generated, during being suitable for natural language processing;Wherein, in advance
First train and formed the term vector model for handling the word for obtaining semantic similarity, and the semanteme for judging to generate
The language model of the semantic possibility of close sentence sample further includes:
Step S1 obtains externally input sentence sample;
Step S2, to the sentence sample carry out word segmentation processing, by the sentence sample be decomposed into including it is multiple sequentially
The combination of the word of arrangement;
Step S3, using the term vector model, it each of includes the word to respectively obtain with the sentence sample
Semantic similarity close word set;
Step S4 chooses a close word and is replaced from the set corresponding with each word respectively
The word, to form the semantic similarity sentence sample for being associated with the sentence sample;
Step S5 judges whether also have the close word being not yet selected in the set:
If so, then returning to the step S4;
Step S6 is directed to each semantic similarity sentence sample and is generated for indicating respectively using the language model
The probable value of the semanteme possibility, and all semantic similarity sentence samples are arranged from high to low according to the probable value
Sequence;
Step S7 chooses and retains semantic similarity sentence sample described in top n, with according to the retained semantic similarity
Sentence sample carries out subsequent processing steps.
Preferably, method for automatically generating semantic similarity sentence sample, wherein the type of the sentence sample includes:
The sentence sample of sentence type, the sentence type includes multiple words of sequential;
Clause type, the sentence sample of the clause type include multiple words of sequential and multiple
The part of speech label of the word or the sentence sample of the clause type include multiple parts of speech of sequential
Label;
The step S1 is specifically included:
Step S11 obtains the externally input sentence sample;
Step S12 judges the type of the sentence sample:
If the sentence sample is the clause type, step S13 is turned to;
If the sentence sample is the sentence type, directly to the step S2;
Each of the sentence sample part of speech label is substituted for respectively corresponding to the part of speech mark by step S13
One high frequency words of label are subsequently diverted to the step S2 to form the complete sentence sample.
Preferably, method for automatically generating semantic similarity sentence sample, wherein advance using a preset segmenting method
It trains and forms the term vector model;
Then in the step S2, word segmentation processing is carried out to the sentence sample using the preset segmenting method.
Preferably, method for automatically generating semantic similarity sentence sample, wherein in the step S4, be selected and be used in combination
The part of speech having the same between the close word and the word being replaced of replacement.
Preferably, the method for automatically generating semantic similarity sentence sample, wherein in the step S6, each institute's predicate
The probable value of the close sentence sample of justice is for indicating each semantic similarity sentence sample as a complete sentence
The semantics scoring for the possibility that son is set up.
Preferably, method for automatically generating semantic similarity sentence sample, wherein the class of the semantic similarity sentence sample
Type includes:
The semantic similarity sentence sample of sentence type, the sentence type includes multiple institute's predicates of sequential
Language;
The semantic similarity sentence sample of clause type, the clause type includes multiple institute's predicates of sequential
The sentence sample of the part of speech label or the clause type of language and multiple words includes the multiple of sequential
The part of speech label;
Then the step S7 is specifically included:
Step S71 chooses and retains semantic similarity sentence sample described in top n;
Step S72 judges whether the semantic similarity sentence sample for needing to export the clause type:
If so, turning to step S73;
If it is not, then turning to step S74;
The word that the semantic similarity sentence sample includes is substituted for the corresponding part of speech label by step S73, with
The complete semantic similarity sentence sample is formed, subsequent processing steps are then carried out;
Step S74 carries out subsequent processing steps according to the retained semantic similarity sentence sample.
The advantageous effect of above-mentioned technical proposal is:A kind of method automatically generating semantic similarity sentence sample is provided, it can
The sentence sample that large batch of semantic similarity is automatically generated in the case where not needing the follow-up sentence set of magnanimity eliminates big
The manual work of amount.
Description of the drawings
Fig. 1 is a kind of totality for the method automatically generating semantic similarity sentence sample in the preferred embodiment of the present invention
Flow diagram;
Fig. 2 is on the basis of Fig. 1, to obtain externally input sentence sample in the preferred embodiment of the present invention and go forward side by side
The flow diagram of row processing;
Fig. 3 is on the basis of Fig. 1, to choose in the preferred embodiment of the present invention and retain semantic similarity sentence sample
While flow diagram that the semantic similarity sentence sample of output is handled.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of not making creative work it is all its
His embodiment, shall fall within the protection scope of the present invention.
It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase
Mutually combination.
The invention will be further described in the following with reference to the drawings and specific embodiments, but not as limiting to the invention.
Based on the above-mentioned problems in the prior art, a kind of side automatically generating semantic similarity sentence sample is now provided
Method, during this method is suitable for natural language processing.
In the above method, the term vector model for handling the word for obtaining semantic similarity is trained and formed in advance, and
The language model of the semantic possibility of semantic similarity sentence sample for judging to generate.
The above method is specific as shown in Figure 1, including:
Step S1 obtains externally input sentence sample;
Step S2, distich subsample carry out word segmentation processing, sentence sample are decomposed into the word for including multiple sequentials
The combination of language;
Step S3 respectively obtains the semantic similarity for each word for including with sentence sample using term vector model
The set of close word;
Step S4 chooses a close word from set corresponding with each word and replaces word, respectively to be formed
It is associated with the semantic similarity sentence sample of sentence sample;
Step S5 judges whether also have the close word being not yet selected in set:
If so, then return to step S4;
Step S6 is directed to each semantic similarity sentence sample and is generated for indicating semantic possible respectively using language model
Property probable value, and according to probable value from high to low to all semantic similarity sentence samples sort;
Step S7 chooses and retains top n semantic similarity sentence sample, with according to retained semantic similarity sentence sample
Carry out subsequent processing steps.
In the present embodiment, above-mentioned term vector model may be used the tool that word is characterized as real number value vector by some and be formed,
Such as Word2vec, the tool can utilize the thought of deep learning, be tieed up by training K is reduced to the processing of content of text
Vector operation in vector space, and the similarity in vector space can be used for indicating the similarity on text semantic.It is above-mentioned
Term vector refers to being modeled to language model using neural network, while obtaining a kind of expression of word in vector space,
The close word of the word can be obtained according to the similarity between word by being handled word using term vector.
Specifically, in the present embodiment, it can be a large amount of textual data that training, which forms the training sample of above-mentioned term vector model,
According to these text datas can derive from the text data in the different forums crawled, and be needed by participle before input
Processing.
After above-mentioned term vector model, the real vector for shoulding be the low dimensional for indicating word of output is instructed
The real vector of a low dimensional should be corresponded to by practicing each word in language material.
Above-mentioned real vector can usually be expressed as [0.792, -0.177, -0.107,0.109, -0.542 ...] or class
As form, dimension is tieed up relatively common with 50 peacekeepings 100.Then most traditional Euclidean can be used at a distance from vector of the word between word
Distance is weighed, and can also be weighed with cos angles.The vector indicated in this way, the distance meeting of " Mike " and " microphone "
Far smaller than " Mike " and " weather ".Such as.It may be used and calculate the mode of cos angles to calculate similarity, to be referred to
Determine the close word of word.During the similarity for calculating other words and specified word, higher similarity is close word.
Correspondingly, in the present embodiment, above-mentioned language model can be the model of the probability that forms a complete sentence for calculating a sentence,
Such as it is expressed as P (W1, W2 ... Wk).Utilizing language model, it may be determined which word sequence is the possibility bigger of sentence, or
Person gives several words, can predict the word that next most probable occurs.Briefly, language model is used for judging several phrases
At word sequence whether meet the custom that people speaks, i.e. the word sequence is the possibility of sentence.One in the present invention is preferable
Embodiment in, n-gram model realizations may be used in above-mentioned language model.
Specifically, during being trained to language model, input model is each text by word segmentation processing
Sentence, the probability that can be combined for collocations in each text sentence of output.
Then in the present embodiment, in above-mentioned steps S1, obtains externally input sentence sample and be likely to be by being manually entered
, it is also possible to it is obtained by the sentence sample database outside connection.Acquired sentence sample can be purely random sentence
Subsample, it is only necessary to most basic semantics rule is followed, such as meets the necessary condition for constituting sentence in semantics, and
And it is a clear and coherent sentence.
In the present embodiment, in above-mentioned steps S2, word segmentation processing is carried out respectively to each sentence sample, therefore can be one
Sentence sample resolves into the combination of the word including multiple sequentials.Such as " I will listen Zhou Jielun for a sentence sample
Blue and white porcelain ", then formed after participle be " my+wanting+listen+Zhou Jielun++ blue and white porcelain ", wherein needing subsequently walking
What is paid close attention in rapid should be the word with concrete meaning, such as noun " Zhou Jielun " and noun " blue and white porcelain ".Further, exist
Each word in above-mentioned sentence sample has a corresponding part of speech label, such as the part of speech label of " Zhou Jielun " is " singer "
(may be indicated with " singer " in computer processing procedure), the part of speech label of " blue and white porcelain " are " song " (in computer
May be indicated with " song " in processing procedure) etc..In the present embodiment, above-mentioned part of speech label can also be referred to as the mark of the word
Label.
In the present embodiment, after distich subsample carries out word segmentation processing, according to each word using at term vector model
Reason obtains the set of its corresponding close word.Specifically, so-called close word refers to the semantic similarity consistent with the part of speech of the word
Word, such as " Zhou Jielun ", label is " singer ", then the correspondence mark obtained according to term vector model treatment
The close word of label may have " Wang Lihong ", " pottery Zhe ", " Chen Yixun " and " that English " etc., then can be handled according to term vector model
The collection for obtaining above-mentioned close word merges output.Correspondingly, if its label is " songster " (in computer for " Zhou Jielun "
May be indicated with " male-Singer " in processing procedure), then the close word for corresponding to the label may have " Wang Lihong ", " pottery
Zhe " and " Chen Yixun " etc..In other words, the corresponding label of different terms determines the set of the close word of the word.
In the present embodiment, in above-mentioned steps S4, a close word is chosen from set corresponding with each word respectively
And word is replaced, to form the semantic similarity sentence sample for being associated with sentence sample.For example, a corresponding sentence sample may be deposited
In a word, i.e. a sentence sample is formed by a word sequential, and there are one close words for each word tool
Set, each set is internal has the b semantic and most similar close word of the word, then a sentence sample may correspond to presence
baA semantic similarity sentence sample, that is, being directed to a sentence sample, there are the set of a semantic similarity sentence sample, for more
A sentence sample may have the set of multiple semantic similarity sentence samples, therefore can realize and automatically generate large batch of language
The close sentence sample of justice.
In the present embodiment, above-mentioned steps S5 is to select the cycle of close set of words, i.e. above-mentioned steps S4-S5 realizations are
The operation of large batch of semantic similarity sentence sample is generated for the sentence sample of a batch input.
In the present embodiment, in the close sentence sample of generative semantics, some semantic similarity sentence samples may be due to simple
The piling up of close word causes semantically obstructed, enters subsequent processing to cannot function as a normal sentence sample.Cause
This is in above-mentioned steps S6, after the close sentence sample of generative semantics, needs the language mould using above-mentioned advance training and generation
Type analyzes the semantic possibility of each semantic similarity sentence sample, may finally be directed to each semantic similarity sentence sample
The probable value of the semantic possibility for indicating the sentence is generated, which can be used to indicate that the sentence in semantics
Reasonability.Then semantic similarity sentence sample is arranged from high to low according to the probable value.Specifically, for giving sentence
S=W1, W2 ..., Wk, wherein S are for marking sentence, Wk (k=1,2,3 ...) to be used to indicate k-th of word in the sentence
Language.
Then the probable value of above-mentioned sentence can be expressed as:P (S)=P (W1, W2 ..., Wk)~P (W1) P (W2 | W1) ... P
(Wk | W1, W2 ..., Wk-1), " P (W1) " in above-mentioned formula, " P (W2 | W1) " equiprobability are to train shape by above-mentioned language model
At.Therefore its probable value P (S) can be obtained for each sentence S processing by language model, which can also be considered as
The semantics score of the sentence.
Finally in above-mentioned steps S7, chooses top n semantic similarity sentence sample and retain, then to retained semanteme
Close sentence sample carries out subsequent processing steps, gives up other semantic similarity sentence samples not being retained.Above-mentioned N can be certainly
So number, and its value can freely be set by user according to actual conditions.
Specifically, for above-mentioned steps S7, in of the invention preferred embodiment, the sentence each inputted can be directed to
Subsample retains top n semantic similarity sentence sample.In an alternative embodiment of the invention, all formation can also be directed to
Semantic similarity sentence sample only retains top n.The object range of above-mentioned selection can be by user's sets itself as needed.
In the preferred embodiment of the present invention, the type of the sentence sample of above-mentioned input includes:
The sentence sample of sentence type, sentence type includes multiple words of sequential;
Clause type, the sentence sample of clause type include the multiple words and part of speech label of sequential, or only
Part of speech label including multiple sequentials;
Then above-mentioned steps S1 is specific as shown in Fig. 2, including:
Step S11 obtains externally input sentence sample;
Step S12 judges the type of sentence sample:
If sentence sample is clause type, step S13 is turned to;
If sentence sample is sentence type, directly to step S2;
Each part of speech label in sentence sample is substituted for the high frequency words corresponding to part of speech label by step S13 respectively,
To form complete sentence sample, it is subsequently diverted to step S2.
Specifically, in the present embodiment, the type of above-mentioned sentence sample may include sentence type and clause type.
So-called sentence type, refers to the sentence for the multiple words for including sequential, such as " I will listen, and Zhou Jielun's is blue and white
Porcelain " is just a sentence.
So-called clause type refers to the multiple words and part of speech label for including sequential, or only includes sequential
Multiple part of speech labels sentence, such as " I will listen ' song ' of ' singer ' " just be a clause, wherein " singer " and " sing
It is bent " it is part of speech label.
Further, as long as soon as occurring a part of speech label in sentence sample, which is a clause type
Sentence sample.Such as " I will listen ' song ' of Zhou Jielun " is just the sentence sample of a clause type.
It, can be into subsequently being grasped in above-mentioned steps S2 without doing any processing for sentence sample then in the present embodiment
Make.
And for clause sample, the word that part of speech label therein is replaced into the corresponding label is needed, to form one
Complete sentence is re-fed into above-mentioned steps S2 and carries out subsequent processing.
Specifically, in above-mentioned steps S13, the part of speech label in the sentence sample for being judged as clause type is replaced into this
High frequency words in label, to form complete sentence sample.So-called high frequency words, refer to by statistical data obtain in a word
Category sign occurrence number it is more, using more frequent word, the sentence sample of clause type is substituted using these high frequency words
In corresponding part of speech label, can form that a comparison is reasonable and complete sentence sample.
In the preferred embodiment of the present invention, using preset segmenting method training in advance and term vector model is formed;
Then in above-mentioned steps S2, word segmentation processing is carried out using preset segmenting method distich subsample.
Specifically, in the present embodiment, the identical segmenting method of above-mentioned term vector model is formed come to sentence using with training
Sample carries out word segmentation processing, and the outer word of collection can be reduced in subsequent processing step, therefore helps to promote final processing effect
Fruit.
In the preferred embodiment of the present invention, the forward direction based on dictionary may be used in above-mentioned preset segmenting method
The processing method that maximum matching idea is segmented:Take m character in sentence to be slit as matching field from left to right,
M is the word length of longest word in dictionary;It searches dictionary to be matched, if successful match, by the field of successful match
It is come out as a word segmentation;If matching is unsuccessful, the last character of matching field is removed, remaining character string is as new
Matching field, matched, repeated the above process again, until being syncopated as all words.
In another preferred embodiment of the present invention, above-mentioned preset segmenting method may be used based on the inverse of dictionary
The processing method segmented to maximum matching idea, specially:The m character conduct of sentence to be slit is taken from right to left
With field, m is the word length of longest word in dictionary;It searches dictionary to be matched, if successful match, by successful match
Field come out as a word segmentation;If matching is unsuccessful, the most previous word of matching field is removed, remaining character string
It as new matching field, is matched, is repeated the above process again, until being syncopated as all words.
In another preferred embodiment of the present invention, above-mentioned preset segmenting method can also be used based on dictionary
The processing method that two-way maximum matching idea is segmented combines above-mentioned Forward Maximum Method thought and reverse maximum matching to think
The method for wanting to carry out word segmentation processing.Specially:
If Forward Maximum Method is identical with reverse maximum matched result, the result of any one and output are taken;
If Forward Maximum Method and reverse maximum matched result are different, first selection segment after less that of word number
As a result;If word number is identical, reverse maximum matched result is selected.
" dictionary " so-called in above-described embodiment refers to including a large amount of words by one formed after compiling
Dictionary database.
In the other embodiment of the present invention, other segmenting methods are readily applicable in the present invention, have no effect on the present invention
Protection domain.
The present invention preferred embodiment in, in above-mentioned steps S4, be selected and for replacement close word be replaced
Word between part of speech having the same, such as be similarly noun or be similarly verb, therefore can ensure replacement operation
Accuracy avoids unreasonable by replaced sentence logic.
In the preferred embodiment of the present invention, the type of above-mentioned semantic similarity sentence sample includes:
Sentence type, the semantic similarity sentence sample of sentence type include multiple words of sequential;
The semantic similarity sentence sample of clause type, clause type includes the multiple words and part of speech mark of sequential
Label, or only include the part of speech label of multiple sequentials;
Then as shown in figure 3, above-mentioned steps S7 is specifically included:
Step S71 chooses and retains top n semantic similarity sentence sample;
Step S72 judges whether the semantic similarity sentence sample for needing to export clause type:
If so, turning to step S73;
If it is not, then turning to step S74;
The word that semantic similarity sentence sample includes is substituted for corresponding part of speech label by step S73, complete to be formed
Semantic similarity sentence sample, then carry out subsequent processing steps;
Step S74 carries out subsequent processing steps according to retained semantic similarity sentence sample.
Specifically, similar the above, above-mentioned semantic similarity sentence sample equally includes sentence type and clause type.
Then in the present embodiment, user can be using the semantic similarity sentence sample of sets itself final output as sentence type or clause
Type:
If user sets the semantic similarity sentence sample of final output as sentence type, directly output passes through language mould
The semantic similarity sentence sample of type screening simultaneously carries out subsequent processing steps.
If user sets the semantic similarity sentence sample of final output as clause type, need semantic similarity sentence
The word that sample includes is substituted for corresponding part of speech label, to form the semantic similarity sentence sample of complete clause type,
Followed by subsequent processing steps.
In the preferred embodiment of the present invention, the above subsequent processing steps may include that basis automatically generates
Large batch of semantic similarity sentence sample carry out the exploitation of semantic open platform, or carry out the calculating etc. of semantic similarity.
Specifically, in preferred embodiment of the invention, the function of semantic open platform is semantic interface opening
Other developers are given, developer is helped to complete the exploitation of detailed programs.When user inputs a sentence or clause, in use
Method described herein can automatically generate a large amount of similar sentences or clause, to increase semantic generalization ability, enhance language
Adopted understandability, and a large amount of manual operation is reduced, the time is saved, raising efficiency.
Correspondingly, it in preferred embodiment of the invention, in the calculating process of semantic similarity, needs to use a large amount of
Semantic similarity sentence or clause, then can in large quantity be generated for semantic similarity using the above method
The sentence sample of the training process of calculating.
In the preferred embodiment of the present invention, in above-mentioned steps S7, it may finally export including retained semantic similarity
The set of sentence sample, for subsequently being handled.
The foregoing is merely preferred embodiments of the present invention, are not intended to limit embodiments of the present invention and protection model
It encloses, to those skilled in the art, should can appreciate that all with made by description of the invention and diagramatic content
Equivalent replacement and obviously change obtained scheme, should all be included within the scope of the present invention.
Claims (7)
1. a kind of method automatically generating semantic similarity sentence sample, during being suitable for natural language processing;Its feature exists
In advance to train and formed the term vector model for handling the word for obtaining semantic similarity, and the institute for judging to generate
The language model for stating the semantic possibility of semantic similarity sentence sample further includes:
Step S1 obtains externally input sentence sample;
Step S2 carries out word segmentation processing to the sentence sample, the sentence sample is decomposed into including multiple sequentials
Word combination;
Step S3, using the term vector model, respectively obtain with the sentence sample each of include the word language
The set of close word similar in justice;
Step S4 chooses a close word and described in replacing from the set corresponding with each word respectively
Word, to form the semantic similarity sentence sample for being associated with the sentence sample;
Step S5 judges whether also have the close word being not yet selected in the set:
If so, then returning to the step S4;
Step S6 is directed to each semantic similarity sentence sample and is generated for indicating described respectively using the language model
The probable value of semantic possibility, and sort from high to low to all semantic similarity sentence samples according to the probable value;
Step S7 chooses and retains semantic similarity sentence sample described in top n, with according to the retained semantic similarity sentence
Sample carries out subsequent processing steps.
2. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the sentence sample
Type include:
The sentence sample of sentence type, the sentence type includes multiple words of sequential;
Clause type, the sentence sample of the clause type include multiple words of sequential and multiple described
The part of speech label of word or the sentence sample of the clause type include multiple part of speech marks of sequential
Label;
The step S1 is specifically included:
Step S11 obtains the externally input sentence sample;
Step S12 judges the type of the sentence sample:
If the sentence sample is the clause type, step S13 is turned to;
If the sentence sample is the sentence type, directly to the step S2;
Each of the sentence sample part of speech label is substituted for respectively corresponding to the part of speech label by step S13
One high frequency words are subsequently diverted to the step S2 to form the complete sentence sample.
3. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that preset using one
Segmenting method training in advance simultaneously forms the term vector model;
Then in the step S2, word segmentation processing is carried out to the sentence sample using the preset segmenting method.
4. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S4
In, it is selected and for part of speech having the same between the close word and the word being replaced of replacement.
5. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S6
In, the probable value of each semantic similarity sentence sample is for indicating each semantic similarity sentence sample conduct
The semantics scoring for the possibility that one complete sentence is set up.
6. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the semantic similarity
The type of sentence sample includes:
The semantic similarity sentence sample of sentence type, the sentence type includes multiple words of sequential;
Clause type, the semantic similarity sentence sample of the clause type include sequential multiple words and
The part of speech label of multiple words or the sentence sample of the clause type include the multiple described of sequential
Part of speech label;
Then the step S7 is specifically included:
Step S71 chooses and retains semantic similarity sentence sample described in top n;
Step S72 judges whether the semantic similarity sentence sample for needing to export the clause type:
If so, turning to step S73;
If it is not, then turning to step S74;
The word that the semantic similarity sentence sample includes is substituted for the corresponding part of speech label by step S73, to be formed
The complete semantic similarity sentence sample, then carries out subsequent processing steps;
Step S74 carries out subsequent processing steps according to the retained semantic similarity sentence sample.
7. the method as described in claim 1 for automatically generating semantic similarity sentence sample, which is characterized in that the step S7
In, after choosing and retaining semantic similarity sentence sample described in top n, output includes the semantic similarity sentence sample being retained
Sample set, to carry out subsequent processing steps.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710109325.7A CN108509409A (en) | 2017-02-27 | 2017-02-27 | A method of automatically generating semantic similarity sentence sample |
PCT/CN2018/074325 WO2018153215A1 (en) | 2017-02-27 | 2018-01-26 | Method for automatically generating sentence sample with similar semantics |
TW107105170A TWI662425B (en) | 2017-02-27 | 2018-02-13 | A method of automatically generating semantic similar sentence samples |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710109325.7A CN108509409A (en) | 2017-02-27 | 2017-02-27 | A method of automatically generating semantic similarity sentence sample |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108509409A true CN108509409A (en) | 2018-09-07 |
Family
ID=63254281
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710109325.7A Pending CN108509409A (en) | 2017-02-27 | 2017-02-27 | A method of automatically generating semantic similarity sentence sample |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN108509409A (en) |
TW (1) | TWI662425B (en) |
WO (1) | WO2018153215A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657231A (en) * | 2018-11-09 | 2019-04-19 | 广东电网有限责任公司 | A kind of long SMS compressing method and system |
CN110334197A (en) * | 2019-06-28 | 2019-10-15 | 科大讯飞股份有限公司 | Corpus processing method and relevant apparatus |
CN110633359A (en) * | 2019-09-04 | 2019-12-31 | 北京百分点信息科技有限公司 | Sentence equivalence judgment method and device |
CN111709234A (en) * | 2020-05-28 | 2020-09-25 | 北京百度网讯科技有限公司 | Training method and device of text processing model and electronic equipment |
CN111950237A (en) * | 2019-04-29 | 2020-11-17 | 深圳市优必选科技有限公司 | Sentence rewriting method, sentence rewriting device and electronic equipment |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096572B (en) * | 2019-04-12 | 2023-09-15 | 成都美满科技有限责任公司 | Sample generation method, device and computer readable medium |
CN110929522A (en) * | 2019-08-19 | 2020-03-27 | 网娱互动科技(北京)股份有限公司 | Intelligent synonym replacement method and system |
CN110929526A (en) * | 2019-10-28 | 2020-03-27 | 深圳绿米联创科技有限公司 | Sample generation method and device and electronic equipment |
CN111178059B (en) * | 2019-12-07 | 2023-08-25 | 武汉光谷信息技术股份有限公司 | Similarity comparison method and device based on word2vec technology |
CN112395867B (en) * | 2020-11-16 | 2023-08-08 | 中国平安人寿保险股份有限公司 | Synonym mining method and device, storage medium and computer equipment |
CN112883150B (en) * | 2021-01-21 | 2023-07-25 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for distinguishing trademark words from general words |
CN113688239B (en) * | 2021-08-20 | 2024-04-16 | 平安国际智慧城市科技股份有限公司 | Text classification method and device under small sample, electronic equipment and storage medium |
US11741302B1 (en) | 2022-05-18 | 2023-08-29 | Microsoft Technology Licensing, Llc | Automated artificial intelligence driven readability scoring techniques |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006059105A (en) * | 2004-08-19 | 2006-03-02 | Mitsubishi Electric Corp | Apparatus, method and program for preparing language model |
US20130018649A1 (en) * | 2011-07-13 | 2013-01-17 | Nuance Communications, Inc. | System and a Method for Generating Semantically Similar Sentences for Building a Robust SLM |
CN103218444A (en) * | 2013-04-22 | 2013-07-24 | 中央民族大学 | Method of Tibetan language webpage text classification based on semanteme |
CN103823794A (en) * | 2014-02-25 | 2014-05-28 | 浙江大学 | Automatic question setting method about query type short answer question of English reading comprehension test |
CN104281565A (en) * | 2014-09-30 | 2015-01-14 | 百度在线网络技术(北京)有限公司 | Semantic dictionary constructing method and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI227417B (en) * | 2003-12-02 | 2005-02-01 | Inst Information Industry | Digital resource recommendation system, method and machine-readable medium using semantic comparison of query sentence |
GB201010545D0 (en) * | 2010-06-23 | 2010-08-11 | Rolls Royce Plc | Entity recognition |
CN106033416B (en) * | 2015-03-09 | 2019-12-24 | 阿里巴巴集团控股有限公司 | Character string processing method and device |
CN105677637A (en) * | 2015-12-31 | 2016-06-15 | 上海智臻智能网络科技股份有限公司 | Method and device for updating abstract semantics database in intelligent question-answering system |
CN106021223B (en) * | 2016-05-09 | 2020-06-23 | Tcl科技集团股份有限公司 | Sentence similarity calculation method and system |
-
2017
- 2017-02-27 CN CN201710109325.7A patent/CN108509409A/en active Pending
-
2018
- 2018-01-26 WO PCT/CN2018/074325 patent/WO2018153215A1/en active Application Filing
- 2018-02-13 TW TW107105170A patent/TWI662425B/en active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006059105A (en) * | 2004-08-19 | 2006-03-02 | Mitsubishi Electric Corp | Apparatus, method and program for preparing language model |
JP4245530B2 (en) * | 2004-08-19 | 2009-03-25 | 三菱電機株式会社 | Language model creation apparatus and method, and program |
US20130018649A1 (en) * | 2011-07-13 | 2013-01-17 | Nuance Communications, Inc. | System and a Method for Generating Semantically Similar Sentences for Building a Robust SLM |
CN103218444A (en) * | 2013-04-22 | 2013-07-24 | 中央民族大学 | Method of Tibetan language webpage text classification based on semanteme |
CN103823794A (en) * | 2014-02-25 | 2014-05-28 | 浙江大学 | Automatic question setting method about query type short answer question of English reading comprehension test |
CN104281565A (en) * | 2014-09-30 | 2015-01-14 | 百度在线网络技术(北京)有限公司 | Semantic dictionary constructing method and device |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657231A (en) * | 2018-11-09 | 2019-04-19 | 广东电网有限责任公司 | A kind of long SMS compressing method and system |
CN109657231B (en) * | 2018-11-09 | 2023-04-07 | 广东电网有限责任公司 | Long short message simplifying method and system |
CN111950237A (en) * | 2019-04-29 | 2020-11-17 | 深圳市优必选科技有限公司 | Sentence rewriting method, sentence rewriting device and electronic equipment |
CN111950237B (en) * | 2019-04-29 | 2023-06-09 | 深圳市优必选科技有限公司 | Sentence rewriting method, sentence rewriting device and electronic equipment |
CN110334197A (en) * | 2019-06-28 | 2019-10-15 | 科大讯飞股份有限公司 | Corpus processing method and relevant apparatus |
CN110633359A (en) * | 2019-09-04 | 2019-12-31 | 北京百分点信息科技有限公司 | Sentence equivalence judgment method and device |
CN110633359B (en) * | 2019-09-04 | 2022-03-29 | 北京百分点科技集团股份有限公司 | Sentence equivalence judgment method and device |
CN111709234A (en) * | 2020-05-28 | 2020-09-25 | 北京百度网讯科技有限公司 | Training method and device of text processing model and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
TW201841121A (en) | 2018-11-16 |
WO2018153215A1 (en) | 2018-08-30 |
TWI662425B (en) | 2019-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN110442760B (en) | Synonym mining method and device for question-answer retrieval system | |
US10496749B2 (en) | Unified semantics-focused language processing and zero base knowledge building system | |
CN101599071B (en) | Automatic extraction method of conversation text topic | |
Constant et al. | MWU-aware part-of-speech tagging with a CRF model and lexical resources | |
CN110489760A (en) | Based on deep neural network text auto-collation and device | |
CN109408642A (en) | A kind of domain entities relation on attributes abstracting method based on distance supervision | |
CN105138514B (en) | It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method | |
CN106126620A (en) | Method of Chinese Text Automatic Abstraction based on machine learning | |
Suleiman et al. | The use of hidden Markov model in natural ARABIC language processing: a survey | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN106599032A (en) | Text event extraction method in combination of sparse coding and structural perceptron | |
CN110879834B (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN108681574A (en) | A kind of non-true class quiz answers selection method and system based on text snippet | |
CN113505209A (en) | Intelligent question-answering system for automobile field | |
CN112364132A (en) | Similarity calculation model and system based on dependency syntax and method for building system | |
CN110222344B (en) | Composition element analysis algorithm for composition tutoring of pupils | |
CN113361252A (en) | Text depression tendency detection system based on multi-modal features and emotion dictionary | |
Fiser et al. | Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources. | |
Sharma et al. | Lexicon a linguistic approach for sentiment classification | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
Srinivasagan et al. | An automated system for tamil named entity recognition using hybrid approach | |
Li et al. | Multilingual toxic text classification model based on deep learning | |
Basumatary et al. | Deep Learning Based Bodo Parts of Speech Tagger | |
Filippova et al. | Bilingual terminology extraction using neural word embeddings on comparable corpora |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1252738 Country of ref document: HK |
|
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180907 |
|
RJ01 | Rejection of invention patent application after publication |