CN109783819A - A kind of generation method and system of regular expression - Google Patents
A kind of generation method and system of regular expression Download PDFInfo
- Publication number
- CN109783819A CN109783819A CN201910046964.2A CN201910046964A CN109783819A CN 109783819 A CN109783819 A CN 109783819A CN 201910046964 A CN201910046964 A CN 201910046964A CN 109783819 A CN109783819 A CN 109783819A
- Authority
- CN
- China
- Prior art keywords
- corpus information
- regular expression
- current corpus
- main body
- clause
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention belongs to technical field of data processing, disclose the generation method and system of a kind of regular expression, and method includes: to obtain current corpus information;Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information;Obtain the semantic slot of the words of the clause main body;Regular expression is generated according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information.The present invention automatically generates regular expression according to the part of speech of sentence structure and word, without manually being write according to the rule that the meaning of sentence is deduced, not only saves labour turnover, but also efficiency is higher.
Description
Technical field
The invention belongs to technical field of data processing, in particular to the generation method and system of a kind of regular expression.
Background technique
With the rapid development of network technology, there is a large amount of information data to generate and need to handle, traditional canonical daily
Expression formula, which generally passes through, is manually write, need according to " check corpus → keyword in judgement corpus → write dictionary →
Write canonical formula " the step of write, that is, need manually to be write according to the rule that the meaning of sentence is deduced, not only mistake
Journey is complicated, and manually checks corpus and the efficiency write is lower, and fully rely on manual compiling regular expression without
Method handles bulk information data newly-increased daily accurately and in time, meanwhile, by manual compiling regular expression to staff's
It is more demanding.
Therefore, currently it is badly in need of a kind of side that can write automatically the corresponding regular expression of corpus according to corpus information by system
Method.
Summary of the invention
The object of the present invention is to provide a kind of generation method of regular expression and system, realization automatically generates regular expressions
The purpose of formula, not only saves labour turnover, but also efficiency is higher.
Technical solution provided by the invention is as follows:
On the one hand, a kind of generation method of regular expression is provided, comprising:
Obtain current corpus information;
Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information;
Obtain the semantic slot of the words of the clause main body;
It is generated just according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information
Then expression formula.
It is further preferred that described carry out syntactic analysis to the current corpus information, the current corpus information is extracted
Clause main body specifically include:
The current corpus information is segmented, the words and corresponding part of speech in the current corpus information are obtained;
According to the part of speech of the words in syntax rule and the current corpus information, sentence is carried out to the current corpus information
Formula analysis, obtains corresponding sentence structure;
According to the sentence structure, the clause main body of the current corpus information is extracted.
It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information
Non-master body portion generate regular expression specifically include:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot according to the current language
The sentence structure of material information is ranked up, and generates regular expression.
It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information
Non-master body portion generate regular expression specifically include:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot according to syntactic structure into
Row sequence generates sequence difference and at least one semantic identical regular expression.
It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information
Non-master body portion generate regular expression further include:
After generating the regular expression, conjunction is added in the regular expression of generation, generates another semanteme
Identical regular expression.
On the other hand, a kind of generation system of regular expression is also provided, comprising:
Corpus information obtains module, for obtaining current corpus information;
Clause main body abstraction module extracts the current corpus for carrying out syntactic analysis to the current corpus information
The clause main body of information;
Semantic slot obtains module, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module, for according to the clause main body, the semantic slot and the current corpus information
In remaining non-master body portion generate regular expression.
It is further preferred that the clause main body abstraction module includes:
Participle unit obtains the words in the current corpus information for segmenting to the current corpus information
And corresponding part of speech;
Clause analytical unit, for the part of speech according to the words in syntax rule and the current corpus information, to described
Current corpus information carries out clause analysis, obtains corresponding sentence structure;
Clause main body extraction unit, for extracting the clause main body of the current corpus information according to the sentence structure.
It is further preferred that the regular expression generation module includes:
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for after segmenting the remaining non-master body portion of the current corpus information and institute
Predicate justice slot is ranked up according to the sentence structure of the current corpus information, generates regular expression.
It is further preferred that the regular expression generation module includes:
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for after segmenting the remaining non-master body portion of the current corpus information and institute
Predicate justice slot is ranked up generation sequence difference and at least one semantic identical regular expression according to syntactic structure.
It is further preferred that the regular expression generation unit, after being also used to generate the regular expression, is generating
The regular expression in be added conjunction, generate the identical regular expression of another semanteme.
Compared with prior art, the generation method and system of a kind of regular expression provided by the invention have beneficial below
Effect:
1, after the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus letter
Then words in clause main body is converted into corresponding semantic slot, finally according to clause by the clause main body in breath, such as Subject, Predicate and Object
Remaining non-master body portion generates regular expression in the corresponding semantic slot of words in main body and corpus information, the present invention according to
The part of speech of sentence structure and word automatically generates regular expression, without manually being carried out according to the rule that the meaning of sentence is deduced
It writes, not only saves labour turnover, but also efficiency is higher.
2, in a preferred embodiment, root can be realized by the way that the occurrence of regular expression is carried out permutation and combination
The purpose of the identical regular expression of multiple semantemes is generated, according to a corpus information to improve the formation efficiency of regular expression.
Detailed description of the invention
Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, the life to a kind of regular expression
It is further described at above-mentioned characteristic, technical characteristic, advantage and its implementation of method and system.
Fig. 1 is a kind of flow diagram of the first embodiment of the generation method of regular expression of the present invention;
Fig. 2 is a kind of flow diagram of the second embodiment of the generation method of regular expression of the present invention;
Fig. 3 is a kind of flow diagram of the 3rd embodiment of the generation method of regular expression of the present invention;
Fig. 4 is a kind of flow diagram of the fourth embodiment of the generation method of regular expression of the present invention;
Fig. 5 is a kind of flow diagram of 5th embodiment of the generation method of regular expression of the present invention;
Fig. 6 is a kind of flow diagram of the sixth embodiment of the generation method of regular expression of the present invention;
Fig. 7 is a kind of structural schematic block diagram of the generation system of regular expression of the present invention.
Drawing reference numeral explanation
100, corpus information obtains module;200, clause main body abstraction module;
210, participle unit;220, clause analytical unit;
230, clause main body extraction unit;300, semantic slot obtains module;
400, regular expression generation module;410, replacement unit;
420, regular expression generation unit.
Specific embodiment
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below
A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing, and obtain other embodiments.
To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented
Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand
Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated
" only this ", can also indicate the situation of " more than one ".
The first embodiment provided according to the present invention, as shown in Figure 1, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S200 carries out syntactic analysis to the current corpus information, extracts the clause main body of the current corpus information;
S300 obtains the semantic slot of the words of the clause main body;
S400 is raw according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information
At regular expression.
Specifically, then the present invention is generated big by obtaining a large amount of corpus information according to a large amount of corpus informations of acquisition
The regular expression of amount, regular expression refer to for describing or matching a series of character strings for meeting some syntactic rule.
The present embodiment illustrates the generation method of its regular expression by taking a corpus information as an example.
Corpus information can be text information, such as a word or bookish a word that user's text inputs, corpus letter
Breath can also be voice messaging or audio-frequency information of recording of user's input etc..Current corpus information of the present embodiment to get
For be illustrated.
After getting current corpus information, syntactic analysis is carried out to current corpus information, extracts the sentence of current corpus information
Formula main body, such as extract subject, predicate, object, attribute in current corpus information.For example, current corpus information is that " whale is
What can spray water ", the clause main body extracted is " whale water spray ", and " whale " is subject, and " water spray " is predicate.
After extracting clause main body, according to the part of speech of the words of clause main body, the words of clause main body is converted into correspondence
Semantic slot, semantic slot can be all words of the corresponding part of speech of the words, or with the semantic identical word of the words.
For example, clause main body is " whale water spray ", wherein " whale " is noun, and " water spray " is verb, and " whale " corresponding semantic slot can
For thesaurus, " water spray " corresponding semantic slot can be verb library.
After obtaining the corresponding semantic slot of words of clause main body and clause main body, can according to clause main body, semantic slot and
Remaining non-master body portion generates the corresponding regular expression of current corpus information in current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and the clause main body extracted is " whale spray
Water ", " whale " corresponding semantic slot are thesaurus, and " water spray " corresponding semantic slot is verb library, and remaining non-master body portion is
" why can ", be according to the regular expression that obtained above- mentioned information generate " ## thesaurus ## [why] [meeting] ## verb
Two ## of library ".
After the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus information
In clause main body, then the words in clause main body is converted into corresponding semantic slot, finally according to clause master by such as Subject, Predicate and Object
Remaining non-master body portion generates regular expression in the corresponding semantic slot of words and corpus information in body, and the present invention is according to sentence
Formula structure and the part of speech of word automatically generate regular expression, without manually being compiled according to the rule that the meaning of sentence is deduced
It writes, not only saves labour turnover, but also efficiency is higher.
The second embodiment provided according to the present invention, as shown in Fig. 2, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information
Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into
The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
S400 is raw according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information
At regular expression.
Specifically, in above-described embodiment one, the method for clause main body of current corpus information is extracted concretely: first right
Current corpus information is segmented, and the part of speech of the words in current corpus information is obtained, then according to syntax rule and current language
The part of speech for expecting the words in information, obtains the sentence structure of current corpus information, finally according to the clause knot of current corpus information
Structure extracts the clause main body of current corpus information.
Participle is carried out to current corpus information refer to that current corpus information is divided into word or word one by one, " will not such as know
Road you what is being said " be divided into " not knowing what you are saying ";" why whale can spray water " is divided into " whale, to be assorted for another example
, meeting, water spray ".
After segmenting current corpus information, the words obtained after participle is analyzed to obtain in current corpus information
Words part of speech, as will after " why whale can spray water " participle obtained words be " whale " (noun), " why " (generation
Word), " meeting " (auxiliary verb), " water spray " (verb).Then right according to the part of speech of the words in corpus rule and current corpus information
Current corpus information carries out clause analysis, and the sentence structure for obtaining current corpus information " why whale can spray water " is " master+shape
+ meaning ", finally according to the sentence structure of current corpus information, to current corpus information analyzed known to based on " whale water spray "
Structure is called, " why spraying water " is verbal endocentric phrase, and " can spray water " is verbal endocentric phrase, according to the current corpus of result after analysis
The main structure of information " why whale can spray water " is " the whale water spray " of subject-predicate phrase, therefore, from " why whale can spray
It is subject-predicate phrase " whale water spray " that clause main body is extracted in water ".
The 3rd embodiment provided according to the present invention, as shown in figure 3, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information
Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into
The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S420 by after participle the remaining non-master body portion of the current corpus information and the semantic slot work as according to described
The sentence structure of preceding corpus information is ranked up, and generates regular expression.
Specifically, the clause main body of current corpus information is extracted according to the method for above-described embodiment, and obtains clause master
After the corresponding semantic slot of the words of body, the non-master body portion in the current corpus information after participle is retained, then will be worked as
The words of clause main body in preceding corpus information replaces with corresponding semantic slot, finally by semantic slot and non-master body portion according to working as
The sentence structure of preceding corpus information itself, which is ranked up, produces the corresponding regular expression of current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master
The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb
Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ", " be assorted by non-master body portion
", the corresponding semantic slot " thesaurus " and " verb library " of words of " meeting " and clause main body according to current corpus information sentence
Formula structure be ranked up as " thesaurus ", " why ", " meeting ", " verb library "." thesaurus ", " why ", " meeting ", " dynamic
Dictionary " is the occurrence of regular expression, and the symbol by the way that regular expression is added between occurrence produces current language
Expect the corresponding regular expression of information " ## thesaurus ## [why] two ## of [meeting] ## verb library ".
The fourth embodiment provided according to the present invention, as shown in figure 4, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information
Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into
The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S430 by after participle the remaining non-master body portion of the current corpus information and the semantic slot according to grammer knot
Structure is ranked up generation sequence difference and at least one semantic identical regular expression.
Specifically, the present embodiment and the difference of above-mentioned 3rd embodiment are, according to the side of embodiment one or embodiment two
Method extracts the clause main body of current corpus information, and after the corresponding semantic slot of the words for obtaining clause main body, after participle
Non-master body portion in current corpus information is retained, and then replaces with the words of the clause main body in current corpus information
Semantic slot and non-master body portion are finally ranked up generation sequence difference and semantic phase according to syntactic structure by corresponding semanteme slot
At least one same regular expression.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master
The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb
Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ".Before keeping semanteme identical
Put, by non-master body portion " why ", the corresponding semantic slot " thesaurus " of words of " meeting " and clause main body and " verb
Library " according to syntactic structure sort available " thesaurus ", " why ", " meeting ", " verb library " and " why ", " noun
Library ", " meeting ", " verb library ".
According to " thesaurus ", " why ", the obtained regular expression of " meeting ", the sequence of " verb library " be " ## noun
Library ## [why] two ## of [meeting] ## verb library ".According to " why ", " thesaurus ", " meeting ", the sequence of " verb library " obtain
Regular expression is " ## [why] two ## of ## thesaurus ## [meeting] ## verb library ".The present embodiment is by by regular expression
Occurrence, which carries out permutation and combination, can realize the purpose that the identical regular expression of multiple semantemes is generated according to a corpus information, with
Improve the formation efficiency of regular expression.
The 5th embodiment provided according to the present invention, as shown in figure 5, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information
Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into
The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S420 by after participle the remaining non-master body portion of the current corpus information and the semantic slot work as according to described
The sentence structure of preceding corpus information is ranked up, and generates regular expression;
After S440 generates regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme
Regular expression.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is
It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation
Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by)
The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library;
" teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just
Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary "
" " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation
It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library
Verb library ## thesaurus ## ".
The sixth embodiment provided according to the present invention, as shown in fig. 6, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information
Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into
The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S430 by after participle the remaining non-master body portion of the current corpus information and the semantic slot according to grammer knot
Structure is ranked up generation sequence difference and at least one semantic identical regular expression;
After S440 generates regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme
Regular expression.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is
It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation
Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by)
The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library;
" teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just
Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary "
" " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation
It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library
Verb library ## thesaurus ## ".
The 7th embodiment provided according to the present invention, as shown in fig. 7, a kind of generation system of regular expression, comprising:
Corpus information obtains module 100, for obtaining current corpus information;
Clause main body abstraction module 200 extracts the current language for carrying out syntactic analysis to the current corpus information
Expect the clause main body of information;
Semantic slot obtains module 300, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module 400, for being believed according to the clause main body, the semantic slot and the current corpus
Remaining non-master body portion generates regular expression in breath.
Specifically, then the present invention is generated big by obtaining a large amount of corpus information according to a large amount of corpus informations of acquisition
The regular expression of amount, regular expression refer to for describing or matching a series of character strings for meeting some syntactic rule.
The present embodiment illustrates the generation method of its regular expression by taking a corpus information as an example.
Corpus information can be text information, such as a word or bookish a word that user's text inputs, corpus letter
Breath can also be voice messaging or audio-frequency information of recording of user's input etc..Current corpus information of the present embodiment to get
For be illustrated.
After getting current corpus information, syntactic analysis is carried out to current corpus information, extracts the sentence of current corpus information
Formula main body, such as extract subject, predicate, object, attribute in current corpus information.For example, current corpus information is that " whale is
What can spray water ", the clause main body extracted is " whale water spray ", and " whale " is subject, and " water spray " is predicate.
After extracting clause main body, according to the part of speech of the words of clause main body, the words of clause main body is converted into correspondence
Semantic slot, semantic slot can be all words of the corresponding part of speech of the words, or with the semantic identical word of the words.
For example, clause main body is " whale water spray ", wherein " whale " is noun, and " water spray " is verb, and " whale " corresponding semantic slot can
For thesaurus, " water spray " corresponding semantic slot can be verb library.
After obtaining the corresponding semantic slot of words of clause main body and clause main body, can according to clause main body, semantic slot and
Remaining non-master body portion generates the corresponding regular expression of current corpus information in current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and the clause main body extracted is " whale spray
Water ", " whale " corresponding semantic slot are thesaurus, and " water spray " corresponding semantic slot is verb library, and remaining non-master body portion is
" why can ", be according to the regular expression that obtained above- mentioned information generate " ## thesaurus ## [why] [meeting] ## verb
Two ## of library ".
After the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus information
In clause main body, then the words in clause main body is converted into corresponding semantic slot, finally according to clause master by such as Subject, Predicate and Object
Remaining non-master body portion generates regular expression in the corresponding semantic slot of words and corpus information in body, and the present invention is according to sentence
Formula structure and the part of speech of word automatically generate regular expression, without manually being compiled according to the rule that the meaning of sentence is deduced
It writes, not only saves labour turnover, but also efficiency is higher.
Preferably, the clause main body abstraction module 200 includes:
Participle unit 210 obtains the word in the current corpus information for segmenting to the current corpus information
Word and corresponding part of speech;
Clause analytical unit 220, for the part of speech according to the words in syntax rule and the current corpus information, to institute
It states current corpus information and carries out clause analysis, obtain corresponding sentence structure;
Clause main body extraction unit 230, for extracting the clause master of the current corpus information according to the sentence structure
Body.
Specifically, in above-described embodiment one, the method for clause main body of current corpus information is extracted concretely: first right
Current corpus information is segmented, and the part of speech of the words in current corpus information is obtained, then according to syntax rule and current language
The part of speech for expecting the words in information, obtains the sentence structure of current corpus information, finally according to the clause knot of current corpus information
Structure extracts the clause main body of current corpus information.
Participle is carried out to current corpus information refer to that current corpus information is divided into word or word one by one, " will not such as know
Road you what is being said " be divided into " not knowing what you are saying ";" why whale can spray water " is divided into " whale, to be assorted for another example
, meeting, water spray ".
After segmenting current corpus information, the words obtained after participle is analyzed to obtain in current corpus information
Words part of speech, as will after " why whale can spray water " participle obtained words be " whale " (noun), " why " (generation
Word), " meeting " (auxiliary verb), " water spray " (verb).Then right according to the part of speech of the words in corpus rule and current corpus information
Current corpus information carries out clause analysis, and the sentence structure for obtaining current corpus information " why whale can spray water " is " master+shape
+ meaning ", finally according to the sentence structure of current corpus information, to current corpus information analyzed known to based on " whale water spray "
Structure is called, " why spraying water " is verbal endocentric phrase, and " can spray water " is verbal endocentric phrase, according to the current corpus of result after analysis
The main structure of information " why whale can spray water " is " the whale water spray " of subject-predicate phrase, therefore, from " why whale can spray
It is subject-predicate phrase " whale water spray " that clause main body is extracted in water ".
Preferably, the regular expression generation module 400 includes:
Replacement unit 410, for the words of the clause main body in the current corpus information to be replaced with corresponding semanteme
Slot;
Regular expression generation unit 420, for the current remaining non-master body portion of corpus information after segmenting
It is ranked up with the semantic slot according to the sentence structure of the current corpus information, generates regular expression.
Specifically, the clause main body of current corpus information is extracted according to the method for above-described embodiment, and obtains clause master
After the corresponding semantic slot of the words of body, the non-master body portion in the current corpus information after participle is retained, then will be worked as
The words of clause main body in preceding corpus information replaces with corresponding semantic slot, finally by semantic slot and non-master body portion according to working as
The sentence structure of preceding corpus information itself, which is ranked up, produces the corresponding regular expression of current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master
The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb
Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ", " be assorted by non-master body portion
", the corresponding semantic slot " thesaurus " and " verb library " of words of " meeting " and clause main body according to current corpus information sentence
Formula structure be ranked up as " thesaurus ", " why ", " meeting ", " verb library "." thesaurus ", " why ", " meeting ", " dynamic
Dictionary " is the occurrence of regular expression, and the symbol by the way that regular expression is added between occurrence produces current language
Expect the corresponding regular expression of information " ## thesaurus ## [why] two ## of [meeting] ## verb library ".
Preferably, the regular expression generation unit 420, after being also used to generate the regular expression, in generation
Conjunction is added in the regular expression, generates the identical regular expression of another semanteme.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is
It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation
Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by)
The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library;
" teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just
Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary "
" " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation
It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library
Verb library ## thesaurus ## ".
The 8th embodiment provided according to the present invention, as shown in fig. 7, a kind of generation system of regular expression, comprising:
Corpus information obtains module 100, for obtaining current corpus information;
Clause main body abstraction module 200 extracts the current language for carrying out syntactic analysis to the current corpus information
Expect the clause main body of information;
Semantic slot obtains module 300, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module 400, for being believed according to the clause main body, the semantic slot and the current corpus
Remaining non-master body portion generates regular expression in breath.
The clause main body abstraction module 200 includes:
Participle unit 210 obtains the word in the current corpus information for segmenting to the current corpus information
Word and corresponding part of speech;
Clause analytical unit 220, for the part of speech according to the words in syntax rule and the current corpus information, to institute
It states current corpus information and carries out clause analysis, obtain corresponding sentence structure;
Clause main body extraction unit 230, for extracting the clause master of the current corpus information according to the sentence structure
Body.
The regular expression generation module 400 includes:
Replacement unit 410, for the words of the clause main body in the current corpus information to be replaced with corresponding semanteme
Slot;
Regular expression generation unit 420, for the current remaining non-master body portion of corpus information after segmenting
It is ranked up with the semantic slot according to syntactic structure and generates that sequence is different and semantic at least one identical regular expression.
Specifically, after getting current corpus information, first current corpus information is segmented, obtains current corpus information
In the part of speech of words obtain current corpus information then according to the part of speech of the words in syntax rule and current corpus information
Sentence structure extract the clause main body of current corpus information finally according to the sentence structure of current corpus information.
Participle is carried out to current corpus information refer to that current corpus information is divided into word or word one by one, " will not such as know
Road you what is being said " be divided into " not knowing what you are saying ";" why whale can spray water " is divided into " whale, to be assorted for another example
, meeting, water spray ".
After segmenting current corpus information, the words obtained after participle is analyzed to obtain in current corpus information
Words part of speech, as will after " why whale can spray water " participle obtained words be " whale " (noun), " why " (generation
Word), " meeting " (auxiliary verb), " water spray " (verb).Then right according to the part of speech of the words in corpus rule and current corpus information
Current corpus information carries out clause analysis, and the sentence structure for obtaining current corpus information " why whale can spray water " is " master+shape
+ meaning ", finally according to the sentence structure of current corpus information, to current corpus information analyzed known to based on " whale water spray "
Structure is called, " why spraying water " is verbal endocentric phrase, and " can spray water " is verbal endocentric phrase, according to the current corpus of result after analysis
The main structure of information " why whale can spray water " is " the whale water spray " of subject-predicate phrase, therefore, from " why whale can spray
It is subject-predicate phrase " whale water spray " that clause main body is extracted in water ".
After extracting clause main body, according to the part of speech of the words of clause main body, the words of clause main body is converted into correspondence
Semantic slot, semantic slot can be all words of the corresponding part of speech of the words, or with the semantic identical word of the words.
For example, clause main body is " whale water spray ", wherein " whale " is noun, and " water spray " is verb, and " whale " corresponding semantic slot can
For thesaurus, " water spray " corresponding semantic slot can be verb library.
The clause main body of current corpus information is extracted, and after the corresponding semantic slot of the words for obtaining clause main body, will be divided
The non-master body portion in current corpus information after word is retained, then by the words of the clause main body in current corpus information
Replace with corresponding semantic slot, finally by semantic slot and non-master body portion according to syntactic structure be ranked up generation sequence it is different and
At least one semantic identical regular expression.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master
The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb
Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ".Before keeping semanteme identical
Put, by non-master body portion " why ", the corresponding semantic slot " thesaurus " of words of " meeting " and clause main body and " verb
Library " according to syntactic structure sort available " thesaurus ", " why ", " meeting ", " verb library " and " why ", " noun
Library ", " meeting ", " verb library ".
According to " thesaurus ", " why ", the obtained regular expression of " meeting ", the sequence of " verb library " be " ## noun
Library ## [why] two ## of [meeting] ## verb library ".According to " why ", " thesaurus ", " meeting ", the sequence of " verb library " obtain
Regular expression is " ## [why] two ## of ## thesaurus ## [meeting] ## verb library ".The present embodiment is by by regular expression
Occurrence, which carries out permutation and combination, can realize the purpose that the identical regular expression of multiple semantemes is generated according to a corpus information, with
Improve the formation efficiency of regular expression.
Preferably, the regular expression generation unit 420, after being also used to generate the regular expression, in generation
Conjunction is added in the regular expression, generates the identical regular expression of another semanteme.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is
It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation
Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by)
The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library;
" teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just
Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary "
" " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation
It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library
Verb library ## thesaurus ## ".
It should be noted that above-described embodiment can be freely combined as needed.The above is only of the invention preferred
Embodiment, it is noted that for those skilled in the art, in the premise for not departing from the principle of the invention
Under, several improvements and modifications can also be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.
Claims (10)
1. a kind of generation method of regular expression characterized by comprising
Obtain current corpus information;
Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information;
Obtain the semantic slot of the words of the clause main body;
Canonical table is generated according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information
Up to formula.
2. a kind of generation method of regular expression according to claim 1, which is characterized in that described to the current language
Expect that information carries out syntactic analysis, the clause main body for extracting the current corpus information specifically includes:
The current corpus information is segmented, the words and corresponding part of speech in the current corpus information are obtained;
According to the part of speech of the words in syntax rule and the current corpus information, clause point is carried out to the current corpus information
Analysis, obtains corresponding sentence structure;
According to the sentence structure, the clause main body of the current corpus information is extracted.
3. a kind of generation method of regular expression according to claim 2, which is characterized in that described according to the clause
Remaining non-master body portion generates regular expression and specifically includes in main body, the semantic slot and the current corpus information:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot believe according to the current corpus
The sentence structure of breath is ranked up, and generates regular expression.
4. a kind of generation method of regular expression according to claim 2, which is characterized in that described according to the clause
Remaining non-master body portion generates regular expression and specifically includes in main body, the semantic slot and the current corpus information:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot arrange according to syntactic structure
Sequence generates sequence difference and at least one semantic identical regular expression.
5. a kind of generation method of regular expression according to claim 3 or 4, which is characterized in that described according to
Remaining non-master body portion generates regular expression in clause main body, the semantic slot and the current corpus information further include:
After generating the regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme
Regular expression.
6. a kind of generation system of regular expression characterized by comprising
Corpus information obtains module, for obtaining current corpus information;
Clause main body abstraction module extracts the current corpus information for carrying out syntactic analysis to the current corpus information
Clause main body;
Semantic slot obtains module, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module, for being remained according in the clause main body, the semantic slot and the current corpus information
Remaining non-master body portion generates regular expression.
7. a kind of generation system of regular expression according to claim 6, which is characterized in that the clause main body extracts
Module includes:
Participle unit obtains words in the current corpus information and right for segmenting to the current corpus information
The part of speech answered;
Clause analytical unit, for the part of speech according to the words in syntax rule and the current corpus information, to described current
Corpus information carries out clause analysis, obtains corresponding sentence structure;
Clause main body extraction unit, for extracting the clause main body of the current corpus information according to the sentence structure.
8. a kind of generation system of regular expression according to claim 7, which is characterized in that the regular expression is raw
Include: at module
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for the remaining non-master body portion of the current corpus information and institute's predicate after segmenting
Adopted slot is ranked up according to the sentence structure of the current corpus information, generates regular expression.
9. a kind of generation system of regular expression according to claim 7, which is characterized in that the regular expression is raw
Include: at module
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for the remaining non-master body portion of the current corpus information and institute's predicate after segmenting
Adopted slot is ranked up generation sequence difference and at least one semantic identical regular expression according to syntactic structure.
10. a kind of generation system of regular expression according to claim 8 or claim 9, which is characterized in that
The regular expression generation unit, after being also used to generate the regular expression, in the regular expression of generation
Middle addition conjunction generates the identical regular expression of another semanteme.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910046964.2A CN109783819B (en) | 2019-01-18 | 2019-01-18 | Regular expression generation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910046964.2A CN109783819B (en) | 2019-01-18 | 2019-01-18 | Regular expression generation method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109783819A true CN109783819A (en) | 2019-05-21 |
CN109783819B CN109783819B (en) | 2023-10-20 |
Family
ID=66501654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910046964.2A Active CN109783819B (en) | 2019-01-18 | 2019-01-18 | Regular expression generation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109783819B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111159384A (en) * | 2019-12-31 | 2020-05-15 | 苏州思必驰信息科技有限公司 | Rule-based sentence generation method and device |
CN111428469A (en) * | 2020-02-27 | 2020-07-17 | 宋继华 | Sentence pattern structure diagram analysis oriented interactive labeling method and system |
CN112115313A (en) * | 2020-09-08 | 2020-12-22 | 北京百度网讯科技有限公司 | Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium |
WO2021068683A1 (en) * | 2019-10-11 | 2021-04-15 | 平安科技(深圳)有限公司 | Method and apparatus for generating regular expression, server, and computer-readable storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN105095186A (en) * | 2015-07-28 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Semantic analysis method and device |
CN105512105A (en) * | 2015-12-07 | 2016-04-20 | 百度在线网络技术(北京)有限公司 | Semantic parsing method and device |
CN107315737A (en) * | 2017-07-04 | 2017-11-03 | 北京奇艺世纪科技有限公司 | A kind of semantic logic processing method and system |
CN107369443A (en) * | 2017-06-29 | 2017-11-21 | 北京百度网讯科技有限公司 | Dialogue management method and device based on artificial intelligence |
CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
CN107766560A (en) * | 2017-11-03 | 2018-03-06 | 广州杰赛科技股份有限公司 | The evaluation method and system of customer service flow |
CN108563790A (en) * | 2018-04-28 | 2018-09-21 | 科大讯飞股份有限公司 | A kind of semantic understanding method and device, equipment, computer-readable medium |
CN109063035A (en) * | 2018-07-16 | 2018-12-21 | 哈尔滨工业大学 | A kind of man-machine more wheel dialogue methods towards trip field |
-
2019
- 2019-01-18 CN CN201910046964.2A patent/CN109783819B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN105095186A (en) * | 2015-07-28 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Semantic analysis method and device |
CN105512105A (en) * | 2015-12-07 | 2016-04-20 | 百度在线网络技术(北京)有限公司 | Semantic parsing method and device |
CN107369443A (en) * | 2017-06-29 | 2017-11-21 | 北京百度网讯科技有限公司 | Dialogue management method and device based on artificial intelligence |
CN107315737A (en) * | 2017-07-04 | 2017-11-03 | 北京奇艺世纪科技有限公司 | A kind of semantic logic processing method and system |
CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
CN107766560A (en) * | 2017-11-03 | 2018-03-06 | 广州杰赛科技股份有限公司 | The evaluation method and system of customer service flow |
CN108563790A (en) * | 2018-04-28 | 2018-09-21 | 科大讯飞股份有限公司 | A kind of semantic understanding method and device, equipment, computer-readable medium |
CN109063035A (en) * | 2018-07-16 | 2018-12-21 | 哈尔滨工业大学 | A kind of man-machine more wheel dialogue methods towards trip field |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021068683A1 (en) * | 2019-10-11 | 2021-04-15 | 平安科技(深圳)有限公司 | Method and apparatus for generating regular expression, server, and computer-readable storage medium |
CN111159384A (en) * | 2019-12-31 | 2020-05-15 | 苏州思必驰信息科技有限公司 | Rule-based sentence generation method and device |
CN111159384B (en) * | 2019-12-31 | 2022-07-08 | 思必驰科技股份有限公司 | Rule-based sentence generation method and device |
CN111428469A (en) * | 2020-02-27 | 2020-07-17 | 宋继华 | Sentence pattern structure diagram analysis oriented interactive labeling method and system |
CN111428469B (en) * | 2020-02-27 | 2023-06-16 | 宋继华 | Interactive labeling method and system for sentence-oriented structure graphic analysis |
CN112115313A (en) * | 2020-09-08 | 2020-12-22 | 北京百度网讯科技有限公司 | Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium |
CN112115313B (en) * | 2020-09-08 | 2023-07-28 | 北京百度网讯科技有限公司 | Regular expression generation and data extraction methods, devices, equipment and media |
Also Published As
Publication number | Publication date |
---|---|
CN109783819B (en) | 2023-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463553B (en) | Text semantic extraction, representation and modeling method and system for elementary mathematic problems | |
CN109783819A (en) | A kind of generation method and system of regular expression | |
CN101937430B (en) | Method for extracting event sentence pattern from Chinese sentence | |
CN110609983B (en) | Structured decomposition method for policy file | |
CN101261623A (en) | Word splitting method and device for word border-free mark language based on search | |
CN105975475A (en) | Chinese phrase string-based fine-grained thematic information extraction method | |
CN109614620B (en) | HowNet-based graph model word sense disambiguation method and system | |
CN111061882A (en) | Knowledge graph construction method | |
CN106569993A (en) | Method and device for mining hypernym-hyponym relation between domain-specific terms | |
Kak | The Paninian approach to natural language processing | |
CN113312922B (en) | Improved chapter-level triple information extraction method | |
Dunn | Frequency vs. association for constraint selection in usage-based construction grammar | |
CN112183059A (en) | Chinese structured event extraction method | |
CN103019924B (en) | The intelligent evaluating system of input method and method | |
CN106021225B (en) | A kind of Chinese Maximal noun phrase recognition methods based on the simple noun phrase of Chinese | |
CN109002540B (en) | Method for automatically generating Chinese announcement document question answer pairs | |
Dridan | Using lexical statistics to improve HPSG parsing | |
Shrawankar et al. | Construction of news headline from detailed news article | |
CN110069780B (en) | Specific field text-based emotion word recognition method | |
CN106021286A (en) | Method for language understanding based on language structure | |
Zhang | Research on the optimizing method of question answering system in natural language processing | |
Tapaswi et al. | Parsing sanskrit sentences using lexical functional grammar | |
Tammewar et al. | Can distributed word embeddings be an alternative to costly linguistic features: A study on parsing hindi | |
CN106681982B (en) | English novel abstraction generating method | |
CN111027308A (en) | Text generation method, system, mobile terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |