CN109783819A - A kind of generation method and system of regular expression - Google Patents

A kind of generation method and system of regular expression Download PDF

Info

Publication number
CN109783819A
CN109783819A CN201910046964.2A CN201910046964A CN109783819A CN 109783819 A CN109783819 A CN 109783819A CN 201910046964 A CN201910046964 A CN 201910046964A CN 109783819 A CN109783819 A CN 109783819A
Authority
CN
China
Prior art keywords
corpus information
regular expression
current corpus
main body
clause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910046964.2A
Other languages
Chinese (zh)
Other versions
CN109783819B (en
Inventor
魏誉荧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Genius Technology Co Ltd
Original Assignee
Guangdong Genius Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Genius Technology Co Ltd filed Critical Guangdong Genius Technology Co Ltd
Priority to CN201910046964.2A priority Critical patent/CN109783819B/en
Publication of CN109783819A publication Critical patent/CN109783819A/en
Application granted granted Critical
Publication of CN109783819B publication Critical patent/CN109783819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention belongs to technical field of data processing, disclose the generation method and system of a kind of regular expression, and method includes: to obtain current corpus information;Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information;Obtain the semantic slot of the words of the clause main body;Regular expression is generated according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information.The present invention automatically generates regular expression according to the part of speech of sentence structure and word, without manually being write according to the rule that the meaning of sentence is deduced, not only saves labour turnover, but also efficiency is higher.

Description

A kind of generation method and system of regular expression
Technical field
The invention belongs to technical field of data processing, in particular to the generation method and system of a kind of regular expression.
Background technique
With the rapid development of network technology, there is a large amount of information data to generate and need to handle, traditional canonical daily Expression formula, which generally passes through, is manually write, need according to " check corpus → keyword in judgement corpus → write dictionary → Write canonical formula " the step of write, that is, need manually to be write according to the rule that the meaning of sentence is deduced, not only mistake Journey is complicated, and manually checks corpus and the efficiency write is lower, and fully rely on manual compiling regular expression without Method handles bulk information data newly-increased daily accurately and in time, meanwhile, by manual compiling regular expression to staff's It is more demanding.
Therefore, currently it is badly in need of a kind of side that can write automatically the corresponding regular expression of corpus according to corpus information by system Method.
Summary of the invention
The object of the present invention is to provide a kind of generation method of regular expression and system, realization automatically generates regular expressions The purpose of formula, not only saves labour turnover, but also efficiency is higher.
Technical solution provided by the invention is as follows:
On the one hand, a kind of generation method of regular expression is provided, comprising:
Obtain current corpus information;
Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information;
Obtain the semantic slot of the words of the clause main body;
It is generated just according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information Then expression formula.
It is further preferred that described carry out syntactic analysis to the current corpus information, the current corpus information is extracted Clause main body specifically include:
The current corpus information is segmented, the words and corresponding part of speech in the current corpus information are obtained;
According to the part of speech of the words in syntax rule and the current corpus information, sentence is carried out to the current corpus information Formula analysis, obtains corresponding sentence structure;
According to the sentence structure, the clause main body of the current corpus information is extracted.
It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information Non-master body portion generate regular expression specifically include:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot according to the current language The sentence structure of material information is ranked up, and generates regular expression.
It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information Non-master body portion generate regular expression specifically include:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot according to syntactic structure into Row sequence generates sequence difference and at least one semantic identical regular expression.
It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information Non-master body portion generate regular expression further include:
After generating the regular expression, conjunction is added in the regular expression of generation, generates another semanteme Identical regular expression.
On the other hand, a kind of generation system of regular expression is also provided, comprising:
Corpus information obtains module, for obtaining current corpus information;
Clause main body abstraction module extracts the current corpus for carrying out syntactic analysis to the current corpus information The clause main body of information;
Semantic slot obtains module, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module, for according to the clause main body, the semantic slot and the current corpus information In remaining non-master body portion generate regular expression.
It is further preferred that the clause main body abstraction module includes:
Participle unit obtains the words in the current corpus information for segmenting to the current corpus information And corresponding part of speech;
Clause analytical unit, for the part of speech according to the words in syntax rule and the current corpus information, to described Current corpus information carries out clause analysis, obtains corresponding sentence structure;
Clause main body extraction unit, for extracting the clause main body of the current corpus information according to the sentence structure.
It is further preferred that the regular expression generation module includes:
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for after segmenting the remaining non-master body portion of the current corpus information and institute Predicate justice slot is ranked up according to the sentence structure of the current corpus information, generates regular expression.
It is further preferred that the regular expression generation module includes:
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for after segmenting the remaining non-master body portion of the current corpus information and institute Predicate justice slot is ranked up generation sequence difference and at least one semantic identical regular expression according to syntactic structure.
It is further preferred that the regular expression generation unit, after being also used to generate the regular expression, is generating The regular expression in be added conjunction, generate the identical regular expression of another semanteme.
Compared with prior art, the generation method and system of a kind of regular expression provided by the invention have beneficial below Effect:
1, after the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus letter Then words in clause main body is converted into corresponding semantic slot, finally according to clause by the clause main body in breath, such as Subject, Predicate and Object Remaining non-master body portion generates regular expression in the corresponding semantic slot of words in main body and corpus information, the present invention according to The part of speech of sentence structure and word automatically generates regular expression, without manually being carried out according to the rule that the meaning of sentence is deduced It writes, not only saves labour turnover, but also efficiency is higher.
2, in a preferred embodiment, root can be realized by the way that the occurrence of regular expression is carried out permutation and combination The purpose of the identical regular expression of multiple semantemes is generated, according to a corpus information to improve the formation efficiency of regular expression.
Detailed description of the invention
Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, the life to a kind of regular expression It is further described at above-mentioned characteristic, technical characteristic, advantage and its implementation of method and system.
Fig. 1 is a kind of flow diagram of the first embodiment of the generation method of regular expression of the present invention;
Fig. 2 is a kind of flow diagram of the second embodiment of the generation method of regular expression of the present invention;
Fig. 3 is a kind of flow diagram of the 3rd embodiment of the generation method of regular expression of the present invention;
Fig. 4 is a kind of flow diagram of the fourth embodiment of the generation method of regular expression of the present invention;
Fig. 5 is a kind of flow diagram of 5th embodiment of the generation method of regular expression of the present invention;
Fig. 6 is a kind of flow diagram of the sixth embodiment of the generation method of regular expression of the present invention;
Fig. 7 is a kind of structural schematic block diagram of the generation system of regular expression of the present invention.
Drawing reference numeral explanation
100, corpus information obtains module;200, clause main body abstraction module;
210, participle unit;220, clause analytical unit;
230, clause main body extraction unit;300, semantic slot obtains module;
400, regular expression generation module;410, replacement unit;
420, regular expression generation unit.
Specific embodiment
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing, and obtain other embodiments.
To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated " only this ", can also indicate the situation of " more than one ".
The first embodiment provided according to the present invention, as shown in Figure 1, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S200 carries out syntactic analysis to the current corpus information, extracts the clause main body of the current corpus information;
S300 obtains the semantic slot of the words of the clause main body;
S400 is raw according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information At regular expression.
Specifically, then the present invention is generated big by obtaining a large amount of corpus information according to a large amount of corpus informations of acquisition The regular expression of amount, regular expression refer to for describing or matching a series of character strings for meeting some syntactic rule. The present embodiment illustrates the generation method of its regular expression by taking a corpus information as an example.
Corpus information can be text information, such as a word or bookish a word that user's text inputs, corpus letter Breath can also be voice messaging or audio-frequency information of recording of user's input etc..Current corpus information of the present embodiment to get For be illustrated.
After getting current corpus information, syntactic analysis is carried out to current corpus information, extracts the sentence of current corpus information Formula main body, such as extract subject, predicate, object, attribute in current corpus information.For example, current corpus information is that " whale is What can spray water ", the clause main body extracted is " whale water spray ", and " whale " is subject, and " water spray " is predicate.
After extracting clause main body, according to the part of speech of the words of clause main body, the words of clause main body is converted into correspondence Semantic slot, semantic slot can be all words of the corresponding part of speech of the words, or with the semantic identical word of the words. For example, clause main body is " whale water spray ", wherein " whale " is noun, and " water spray " is verb, and " whale " corresponding semantic slot can For thesaurus, " water spray " corresponding semantic slot can be verb library.
After obtaining the corresponding semantic slot of words of clause main body and clause main body, can according to clause main body, semantic slot and Remaining non-master body portion generates the corresponding regular expression of current corpus information in current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and the clause main body extracted is " whale spray Water ", " whale " corresponding semantic slot are thesaurus, and " water spray " corresponding semantic slot is verb library, and remaining non-master body portion is " why can ", be according to the regular expression that obtained above- mentioned information generate " ## thesaurus ## [why] [meeting] ## verb Two ## of library ".
After the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus information In clause main body, then the words in clause main body is converted into corresponding semantic slot, finally according to clause master by such as Subject, Predicate and Object Remaining non-master body portion generates regular expression in the corresponding semantic slot of words and corpus information in body, and the present invention is according to sentence Formula structure and the part of speech of word automatically generate regular expression, without manually being compiled according to the rule that the meaning of sentence is deduced It writes, not only saves labour turnover, but also efficiency is higher.
The second embodiment provided according to the present invention, as shown in Fig. 2, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
S400 is raw according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information At regular expression.
Specifically, in above-described embodiment one, the method for clause main body of current corpus information is extracted concretely: first right Current corpus information is segmented, and the part of speech of the words in current corpus information is obtained, then according to syntax rule and current language The part of speech for expecting the words in information, obtains the sentence structure of current corpus information, finally according to the clause knot of current corpus information Structure extracts the clause main body of current corpus information.
Participle is carried out to current corpus information refer to that current corpus information is divided into word or word one by one, " will not such as know Road you what is being said " be divided into " not knowing what you are saying ";" why whale can spray water " is divided into " whale, to be assorted for another example , meeting, water spray ".
After segmenting current corpus information, the words obtained after participle is analyzed to obtain in current corpus information Words part of speech, as will after " why whale can spray water " participle obtained words be " whale " (noun), " why " (generation Word), " meeting " (auxiliary verb), " water spray " (verb).Then right according to the part of speech of the words in corpus rule and current corpus information Current corpus information carries out clause analysis, and the sentence structure for obtaining current corpus information " why whale can spray water " is " master+shape + meaning ", finally according to the sentence structure of current corpus information, to current corpus information analyzed known to based on " whale water spray " Structure is called, " why spraying water " is verbal endocentric phrase, and " can spray water " is verbal endocentric phrase, according to the current corpus of result after analysis The main structure of information " why whale can spray water " is " the whale water spray " of subject-predicate phrase, therefore, from " why whale can spray It is subject-predicate phrase " whale water spray " that clause main body is extracted in water ".
The 3rd embodiment provided according to the present invention, as shown in figure 3, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S420 by after participle the remaining non-master body portion of the current corpus information and the semantic slot work as according to described The sentence structure of preceding corpus information is ranked up, and generates regular expression.
Specifically, the clause main body of current corpus information is extracted according to the method for above-described embodiment, and obtains clause master After the corresponding semantic slot of the words of body, the non-master body portion in the current corpus information after participle is retained, then will be worked as The words of clause main body in preceding corpus information replaces with corresponding semantic slot, finally by semantic slot and non-master body portion according to working as The sentence structure of preceding corpus information itself, which is ranked up, produces the corresponding regular expression of current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ", " be assorted by non-master body portion ", the corresponding semantic slot " thesaurus " and " verb library " of words of " meeting " and clause main body according to current corpus information sentence Formula structure be ranked up as " thesaurus ", " why ", " meeting ", " verb library "." thesaurus ", " why ", " meeting ", " dynamic Dictionary " is the occurrence of regular expression, and the symbol by the way that regular expression is added between occurrence produces current language Expect the corresponding regular expression of information " ## thesaurus ## [why] two ## of [meeting] ## verb library ".
The fourth embodiment provided according to the present invention, as shown in figure 4, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S430 by after participle the remaining non-master body portion of the current corpus information and the semantic slot according to grammer knot Structure is ranked up generation sequence difference and at least one semantic identical regular expression.
Specifically, the present embodiment and the difference of above-mentioned 3rd embodiment are, according to the side of embodiment one or embodiment two Method extracts the clause main body of current corpus information, and after the corresponding semantic slot of the words for obtaining clause main body, after participle Non-master body portion in current corpus information is retained, and then replaces with the words of the clause main body in current corpus information Semantic slot and non-master body portion are finally ranked up generation sequence difference and semantic phase according to syntactic structure by corresponding semanteme slot At least one same regular expression.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ".Before keeping semanteme identical Put, by non-master body portion " why ", the corresponding semantic slot " thesaurus " of words of " meeting " and clause main body and " verb Library " according to syntactic structure sort available " thesaurus ", " why ", " meeting ", " verb library " and " why ", " noun Library ", " meeting ", " verb library ".
According to " thesaurus ", " why ", the obtained regular expression of " meeting ", the sequence of " verb library " be " ## noun Library ## [why] two ## of [meeting] ## verb library ".According to " why ", " thesaurus ", " meeting ", the sequence of " verb library " obtain Regular expression is " ## [why] two ## of ## thesaurus ## [meeting] ## verb library ".The present embodiment is by by regular expression Occurrence, which carries out permutation and combination, can realize the purpose that the identical regular expression of multiple semantemes is generated according to a corpus information, with Improve the formation efficiency of regular expression.
The 5th embodiment provided according to the present invention, as shown in figure 5, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S420 by after participle the remaining non-master body portion of the current corpus information and the semantic slot work as according to described The sentence structure of preceding corpus information is ranked up, and generates regular expression;
After S440 generates regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme Regular expression.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by) The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library; " teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary " " " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library Verb library ## thesaurus ## ".
The sixth embodiment provided according to the present invention, as shown in fig. 6, a kind of generation method of regular expression, comprising:
S100 obtains current corpus information;
S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information Property;
S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into The analysis of row clause, obtains corresponding sentence structure;
S230 extracts the clause main body of the current corpus information according to the sentence structure;
S300 obtains the semantic slot of the words of the clause main body;
The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410;
S430 by after participle the remaining non-master body portion of the current corpus information and the semantic slot according to grammer knot Structure is ranked up generation sequence difference and at least one semantic identical regular expression;
After S440 generates regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme Regular expression.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by) The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library; " teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary " " " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library Verb library ## thesaurus ## ".
The 7th embodiment provided according to the present invention, as shown in fig. 7, a kind of generation system of regular expression, comprising:
Corpus information obtains module 100, for obtaining current corpus information;
Clause main body abstraction module 200 extracts the current language for carrying out syntactic analysis to the current corpus information Expect the clause main body of information;
Semantic slot obtains module 300, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module 400, for being believed according to the clause main body, the semantic slot and the current corpus Remaining non-master body portion generates regular expression in breath.
Specifically, then the present invention is generated big by obtaining a large amount of corpus information according to a large amount of corpus informations of acquisition The regular expression of amount, regular expression refer to for describing or matching a series of character strings for meeting some syntactic rule. The present embodiment illustrates the generation method of its regular expression by taking a corpus information as an example.
Corpus information can be text information, such as a word or bookish a word that user's text inputs, corpus letter Breath can also be voice messaging or audio-frequency information of recording of user's input etc..Current corpus information of the present embodiment to get For be illustrated.
After getting current corpus information, syntactic analysis is carried out to current corpus information, extracts the sentence of current corpus information Formula main body, such as extract subject, predicate, object, attribute in current corpus information.For example, current corpus information is that " whale is What can spray water ", the clause main body extracted is " whale water spray ", and " whale " is subject, and " water spray " is predicate.
After extracting clause main body, according to the part of speech of the words of clause main body, the words of clause main body is converted into correspondence Semantic slot, semantic slot can be all words of the corresponding part of speech of the words, or with the semantic identical word of the words. For example, clause main body is " whale water spray ", wherein " whale " is noun, and " water spray " is verb, and " whale " corresponding semantic slot can For thesaurus, " water spray " corresponding semantic slot can be verb library.
After obtaining the corresponding semantic slot of words of clause main body and clause main body, can according to clause main body, semantic slot and Remaining non-master body portion generates the corresponding regular expression of current corpus information in current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and the clause main body extracted is " whale spray Water ", " whale " corresponding semantic slot are thesaurus, and " water spray " corresponding semantic slot is verb library, and remaining non-master body portion is " why can ", be according to the regular expression that obtained above- mentioned information generate " ## thesaurus ## [why] [meeting] ## verb Two ## of library ".
After the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus information In clause main body, then the words in clause main body is converted into corresponding semantic slot, finally according to clause master by such as Subject, Predicate and Object Remaining non-master body portion generates regular expression in the corresponding semantic slot of words and corpus information in body, and the present invention is according to sentence Formula structure and the part of speech of word automatically generate regular expression, without manually being compiled according to the rule that the meaning of sentence is deduced It writes, not only saves labour turnover, but also efficiency is higher.
Preferably, the clause main body abstraction module 200 includes:
Participle unit 210 obtains the word in the current corpus information for segmenting to the current corpus information Word and corresponding part of speech;
Clause analytical unit 220, for the part of speech according to the words in syntax rule and the current corpus information, to institute It states current corpus information and carries out clause analysis, obtain corresponding sentence structure;
Clause main body extraction unit 230, for extracting the clause master of the current corpus information according to the sentence structure Body.
Specifically, in above-described embodiment one, the method for clause main body of current corpus information is extracted concretely: first right Current corpus information is segmented, and the part of speech of the words in current corpus information is obtained, then according to syntax rule and current language The part of speech for expecting the words in information, obtains the sentence structure of current corpus information, finally according to the clause knot of current corpus information Structure extracts the clause main body of current corpus information.
Participle is carried out to current corpus information refer to that current corpus information is divided into word or word one by one, " will not such as know Road you what is being said " be divided into " not knowing what you are saying ";" why whale can spray water " is divided into " whale, to be assorted for another example , meeting, water spray ".
After segmenting current corpus information, the words obtained after participle is analyzed to obtain in current corpus information Words part of speech, as will after " why whale can spray water " participle obtained words be " whale " (noun), " why " (generation Word), " meeting " (auxiliary verb), " water spray " (verb).Then right according to the part of speech of the words in corpus rule and current corpus information Current corpus information carries out clause analysis, and the sentence structure for obtaining current corpus information " why whale can spray water " is " master+shape + meaning ", finally according to the sentence structure of current corpus information, to current corpus information analyzed known to based on " whale water spray " Structure is called, " why spraying water " is verbal endocentric phrase, and " can spray water " is verbal endocentric phrase, according to the current corpus of result after analysis The main structure of information " why whale can spray water " is " the whale water spray " of subject-predicate phrase, therefore, from " why whale can spray It is subject-predicate phrase " whale water spray " that clause main body is extracted in water ".
Preferably, the regular expression generation module 400 includes:
Replacement unit 410, for the words of the clause main body in the current corpus information to be replaced with corresponding semanteme Slot;
Regular expression generation unit 420, for the current remaining non-master body portion of corpus information after segmenting It is ranked up with the semantic slot according to the sentence structure of the current corpus information, generates regular expression.
Specifically, the clause main body of current corpus information is extracted according to the method for above-described embodiment, and obtains clause master After the corresponding semantic slot of the words of body, the non-master body portion in the current corpus information after participle is retained, then will be worked as The words of clause main body in preceding corpus information replaces with corresponding semantic slot, finally by semantic slot and non-master body portion according to working as The sentence structure of preceding corpus information itself, which is ranked up, produces the corresponding regular expression of current corpus information.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ", " be assorted by non-master body portion ", the corresponding semantic slot " thesaurus " and " verb library " of words of " meeting " and clause main body according to current corpus information sentence Formula structure be ranked up as " thesaurus ", " why ", " meeting ", " verb library "." thesaurus ", " why ", " meeting ", " dynamic Dictionary " is the occurrence of regular expression, and the symbol by the way that regular expression is added between occurrence produces current language Expect the corresponding regular expression of information " ## thesaurus ## [why] two ## of [meeting] ## verb library ".
Preferably, the regular expression generation unit 420, after being also used to generate the regular expression, in generation Conjunction is added in the regular expression, generates the identical regular expression of another semanteme.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by) The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library; " teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary " " " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library Verb library ## thesaurus ## ".
The 8th embodiment provided according to the present invention, as shown in fig. 7, a kind of generation system of regular expression, comprising:
Corpus information obtains module 100, for obtaining current corpus information;
Clause main body abstraction module 200 extracts the current language for carrying out syntactic analysis to the current corpus information Expect the clause main body of information;
Semantic slot obtains module 300, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module 400, for being believed according to the clause main body, the semantic slot and the current corpus Remaining non-master body portion generates regular expression in breath.
The clause main body abstraction module 200 includes:
Participle unit 210 obtains the word in the current corpus information for segmenting to the current corpus information Word and corresponding part of speech;
Clause analytical unit 220, for the part of speech according to the words in syntax rule and the current corpus information, to institute It states current corpus information and carries out clause analysis, obtain corresponding sentence structure;
Clause main body extraction unit 230, for extracting the clause master of the current corpus information according to the sentence structure Body.
The regular expression generation module 400 includes:
Replacement unit 410, for the words of the clause main body in the current corpus information to be replaced with corresponding semanteme Slot;
Regular expression generation unit 420, for the current remaining non-master body portion of corpus information after segmenting It is ranked up with the semantic slot according to syntactic structure and generates that sequence is different and semantic at least one identical regular expression.
Specifically, after getting current corpus information, first current corpus information is segmented, obtains current corpus information In the part of speech of words obtain current corpus information then according to the part of speech of the words in syntax rule and current corpus information Sentence structure extract the clause main body of current corpus information finally according to the sentence structure of current corpus information.
Participle is carried out to current corpus information refer to that current corpus information is divided into word or word one by one, " will not such as know Road you what is being said " be divided into " not knowing what you are saying ";" why whale can spray water " is divided into " whale, to be assorted for another example , meeting, water spray ".
After segmenting current corpus information, the words obtained after participle is analyzed to obtain in current corpus information Words part of speech, as will after " why whale can spray water " participle obtained words be " whale " (noun), " why " (generation Word), " meeting " (auxiliary verb), " water spray " (verb).Then right according to the part of speech of the words in corpus rule and current corpus information Current corpus information carries out clause analysis, and the sentence structure for obtaining current corpus information " why whale can spray water " is " master+shape + meaning ", finally according to the sentence structure of current corpus information, to current corpus information analyzed known to based on " whale water spray " Structure is called, " why spraying water " is verbal endocentric phrase, and " can spray water " is verbal endocentric phrase, according to the current corpus of result after analysis The main structure of information " why whale can spray water " is " the whale water spray " of subject-predicate phrase, therefore, from " why whale can spray It is subject-predicate phrase " whale water spray " that clause main body is extracted in water ".
After extracting clause main body, according to the part of speech of the words of clause main body, the words of clause main body is converted into correspondence Semantic slot, semantic slot can be all words of the corresponding part of speech of the words, or with the semantic identical word of the words. For example, clause main body is " whale water spray ", wherein " whale " is noun, and " water spray " is verb, and " whale " corresponding semantic slot can For thesaurus, " water spray " corresponding semantic slot can be verb library.
The clause main body of current corpus information is extracted, and after the corresponding semantic slot of the words for obtaining clause main body, will be divided The non-master body portion in current corpus information after word is retained, then by the words of the clause main body in current corpus information Replace with corresponding semantic slot, finally by semantic slot and non-master body portion according to syntactic structure be ranked up generation sequence it is different and At least one semantic identical regular expression.
Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ".Before keeping semanteme identical Put, by non-master body portion " why ", the corresponding semantic slot " thesaurus " of words of " meeting " and clause main body and " verb Library " according to syntactic structure sort available " thesaurus ", " why ", " meeting ", " verb library " and " why ", " noun Library ", " meeting ", " verb library ".
According to " thesaurus ", " why ", the obtained regular expression of " meeting ", the sequence of " verb library " be " ## noun Library ## [why] two ## of [meeting] ## verb library ".According to " why ", " thesaurus ", " meeting ", the sequence of " verb library " obtain Regular expression is " ## [why] two ## of ## thesaurus ## [meeting] ## verb library ".The present embodiment is by by regular expression Occurrence, which carries out permutation and combination, can realize the purpose that the identical regular expression of multiple semantemes is generated according to a corpus information, with Improve the formation efficiency of regular expression.
Preferably, the regular expression generation unit 420, after being also used to generate the regular expression, in generation Conjunction is added in the regular expression, generates the identical regular expression of another semanteme.
Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by) The identical regular expression of justice.
Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library; " teacher " is noun, and corresponding semanteme slot is thesaurus;" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary " " " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library Verb library ## thesaurus ## ".
It should be noted that above-described embodiment can be freely combined as needed.The above is only of the invention preferred Embodiment, it is noted that for those skilled in the art, in the premise for not departing from the principle of the invention Under, several improvements and modifications can also be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.

Claims (10)

1. a kind of generation method of regular expression characterized by comprising
Obtain current corpus information;
Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information;
Obtain the semantic slot of the words of the clause main body;
Canonical table is generated according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information Up to formula.
2. a kind of generation method of regular expression according to claim 1, which is characterized in that described to the current language Expect that information carries out syntactic analysis, the clause main body for extracting the current corpus information specifically includes:
The current corpus information is segmented, the words and corresponding part of speech in the current corpus information are obtained;
According to the part of speech of the words in syntax rule and the current corpus information, clause point is carried out to the current corpus information Analysis, obtains corresponding sentence structure;
According to the sentence structure, the clause main body of the current corpus information is extracted.
3. a kind of generation method of regular expression according to claim 2, which is characterized in that described according to the clause Remaining non-master body portion generates regular expression and specifically includes in main body, the semantic slot and the current corpus information:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot believe according to the current corpus The sentence structure of breath is ranked up, and generates regular expression.
4. a kind of generation method of regular expression according to claim 2, which is characterized in that described according to the clause Remaining non-master body portion generates regular expression and specifically includes in main body, the semantic slot and the current corpus information:
The words of clause main body in the current corpus information is replaced with into corresponding semantic slot;
By after participle the remaining non-master body portion of the current corpus information and the semantic slot arrange according to syntactic structure Sequence generates sequence difference and at least one semantic identical regular expression.
5. a kind of generation method of regular expression according to claim 3 or 4, which is characterized in that described according to Remaining non-master body portion generates regular expression in clause main body, the semantic slot and the current corpus information further include:
After generating the regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme Regular expression.
6. a kind of generation system of regular expression characterized by comprising
Corpus information obtains module, for obtaining current corpus information;
Clause main body abstraction module extracts the current corpus information for carrying out syntactic analysis to the current corpus information Clause main body;
Semantic slot obtains module, the semantic slot of the words for obtaining the clause main body;
Regular expression generation module, for being remained according in the clause main body, the semantic slot and the current corpus information Remaining non-master body portion generates regular expression.
7. a kind of generation system of regular expression according to claim 6, which is characterized in that the clause main body extracts Module includes:
Participle unit obtains words in the current corpus information and right for segmenting to the current corpus information The part of speech answered;
Clause analytical unit, for the part of speech according to the words in syntax rule and the current corpus information, to described current Corpus information carries out clause analysis, obtains corresponding sentence structure;
Clause main body extraction unit, for extracting the clause main body of the current corpus information according to the sentence structure.
8. a kind of generation system of regular expression according to claim 7, which is characterized in that the regular expression is raw Include: at module
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for the remaining non-master body portion of the current corpus information and institute's predicate after segmenting Adopted slot is ranked up according to the sentence structure of the current corpus information, generates regular expression.
9. a kind of generation system of regular expression according to claim 7, which is characterized in that the regular expression is raw Include: at module
Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot;
Regular expression generation unit, for the remaining non-master body portion of the current corpus information and institute's predicate after segmenting Adopted slot is ranked up generation sequence difference and at least one semantic identical regular expression according to syntactic structure.
10. a kind of generation system of regular expression according to claim 8 or claim 9, which is characterized in that
The regular expression generation unit, after being also used to generate the regular expression, in the regular expression of generation Middle addition conjunction generates the identical regular expression of another semanteme.
CN201910046964.2A 2019-01-18 2019-01-18 Regular expression generation method and system Active CN109783819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910046964.2A CN109783819B (en) 2019-01-18 2019-01-18 Regular expression generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910046964.2A CN109783819B (en) 2019-01-18 2019-01-18 Regular expression generation method and system

Publications (2)

Publication Number Publication Date
CN109783819A true CN109783819A (en) 2019-05-21
CN109783819B CN109783819B (en) 2023-10-20

Family

ID=66501654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910046964.2A Active CN109783819B (en) 2019-01-18 2019-01-18 Regular expression generation method and system

Country Status (1)

Country Link
CN (1) CN109783819B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159384A (en) * 2019-12-31 2020-05-15 苏州思必驰信息科技有限公司 Rule-based sentence generation method and device
CN111428469A (en) * 2020-02-27 2020-07-17 宋继华 Sentence pattern structure diagram analysis oriented interactive labeling method and system
CN112115313A (en) * 2020-09-08 2020-12-22 北京百度网讯科技有限公司 Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium
WO2021068683A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Method and apparatus for generating regular expression, server, and computer-readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN105095186A (en) * 2015-07-28 2015-11-25 百度在线网络技术(北京)有限公司 Semantic analysis method and device
CN105512105A (en) * 2015-12-07 2016-04-20 百度在线网络技术(北京)有限公司 Semantic parsing method and device
CN107315737A (en) * 2017-07-04 2017-11-03 北京奇艺世纪科技有限公司 A kind of semantic logic processing method and system
CN107369443A (en) * 2017-06-29 2017-11-21 北京百度网讯科技有限公司 Dialogue management method and device based on artificial intelligence
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN107766560A (en) * 2017-11-03 2018-03-06 广州杰赛科技股份有限公司 The evaluation method and system of customer service flow
CN108563790A (en) * 2018-04-28 2018-09-21 科大讯飞股份有限公司 A kind of semantic understanding method and device, equipment, computer-readable medium
CN109063035A (en) * 2018-07-16 2018-12-21 哈尔滨工业大学 A kind of man-machine more wheel dialogue methods towards trip field

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN105095186A (en) * 2015-07-28 2015-11-25 百度在线网络技术(北京)有限公司 Semantic analysis method and device
CN105512105A (en) * 2015-12-07 2016-04-20 百度在线网络技术(北京)有限公司 Semantic parsing method and device
CN107369443A (en) * 2017-06-29 2017-11-21 北京百度网讯科技有限公司 Dialogue management method and device based on artificial intelligence
CN107315737A (en) * 2017-07-04 2017-11-03 北京奇艺世纪科技有限公司 A kind of semantic logic processing method and system
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN107766560A (en) * 2017-11-03 2018-03-06 广州杰赛科技股份有限公司 The evaluation method and system of customer service flow
CN108563790A (en) * 2018-04-28 2018-09-21 科大讯飞股份有限公司 A kind of semantic understanding method and device, equipment, computer-readable medium
CN109063035A (en) * 2018-07-16 2018-12-21 哈尔滨工业大学 A kind of man-machine more wheel dialogue methods towards trip field

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021068683A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Method and apparatus for generating regular expression, server, and computer-readable storage medium
CN111159384A (en) * 2019-12-31 2020-05-15 苏州思必驰信息科技有限公司 Rule-based sentence generation method and device
CN111159384B (en) * 2019-12-31 2022-07-08 思必驰科技股份有限公司 Rule-based sentence generation method and device
CN111428469A (en) * 2020-02-27 2020-07-17 宋继华 Sentence pattern structure diagram analysis oriented interactive labeling method and system
CN111428469B (en) * 2020-02-27 2023-06-16 宋继华 Interactive labeling method and system for sentence-oriented structure graphic analysis
CN112115313A (en) * 2020-09-08 2020-12-22 北京百度网讯科技有限公司 Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium
CN112115313B (en) * 2020-09-08 2023-07-28 北京百度网讯科技有限公司 Regular expression generation and data extraction methods, devices, equipment and media

Also Published As

Publication number Publication date
CN109783819B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN107463553B (en) Text semantic extraction, representation and modeling method and system for elementary mathematic problems
CN109783819A (en) A kind of generation method and system of regular expression
CN101937430B (en) Method for extracting event sentence pattern from Chinese sentence
CN110609983B (en) Structured decomposition method for policy file
CN101261623A (en) Word splitting method and device for word border-free mark language based on search
CN105975475A (en) Chinese phrase string-based fine-grained thematic information extraction method
CN109614620B (en) HowNet-based graph model word sense disambiguation method and system
CN111061882A (en) Knowledge graph construction method
CN106569993A (en) Method and device for mining hypernym-hyponym relation between domain-specific terms
Kak The Paninian approach to natural language processing
CN113312922B (en) Improved chapter-level triple information extraction method
Dunn Frequency vs. association for constraint selection in usage-based construction grammar
CN112183059A (en) Chinese structured event extraction method
CN103019924B (en) The intelligent evaluating system of input method and method
CN106021225B (en) A kind of Chinese Maximal noun phrase recognition methods based on the simple noun phrase of Chinese
CN109002540B (en) Method for automatically generating Chinese announcement document question answer pairs
Dridan Using lexical statistics to improve HPSG parsing
Shrawankar et al. Construction of news headline from detailed news article
CN110069780B (en) Specific field text-based emotion word recognition method
CN106021286A (en) Method for language understanding based on language structure
Zhang Research on the optimizing method of question answering system in natural language processing
Tapaswi et al. Parsing sanskrit sentences using lexical functional grammar
Tammewar et al. Can distributed word embeddings be an alternative to costly linguistic features: A study on parsing hindi
CN106681982B (en) English novel abstraction generating method
CN111027308A (en) Text generation method, system, mobile terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant