CN109783819A

CN109783819A - A kind of generation method and system of regular expression

Info

Publication number: CN109783819A
Application number: CN201910046964.2A
Authority: CN
Inventors: 魏誉荧
Original assignee: Guangdong Genius Technology Co Ltd
Current assignee: Guangdong Genius Technology Co Ltd
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2019-05-21
Anticipated expiration: 2039-01-18
Also published as: CN109783819B

Abstract

The invention belongs to technical field of data processing, disclose the generation method and system of a kind of regular expression, and method includes: to obtain current corpus information；Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information；Obtain the semantic slot of the words of the clause main body；Regular expression is generated according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information.The present invention automatically generates regular expression according to the part of speech of sentence structure and word, without manually being write according to the rule that the meaning of sentence is deduced, not only saves labour turnover, but also efficiency is higher.

Description

A kind of generation method and system of regular expression

Technical field

The invention belongs to technical field of data processing, in particular to the generation method and system of a kind of regular expression.

Background technique

With the rapid development of network technology, there is a large amount of information data to generate and need to handle, traditional canonical daily Expression formula, which generally passes through, is manually write, need according to " check corpus → keyword in judgement corpus → write dictionary → Write canonical formula " the step of write, that is, need manually to be write according to the rule that the meaning of sentence is deduced, not only mistake Journey is complicated, and manually checks corpus and the efficiency write is lower, and fully rely on manual compiling regular expression without Method handles bulk information data newly-increased daily accurately and in time, meanwhile, by manual compiling regular expression to staff's It is more demanding.

Therefore, currently it is badly in need of a kind of side that can write automatically the corresponding regular expression of corpus according to corpus information by system Method.

Summary of the invention

The object of the present invention is to provide a kind of generation method of regular expression and system, realization automatically generates regular expressions The purpose of formula, not only saves labour turnover, but also efficiency is higher.

Technical solution provided by the invention is as follows:

On the one hand, a kind of generation method of regular expression is provided, comprising:

Obtain current corpus information；

Syntactic analysis is carried out to the current corpus information, extracts the clause main body of the current corpus information；

Obtain the semantic slot of the words of the clause main body；

It is generated just according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information Then expression formula.

It is further preferred that described carry out syntactic analysis to the current corpus information, the current corpus information is extracted Clause main body specifically include:

The current corpus information is segmented, the words and corresponding part of speech in the current corpus information are obtained；

According to the part of speech of the words in syntax rule and the current corpus information, sentence is carried out to the current corpus information Formula analysis, obtains corresponding sentence structure；

According to the sentence structure, the clause main body of the current corpus information is extracted.

It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information Non-master body portion generate regular expression specifically include:

The words of clause main body in the current corpus information is replaced with into corresponding semantic slot；

By after participle the remaining non-master body portion of the current corpus information and the semantic slot according to the current language The sentence structure of material information is ranked up, and generates regular expression.

By after participle the remaining non-master body portion of the current corpus information and the semantic slot according to syntactic structure into Row sequence generates sequence difference and at least one semantic identical regular expression.

It is further preferred that it is described according to the clause main body, it is remaining in the semantic slot and the current corpus information Non-master body portion generate regular expression further include:

After generating the regular expression, conjunction is added in the regular expression of generation, generates another semanteme Identical regular expression.

On the other hand, a kind of generation system of regular expression is also provided, comprising:

Corpus information obtains module, for obtaining current corpus information；

Clause main body abstraction module extracts the current corpus for carrying out syntactic analysis to the current corpus information The clause main body of information；

Semantic slot obtains module, the semantic slot of the words for obtaining the clause main body；

Regular expression generation module, for according to the clause main body, the semantic slot and the current corpus information In remaining non-master body portion generate regular expression.

It is further preferred that the clause main body abstraction module includes:

Participle unit obtains the words in the current corpus information for segmenting to the current corpus information And corresponding part of speech；

Clause analytical unit, for the part of speech according to the words in syntax rule and the current corpus information, to described Current corpus information carries out clause analysis, obtains corresponding sentence structure；

Clause main body extraction unit, for extracting the clause main body of the current corpus information according to the sentence structure.

It is further preferred that the regular expression generation module includes:

Replacement unit, for the words of the clause main body in the current corpus information to be replaced with corresponding semantic slot；

Regular expression generation unit, for after segmenting the remaining non-master body portion of the current corpus information and institute Predicate justice slot is ranked up according to the sentence structure of the current corpus information, generates regular expression.

It is further preferred that the regular expression generation module includes:

Regular expression generation unit, for after segmenting the remaining non-master body portion of the current corpus information and institute Predicate justice slot is ranked up generation sequence difference and at least one semantic identical regular expression according to syntactic structure.

It is further preferred that the regular expression generation unit, after being also used to generate the regular expression, is generating The regular expression in be added conjunction, generate the identical regular expression of another semanteme.

Compared with prior art, the generation method and system of a kind of regular expression provided by the invention have beneficial below Effect:

1, after the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus letter Then words in clause main body is converted into corresponding semantic slot, finally according to clause by the clause main body in breath, such as Subject, Predicate and Object Remaining non-master body portion generates regular expression in the corresponding semantic slot of words in main body and corpus information, the present invention according to The part of speech of sentence structure and word automatically generates regular expression, without manually being carried out according to the rule that the meaning of sentence is deduced It writes, not only saves labour turnover, but also efficiency is higher.

2, in a preferred embodiment, root can be realized by the way that the occurrence of regular expression is carried out permutation and combination The purpose of the identical regular expression of multiple semantemes is generated, according to a corpus information to improve the formation efficiency of regular expression.

Detailed description of the invention

Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, the life to a kind of regular expression It is further described at above-mentioned characteristic, technical characteristic, advantage and its implementation of method and system.

Fig. 1 is a kind of flow diagram of the first embodiment of the generation method of regular expression of the present invention；

Fig. 2 is a kind of flow diagram of the second embodiment of the generation method of regular expression of the present invention；

Fig. 3 is a kind of flow diagram of the 3rd embodiment of the generation method of regular expression of the present invention；

Fig. 4 is a kind of flow diagram of the fourth embodiment of the generation method of regular expression of the present invention；

Fig. 5 is a kind of flow diagram of 5th embodiment of the generation method of regular expression of the present invention；

Fig. 6 is a kind of flow diagram of the sixth embodiment of the generation method of regular expression of the present invention；

Fig. 7 is a kind of structural schematic block diagram of the generation system of regular expression of the present invention.

Drawing reference numeral explanation

100, corpus information obtains module；200, clause main body abstraction module；

210, participle unit；220, clause analytical unit；

230, clause main body extraction unit；300, semantic slot obtains module；

400, regular expression generation module；410, replacement unit；

420, regular expression generation unit.

Specific embodiment

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing, and obtain other embodiments.

To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated " only this ", can also indicate the situation of " more than one ".

The first embodiment provided according to the present invention, as shown in Figure 1, a kind of generation method of regular expression, comprising:

S100 obtains current corpus information；

S200 carries out syntactic analysis to the current corpus information, extracts the clause main body of the current corpus information；

S300 obtains the semantic slot of the words of the clause main body；

S400 is raw according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information At regular expression.

Specifically, then the present invention is generated big by obtaining a large amount of corpus information according to a large amount of corpus informations of acquisition The regular expression of amount, regular expression refer to for describing or matching a series of character strings for meeting some syntactic rule. The present embodiment illustrates the generation method of its regular expression by taking a corpus information as an example.

Corpus information can be text information, such as a word or bookish a word that user's text inputs, corpus letter Breath can also be voice messaging or audio-frequency information of recording of user's input etc..Current corpus information of the present embodiment to get For be illustrated.

After getting current corpus information, syntactic analysis is carried out to current corpus information, extracts the sentence of current corpus information Formula main body, such as extract subject, predicate, object, attribute in current corpus information.For example, current corpus information is that " whale is What can spray water ", the clause main body extracted is " whale water spray ", and " whale " is subject, and " water spray " is predicate.

After extracting clause main body, according to the part of speech of the words of clause main body, the words of clause main body is converted into correspondence Semantic slot, semantic slot can be all words of the corresponding part of speech of the words, or with the semantic identical word of the words. For example, clause main body is " whale water spray ", wherein " whale " is noun, and " water spray " is verb, and " whale " corresponding semantic slot can For thesaurus, " water spray " corresponding semantic slot can be verb library.

After obtaining the corresponding semantic slot of words of clause main body and clause main body, can according to clause main body, semantic slot and Remaining non-master body portion generates the corresponding regular expression of current corpus information in current corpus information.

Illustratively, current corpus information is " why whale can spray water ", and the clause main body extracted is " whale spray Water ", " whale " corresponding semantic slot are thesaurus, and " water spray " corresponding semantic slot is verb library, and remaining non-master body portion is " why can ", be according to the regular expression that obtained above- mentioned information generate " ## thesaurus ## [why] [meeting] ## verb Two ## of library ".

After the present invention gets corpus information, clause analysis first is carried out to the corpus information of acquisition, extracts corpus information In clause main body, then the words in clause main body is converted into corresponding semantic slot, finally according to clause master by such as Subject, Predicate and Object Remaining non-master body portion generates regular expression in the corresponding semantic slot of words and corpus information in body, and the present invention is according to sentence Formula structure and the part of speech of word automatically generate regular expression, without manually being compiled according to the rule that the meaning of sentence is deduced It writes, not only saves labour turnover, but also efficiency is higher.

The second embodiment provided according to the present invention, as shown in Fig. 2, a kind of generation method of regular expression, comprising:

S100 obtains current corpus information；

S210 segments the current corpus information, obtains the words and corresponding word in the current corpus information Property；

S220 according to the part of speech of the words in syntax rule and the current corpus information, to the current corpus information into The analysis of row clause, obtains corresponding sentence structure；

S230 extracts the clause main body of the current corpus information according to the sentence structure；

S300 obtains the semantic slot of the words of the clause main body；

Specifically, in above-described embodiment one, the method for clause main body of current corpus information is extracted concretely: first right Current corpus information is segmented, and the part of speech of the words in current corpus information is obtained, then according to syntax rule and current language The part of speech for expecting the words in information, obtains the sentence structure of current corpus information, finally according to the clause knot of current corpus information Structure extracts the clause main body of current corpus information.

Participle is carried out to current corpus information refer to that current corpus information is divided into word or word one by one, " will not such as know Road you what is being said " be divided into " not knowing what you are saying "；" why whale can spray water " is divided into " whale, to be assorted for another example , meeting, water spray ".

After segmenting current corpus information, the words obtained after participle is analyzed to obtain in current corpus information Words part of speech, as will after " why whale can spray water " participle obtained words be " whale " (noun), " why " (generation Word), " meeting " (auxiliary verb), " water spray " (verb).Then right according to the part of speech of the words in corpus rule and current corpus information Current corpus information carries out clause analysis, and the sentence structure for obtaining current corpus information " why whale can spray water " is " master+shape + meaning ", finally according to the sentence structure of current corpus information, to current corpus information analyzed known to based on " whale water spray " Structure is called, " why spraying water " is verbal endocentric phrase, and " can spray water " is verbal endocentric phrase, according to the current corpus of result after analysis The main structure of information " why whale can spray water " is " the whale water spray " of subject-predicate phrase, therefore, from " why whale can spray It is subject-predicate phrase " whale water spray " that clause main body is extracted in water ".

The 3rd embodiment provided according to the present invention, as shown in figure 3, a kind of generation method of regular expression, comprising:

S100 obtains current corpus information；

S300 obtains the semantic slot of the words of the clause main body；

The words of clause main body in the current corpus information is replaced with corresponding semantic slot by S410；

S420 by after participle the remaining non-master body portion of the current corpus information and the semantic slot work as according to described The sentence structure of preceding corpus information is ranked up, and generates regular expression.

Specifically, the clause main body of current corpus information is extracted according to the method for above-described embodiment, and obtains clause master After the corresponding semantic slot of the words of body, the non-master body portion in the current corpus information after participle is retained, then will be worked as The words of clause main body in preceding corpus information replaces with corresponding semantic slot, finally by semantic slot and non-master body portion according to working as The sentence structure of preceding corpus information itself, which is ranked up, produces the corresponding regular expression of current corpus information.

Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ", " be assorted by non-master body portion ", the corresponding semantic slot " thesaurus " and " verb library " of words of " meeting " and clause main body according to current corpus information sentence Formula structure be ranked up as " thesaurus ", " why ", " meeting ", " verb library "." thesaurus ", " why ", " meeting ", " dynamic Dictionary " is the occurrence of regular expression, and the symbol by the way that regular expression is added between occurrence produces current language Expect the corresponding regular expression of information " ## thesaurus ## [why] two ## of [meeting] ## verb library ".

The fourth embodiment provided according to the present invention, as shown in figure 4, a kind of generation method of regular expression, comprising:

S100 obtains current corpus information；

S300 obtains the semantic slot of the words of the clause main body；

S430 by after participle the remaining non-master body portion of the current corpus information and the semantic slot according to grammer knot Structure is ranked up generation sequence difference and at least one semantic identical regular expression.

Specifically, the present embodiment and the difference of above-mentioned 3rd embodiment are, according to the side of embodiment one or embodiment two Method extracts the clause main body of current corpus information, and after the corresponding semantic slot of the words for obtaining clause main body, after participle Non-master body portion in current corpus information is retained, and then replaces with the words of the clause main body in current corpus information Semantic slot and non-master body portion are finally ranked up generation sequence difference and semantic phase according to syntactic structure by corresponding semanteme slot At least one same regular expression.

Illustratively, current corpus information is " why whale can spray water ", and clause main body is " whale water spray ", clause master The corresponding semantic slot of words " whale " in body is thesaurus, and the corresponding semantic slot of the words " water spray " in clause main body is verb Library, in the current corpus information after participle remaining non-master body portion be " why " and " meeting ".Before keeping semanteme identical Put, by non-master body portion " why ", the corresponding semantic slot " thesaurus " of words of " meeting " and clause main body and " verb Library " according to syntactic structure sort available " thesaurus ", " why ", " meeting ", " verb library " and " why ", " noun Library ", " meeting ", " verb library ".

According to " thesaurus ", " why ", the obtained regular expression of " meeting ", the sequence of " verb library " be " ## noun Library ## [why] two ## of [meeting] ## verb library ".According to " why ", " thesaurus ", " meeting ", the sequence of " verb library " obtain Regular expression is " ## [why] two ## of ## thesaurus ## [meeting] ## verb library ".The present embodiment is by by regular expression Occurrence, which carries out permutation and combination, can realize the purpose that the identical regular expression of multiple semantemes is generated according to a corpus information, with Improve the formation efficiency of regular expression.

The 5th embodiment provided according to the present invention, as shown in figure 5, a kind of generation method of regular expression, comprising:

S100 obtains current corpus information；

S300 obtains the semantic slot of the words of the clause main body；

S420 by after participle the remaining non-master body portion of the current corpus information and the semantic slot work as according to described The sentence structure of preceding corpus information is ranked up, and generates regular expression；

After S440 generates regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme Regular expression.

Specifically, in Chinese grammer, the different but semantic identical situation there is also the clause such as active sentence and passive sentence is It fully considers such case, in the case where not changing the intention of current corpus information, is added in the regular expression of generation Then occurrence in the regular expression of generation is re-started assembled arrangement and generates another language by conjunction (such as, by) The identical regular expression of justice.

Illustratively, current corpus information is " to teacher's dictionary " that " giving " is verb, and corresponding semanteme slot is verb library； " teacher " is noun, and corresponding semanteme slot is thesaurus；" dictionary " is noun, and corresponding semanteme slot is thesaurus, and generation is just Then expression formula is " ## verb library ## thesaurus ## thesaurus ## ".Relative is added in current corpus information " giving teacher's dictionary " " " current corpus information becomes " dictionary to teacher " afterwards, therefore, in regular expression " the ## verb library ## noun of generation It is " ## thesaurus ## that the identical regular expression of another semanteme that conjunction " " generates afterwards is added in the ## thesaurus ## " of library Verb library ## thesaurus ## ".

The sixth embodiment provided according to the present invention, as shown in fig. 6, a kind of generation method of regular expression, comprising:

S100 obtains current corpus information；

S300 obtains the semantic slot of the words of the clause main body；

S430 by after participle the remaining non-master body portion of the current corpus information and the semantic slot according to grammer knot Structure is ranked up generation sequence difference and at least one semantic identical regular expression；

The 7th embodiment provided according to the present invention, as shown in fig. 7, a kind of generation system of regular expression, comprising:

Corpus information obtains module 100, for obtaining current corpus information；

Clause main body abstraction module 200 extracts the current language for carrying out syntactic analysis to the current corpus information Expect the clause main body of information；

Semantic slot obtains module 300, the semantic slot of the words for obtaining the clause main body；

Regular expression generation module 400, for being believed according to the clause main body, the semantic slot and the current corpus Remaining non-master body portion generates regular expression in breath.

Preferably, the clause main body abstraction module 200 includes:

Participle unit 210 obtains the word in the current corpus information for segmenting to the current corpus information Word and corresponding part of speech；

Clause analytical unit 220, for the part of speech according to the words in syntax rule and the current corpus information, to institute It states current corpus information and carries out clause analysis, obtain corresponding sentence structure；

Clause main body extraction unit 230, for extracting the clause master of the current corpus information according to the sentence structure Body.

Preferably, the regular expression generation module 400 includes:

Replacement unit 410, for the words of the clause main body in the current corpus information to be replaced with corresponding semanteme Slot；

Regular expression generation unit 420, for the current remaining non-master body portion of corpus information after segmenting It is ranked up with the semantic slot according to the sentence structure of the current corpus information, generates regular expression.

Preferably, the regular expression generation unit 420, after being also used to generate the regular expression, in generation Conjunction is added in the regular expression, generates the identical regular expression of another semanteme.

The 8th embodiment provided according to the present invention, as shown in fig. 7, a kind of generation system of regular expression, comprising:

The clause main body abstraction module 200 includes:

The regular expression generation module 400 includes:

Regular expression generation unit 420, for the current remaining non-master body portion of corpus information after segmenting It is ranked up with the semantic slot according to syntactic structure and generates that sequence is different and semantic at least one identical regular expression.

Specifically, after getting current corpus information, first current corpus information is segmented, obtains current corpus information In the part of speech of words obtain current corpus information then according to the part of speech of the words in syntax rule and current corpus information Sentence structure extract the clause main body of current corpus information finally according to the sentence structure of current corpus information.

The clause main body of current corpus information is extracted, and after the corresponding semantic slot of the words for obtaining clause main body, will be divided The non-master body portion in current corpus information after word is retained, then by the words of the clause main body in current corpus information Replace with corresponding semantic slot, finally by semantic slot and non-master body portion according to syntactic structure be ranked up generation sequence it is different and At least one semantic identical regular expression.

It should be noted that above-described embodiment can be freely combined as needed.The above is only of the invention preferred Embodiment, it is noted that for those skilled in the art, in the premise for not departing from the principle of the invention Under, several improvements and modifications can also be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.

Claims

1. a kind of generation method of regular expression characterized by comprising

Obtain current corpus information；

Obtain the semantic slot of the words of the clause main body；

Canonical table is generated according to non-master body portion remaining in the clause main body, the semantic slot and the current corpus information Up to formula.

2. a kind of generation method of regular expression according to claim 1, which is characterized in that described to the current language Expect that information carries out syntactic analysis, the clause main body for extracting the current corpus information specifically includes:

According to the part of speech of the words in syntax rule and the current corpus information, clause point is carried out to the current corpus information Analysis, obtains corresponding sentence structure；

3. a kind of generation method of regular expression according to claim 2, which is characterized in that described according to the clause Remaining non-master body portion generates regular expression and specifically includes in main body, the semantic slot and the current corpus information:

By after participle the remaining non-master body portion of the current corpus information and the semantic slot believe according to the current corpus The sentence structure of breath is ranked up, and generates regular expression.

4. a kind of generation method of regular expression according to claim 2, which is characterized in that described according to the clause Remaining non-master body portion generates regular expression and specifically includes in main body, the semantic slot and the current corpus information:

By after participle the remaining non-master body portion of the current corpus information and the semantic slot arrange according to syntactic structure Sequence generates sequence difference and at least one semantic identical regular expression.

5. a kind of generation method of regular expression according to claim 3 or 4, which is characterized in that described according to Remaining non-master body portion generates regular expression in clause main body, the semantic slot and the current corpus information further include:

After generating the regular expression, conjunction is added in the regular expression of generation, it is identical to generate another semanteme Regular expression.

6. a kind of generation system of regular expression characterized by comprising

Corpus information obtains module, for obtaining current corpus information；

Clause main body abstraction module extracts the current corpus information for carrying out syntactic analysis to the current corpus information Clause main body；

Regular expression generation module, for being remained according in the clause main body, the semantic slot and the current corpus information Remaining non-master body portion generates regular expression.

7. a kind of generation system of regular expression according to claim 6, which is characterized in that the clause main body extracts Module includes:

Participle unit obtains words in the current corpus information and right for segmenting to the current corpus information The part of speech answered；

8. a kind of generation system of regular expression according to claim 7, which is characterized in that the regular expression is raw Include: at module

Regular expression generation unit, for the remaining non-master body portion of the current corpus information and institute's predicate after segmenting Adopted slot is ranked up according to the sentence structure of the current corpus information, generates regular expression.

9. a kind of generation system of regular expression according to claim 7, which is characterized in that the regular expression is raw Include: at module

Regular expression generation unit, for the remaining non-master body portion of the current corpus information and institute's predicate after segmenting Adopted slot is ranked up generation sequence difference and at least one semantic identical regular expression according to syntactic structure.

10. a kind of generation system of regular expression according to claim 8 or claim 9, which is characterized in that

The regular expression generation unit, after being also used to generate the regular expression, in the regular expression of generation Middle addition conjunction generates the identical regular expression of another semanteme.