CN111159497A

CN111159497A - Regular expression generation method and regular expression-based data extraction method

Info

Publication number: CN111159497A
Application number: CN201911417417.7A
Authority: CN
Inventors: 孙洪亮; 张勇
Original assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Current assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-15
Anticipated expiration: 2039-12-31
Also published as: CN111159497B

Abstract

The invention provides a regular expression generation method and a regular expression-based data extraction method. The regular expression generation method comprises the following steps: determining fields to be extracted and fields not to be extracted in the original data character string; performing wildcard filtering on the fields to be extracted to obtain regular expressions of the fields, and traversing the character strings to correspond the characters and the expressions one by one to obtain the regular expressions of the fields to be extracted; and combining the regular expressions of the fields to be extracted and the regular expressions of the fields not to be extracted according to the arrangement sequence of the fields to be extracted and the fields not to be extracted in the original data character string to obtain the regular expressions of the original data. By the method and the device, the writing difficulty of the regular expression can be reduced, and the data extraction efficiency is improved.

Description

Regular expression generation method and regular expression-based data extraction method

Technical Field

The invention relates to the technical field of data processing, in particular to a regular expression generation method and a regular expression-based data extraction method.

Background

In the existing data extraction technology, extraction by using a regular expression is a common way. In the implementation process of data extraction, a writer needs to write a regular expression manually according to a data sample.

However, the regular expression has obscure semantics, a writer needs to have a very professional skill, and after writing, the regular expression needs to be verified by using new data. A skilled worker who is skilled in regular expressions can still take tens of minutes or even hours to complete the process. Writing regular expressions takes a lot of time and effort for writers, which also results in inefficient data extraction from raw data using regular expressions.

Aiming at the problems that the regular expression in the related technology needs to be manually written, is difficult, consumes long time and further influences the data processing efficiency, an effective solution is not provided at present.

Disclosure of Invention

The invention aims to provide a regular expression generation method, a regular expression-based data extraction device, computer equipment and a storage medium, which are used for solving the problems that in the prior art, a regular expression needs to be manually written, the difficulty is high, the time is long, and the data processing efficiency is further influenced.

According to an aspect of the embodiments of the present invention, to achieve the above object, the present invention provides a method for generating a regular expression.

The generation method of the regular expression comprises the following steps: determining fields to be extracted and fields not to be extracted in the original data character string; performing wildcard filtering on the fields to be extracted to obtain regular expressions of the fields, and traversing the character strings to correspond the characters and the expressions one by one to obtain the regular expressions of the fields to be extracted; and combining the regular expressions of the fields to be extracted and the regular expressions of the fields not to be extracted according to the arrangement sequence of the fields to be extracted and the fields not to be extracted in the original data character string to obtain the regular expressions of the original data.

Further, for the field to be extracted, wildcard filtering is performed to obtain the regular expression of the field, and for the field to be extracted, traversing the character string to correspond the characters and the expression one to obtain the regular expression of the field, including: judging whether the original data character string belongs to a special sample or not according to the determined field to be extracted; if the original data character string does not belong to the special sample, aiming at each field not to be extracted, acquiring a character in front of the field to be extracted at the adjacent position behind the field not to be extracted, and performing wildcard filtering on the separation character T of the field not to be extracted to obtain a regular expression containing [ ^ T ] and a T form; and traversing the character strings to correspond the characters and expressions one by one to obtain the regular expressions of the characters and the expressions, wherein the numbers from 0 to 9 correspond to the expressions \ d, the word characters correspond to \ w, and the escape characters \ or \ \ are added at the front positions of the special characters.

Further, judging whether the original data character string belongs to a special sample according to the determined field to be extracted, including: judging whether the front and back characters of the field to be extracted are special characters, if not, judging that the field belongs to a special sample; and/or judging whether the characters before and after the field to be extracted are adjacent, if not, judging that the field belongs to the special sample.

Further, when the characters before and after the field to be extracted are not uniform to be special characters, the method comprises the following steps: taking the last character of the first field to be not extracted as an initial position, searching forward one by one, taking the first searched special character as a separation character T, and carrying out wildcard filtering on the rest fields before the first searched special character in the first field to be not extracted to obtain a regular expression containing [ ^ T ] and a T form; counting the number n of the characters of the non-special characters between the first searched special character in the first field not to be extracted and the first field to be extracted at the adjacent position behind the first field to be extracted, and generating a corresponding regular expression of \ w { n }; aiming at a field to be extracted between two fields to be extracted, searching backwards one by one from a first character of the field to be extracted as an initial position, taking a first searched special character as a separation character T, and carrying out wildcard filtering on the rest fields behind the first searched special character in the field to be extracted to obtain a regular expression containing [ ^ T ] and a T form; counting the number m of the characters of the first searched special character in the field to be extracted between the two fields to be extracted and the non-special character between the fields to be extracted at the front adjacent positions, and generating a corresponding regular expression of \ w { m }.

Further, when characters before and after the field to be extracted are adjacent, the method comprises the following steps: judging whether the joint of two adjacent fields to be extracted is a non-special character; if yes, generating a regular expression by adopting an accurate matching mode for the field to be extracted, which is positioned behind the adjacent two fields to be extracted; the method comprises the steps that a field to be extracted, positioned in front of two adjacent fields to be extracted, generates a regular expression in a mode of traversing character strings to enable characters to be in one-to-one correspondence with the expression, and a non-field to be extracted obtains the regular expression in a wildcard filtering mode; if not, the field to be extracted generates a regular expression in a mode of traversing the character strings to correspond the characters and the expression one by one, and the field not to be extracted obtains the regular expression in a wildcard filtering mode.

According to an aspect of an embodiment of the present invention, to achieve the above object, the present invention provides a data extraction method based on a regular expression, including: acquiring original data extracted from required data; analyzing the original data to generate a corresponding regular expression; extracting data of original data needing data extraction according to the generated regular expression; and generating a corresponding regular expression by the regular expression generating method.

According to an aspect of the embodiments of the present invention, to achieve the above object, the present invention provides a regular expression generation apparatus.

The regular expression generation device comprises: the determining module is used for determining fields to be extracted and fields not to be extracted in the original data character string; the first generation control module is used for carrying out wildcard filtering on the fields which are not to be extracted to obtain the regular expressions of the fields, and traversing the character strings to enable the characters to correspond to the expressions one by one to obtain the regular expressions of the fields which are not to be extracted; and the second generation control module is used for combining the regular expressions of the fields to be extracted and the regular expressions of the fields not to be extracted according to the arrangement sequence of the fields to be extracted and the fields not to be extracted in the original data character string so as to obtain the regular expressions of the original data.

Further, the first generation control module includes: the judging unit is used for judging whether the original data character string belongs to the special sample or not according to the determined field to be extracted; the first generation control unit is used for acquiring a character in front of a field to be extracted at the position adjacent to the rear of the field to be extracted as a regular expression containing [ < Lambda > T ] and a T form by carrying out wildcard filtering on the separation character T of the field to be extracted aiming at each field to be extracted when the judgment result of the judgment unit is that the original data character string does not belong to the special sample; and the second generation control unit is used for traversing the character string to correspond the characters and expressions one by one to obtain regular expressions of each field to be extracted when the judgment result of the judgment unit is that the original data character string does not belong to the special sample, wherein the numbers from 0 to 9 correspond to the expression \ d, the word characters correspond to \ w, and the escape characters \ or \ \ are added at the front position of the special character.

Further, the judging unit includes: the first judgment subunit is used for judging whether the characters before and after the field to be extracted are special characters, and if not, judging that the field belongs to a special sample; and/or the second judging subunit is used for judging whether the front and rear characters of the field to be extracted are adjacent or not, and if not, judging that the field to be extracted belongs to the special sample.

Further, when the characters before and after the field to be extracted are uneven and are special characters, the device comprises: the third generation control unit is used for searching forward one by taking the last character of the first field which is not to be extracted as an initial position when the first judgment subunit judges that the characters before and after the field to be extracted are not uniform to be special characters, taking the first searched special character as a separation character T, and carrying out wildcard filtering on the residual field before the first searched special character in the first field which is not to be extracted to obtain a regular expression containing [ ^ T ] and a T form; the fourth generation control unit is used for counting the number n of the characters of the non-special characters between the special character searched first in the first non-to-be-extracted field and the first to-be-extracted field at the adjacent position behind the special character searched first in the first non-to-be-extracted field when the first judgment result of the first judgment subunit is that the front and rear characters of the field to be extracted are not uniform to be the special characters, and generating a corresponding regular expression of \ w { n }; a fifth generation control unit, configured to search backwards one character by one character from a first character of a field to be extracted as an initial position of a field to be extracted, regarding a field to be not extracted between two fields to be extracted, when the first judgment result of the first judgment subunit is that characters before and after the field to be extracted are not uniform to be special characters, regarding the first searched special character as a separation character T, and performing wildcard filtering on remaining fields after the first searched special character in the field to be not extracted to obtain a regular expression including [ ^ T ] and a T form; and the sixth generation control unit is used for counting the number m of the characters of the non-special characters between the first searched special character in the non-to-be-extracted field between the two fields to be extracted and the field to be extracted at the adjacent position in front when the judgment result of the first judgment subunit is that the characters before and after the field to be extracted are not the special characters, and generating a corresponding regular expression of \ w { m }.

Further, when characters before and after the field to be extracted are adjacent, the device comprises: the third judging subunit is used for judging whether the joint of the two adjacent fields to be extracted is a non-special character or not when the second judging subunit judges that the characters before and after the fields to be extracted are adjacent; the seventh generation control unit is used for controlling the field to be extracted, positioned at the back of the two adjacent fields to be extracted, to generate a regular expression in an accurate matching mode when the judgment result of the third judgment subunit is yes; the method comprises the steps that a field to be extracted, positioned in front of two adjacent fields to be extracted, generates a regular expression in a mode of traversing character strings to enable characters to be in one-to-one correspondence with the expression, and a non-field to be extracted obtains the regular expression in a wildcard filtering mode; and the eighth generation control unit is used for controlling the field to be extracted to generate a regular expression in a mode of traversing the character strings to correspond the characters to the expression one by one when the judgment result of the third judgment subunit is negative, and the field not to be extracted obtains the regular expression in a wildcard filtering mode.

According to an aspect of the embodiments of the present invention, to achieve the above object, the present invention provides a data extraction apparatus based on regular expressions.

The regular expression-based data extraction device comprises: the acquisition module is used for acquiring original data extracted by the required data; the third generation module is used for analyzing the original data and generating a corresponding regular expression; the data extraction module is used for extracting data of original data needing data extraction according to the generated regular expression; the third generation module generates a regular expression through the regular expression.

According to an aspect of the embodiments of the present invention, to achieve the above object, the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the above method when executing the computer program.

According to an aspect of embodiments of the present invention, to achieve the above object, the present invention further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above method.

According to the regular expression generation method, the regular expression-based data extraction device, the computer equipment and the storage medium, when the regular expression generation method is realized, after a user provides a section of original data and determines a target field to be extracted, the regular expression is automatically generated through analysis of the original data and extraction of the field to be extracted, the user does not need to be familiar with relevant rules of the regular expression to obtain the required regular expression, and the writing difficulty of the regular expression is reduced; meanwhile, after the regular expression is obtained, the log files of the same type are extracted on the basis of the regular expression, and the data extraction efficiency can be greatly improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is an optional flowchart of a method for generating a regular expression according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a sample of original data in a regular expression generation method according to an embodiment of the present invention;

fig. 3 is an optional flowchart of the regular expression-based data extraction method according to the second embodiment of the present invention;

fig. 4 is an optional structural block diagram of a regular expression generation apparatus according to a third embodiment of the present invention;

fig. 5 is an optional block diagram of the regular expression-based data extraction apparatus according to the fourth embodiment of the present invention;

fig. 6 is a hardware configuration diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the existing data extraction technology, regular expression extraction is the most common mode, wherein a writer needs to extract a regular expression according to a data sample, but the regular expression is obscure in semantics and needs a very professional skill, and after the writing is finished, the new data is used for verifying the regular expression. The proficient worker who is good at the regular expression finishes the process for at least 30 minutes, so that the working efficiency of extracting data from the original data by using the regular expression is low; according to the embodiment of the invention, the regular expression is automatically generated by analyzing the original data and dividing the field to be extracted, so that the effect of data extraction in what you see is what you get is achieved, the writing difficulty of the regular expression is reduced, and the data extraction efficiency is improved. After verification, the regular expression can be automatically generated for more than 90% of logs, a user does not need to be familiar with the regular expression for logs with general complexity, the regular expression is automatically generated by dividing the content of the field to be extracted of the original log, and the generated regular expression is automatically checked by using a new log, so that the process is finished, and only a few minutes are needed.

Specific embodiments of the regular expression generation method, the regular expression-based data extraction device, the computer device, and the storage medium provided by the present invention will be described in detail below.

Example one

The embodiment of the invention provides a regular expression generation method, which can be applied to the field of data processing, such as data extraction, data matching and filtering, and the like. Specifically, fig. 1 is a flowchart of a method for generating a regular expression according to an embodiment of the present invention, and as shown in fig. 1, the method for generating a regular expression according to the embodiment includes steps S101 to S103 as follows.

Step S101: determining fields to be extracted and fields not to be extracted in the original data character string;

when the method is implemented, a user can firstly provide original data of a regular expression to be generated as an example, and then select target content to be extracted, wherein the target content corresponds to a field to be extracted, the user can select a plurality of target content to be extracted, and accordingly, the method can generate the regular expression for extracting the plurality of target content. After the field to be extracted is determined, the field to be extracted divides the provided original data into a plurality of parts, and each part corresponds to the field to be extracted of the present invention, that is, the number of the field to be extracted of the present invention may be multiple. For example, referring to fig. 2, in the example of the raw data of fig. 2, the selected target contents are: "770", "info", "eth 1", and thus "770", "info", "eth 1" are three fields to be extracted, which divide the original data sample into 1 to 7 fields, including three fields to be extracted (2, 4, 6 fields in the figure) and 4 fields not to be extracted (1, 3, 5, 7 fields in the figure).

Step S102: performing wildcard filtering on the fields to be extracted to obtain regular expressions of the fields, and traversing the character strings to correspond the characters and the expressions one by one to obtain the regular expressions of the fields to be extracted;

in this embodiment, two basic strategies for generating regular expressions, including a cut-apart formula and a traversal formula, are provided, which are specifically described as follows:

a cutting formula: the characters in the character string are divided into two categories of [ ^ T ] and T by taking special characters T (such as spaces, ' $ ', ' \\ ' and the like) as separating symbols, and the generated expression forms are as [ ^ S ] + \ S ' (\ S represents spaces, tabs and the like).

Traversing formula: traversing the character string to correspond the characters to corresponding expressions one by one, for example, the digital characters '0-9' correspond to the expression '\ d', the word characters correspond to the expression '\ \ w', and the special characters are preceded by the escape characters '\' or '\ \ is a corresponding expression, and finally the generated expression is like "\ \ d \ s \ w \ \ d \ d'.

It should be noted here that the regular expression is a logical formula operating on a character string, and under the state of the art, the character string can be divided into general characters (e.g., letters between a to z) and special characters, and the special characters are also called meta characters, which is a basic concept well known and clear to those skilled in the art.

In particular, the above-described disjunctive and traversable are illustrated in the following log sample to provide a better understanding of embodiments of the invention.

May 30 14:20:58 localhost dockerd:time＝"2018-05-30T14:20: 58.069595211+08:00"

It should be noted here that, for the sake of distinction, the applicant marks an underline below the field to be extracted in the above example, and the underline is merely used to help the reader distinguish the field to be extracted from the field not to be extracted, and the log raw data itself is not underlined.

One of the functions of the disjunctive in the embodiment of the present invention is to locate a field to be extracted, and is mostly used to generate an expression of a non-extracted field, where, as in the above-mentioned sample, a time field (underlined field) in the sample needs to be extracted, and the expression of the non-extracted field generated by using the disjunctive method of the present invention is as follows:

the method has the advantages that the expression generated by the method is simple and the matching efficiency is high, and the characters only need to be compared with special characters \ during matching.

The traversal method is used for generating the expression of the field to be extracted, and as in the above example, the traversal method is used for generating the expression of the field to be extracted:

(. The complete expression finally generated using the disjunctive and ergodic forms is as follows:

^(？:[^\"]*\")(？<Time>\d+\-\d+\-\w\:\d+\:\d+\.\d+\+\d+\:\d+)

step S103: and combining the regular expressions of the fields to be extracted and the regular expressions of the fields not to be extracted according to the arrangement sequence of the fields to be extracted and the fields not to be extracted in the original data character string to obtain the regular expressions of the original data.

The original data is divided into a plurality of fields, and after each field generates a corresponding expression according to a corresponding method, the fields are assembled and merged according to the arrangement sequence of the fields in the original data, so that the regular expression of the original data sample is finally obtained. Referring to the above example, the expression (: ?.

The following description is made of a method for generating a regular expression in a case where a plurality of fields to be extracted are provided, so as to better understand the embodiment of the present invention:

taking the data provided in fig. 2 as a sample of the original data, in the sample of the original data in fig. 2, "770", "info", "eth 1" are taken as three fields to be extracted, and the fields to be extracted are named as type, "770", "info", "eth 1" are named: type1, type2, type3 adopt the above-mentioned segmentation and ergodic modes, appoint the former character of the field to be extracted as the special character to separate the field not to extract first, the field to be extracted is then generated by way of ergodic, the concrete step of generating the regular expression automatically is:

1. the regular expressions are respectively generated by separating the whole character string into seven fields of 1, 2, 3, 4, 5, 6 and 7.

2. Generating a regular expression for the field 1, wherein a character in front of the field 2 to be extracted is '[', so that the field 1 is separated by using the character as a special character, and the obtained expression 1 is an expression which starts with a non-left middle bracket character and ends with a left middle bracket: ([ ^ a/[ ]/C), the number of the expression is 1: ([ ^/[ ]. multidot. \[) {1} (the {1} below can be omitted).

3. Since the field 2 is a field to be extracted, all characters included in the field are directly checked by adopting a traversal method, and the expression is obtained as follows: \ d +.

4. For the generation of the field 3, preferably, it is determined first whether the beginning character of the field 3 is equal to the segmentation character of the subsequent field to be extracted, if so, a segment of the expression of the character is added, and since there is no expression, the expression is generated in the same way as the field 1: ([ ^ a ] + \\.

5. The field 4 is generated in the same way as the field 2, and comprises the following steps: w +.

6. Also for field 5, the method of reference field 1 generates: ([ ^ a/(] + \\ ()).

7. Field 6 generates a regular expression: w +.

8. Preferably, since the field 7 is not a field to be extracted and is at the end while being separated from the field 6 with a special character, a regular expression may not be generated.

9. The expression obtained finally is:

^(？:[^\[]*\[)(？<type1>\d+)(？:[^\<]+\<)(？<type2>\w+)(？:[^\(]+\()(？<type3>\w+)

according to the embodiment provided by the invention, the regular expression is automatically generated by analyzing the original data and dividing the field to be extracted, so that a user can obtain the required regular expression without being familiar with the relevant rules of the regular expression, and the writing difficulty of the regular expression is reduced; meanwhile, after the regular expression is obtained, the log files of the same type are extracted on the basis of the regular expression, and the data extraction efficiency can be greatly improved.

In view of the fact that the original data may have an irregular condition, in order to improve the generality and compatibility of the regular expression generated by the present application, the above-mentioned scheme is further optimized in this embodiment, specifically, before the regular expression is generated, a determination step is added to determine whether the provided original data sample is an irregular special condition, if not, the processing is performed according to the above-mentioned processing manner, and if there is a special condition, the generation logic corresponding to the special condition is invoked for generation. The concrete description is as follows:

firstly, providing a judgment logic for judging whether an original data character string belongs to a special sample, comprising:

judging whether the front and rear characters of the field to be extracted are special characters, and if not, judging that the field belongs to a special sample; and/or judging whether the characters before and after the field to be extracted are adjacent, and if not, judging that the field belongs to the special sample. Through the judgment logic, whether the sample is a special sample or not can be directly determined, if the special condition exists, the generation logic corresponding to the special condition is called to generate, and the compatibility of the generated regular expression is improved.

For different special situations, the method corresponds to different processing logics, and specifically, in a processing scenario provided by the present invention, when characters before and after a field to be extracted are not uniform to be special characters, the processing logics include:

taking the last character of the first field to be not extracted as an initial position, searching forward one by one, taking the first searched special character as a separation character T, and carrying out wildcard filtering on the rest fields before the first searched special character in the first field to be not extracted to obtain a regular expression containing [ ^ T ] and a T form; counting the number n of the characters of the non-special characters between the first searched special character in the first field not to be extracted and the first field to be extracted at the adjacent position behind the first field to be extracted, and generating a corresponding regular expression of \ w { n }; aiming at a field to be extracted between two fields to be extracted, searching backwards one by one from a first character of the field to be extracted as an initial position, taking a first searched special character as a separation character T, and carrying out wildcard filtering on the rest fields behind the first searched special character in the field to be extracted to obtain a regular expression containing [ ^ T ] and a T form; counting the number m of the characters of the first searched special character in the field to be extracted between the two fields to be extracted and the non-special character between the fields to be extracted at the front adjacent positions, and generating a corresponding regular expression of \ w { m }.

The following is described in connection with specific examples in order to better understand the present solution:

log sample: oplog result ═ 1' session _ id ═ e9b2a9f18ffaa150e167d41922e59756card＝"G1/1"

The above-mentioned two fields to be extracted167d41922e andG1/1(named type1, type2), where neither type1 nor before nor after is a special character, and if the length of the field to be extracted is not fixed in other examples, it is difficult to extract a valid expression by the above-mentioned ways of splitting and traversing, so that the algorithm needs to consider the special case, and the module adopts the following method:

first, a search is performed from the previous character of the field to be extracted in type1, and when a special character is encountered, for example, the above example searches forward until ═ is the special character encountered first, then the non-extracted field in front of the special character (including the character) is generated into a regular expression according to a segmentation method as follows:

^(？:[^\＝]*\＝){2}

the remaining non-extracted field part "e 9b2a9f18ffaa150 e" has 17 non-special characters counted during searching, and generates the corresponding expression as follows:

\w{17}

and the expression generated by type1 in the traversal method is as follows:

(？<type1>\w+)

then, referring to the above method, the non-special character adjacent to type1 in the non-extraction field between type1 and type2 stops when encountering a special character (space character in this example) by searching from front to back, and counts up 6 non-special characters, the remaining non-extraction field part is divided according to the dividing method, and finally the regular expression of the part is as follows:

\w{6}(？:[^\"]*\")

and type2 generates the following expression:

(？<type2>\w+\/\w)

the final generated complete expression is:

^(？:[^\＝]*\＝){2}\w{17}(？<type1>\w+)\w{6}(？:[^\"]*\")(？<type2>\w+\/\w)

in the non-extraction field, the non-special characters adjacent to the field to be extracted are mainly processed in a segmentation method, and the method for correspondingly generating the regular expression under the condition that the non-special characters are not special is referred to in the condition.

The special condition processing method is used for the condition that the number of non-special characters before and after the field to be extracted is fixed, has the best effect, and is particularly suitable for scenes such as extracting fields such as birth date in identity card numbers.

For different special cases, the method corresponds to different processing logics, and specifically, in another processing scenario provided by the present invention, when front and rear characters of a field to be extracted are adjacent, the processing logics include:

judging whether the joint of two adjacent fields to be extracted is a non-special character; if yes, generating a regular expression by adopting an accurate matching mode for the field to be extracted, which is positioned behind the adjacent two fields to be extracted; the method comprises the steps that a field to be extracted, positioned in front of two adjacent fields to be extracted, generates a regular expression in a mode of traversing character strings to enable characters to be in one-to-one correspondence with the expression, and a non-field to be extracted obtains the regular expression in a wildcard filtering mode; if not, the field to be extracted generates a regular expression in a mode of traversing the character strings to correspond the characters and the expression one by one, and the field not to be extracted obtains the regular expression in a wildcard filtering mode.

log sample: leave: false network peers:1entries:2Queue qLen:0net

It should be noted here that, for the sake of distinction, the applicant underlines the field to be extracted in the above example, and since two fields to be extracted are connected, the distinction is made in the underline style, and the single line underlinenetFor a field to be extracted, double-underlined

Is another field to be extracted. The underlining is used to help the reader distinguish different fields to be extracted and between the fields to be extracted and the fields not to be extracted, and the log original data is not underlined.

The two fields to be extracted (named type1 and type2) are adjacent, and the junction is a non-special character, if the two fields generated by the traversal method plus optimization are expressed as follows:

(？<type1>\w+)(？<type2>\w+\/\w+)

if the above expression is used to match the sample, the result of obtaining two fields is:

type1:netMs

type2:g/s80r

it is clear that the results are not consistent with expectations, since the junction of the two fields is a non-special character, and the field boundaries cannot be effectively defined using "\ w +".

The preferred processing method provided by this embodiment is as follows: when two fields are adjacent, whether the characters at the joint are all non-special characters is judged, if yes, the next field generates a regular expression in an accurate matching mode during traversal generation, if not, the next field continues to generate the regular expression according to an optimized traversal method, and the regular expression generated by the two fields according to the method in the sample is as follows:

(？<type1>\w+)(？<type2>\w{3}\/\w{4})

it should be noted that the regular expression is a logic formula operating on a character string, and in the field of the present invention, there exist an exact matching manner and a fuzzy matching manner in the regular expression, and the exact matching manner is a basic concept known and clear to those skilled in the art.

The special case processing method is used for the case that the lengths of two adjacent fields are fixed, has the best effect, and is generally applied to some samples with fixed formats, such as extracting the region field and the birth date field in the identification number, wherein the two fields are adjacent and the joint is a non-special character number, but the method can also be used for effectively extracting the identification number due to the fact that the lengths of the two fields are fixed.

For different special cases, corresponding to different processing logics, specifically, in an optional embodiment of the present invention, another judgment and processing manner for special cases is further provided, for example:

log sample: leave: false network peers:1entries:2Queue qLen:0netMsg/

It should be noted here that, for the sake of distinction, the applicant underlines the field to be extracted in the above example, and since two fields to be extracted are connected, the distinction is made in the underline style, and the single line underlinenetMsg/For a field to be extracted, double-underlined

The two fields to be extracted are type1 and type2 respectively, and the two fields are adjacent.

The preferred processing method provided by this embodiment is as follows: if there are multiple fields, comparing whether the starting position of the next field is coincident with the ending position of the previous field, if not, continuing the segmentation method and the traversal method introduced above, if so, directly calling the traversal method to generate the expression of the next field, and the expression generated by the method in this case is as follows:

^(？:[^\s]*\s){5}(？<type1>\w+\/)(？<type2>\w+)

when the method for generating the regular expression provided by the embodiment is implemented, after a user provides a section of original data and determines a target field to be extracted, the regular expression is automatically generated through analysis of the original data and extraction of the field to be extracted, the user does not need to be familiar with relevant rules of the regular expression to obtain the required regular expression, and the writing difficulty of the regular expression is reduced; meanwhile, after the regular expression is obtained, the log files of the same type are extracted on the basis of the regular expression, and the data extraction efficiency can be greatly improved.

Example two

Based on the first embodiment, the second embodiment of the present invention provides a data extraction method based on a regular expression, and referring to fig. 3, the data extraction method based on the regular expression includes the following steps S301 to S303:

s301, acquiring original data extracted by the required data;

s302, analyzing the original data to generate a corresponding regular expression;

s303, extracting the data of the original data to be extracted according to the generated regular expression;

the regular expression corresponding to step S302 is generated by the regular expression generating method in the first embodiment.

In specific implementation, an operation interface may be provided, a user may input, as raw data, a sample of a regular expression to be generated at a first designated location of the operation interface, and after receiving, by a background server, a generated regular expression sent by an operation interface user, according to processing logic corresponding to the regular expression generation method in the first embodiment, a regular expression corresponding to the sample of the raw data input by the user is directly generated and displayed at a second designated location of the operation interface. Preferably, the user may also send an instruction for hiding the regular expression, and after the background server receives the instruction for hiding the regular expression sent by the user on the operation interface, the generated regular expression is hidden on the operation interface. Therefore, the user can control whether the regular expression is displayed or not according to the requirements of the user, and the operation flexibility is improved.

For the part of regular expression generation, the related technical features and the corresponding technical effects can refer to the first embodiment, and are not described herein again.

EXAMPLE III

Corresponding to the first embodiment, a third embodiment of the present invention provides a device for generating a regular expression, and reference may be made to the first embodiment for related technical features and corresponding technical effects, which are not described herein again. Fig. 4 is a block diagram of a regular expression generating apparatus according to a third embodiment of the present invention, and as shown in fig. 4, the regular expression generating apparatus includes: a determining module 401, configured to determine a field to be extracted and a field not to be extracted in an original data string; the first generation control module 402 performs wildcard filtering on the fields to be extracted to obtain regular expressions of the fields, and traverses the character strings to correspond the characters and the expressions one by one to obtain the regular expressions of the fields to be extracted; the second generation control module 403 is configured to combine the regular expression of the field to be extracted and the regular expression of the field not to be extracted according to the arrangement order of the field to be extracted and the field not to be extracted in the original data character string, so as to obtain the regular expression of the original data.

When the regular expression generating device provided in the embodiment is implemented, after a user provides a section of original data and determines a target field to be extracted, the regular expression is automatically generated through analysis of the original data and extraction of the field to be extracted, the user does not need to be familiar with relevant rules of the regular expression to obtain the required regular expression, and the writing difficulty of the regular expression is reduced; meanwhile, after the regular expression is obtained, the log files of the same type are extracted on the basis of the regular expression, and the data extraction efficiency can be greatly improved.

Example four

Corresponding to the second embodiment, a fourth embodiment of the present invention provides a data extraction apparatus based on a regular expression, and reference may be made to the second embodiment and the first embodiment for related technical features and corresponding technical effects, which are not described herein again. Fig. 5 is a block diagram of a regular expression-based data extraction apparatus according to a fourth embodiment of the present invention, and as shown in fig. 5, the regular expression generation apparatus includes: an obtaining module 502, configured to obtain original data of required data extraction; a third generating module 504, configured to analyze the original data and generate a corresponding regular expression; a data extraction module 506, configured to perform data extraction on original data to be extracted according to the generated regular expression; the third generation module is implemented by the generation method of the regular expression provided in the first embodiment.

EXAMPLE five

The third embodiment further provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of multiple servers) capable of executing programs, and the like. As shown in fig. 6, the computer device 01 of the present embodiment at least includes but is not limited to: a memory 011 and a processor 012, which are communicatively connected to each other via a system bus, as shown in fig. 6. It is noted that fig. 6 only shows the computer device 01 having the component memory 011 and the processor 012, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the memory 011 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 011 can be an internal storage unit of the computer device 01, such as a hard disk or a memory of the computer device 01. In other embodiments, the memory 011 can also be an external storage device of the computer device 01, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device 01. Of course, the memory 011 can also include both internal and external memory units of the computer device 01. In this embodiment, the memory 011 is generally used for storing an operating system and various application software installed in the computer device 01, for example, the program code of the regular expression generation apparatus in the third embodiment, the data extraction apparatus based on the regular expression in the fourth embodiment, and the like. Further, the memory 011 can also be used to temporarily store various kinds of data that have been output or are to be output.

The processor 012 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor, or other data Processing chip in some embodiments. The processor 012 is generally used to control the overall operation of the computer device 01. In the present embodiment, the processor 012 is configured to run program codes or processing data stored in the memory 011, for example, a method of generating a regular expression, or the like.

EXAMPLE six

The sixth embodiment further provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for a regular expression generation apparatus, and when executed by a processor, implements the regular expression generation method of the first embodiment.

In another implementation, the computer-readable storage medium of this embodiment is used for a regular expression-based data extraction apparatus, and when executed by a processor, the computer-readable storage medium implements the regular expression-based data extraction method of the second embodiment.

It can be seen from the above description that, in the method for generating a regular expression, the method for extracting data based on a regular expression, the apparatus, the computer device and the storage medium provided in the above embodiments of the present invention, when implemented, after a user provides a segment of original data and determines a target field to be extracted, the regular expression is automatically generated by analyzing the original data and extracting the field to be extracted, and the user does not need to be familiar with the relevant rules of the regular expression to obtain the required regular expression, thereby reducing the writing difficulty of the regular expression; meanwhile, after the regular expression is obtained, the log files of the same type are extracted on the basis of the regular expression, and the data extraction efficiency can be greatly improved.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for generating a regular expression, comprising:

determining fields to be extracted and fields not to be extracted in the original data character string;

performing wildcard filtering on the fields not to be extracted to obtain regular expressions of the fields, and traversing the character strings to enable the characters to correspond to the expressions one by one to obtain the regular expressions of the fields to be extracted;

and combining the regular expression of the field to be extracted and the regular expression of the field not to be extracted according to the arrangement sequence of the field to be extracted and the field not to be extracted in the original data character string to obtain the regular expression of the original data.

2. The method for generating a regular expression according to claim 1, wherein the performing wildcard filtering to the field not to be extracted to obtain the regular expression, and traversing the character string to correspond the characters and the expression one to obtain the regular expression to the field to be extracted includes:

judging whether the original data character string belongs to a special sample or not according to the determined field to be extracted;

if the original data character string does not belong to the special sample, aiming at each field not to be extracted, acquiring a character in front of the field to be extracted at the position adjacent to the rear of the field not to be extracted, and performing wildcard filtering on the separation character T of the field not to be extracted to obtain a regular expression containing [ ^ T ] and a T form;

and traversing the character strings to correspond the characters and expressions one by one to obtain the regular expressions of the characters and the expressions, wherein the numbers from 0 to 9 correspond to the expressions \ d, the word characters correspond to \ w, and the escape characters \ or \ \ are added at the front positions of the special characters.

3. The method for generating a regular expression according to claim 2, wherein the determining whether the original data character string belongs to a special sample according to the determined field to be extracted includes:

judging whether the front and back characters of the field to be extracted are special characters, and if not, judging that the field belongs to a special sample; and/or the presence of a gas in the gas,

and judging whether the front and rear characters of the field to be extracted are adjacent or not, and if not, judging that the field to be extracted belongs to a special sample.

4. The method for generating a regular expression according to claim 3, wherein when the characters before and after the field to be extracted are not uniform to be special characters, the method comprises:

taking the last character of a first field to be not extracted as an initial position, searching forward one by one, taking a first searched special character as a separation character T, and carrying out wildcard filtering on the rest fields before the first searched special character in the first field to be not extracted to obtain a regular expression containing [ ^ T ] and a T form;

counting the number n of the characters of the non-special characters between the first searched special character in the first field not to be extracted and the first field to be extracted at the adjacent position behind the first field not to be extracted, and generating a corresponding regular expression of \ w { n };

aiming at a field to be extracted between two fields to be extracted, searching backwards one by one from a first character of the field to be extracted as an initial position, taking a first searched special character as a separation character T, and carrying out wildcard filtering on the rest fields behind the first searched special character in the field to be extracted to obtain a regular expression containing [ ^ T ] and a T form;

and counting the number m of the characters of the special character searched for in the first field to be extracted in the fields to be extracted between the two fields to be extracted and the special character between the fields to be extracted in the adjacent position in front, and generating a corresponding regular expression of \ w { m }.

5. The method for generating the regular expression according to claim 3, wherein when characters before and after the field to be extracted are adjacent, the method comprises:

judging whether the joint of two adjacent fields to be extracted is a non-special character;

if yes, generating a regular expression by adopting an accurate matching mode for the field to be extracted, which is positioned behind the adjacent two fields to be extracted; the field to be extracted, positioned in front of the two adjacent fields to be extracted, generates a regular expression in a way that characters and expressions are in one-to-one correspondence by traversing the character strings, and the non-field to be extracted obtains the regular expression by a wildcard filtering way;

if not, the field to be extracted generates a regular expression in a mode that characters correspond to expressions one by adopting the traversal character string, and the field not to be extracted obtains the regular expression in a wildcard filtering mode.

6. A data extraction method based on regular expressions is characterized by comprising the following steps:

acquiring original data extracted from required data;

analyzing the original data to generate a corresponding regular expression;

performing data extraction on original data needing data extraction according to the generated regular expression;

wherein the corresponding regular expression is generated by the regular expression generation method according to any one of claims 1 to 5.

7. An apparatus for generating a regular expression, comprising:

the determining module is used for determining fields to be extracted and fields not to be extracted in the original data character string;

the first generation control module is used for carrying out wildcard filtering on the fields which are not to be extracted to obtain regular expressions of the fields, and traversing the character strings to enable the characters and the expressions to be in one-to-one correspondence to obtain the regular expressions of the fields which are not to be extracted;

and the second generation control module is used for combining the regular expressions of the fields to be extracted and the regular expressions of the fields not to be extracted according to the arrangement sequence of the fields to be extracted and the fields not to be extracted in the original data character string so as to obtain the regular expressions of the original data.

8. A regular expression-based data extraction device, comprising:

the acquisition module is used for acquiring original data extracted by the required data;

the third generation module is used for analyzing the original data and generating a corresponding regular expression;

the data extraction module is used for extracting data of original data needing data extraction according to the generated regular expression;

wherein the third generation module generates the regular expression by the regular expression of any of claims 1-5.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 5 are implemented by the processor when executing the computer program.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 5.