CN112115313A - Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium - Google Patents

Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium Download PDF

Info

Publication number
CN112115313A
CN112115313A CN202010935977.8A CN202010935977A CN112115313A CN 112115313 A CN112115313 A CN 112115313A CN 202010935977 A CN202010935977 A CN 202010935977A CN 112115313 A CN112115313 A CN 112115313A
Authority
CN
China
Prior art keywords
data
list
processed
regular expression
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010935977.8A
Other languages
Chinese (zh)
Other versions
CN112115313B (en
Inventor
吕亮亮
冯智
宋传园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010935977.8A priority Critical patent/CN112115313B/en
Publication of CN112115313A publication Critical patent/CN112115313A/en
Application granted granted Critical
Publication of CN112115313B publication Critical patent/CN112115313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The application discloses regular expression generation and data extraction methods, devices, equipment and media, relates to the technical field of data processing, and further relates to a data extraction and classification technology, wherein the regular expression generation method comprises the following steps: acquiring a sample data list; the sample data list comprises a plurality of sample data; generating a public data tree corresponding to the sample data list according to each sample data; generating a data type list according to the public data tree; and generating a plurality of regular expressions matched with the sample data list according to the data type list. The regular expression generation method and the regular expression generation device can automatically generate the regular expression, so that the generation efficiency of the regular expression is improved.

Description

Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium
Technical Field
The application relates to the technical field of data processing, in particular to a data extraction and classification technology.
Background
In the formulation of data standards, it is often necessary to abstract a regular expression from a certain class of data. The regular expression can abstract common characteristics of similar data, and can automatically extract and classify the data. Because data has the characteristics of diversified data types and huge quantity, manual design of the regular expression wastes manpower and time, and automatic generation of the regular expression is necessary.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a medium for generating and extracting regular expressions, so that the regular expressions are automatically generated, and the generation efficiency of the regular expressions is improved.
In a first aspect, an embodiment of the present application provides a method for generating a regular expression, including:
acquiring a sample data list; the sample data list comprises a plurality of sample data;
generating a public data tree corresponding to the sample data list according to each sample data;
generating a data type list according to the public data tree;
and generating a plurality of regular expressions matched with the sample data list according to the data type list.
In a second aspect, an embodiment of the present application provides a data extraction method, including:
acquiring data to be processed;
analyzing the data to be processed to generate a regular expression matched with the data to be processed;
performing data extraction on the data to be processed according to the generated regular expression;
wherein the regular expression is generated by the regular expression generation method of any one of claims 1-13.
In a third aspect, an embodiment of the present application provides an apparatus for generating a regular expression, including:
the sample data list acquisition module is used for acquiring a sample data list; the sample data list comprises a plurality of sample data;
the public data tree generating module is used for generating a public data tree corresponding to the sample data list according to each sample data;
the data type list generating module is used for generating a data type list according to the public data tree;
and the first regular expression generating module is used for generating a plurality of regular expressions matched with the sample data list according to the data type list.
In a fourth aspect, an embodiment of the present application provides a data extraction apparatus, including:
the data to be processed acquisition module is used for acquiring data to be processed;
the second regular expression generating module is used for analyzing the data to be processed and generating a regular expression matched with the data to be processed;
the data extraction module is used for extracting data of the data to be processed according to the generated regular expression;
wherein the regular expression is generated by the regular expression generation method of any one of claims 1-13.
In a fifth aspect, an embodiment of the present application provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method for generating the regular expression provided by the first aspect embodiment or the method for extracting data provided by the second aspect embodiment.
In a sixth aspect, embodiments of the present application further provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method for generating a regular expression provided in the first aspect, or execute the method for extracting data provided in the second aspect.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the sample data list, the data type list is generated according to the public data tree, the regular expressions matched with the sample data list are generated according to the data type list, data to be processed are extracted and classified quickly by using the generated regular expressions, the automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions and the extraction and classification efficiency of the data are improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a flowchart of a method for generating a regular expression according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for generating a regular expression according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for generating a regular expression according to an embodiment of the present application;
fig. 4 is a flowchart of a data extraction method provided in an embodiment of the present application;
fig. 5 is a structural diagram of a regular expression generation apparatus according to an embodiment of the present application;
fig. 6 is a structural diagram of a data extraction apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device for implementing a regular expression generation method or a data extraction method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In an example, fig. 1 is a flowchart of a method for generating a regular expression provided in an embodiment of the present application, where the embodiment is applicable to a case of automatically generating a regular expression, and the method may be performed by a regular expression generating apparatus, which may be implemented by software and/or hardware, and may be generally integrated in an electronic device. The electronic device may be a computer device or the like. Accordingly, as shown in fig. 1, the method comprises the following operations:
s110, acquiring a sample data list; the sample data list comprises a plurality of sample data.
The sample data list may be a list of sample data components required to generate a regular expression. Optionally, the data type of the sample data may be a character string type or a chinese character type, and the data type of the sample data is not limited in the embodiment of the present application.
In the embodiment of the present invention, before generating the regular expression, a sample data list for generating the regular expression may be first obtained. Illustratively, data screening can be performed from batch data, and sample data obtained through screening is constructed to form a sample data list. For example, a plurality of pieces of website data are screened from the batch data to construct a sample data list. Or, the sample data can be independently constructed directly according to the data screening requirement, and a sample data list can be constructed and formed according to the constructed sample data. For example, according to the screening requirement of the special webpage link character string, a corresponding special webpage link character string sample is independently constructed or obtained to be used as sample data to construct a sample data list. The embodiment of the present application does not limit the specific obtaining manner of the sample data list.
And S120, generating a public data tree corresponding to the sample data list according to each sample data.
Wherein the common data tree may record a common data sequence between the sample data. The common data sequence is the same data between sample data.
Correspondingly, after the sample data list is obtained, each sample data of the sample data list can be analyzed to determine a common data sequence of each sample data, and a common data tree corresponding to the sample data list is generated according to the common data sequence of each sample data.
Illustratively, for sample data "www.cbidu.com.cn" and "www.za.com" whose common data sequence is "www." and ". com", a common data tree of the sample data list [ "www.cbidu.com.cn", "www.za.com" ] may be constructed from the common data sequence "www." and ". com". Wherein each node in the common data tree may be a common data sequence. For example, the common data tree for the sample data list [ "www.cbidu.com.cn", "www.za.com" ] may be: the root node is "www.", and the child nodes are ". com".
And S130, generating a data type list according to the public data tree.
The data type list can be used for recording data types of related data in the sample data list, and the data types can be used for judging variable characteristics of the related data of each sample data.
Correspondingly, after the public data tree corresponding to the sample data list is generated, a data type list can be further generated according to the generated public data tree, and the variable characteristics of the related data of the sample data can be judged through the data type list.
And S140, generating a plurality of regular expressions matched with the sample data list according to the data type list.
In the embodiment of the invention, after the data type list is generated aiming at the sample data list, a plurality of regular expressions matched with the sample data list can be generated according to the data type list. Optionally, the component of the regular expression corresponding to each piece of relevant data may be generated according to the result of evaluating the variable characteristic of the relevant data of the sample data by the data type list and the specific data content of each piece of relevant data, so as to automatically generate the regular expression matched with the sample data list.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the obtained sample data list, the data type list is generated according to the public data tree, and therefore the regular expressions matched with the sample data list are generated according to the data type list, automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions is improved.
In an example, fig. 2 is a flowchart of a method for generating a regular expression provided in an embodiment of the present application, and the embodiment of the present application performs optimization and improvement on the basis of the technical solutions of the above embodiments, and provides a plurality of specific implementation manners for generating a common data tree corresponding to the sample data list according to each sample data and generating a data type list according to the common data tree.
A method for generating a regular expression as shown in fig. 2 includes:
and S210, acquiring a sample data list.
And S220, generating a public data tree corresponding to the sample data list according to each sample data.
In an optional embodiment of the present application, the generating a common data tree corresponding to the sample data list according to each sample data may include: taking the sample data list as a current data list; generating a current target common continuous subsequence of each sample data in the current data list through a suffix tree data structure; taking the current target public continuous subsequence as a root node of a current public data tree, and sequentially determining a temporary child node of each sample data according to the target public continuous subsequence and each sample data in the current data list; wherein the temporary child nodes include a first temporary child node and a second temporary child node; constructing a target data list according to each temporary child node, and updating the current data list according to the target data list; and returning to execute the operation of generating the current target public continuous subsequence of each sample data in the current data list through a suffix tree data structure, and updating the child nodes of the current public data tree according to the root nodes of the target data list until the current target public continuous subsequence is empty.
The current target common continuous subsequence may be a longest common continuous subsequence of each sample data in the current data list. The temporary child node may be data obtained by splitting the sample data by using a root node, and the data may be used for constructing a new data list. The first temporary child node may be a left child node and the second temporary child node may be a right child node. Alternatively, the first temporary child node may be a right child node and the second temporary child node may be a left child node. The node types of the first temporary child node and the second temporary child node are not limited in the embodiments of the present application. The target data list may be a new data list constructed according to child nodes of each child public data tree, and sample data in the data list is partial data of original sample data.
In an optional embodiment of the present application, the constructing a target data list according to child nodes of each of the current child public data trees may include: constructing a first target data list according to each first temporary child node; constructing a second target data list according to each second temporary child node; the updating the child nodes of the current common data tree according to the root node of the target data list may include: taking a root node of each first target data list as a first child node of the current public data tree; and taking the root node of each second target data list as a second child node of the current public data tree.
The first target data list may be a new data list generated according to the first temporary child node, and the second target data list may be a new data list generated according to the second temporary child node. When the first temporary child node is a left child node and the second temporary child node is a right child node, the first child node may be the left child node and the second child node may be the right child node. When the first temporary child node is a right child node and the second temporary child node is a left child node, the first child node may be the right child node and the second child node may be the left child node.
The embodiment of the application generates each sub public data tree in a circular recursion mode. Specifically, the sample data list may be used as a current data list, a current target common continuous subsequence of each sample data in the current data list is generated through a suffix tree data structure, the generated current target common continuous subsequence is used as a root node of the current common data tree, and a temporary child node of each sample data is sequentially determined according to the target common continuous subsequence and each sample data in the current data list. Then, each temporary child node can construct a target data list, and the current data list is updated according to the target data list. The structure type of the current public data tree may be: one root node, one left child node, and one right child node. Accordingly, after the current common data tree is generated for the current data list, the target data list can be constructed according to the root node of the current common data tree and each sample data. Namely, a first target data list is constructed according to the first temporary sub-node, a second target data list is constructed according to the second temporary sub-node, then the first target data list and the second target data list are updated to be a current data list, the operation of generating a current target public continuous sub-sequence of each sample data in the current data list through a suffix tree data structure is returned to be executed, and the current target public continuous sub-sequence of each sample data in the current data list is regenerated until the current target public continuous sub-sequence is empty. It should be noted that, after the first target data list generates the corresponding current target common continuous subsequence, the current target common continuous subsequence corresponding to the first target data list may be used as the first child node of the current common data tree. Similarly, after the second target data list generates the corresponding current target public continuous subsequence, the current target public continuous subsequence corresponding to the second target data list may be used as the second child node of the current public data tree. Thus, a complete child common data tree can be finally generated for the sample data list.
In one illustrative example, assume the list of sample data is [ "http:// www.cbidu.com", "https:// www.za.com", "http:// www.alucaaa.com" ], where the sample data is "http:// www.cbidu.com", "https:// www.za.com", and "http:// www.alucaaa.com", respectively. Taking the sample data category as a current data list, and generating a current target public continuous subsequence of each sample data in the current data list through a suffix tree data structure: // www. Then, the "// www." is used as a root node of the current public data tree, and each sample data in the current data list is split into three parts of a first temporary child node + the root node + a second temporary child node according to the root node. Wherein, in the current data list, the first temporary child node corresponding to the sample data of 'http:// www.cbidu.com' is http; the root node is// www; com is the second temporary child node. Similarly, the first temporary child node corresponding to the sample data of 'https:// www.za.com' is https; the root node is// www; com is the second temporary child node. The first temporary child node corresponding to the http:// www.alucaaa.com sample data is http; the root node is// www; the second temporary child node is alucaaa. Then, the first temporary child node of each sample data forms a first target data list, namely the http, the https and the http form a first target data list; the second temporary child nodes of each sample data are grouped into a second target data list, that is, "cbidu. And after generating a first target data list and a second target data list, respectively updating the first target data list and the second target data list into current data lists, and calling a step of generating current target common continuous subsequences of each sample data in the current data lists through a suffix tree data structure. The root node of the first target data list, that is, the current target public continuous subsequence, may be a left child node of the current public data tree, and the root node of the second target data list, that is, the current target public continuous subsequence, may be a right child node of the current public data tree. That is, [ "http", "https", "http" ] is taken as a first target data list, and [ "cbidu.com", "za.com", "alucaaa.com" ] is taken as a second target data list, and each target data list is updated to a current data list respectively to generate a current target common continuous subsequence of each target data list. In the above target data list, the current target common consecutive subsequence of the first target data list is empty, and the current target common consecutive subsequence of the second target data list is com. Then, taking the current target public continuous subsequence of the first target data list as a first child node of the current public data tree, taking the current target public continuous subsequence of the second target data list as a second child node of the current public data tree, and obtaining a final public data tree as follows: the root node is "// www.", the first child node is null, and the second child node is ". com".
According to the technology, each piece of common data of each piece of sample data can be acquired in sequence by adopting a circular recursion mode, and a common data tree is constructed according to each piece of acquired common data.
And S230, generating a public data full list according to the public data tree and the sample data list.
And S240, generating the data type list according to the public data full list.
The common data full list may be generated for each sample data of the sample data list, and is used to represent a list of the same feature data and different feature data between each sample data.
In the embodiment of the present invention, when the data type list is generated according to the common data tree, the common data full list may be generated according to the common data tree and the sample data list, and then the data type list may be generated according to the common data full list.
In the above scheme, the common data full list may represent the same feature data and different feature data between each sample data, and then may respectively determine corresponding data types for the same feature data and the different feature data, so as to generate a final data type list.
In an optional embodiment of the present application, the generating a public data full list according to the public data tree and the sample data list may include: traversing the public data tree, and constructing a public data intermediate list according to a traversal result; forming a corresponding sub data list according to the non-public data included in the sample data list; and expanding the public data intermediate list according to each subdata list to obtain the public data full list.
Wherein the common data intermediate list may be a list including the same feature data. Wherein the same characteristic data is also common data. Accordingly, different characteristic data, i.e. non-common data. The sub data list may be a data list formed from non-public data abstractions.
Specifically, the common data tree may be traversed to construct the intermediate common data list according to the traversal result. Optionally, the traversal mode of the common data tree may be a middle-order traversal mode. After the public data intermediate list is obtained, the non-public data included in the sample data list can be used for forming a corresponding sub data list, and then each sub data list is used for expanding the public data intermediate list, so that a final public data full list is obtained.
In one illustrative example, assume the sample data list is list1, and the specific sample data is [ "http:// www.cbidu.com", "https:// www.za.com", "http:// www.alucaaa.com" ], with the corresponding common data tree: the root node is "// www.", the first child node is null, and the second child node is ". com". Firstly, performing middle-order traversal on the public data tree to obtain a public data intermediate list2 arranged in order: [ ":// www.", ". com" ]. Then, the corresponding sub data lists list4 and list5 are formed using the non-public data included in the sample data list. Wherein, list4 is [ "http", "https", "http" ], and list5 is [ "cbidu", "za", "alucaaa" ]. Finally, the common data intermediate list2 is expanded by using each sub data list4 and list5, and a final common data full list3: [ list4, "// www.", list5, ". com" ]isobtained.
In the above scheme, the public data intermediate list is constructed first, and the public data intermediate list is expanded in the sub-data list formed by using the non-public data, so that a public data full list including the nested list can be obtained. Wherein, each subdata list is also a nested list. The public data full list can obviously distinguish public data from non-public data, so that the data type of each data can be judged.
In an optional embodiment of the present application, the generating the data type list according to the public data full list may include: determining public data of the public data full list as a first data type; calculating the length information entropy of each subdata list of the public data full list; and determining the data type of each sub data list according to the numerical relationship between the length information entropy of each sub data list and a first set threshold.
In an optional embodiment of the present application, the determining a data type of each sub data list according to a numerical relationship between length information entropy of each sub data list and a first set threshold may include: determining the data type of the sub data list as a second data type under the condition that the length information entropy of the sub data list is larger than the first set threshold; and determining the data type of the sub data list as the first data type under the condition that the length information entropy of the sub data list is less than or equal to the first set threshold.
Wherein the first data type may be a constant type. The second data type may be a variable type. A constant type is also a fixed constant, and a variable type is also a non-fixed variable. The length information entropy may be an information entropy calculated for each sub data list to embody uncertainty of data as a constant type or a variable type. The first set threshold may be set according to an actual requirement, or may be pre-specified, for example, a value is 1.3 or 2.4, and the embodiment of the present application does not limit a specific value of the first set threshold.
In the embodiment of the invention, the data type list is generated according to the public data full list, and the corresponding data types are mainly determined for each part of data in the public data full list. Specifically, since the common data in the common data full list is the same characteristic data of each sample data, the data type of the common data can be directly determined as the first data type, that is, the constant data type. For the non-common data portion of each sub data list, the data type of each sub data list may be determined by means of length information entropy.
Wherein, the specific definition of the length information entropy can be
Figure BDA0002671932500000091
Where pi is the number of each data length divided by the total number of data. Exemplary, [ "cbidu", "za", "alucaaa"]The data length list corresponding to each data of the list is [5,2,7 ]]So the list corresponds to pi of [1/3,1/3,1/3]. The length information entropy may determine the uncertainty between data.
Specifically, when the data type of each sub-data list is determined in a length information entropy manner, the length information entropy of each sub-data list may be calculated, and the length information entropy may be compared with the first set threshold. If the length information entropy of the sub data list is greater than the first set threshold, it indicates that the uncertainty of the sub data list is greater than the preset value, and the data type of the sub data list can be determined as a second data type, that is, a variable type. If the length information entropy of the sub data list is less than or equal to the first set threshold, it indicates that the uncertainty of the sub data list is less than or equal to the preset value, and the data type of the sub data list may be determined as the first data type, that is, as the constant type.
It should be noted that the data type of the sub data list may also be determined without depending on the length information entropy, that is, the data type of the sub data list is defined as a first data type and a second data type, and then, for each data type, the corresponding regular expression result may be determined according to the length information entropy.
According to the technical scheme, the specific data type is determined for the non-public data of each sample data in a length information entropy mode, and the regular expression corresponding to each sample data can be determined according to actual data abstraction requirements. When the values of the first set thresholds for determining the non-public data are different, the determination results of the data types corresponding to the non-public data are also different, so that the required regular expression can be automatically generated according to the actual data abstraction requirement.
And S250, generating a plurality of regular expressions matched with the sample data list according to the data type list.
According to the technical scheme, the public data trees corresponding to the sample data lists are generated according to the sample data, public data contents in the sample data can be sequentially extracted, the public data can be embodied in a public data tree mode, then the public data trees and the sample data are used for generating the public data full lists, the data type lists are generated according to the public data full lists and the length information entropy mode, the regular expression generation efficiency can be improved, and the regular expression generation mode can be enriched.
In an example, fig. 3 is a flowchart of a method for generating a regular expression provided in the embodiment of the present application, and the embodiment of the present application performs optimization and improvement on the basis of the technical solutions of the above embodiments, and provides a plurality of specific implementation manners for generating a plurality of regular expressions matched with the sample data list according to the data type list.
A method for generating a regular expression as shown in fig. 3 includes:
and S310, acquiring a sample data list.
And S320, generating a public data tree corresponding to the sample data list according to each sample data.
And S330, generating a data type list according to the public data tree.
And S340, generating a plurality of regular expressions matched with the sample data list according to the data type list and the public data full list.
In the embodiment of the application, a plurality of regular expressions matched with the sample data list can be generated according to the data type list and the public data full list.
Correspondingly, S340 may specifically include the following operations:
and S341, acquiring the current data to be processed of the public data full list according to the data sorting sequence.
The data sorting order may be a sorting order of each data in the common data full list. The current data to be processed is also the data which needs to generate the content of the regular expression currently in the public data full list.
Taking the common data full list3: [ list4, "// www.", list5, ". com" ] in the above example as an example, when processing starts to be performed on the list3, data "list 4" of the list3 is obtained as current data to be processed according to a data sorting order, and after processing of the current data to be processed "list 4" is completed, data "// www." of the list3 can be obtained as the current data to be processed according to the data sorting order until all data are processed.
S342, judging whether the data type of the current data to be processed is a first data type, if so, executing S343; otherwise, S344 is performed.
And S343, generating a sub regular expression matched with the current data to be processed according to the quantity of the data to be processed included in the current data to be processed.
The number of the data to be processed is also the number of the data to be processed. The sub-regular expressions may be portions of the expression content of the regular expression generated for each data correspondence.
In the embodiment of the present application, if the data type of the current data to be processed is the first data type, that is, the constant type, a sub-regular expression matched with the current data to be processed needs to be generated according to the quantity of the data to be processed included in the current data to be processed.
In an optional embodiment of the present application, the generating, according to the amount of to-be-processed data included in the current to-be-processed data, a sub-regular expression matched with the current to-be-processed data may include: when the quantity of the data to be processed is determined to be a first quantity, directly taking the current data to be processed as a sub regular expression matched with the current data to be processed; and when the quantity of the data to be processed is determined to be not the first quantity, combining all data of the current data to be processed as a sub regular expression matched with the current data to be processed.
Wherein the first number may be 1. The corresponding non-first number is also a positive integer greater than 1.
Optionally, if the number of the data to be processed is the first number, the current data to be processed may be directly used as a sub regular expression matched with the current data to be processed; otherwise, combining all data of the current data to be processed as a sub regular expression matched with the current data to be processed.
The example is illustrated with the public data full list3: [ list4, "// www.", list5, ". com" ] in the above example, where list4 is [ "http", "https", "http" ], and list5 is [ "cbidu", "za", "alucaaa" ]. Suppose "/www." is the current data to be processed and the data type is the first data type. Since the data amount of the current data to be processed is 1. Thus, the "corresponding sub regular expression"// www. "is itself. Suppose list4 is the current data to be processed and the data type is the first data type. Since the current data to be processed includes data to be processed of "http" and "https", the data amount thereof is 2, that is, the data amount is greater than 1. Thus, the child regular expression to which list4 corresponds may be "http | https". Wherein the symbol "|" represents the meaning of or.
In the above scheme, the matched sub regular expressions are generated for the current data to be processed of the first data type according to the number of the data to be processed included in the current data to be processed, so that public data can be retained to the greatest extent, that is, common features of the data are extracted.
And S344, generating a sub regular expression matched with the current data to be processed according to the length information entropy of the current data to be processed.
Correspondingly, if the data type of the current data to be processed is the second data type, the sub regular expression matched with the current data to be processed can be generated according to the length information entropy of the current data to be processed.
In an optional embodiment of the present application, the generating, according to the length information entropy of the current data to be processed, a sub-regular expression matched with the current data to be processed may include: under the condition that the length information entropy of the current data to be processed is greater than or equal to a second set threshold, taking a preset character as a sub regular expression matched with the current data to be processed; under the condition that the length information entropy of the current data to be processed is smaller than a second set threshold and larger than a third set threshold, taking the first length information and the second length information of each data of the current data to be processed as sub regular expressions matched with the current data to be processed; and under the condition that the length information entropy of the current data to be processed is smaller than or equal to the third set threshold, taking the third length information of each piece of data of the current data to be processed as a sub regular expression matched with the current data to be processed.
The second set threshold may be set according to an actual requirement, or may be pre-specified, for example, the value is 1, 2, or 2.5, and the specific value of the second set threshold is not limited in this embodiment of the application. The third set threshold may be 0. The preset characters may be set according to actual requirements, such as "+", "-" or ". the embodiment of the present application does not limit specific character contents of the preset characters. The first length information may be a minimum data length among the data, the second length information may be a maximum data length among the data, and the third length information may be a data length of data having the same data length.
In the embodiment of the present application, if the data type of the current data to be processed is the second data type, that is, the variable type, a sub regular expression matched with the current data to be processed needs to be generated according to the length information entropy of the current data to be processed. Optionally, if the length information entropy of the current data to be processed is greater than or equal to a second set threshold, the preset character may be used as a sub-regular expression matched with the current data to be processed. If the length information entropy of the current data to be processed is smaller than the second set threshold and larger than the third set threshold, the first length information and the second length information of each piece of data of the current data to be processed can be used as sub regular expressions matched with the current data to be processed, so as to abstract the characteristics of the current data to be processed on the data length, namely, to embody the range interval of the data length of the current data to be processed. If the length information entropy of the current data to be processed is smaller than or equal to the third set threshold, the third length information of each piece of data of the current data to be processed can be used as a sub regular expression matched with the current data to be processed, so as to abstract the characteristics of the current data to be processed on the data length, namely to embody the specific numerical value of each data length in the current data to be processed.
With the public data full list3: [ list4, "// www.", list5, ". com" in the above example "]For purposes of illustration, where list4 is [ "http", "https", "http"]And list5 is [ "cbidu", "za", "alucaaa"]. Suppose list5 is the current data to be processed and the data type is the second data type. Using formulas
Figure BDA0002671932500000131
The length information entropy of list5 is calculated. While the entropy of the length information of list5 is compared to a second set threshold and zero value. If the length information entropy of the list5 is greater than or equal to the second set threshold, which indicates that the uncertainty of the list5 is relatively large, the list5 may be classified into MAX classes. If the length information entropy of the list5 is less than the second set threshold and greater than zero, indicating that the uncertainty of the list5 is relatively small, the list5 can be classified as the MID class. If the entropy of the length information of the list5 is less than or equal to 0, indicating that the data lengths of the data in the list5 are consistent, the list5 can be divided into MIN classes.
Since the second set threshold can be set according to actual requirements, the length information entropy of list5 is fixed. Therefore, when the values of the second setting threshold are different, the types of the last partitions 5 may be different. Accordingly, if the final type of list5 is MAX type, list5 can be translated as a preset character "+", i.e., the "+" is used as a sub-regular expression matched with list 5. If the final type of list5 is the MID class, list5 can be translated as { minlen, maxlen }. Wherein minlen is the first length information, i.e. the minimum length, and maxlen is the second length information, i.e. the maximum length. I.e., {2,7} as child regular expressions matched by list 5. If the final type of list5 is a MIN class, list5 can be translated as { len }. Here, len is the data length of each data in list 5. For example, assuming that the list5 is [ "cbidu", "zauca", "aluca" ], that is, the data length of each data in the list5 is 5, then {4} can be taken as a child regular expression matched with the list 5.
In the above scheme, the certainty judgment is performed on the current data to be processed of the second data type by setting the second setting threshold and the third setting threshold, so as to generate the sub regular expression corresponding to the current data to be processed according to the judgment result, and the required regular expression can be automatically generated according to the actual data abstraction requirement.
It should be noted that, in the embodiment of the present application, both the first setting threshold and the second setting threshold may be set according to actual requirements, that is, different values may be set for different types of sample data, and accordingly, different types of regular expressions may be obtained.
S345, judging whether all the data to be processed are processed completely, if so, executing S347; otherwise, S346 is performed.
S346, obtaining next data to be processed according to the data sorting sequence, updating the current data to be processed according to the next data to be processed, and returning to execute S342.
Correspondingly, after the current data to be processed generates the corresponding sub regular expressions, the next data to be processed can be obtained according to the data sorting sequence, the next data to be processed is taken as the current data to be processed, and the operation of generating the sub regular expressions matched with the current data to be processed is returned to be executed until all the data to be processed in the public data full list are processed, that is, all the data to be processed in the public data full list are correspondingly generated into the sub regular expressions.
And S347, generating a plurality of regular expressions matched with the sample data list according to the sub regular expressions.
Correspondingly, after all the data to be processed in the public data full list are correspondingly generated into the regular sub-expressions, all the regular sub-expressions can be spliced according to the data sorting sequence to generate a plurality of regular expressions matched with the sample data list.
The example is illustrated with the public data full list3: [ list4, "// www.", list5, ". com" ] in the above example, where list4 is [ "http", "https", "http" ], and list5 is [ "cbidu", "za", "alucaaa" ]. Through the setting of the first set threshold and the second set threshold, each data in list3 can generate a corresponding sub regular expression. Correspondingly, the sub regular expressions are spliced according to the data sorting order, and the regular expression matched with list1 may be: (https | http) (:// www.) + (. com), (https | http) (:// www.) {2,7} (. com) or {4,5}) (:// www.) {2,7} (. com), etc.
According to the technical scheme, the sub-regular expressions corresponding to the data are determined according to the data type of each piece of data in the public data full list, so that common characteristics among sample data can be effectively reserved, and the required regular expressions can be automatically generated according to actual data abstraction requirements aiming at different characteristic parts.
And S348, obtaining the expression composition type of the target sub regular expressions in each regular expression.
And S349, identifying each target sub regular expression by using a preset identifier according to the expression composition type.
The target sub regular expressions can be sub regular expressions which need to be additionally identified in each sub regular expression. The preset identifier may be set according to actual requirements, for example, the preset identifier may be "\ w" or "\ D", and the like, and the specific identifier type of the preset identifier is not limited in the embodiment of the present application.
In the embodiment of the application, after the sub regular expressions of each data are generated aiming at the public data full list, some of the sub regular expressions can be used as target sub regular expressions to further identify the target sub regular expressions. Optionally, the expression composition type of the target sub regular expressions may be determined first, and then each target sub regular expression is identified by using a preset identifier according to the determined expression composition type. The expression composition type of the target sub regular expression may be, for example: all of the data are in alphabetical composition, or, all of the data are not in alphabetical composition, etc.
Taking the example of the public data full list3: [ list4, ":// www.", list5, ". com" ] in the above example as an example, if the list3 matches the list1 with a regular expression as: (https | http) (:// www.) + (. com), (https | http) (:// www.) {2,7} (. com) or {4,5}) (:// www.) {2,7} (. com), the sub-regular expression generated for data of the second data type can be identified using a preset identification, e.g., using "\\ W" to identify all the characters in the data as letters and using "\\ D" to identify not all the characters in the data as letters. Accordingly, the regular expression that list1 finally matches may be: (https | http) (:// www.) \ w + (. com), (https | http) (:// www.) \ w {2,7} (. com), and \ w {4,5} (:// www.) \ w {2,7} (. com).
By identifying each target sub regular expression by using the preset identification according to the expression composition type, the similar characteristics can be further extracted from the data of the second data type, so that the finally generated regular expression can reflect the characteristics of the sample data to the maximum extent.
And S350, calculating the intimacy of each regular expression by utilizing an intimacy function.
Wherein, the affinity function can be used for calculating the affinity of each regular expression.
And S360, determining the regular expression corresponding to the target intimacy as a target regular expression.
Wherein the target affinity may be the highest value of affinity.
In the embodiment of the application, after a plurality of matched regular expressions are generated for the sample data list, in order to further screen the regular expressions which meet the requirements, the intimacy degree of each regular expression can be calculated by utilizing the intimacy degree function, so that the regular expression corresponding to the intimacy degree with the maximum value is screened out and determined as the target regular expression. The screened target regular expression can meet the abstract requirement of sample data to the maximum extent.
For example, intimacyThe function may be:
Figure BDA0002671932500000161
wherein regex represents the generated regular expression, intimacy represents the intimacy function, LIST [ i [ ]]Representing sample data in the sample data list. i may represent the order of the sample data, e.g. LIST [1 ]]Representing the first sample data. And (f) calculating intimacy (namely inturacy) through f (regex) and each sample data in the sample data list, wherein the maximum value is optimal. The intimacy can be defined according to actual requirements, for example, similarity between cosine similarity calculation sample data and a regular expression is used as intimacy.
According to the technical scheme, the data type list and the public data full list are combined with the length information entropy mode to generate the regular expressions matched with the sample data list, the generation efficiency of the regular expressions can be improved, and the generation mode of the regular expressions can be enriched.
In an example, fig. 4 is a flowchart of a data extraction method provided in an embodiment of the present application, and this embodiment may be applied to a case where data is extracted and classified according to an automatically generated regular expression, and the method may be performed by a data extraction apparatus, which may be implemented by software and/or hardware, and may be generally integrated in an electronic device. The electronic device may be a computer device or the like. Accordingly, as shown in fig. 4, the method includes the following operations:
and S410, acquiring data to be processed.
And S420, analyzing the data to be processed, and generating a regular expression matched with the data to be processed.
The data to be processed is also the original data which needs to be extracted and classified by using the regular expression.
In this embodiment of the present application, after the data to be processed is obtained, the data to be processed may be analyzed, for example, several sample data are selected from the data to be processed, and then the regular expression for data extraction of the data to be processed is generated according to the selected sample data by using the method for generating a regular expression described in any of the above embodiments.
For example, an operation interface may be provided, and a user may provide data to be processed on the operation interface and input a sample for generating a regular expression as sample data at a specified position of an operation node. After the background server receives an instruction for generating a regular expression sent by an operation interface user, the regular expression corresponding to the sample data sample is directly generated according to the processing logic corresponding to the regular expression generating method in any embodiment, and is displayed at another specified position of the operation interface.
S430, extracting data of the data to be processed according to the generated regular expression.
Correspondingly, after the regular expression used for extracting the data to be processed is generated, the data to be processed can be extracted according to the generated regular expression so as to obtain the data meeting the requirements.
The data extraction method can be applied to application scenarios of various data extraction and classification, for example, screening webpage link data from log data, or screening metaphorical sentences or ranking sentences from training corpus, and the like.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the sample data list, the data type list is generated according to the public data tree, the regular expressions matched with the sample data list are generated according to the data type list, data to be processed are extracted and classified quickly by using the generated regular expressions, the automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions and the extraction and classification efficiency of the data are improved.
In an example, fig. 5 is a structural diagram of an apparatus for generating a regular expression according to an embodiment of the present application, where the apparatus is implemented by software and/or hardware and is specifically configured in an electronic device. The electronic device may be a computer device or the like.
An apparatus 500 for generating a regular expression as shown in fig. 5 comprises: a sample data list acquisition module 510, a common data tree generation module 520, a data type list generation module 530, and a first regular expression generation module 540. Wherein the content of the first and second substances,
a sample data list obtaining module 510, configured to obtain a sample data list; the sample data list comprises a plurality of sample data;
a common data tree generating module 520, configured to generate a common data tree corresponding to the sample data list according to each sample data;
a data type list generating module 530, configured to generate a data type list according to the public data tree;
a first regular expression generating module 540, configured to generate, according to the data type list, a plurality of regular expressions matched with the sample data list.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the obtained sample data list, the data type list is generated according to the public data tree, and therefore the regular expressions matched with the sample data list are generated according to the data type list, automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions is improved.
Optionally, the common data tree generating module 520 is specifically configured to: taking the sample data list as a current data list; generating a current target common continuous subsequence of each sample data in the current data list through a suffix tree data structure; taking the current target public continuous subsequence as a root node of a current public data tree, and sequentially determining a temporary child node of each sample data according to the target public continuous subsequence and each sample data in the current data list; wherein the temporary child nodes include a first temporary child node and a second temporary child node; constructing a target data list according to each temporary child node, and updating the current data list according to the target data list; and returning to execute the operation of generating the current target public continuous subsequence of each sample data in the current data list through a suffix tree data structure, and updating the child nodes of the current public data tree according to the root nodes of the target data list until the current target public continuous subsequence is empty.
Optionally, the common data tree generating module 520 is specifically configured to: constructing a first target data list according to each first temporary child node; constructing a second target data list according to each second temporary child node; taking a root node of each first target data list as a first child node of the current public data tree; and taking the root node of each second target data list as a second child node of the current public data tree.
Optionally, the data type list generating module 530 is specifically configured to: generating a public data full list according to the public data tree and the sample data list; and generating the data type list according to the public data full list.
Optionally, the data type list generating module 530 is specifically configured to: traversing the public data tree, and constructing a public data intermediate list according to a traversal result; forming a corresponding sub data list according to the non-public data included in the sample data list; and expanding the public data intermediate list according to each subdata list to obtain the public data full list.
Optionally, the data type list generating module 530 is specifically configured to: determining public data of the public data full list as a first data type; calculating the length information entropy of each subdata list of the public data full list; and determining the data type of each sub data list according to the numerical relationship between the length information entropy of each sub data list and a first set threshold.
Optionally, the data type list generating module 530 is specifically configured to: determining the data type of the sub data list as a second data type under the condition that the length information entropy of the sub data list is larger than the first set threshold; and determining the data type of the sub data list as the first data type under the condition that the length information entropy of the sub data list is less than or equal to the first set threshold.
Optionally, the first regular expression generating module 540 is specifically configured to: and generating a plurality of regular expressions matched with the sample data list according to the data type list and the public data full list.
Optionally, the first regular expression generating module 540 is specifically configured to: acquiring current data to be processed of the public data full list according to a data sorting sequence; under the condition that the data type of the current data to be processed is determined to be a first data type, generating a sub regular expression matched with the current data to be processed according to the quantity of the data to be processed included in the current data to be processed; under the condition that the data type of the current data to be processed is determined to be a second data type, generating a sub regular expression matched with the current data to be processed according to the length information entropy of the current data to be processed; acquiring next data to be processed according to the data sorting sequence, and updating the current data to be processed according to the next data to be processed; returning to execute the operation of generating the sub regular expression matched with the current data to be processed until all the data to be processed of the public data full list are processed; and generating a plurality of regular expressions matched with the sample data list according to each sub regular expression.
Optionally, the first regular expression generating module 540 is specifically configured to: when the quantity of the data to be processed is determined to be a first quantity, directly taking the current data to be processed as a sub regular expression matched with the current data to be processed; and when the quantity of the data to be processed is determined to be not the first quantity, combining all data of the current data to be processed as a sub regular expression matched with the current data to be processed.
Optionally, the first regular expression generating module 540 is specifically configured to: under the condition that the length information entropy of the current data to be processed is greater than or equal to a second set threshold, taking a preset character as a sub regular expression matched with the current data to be processed; under the condition that the length information entropy of the current data to be processed is smaller than a second set threshold and larger than a third set threshold, taking the first length information and the second length information of each data of the current data to be processed as sub regular expressions matched with the current data to be processed; and under the condition that the length information entropy of the current data to be processed is smaller than or equal to the third set threshold, taking the third length information of each piece of data of the current data to be processed as a sub regular expression matched with the current data to be processed.
Optionally, the first regular expression generating module 540 is specifically configured to: obtaining an expression composition type of a target sub regular expression in each sub regular expression; and identifying each target sub regular expression by using a preset identification according to the expression composition type.
Optionally, the apparatus for generating a regular expression further includes: the intimacy degree calculating module is used for calculating intimacy degree of each regular expression by utilizing an intimacy degree function; and the target regular expression determining module is used for determining the regular expression corresponding to the target intimacy as the target regular expression.
The regular expression generation device can execute the regular expression generation method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to a method for generating a regular expression provided in any embodiment of the present application.
Since the above-described regular expression generation device is a device capable of executing the regular expression generation method in the embodiment of the present application, based on the regular expression generation method described in the embodiment of the present application, a person skilled in the art can understand a specific implementation manner of the regular expression generation device of the present embodiment and various variations thereof, and therefore, a detailed description of how the regular expression generation method in the embodiment of the present application is implemented by the regular expression generation device is not described here. The scope of the present application is intended to be protected by only those skilled in the art who implement the apparatus for generating the regular expression in the embodiments of the present application.
In an example, fig. 6 is a structural diagram of a data extraction apparatus provided in an embodiment of the present application, and the embodiment of the present application is applicable to a case where data is extracted and classified according to an automatically generated regular expression, and the apparatus is implemented by software and/or hardware and is specifically configured in an electronic device. The electronic device may be a computer device or the like.
A data extraction apparatus 600 as shown in fig. 6, comprising: a to-be-processed data acquisition module 610, a second regular expression generation module 620 and a data extraction module 630. Wherein the content of the first and second substances,
a to-be-processed data acquisition module 610, configured to acquire to-be-processed data;
a second regular expression generating module 620, configured to analyze the data to be processed, and generate a regular expression matched with the data to be processed;
a data extraction module 630, configured to perform data extraction on the data to be processed according to the generated regular expression;
wherein the regular expression is generated by the regular expression generation method of any one of claims 1-13.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the sample data list, the data type list is generated according to the public data tree, the regular expressions matched with the sample data list are generated according to the data type list, data to be processed are extracted and classified quickly by using the generated regular expressions, the automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions and the extraction and classification efficiency of the data are improved.
The data extraction device can execute the data extraction method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to a data extraction method provided in any embodiment of the present application.
Since the data extraction device described above is a device capable of executing the data extraction method in the embodiment of the present application, based on the data extraction method described in the embodiment of the present application, a person skilled in the art can understand the specific implementation of the data extraction device of the present embodiment and various variations thereof, and therefore, how to implement the data extraction method in the embodiment of the present application by the data extraction device is not described in detail herein. The device used by those skilled in the art to implement the data extraction method in the embodiments of the present application is within the scope of the present application.
In one example, the present application also provides an electronic device and a readable storage medium.
Fig. 7 is a schematic structural diagram of an electronic device for implementing a regular expression generation method or a data extraction method according to an embodiment of the present application. As shown in fig. 7, the electronic device is a block diagram of a method for generating a regular expression or a method for extracting data according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 7, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 7, one processor 701 is taken as an example.
The memory 702 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the regular expression generation method or the data extraction method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the generation method or the data extraction method of the regular expression provided in the present application.
The memory 702, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the regular expression generation method or the data extraction method in the embodiments of the present application (for example, the sample data list acquisition module 510, the common data tree generation module 520, the data type list generation module 530, and the first regular expression generation module 540 shown in fig. 5, or the to-be-processed data acquisition module 610, the second regular expression generation module 620, and the data extraction module 630 shown in fig. 6). The processor 701 executes various functional applications and data processing of the server, that is, implements the generation method of the regular expression or the data extraction method in the above-described method embodiments, by running the non-transitory software program, instructions, and modules stored in the memory 702.
The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of an electronic device that implements a generation method or a data extraction method of a regular expression, or the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 702 optionally includes memory located remotely from the processor 701, and such remote memory may be connected over a network to an electronic device implementing the regular expression generation method or the data extraction method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device implementing the regular expression generation method or the data extraction method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus implementing a generation method or a data extraction method of a regular expression, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. The client may be a smart phone, a notebook computer, a desktop computer, a tablet computer, a smart speaker, etc., but is not limited thereto. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud computing, cloud service, a cloud database, cloud storage and the like. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the sample data list, the data type list is generated according to the public data tree, the regular expressions matched with the sample data list are generated according to the data type list, data to be processed are extracted and classified quickly by using the generated regular expressions, the automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions and the extraction and classification efficiency of the data are improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (30)

1. A method for generating a regular expression comprises the following steps:
acquiring a sample data list; the sample data list comprises a plurality of sample data;
generating a public data tree corresponding to the sample data list according to each sample data;
generating a data type list according to the public data tree;
and generating a plurality of regular expressions matched with the sample data list according to the data type list.
2. The method of claim 1, wherein said generating a common data tree corresponding to said list of sample data from each said sample data comprises:
taking the sample data list as a current data list;
generating a current target common continuous subsequence of each sample data in the current data list through a suffix tree data structure;
taking the current target public continuous subsequence as a root node of a current public data tree, and sequentially determining a temporary child node of each sample data according to the target public continuous subsequence and each sample data in the current data list; wherein the temporary child nodes include a first temporary child node and a second temporary child node;
constructing a target data list according to each temporary child node, and updating the current data list according to the target data list;
and returning to execute the operation of generating the current target public continuous subsequence of each sample data in the current data list through a suffix tree data structure, and updating the child nodes of the current public data tree according to the root nodes of the target data list until the current target public continuous subsequence is empty.
3. The method of claim 2, said building a target data list from child nodes of each of said current child common data trees, comprising:
constructing a first target data list according to each first temporary child node; and
constructing a second target data list according to each second temporary child node;
the updating child nodes of the current public data tree according to the root node of the target data list includes:
taking a root node of each first target data list as a first child node of the current public data tree;
and taking the root node of each second target data list as a second child node of the current public data tree.
4. The method of claim 1, wherein the generating a list of data types from the common data tree comprises:
generating a public data full list according to the public data tree and the sample data list;
and generating the data type list according to the public data full list.
5. The method of claim 4, wherein said generating a common full list of data from said common data tree and said sample data list comprises:
traversing the public data tree, and constructing a public data intermediate list according to a traversal result;
forming a corresponding sub data list according to the non-public data included in the sample data list;
and expanding the public data intermediate list according to each subdata list to obtain the public data full list.
6. The method of claim 4 or 5, wherein the generating the list of data types from the full list of common data comprises:
determining public data of the public data full list as a first data type;
calculating the length information entropy of each subdata list of the public data full list;
and determining the data type of each sub data list according to the numerical relationship between the length information entropy of each sub data list and a first set threshold.
7. The method of claim 6, wherein the determining the data type of each sub data list according to the numerical relationship between the length information entropy of each sub data list and a first set threshold comprises:
determining the data type of the sub data list as a second data type under the condition that the length information entropy of the sub data list is larger than the first set threshold;
and determining the data type of the sub data list as the first data type under the condition that the length information entropy of the sub data list is less than or equal to the first set threshold.
8. The method of claim 1, wherein said generating a plurality of regular expressions from said list of data types that match said list of specimen data comprises:
and generating a plurality of regular expressions matched with the sample data list according to the data type list and the public data full list.
9. The method of claim 8, wherein said generating a plurality of regular expressions matching the sample data list from the list of data types and a full list of common data comprises:
acquiring current data to be processed of the public data full list according to a data sorting sequence;
under the condition that the data type of the current data to be processed is determined to be a first data type, generating a sub regular expression matched with the current data to be processed according to the quantity of the data to be processed included in the current data to be processed;
under the condition that the data type of the current data to be processed is determined to be a second data type, generating a sub regular expression matched with the current data to be processed according to the length information entropy of the current data to be processed;
acquiring next data to be processed according to the data sorting sequence, and updating the current data to be processed according to the next data to be processed;
returning to execute the operation of generating the sub regular expression matched with the current data to be processed until all the data to be processed of the public data full list are processed;
and generating a plurality of regular expressions matched with the sample data list according to each sub regular expression.
10. The method according to claim 9, wherein the generating a sub regular expression matched with the current data to be processed according to the quantity of the data to be processed included in the current data to be processed comprises:
when the quantity of the data to be processed is determined to be a first quantity, directly taking the current data to be processed as a sub regular expression matched with the current data to be processed;
and when the quantity of the data to be processed is determined to be not the first quantity, combining all data of the current data to be processed as a sub regular expression matched with the current data to be processed.
11. The method according to claim 9, wherein the entropy-generating a sub-regular expression matched with the current data to be processed according to the length information of the current data to be processed comprises:
under the condition that the length information entropy of the current data to be processed is greater than or equal to a second set threshold, taking a preset character as a sub regular expression matched with the current data to be processed;
under the condition that the length information entropy of the current data to be processed is smaller than a second set threshold and larger than a third set threshold, taking the first length information and the second length information of each data of the current data to be processed as sub regular expressions matched with the current data to be processed;
and under the condition that the length information entropy of the current data to be processed is smaller than or equal to the third set threshold, taking the third length information of each piece of data of the current data to be processed as a sub regular expression matched with the current data to be processed.
12. The method of claim 11, further comprising:
obtaining an expression composition type of a target sub regular expression in each sub regular expression;
and identifying each target sub regular expression by using a preset identification according to the expression composition type.
13. The method of claim 1, further comprising:
calculating the intimacy of each regular expression by utilizing an intimacy function;
and determining the regular expression corresponding to the target intimacy as a target regular expression.
14. A method of data extraction, comprising:
acquiring data to be processed;
analyzing the data to be processed to generate a regular expression matched with the data to be processed;
performing data extraction on the data to be processed according to the generated regular expression;
wherein the regular expression is generated by the regular expression generation method of any one of claims 1-13.
15. An apparatus for generating a regular expression, comprising:
the sample data list acquisition module is used for acquiring a sample data list; the sample data list comprises a plurality of sample data;
the public data tree generating module is used for generating a public data tree corresponding to the sample data list according to each sample data;
the data type list generating module is used for generating a data type list according to the public data tree;
and the first regular expression generating module is used for generating a plurality of regular expressions matched with the sample data list according to the data type list.
16. The apparatus of claim 15, wherein the common data tree generation module is specifically configured to:
taking the sample data list as a current data list;
generating a current target common continuous subsequence of each sample data in the current data list through a suffix tree data structure;
taking the current target public continuous subsequence as a root node of a current public data tree, and sequentially determining a temporary child node of each sample data according to the target public continuous subsequence and each sample data in the current data list; wherein the temporary child nodes include a first temporary child node and a second temporary child node;
constructing a target data list according to each temporary child node, and updating the current data list according to the target data list;
and returning to execute the operation of generating the current target public continuous subsequence of each sample data in the current data list through a suffix tree data structure, and updating the child nodes of the current public data tree according to the root nodes of the target data list until the current target public continuous subsequence is empty.
17. The apparatus of claim 16, wherein the common data tree generation module is specifically configured to:
constructing a first target data list according to each first temporary child node; and
constructing a second target data list according to each second temporary child node;
taking a root node of each first target data list as a first child node of the current public data tree;
and taking the root node of each second target data list as a second child node of the current public data tree.
18. The apparatus according to claim 15, wherein the data type list generation module is specifically configured to:
generating a public data full list according to the public data tree and the sample data list;
and generating the data type list according to the public data full list.
19. The apparatus of claim 18, wherein the data type list generation module is specifically configured to:
traversing the public data tree, and constructing a public data intermediate list according to a traversal result;
forming a corresponding sub data list according to the non-public data included in the sample data list;
and expanding the public data intermediate list according to each subdata list to obtain the public data full list.
20. The apparatus according to claim 18 or 19, wherein the data type list generation module is specifically configured to:
determining public data of the public data full list as a first data type;
calculating the length information entropy of each subdata list of the public data full list;
and determining the data type of each sub data list according to the numerical relationship between the length information entropy of each sub data list and a first set threshold.
21. The apparatus according to claim 20, wherein the data type list generation module is specifically configured to:
determining the data type of the sub data list as a second data type under the condition that the length information entropy of the sub data list is larger than the first set threshold;
and determining the data type of the sub data list as the first data type under the condition that the length information entropy of the sub data list is less than or equal to the first set threshold.
22. The apparatus according to claim 15, wherein the first regular expression generation module is specifically configured to:
and generating a plurality of regular expressions matched with the sample data list according to the data type list and the public data full list.
23. The apparatus of claim 22, wherein the first regular expression generation module is specifically configured to:
acquiring current data to be processed of the public data full list according to a data sorting sequence;
under the condition that the data type of the current data to be processed is determined to be a first data type, generating a sub regular expression matched with the current data to be processed according to the quantity of the data to be processed included in the current data to be processed;
under the condition that the data type of the current data to be processed is determined to be a second data type, generating a sub regular expression matched with the current data to be processed according to the length information entropy of the current data to be processed;
acquiring next data to be processed according to the data sorting sequence, and updating the current data to be processed according to the next data to be processed;
returning to execute the operation of generating the sub regular expression matched with the current data to be processed until all the data to be processed of the public data full list are processed;
and generating a plurality of regular expressions matched with the sample data list according to each sub regular expression.
24. The apparatus according to claim 23, wherein the first regular expression generation module is specifically configured to:
when the quantity of the data to be processed is determined to be a first quantity, directly taking the current data to be processed as a sub regular expression matched with the current data to be processed;
and when the quantity of the data to be processed is determined to be not the first quantity, combining all data of the current data to be processed as a sub regular expression matched with the current data to be processed.
25. The apparatus according to claim 23, wherein the first regular expression generation module is specifically configured to:
under the condition that the length information entropy of the current data to be processed is greater than or equal to a second set threshold, taking a preset character as a sub regular expression matched with the current data to be processed;
under the condition that the length information entropy of the current data to be processed is smaller than a second set threshold and larger than a third set threshold, taking the first length information and the second length information of each data of the current data to be processed as sub regular expressions matched with the current data to be processed;
and under the condition that the length information entropy of the current data to be processed is smaller than or equal to the third set threshold, taking the third length information of each piece of data of the current data to be processed as a sub regular expression matched with the current data to be processed.
26. The apparatus according to claim 25, wherein the first regular expression generation module is specifically configured to:
obtaining an expression composition type of a target sub regular expression in each sub regular expression;
and identifying each target sub regular expression by using a preset identification according to the expression composition type.
27. The apparatus of claim 15, further comprising:
the intimacy degree calculating module is used for calculating intimacy degree of each regular expression by utilizing an intimacy degree function;
and the target regular expression determining module is used for determining the regular expression corresponding to the target intimacy as the target regular expression.
28. A data extraction apparatus, comprising:
the data to be processed acquisition module is used for acquiring data to be processed;
the second regular expression generating module is used for analyzing the data to be processed and generating a regular expression matched with the data to be processed;
the data extraction module is used for extracting data of the data to be processed according to the generated regular expression;
wherein the regular expression is generated by the regular expression generation method of any one of claims 1-13.
29. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of generating a regular expression of any one of claims 1-13 or to perform the method of data extraction of claim 14.
30. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method of generating a regular expression of any one of claims 1 to 13 or the method of extracting data of claim 14.
CN202010935977.8A 2020-09-08 2020-09-08 Regular expression generation and data extraction methods, devices, equipment and media Active CN112115313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010935977.8A CN112115313B (en) 2020-09-08 2020-09-08 Regular expression generation and data extraction methods, devices, equipment and media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010935977.8A CN112115313B (en) 2020-09-08 2020-09-08 Regular expression generation and data extraction methods, devices, equipment and media

Publications (2)

Publication Number Publication Date
CN112115313A true CN112115313A (en) 2020-12-22
CN112115313B CN112115313B (en) 2023-07-28

Family

ID=73802612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010935977.8A Active CN112115313B (en) 2020-09-08 2020-09-08 Regular expression generation and data extraction methods, devices, equipment and media

Country Status (1)

Country Link
CN (1) CN112115313B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343715A (en) * 2021-06-29 2021-09-03 深圳前海微众银行股份有限公司 Method, device and equipment for automatically generating regular expression and storage medium
CN114741469A (en) * 2022-04-11 2022-07-12 上海弘玑信息技术有限公司 Regular expression generation method and electronic equipment
CN115269939A (en) * 2022-09-28 2022-11-01 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Regular expression generation method and device, intelligent terminal and computer storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609459A (en) * 2012-01-12 2012-07-25 神州数码网络(北京)有限公司 Method and device for string matching based on regular expression
CN105512105A (en) * 2015-12-07 2016-04-20 百度在线网络技术(北京)有限公司 Semantic parsing method and device
CN105868166A (en) * 2015-01-22 2016-08-17 阿里巴巴集团控股有限公司 Regular expression generation method and system
CN107992481A (en) * 2017-12-25 2018-05-04 中科鼎富(北京)科技发展有限公司 A kind of matching regular expressions method, apparatus and system based on multiway tree
US20180268081A1 (en) * 2015-01-28 2018-09-20 British Telecommunications Public Limited Company Data extraction
CN109783819A (en) * 2019-01-18 2019-05-21 广东小天才科技有限公司 A kind of generation method and system of regular expression
CN111222022A (en) * 2020-01-15 2020-06-02 奇安信科技集团股份有限公司 Regular expression-based matching method and device
US20200210467A1 (en) * 2018-12-26 2020-07-02 Oath Inc. Template generation using directed acyclic word graphs

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609459A (en) * 2012-01-12 2012-07-25 神州数码网络(北京)有限公司 Method and device for string matching based on regular expression
CN105868166A (en) * 2015-01-22 2016-08-17 阿里巴巴集团控股有限公司 Regular expression generation method and system
US20180268081A1 (en) * 2015-01-28 2018-09-20 British Telecommunications Public Limited Company Data extraction
CN105512105A (en) * 2015-12-07 2016-04-20 百度在线网络技术(北京)有限公司 Semantic parsing method and device
CN107992481A (en) * 2017-12-25 2018-05-04 中科鼎富(北京)科技发展有限公司 A kind of matching regular expressions method, apparatus and system based on multiway tree
US20200210467A1 (en) * 2018-12-26 2020-07-02 Oath Inc. Template generation using directed acyclic word graphs
CN109783819A (en) * 2019-01-18 2019-05-21 广东小天才科技有限公司 A kind of generation method and system of regular expression
CN111222022A (en) * 2020-01-15 2020-06-02 奇安信科技集团股份有限公司 Regular expression-based matching method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343715A (en) * 2021-06-29 2021-09-03 深圳前海微众银行股份有限公司 Method, device and equipment for automatically generating regular expression and storage medium
CN114741469A (en) * 2022-04-11 2022-07-12 上海弘玑信息技术有限公司 Regular expression generation method and electronic equipment
CN115269939A (en) * 2022-09-28 2022-11-01 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Regular expression generation method and device, intelligent terminal and computer storage medium
CN115269939B (en) * 2022-09-28 2023-02-17 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Regular expression generation method and device, intelligent terminal and computer storage medium

Also Published As

Publication number Publication date
CN112115313B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN111967262B (en) Determination method and device for entity tag
CN111522994B (en) Method and device for generating information
EP3832488A2 (en) Method and apparatus for generating event theme, device and storage medium
CN112115313B (en) Regular expression generation and data extraction methods, devices, equipment and media
US20180375529A1 (en) Compression of javascript object notation data using structure information
CN111522967B (en) Knowledge graph construction method, device, equipment and storage medium
CN111709247A (en) Data set processing method and device, electronic equipment and storage medium
CN111563385B (en) Semantic processing method, semantic processing device, electronic equipment and medium
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
JP2016508264A (en) Method and apparatus for providing input candidate item corresponding to input character string
CN111460289B (en) News information pushing method and device
CN112148881B (en) Method and device for outputting information
CN112380847B (en) Point-of-interest processing method and device, electronic equipment and storage medium
CN111708805A (en) Data query method and device, electronic equipment and storage medium
CN111400456B (en) Information recommendation method and device
CN111666372B (en) Method, device, electronic equipment and readable storage medium for analyzing query word query
CN111177462B (en) Video distribution timeliness determination method and device
CN110968802B (en) Analysis method and analysis device for user characteristics and readable storage medium
CN112183052B (en) Document repetition degree detection method, device, equipment and medium
JP7197542B2 (en) Method, Apparatus, Device and Medium for Text Word Segmentation
CN111737966B (en) Document repetition detection method, device, equipment and readable storage medium
CN111460257B (en) Thematic generation method, apparatus, electronic device and storage medium
CN112699314A (en) Hot event determination method and device, electronic equipment and storage medium
CN115329078B (en) Text data processing method, device, equipment and storage medium
CN113868508B (en) Writing material query method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant