WO2023185377A1

WO2023185377A1 - Multi-granularity data pattern mining method and related device

Info

Publication number: WO2023185377A1
Application number: PCT/CN2023/079655
Authority: WO
Inventors: 魏子恒; 郝诗源; 龙江; 吕红
Original assignee: 华为云计算技术有限公司
Priority date: 2022-03-30
Filing date: 2023-03-03
Publication date: 2023-10-05
Also published as: CN116932604A

Abstract

Provided in the present application are a multi-granularity data pattern mining method and a related device. The method comprises: reading data to be processed, and performing multi-granularity pattern mining on the data to be processed; according to a multi-granularity pattern mining result, generating a multi-granularity data pattern corresponding to the data to be processed; outputting and displaying the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern comprises a basic pattern corresponding to the data to be processed, the basic pattern comprises a first-level data pattern and a second-level data pattern, and each level of data pattern comprises a data pattern sample, the amount of data matching the data pattern sample, and the proportion of the data in the data to be processed. The method can enrich the mining granularities of a data pattern, helps a user to comprehensively and effectively recognize data features, and can display, in multiple dimensions, the data features of the data and service insights.

Description

A multi-granularity data pattern mining method and related equipment

This application claims priority to the Chinese patent application filed with the China Patent Office on March 30, 2022, with application number 2022103260772 and the application title "A multi-granularity data pattern mining method and related equipment", the entire content of which is incorporated by reference. in this application.

Technical field

The present invention relates to the technical field of data mining, and in particular to a multi-granularity data pattern mining method and related equipment.

Background technique

Data pattern (datapattern) is an important means of displaying data content and reflecting the distribution of data content. It is an important component of various products such as data preparation, data asset management, and data warehouse (extract-transform-load, ETL). It is also It is an important basis for automatic ETL, data feature extraction, operator recommendation and other data governance intelligent algorithms.

Pattern mining (PM) is the main technical means to obtain data patterns. Currently, most data governance vendors, such as Informatica, Trifacta, Talend, etc., have pattern mining functions and integrate them in data preparation, data cleaning, data catalog, Multiple modules such as data overview are used to mine the data patterns of the obtained data, thereby helping users complete major data management functions such as data content analysis, data cleaning, data format conversion, and data integration. However, the current pattern mining algorithm can only support basic patterns and some special data patterns (such as date address templates), and cannot identify other content characteristics and business characteristics of the data, especially encoding data. The existing pattern mining algorithm cannot identify its encoding at all. Features, therefore, it is also difficult to provide users with multi-dimensional data insights and business insights.

Therefore, how to enrich the mining granularity of data patterns, help users comprehensively and effectively identify the characteristics of data, and display the data characteristics and business insights of data in multiple dimensions is an urgent problem that needs to be solved.

Contents of the invention

Embodiments of the present invention disclose a multi-granularity data pattern mining method and related equipment, which can enrich the mining granularity of data patterns, help users comprehensively and effectively identify data characteristics, and display data characteristics and business insights of data in multiple dimensions.

In a first aspect, the present application provides a multi-granularity data pattern mining method, which includes: reading data to be processed, and performing multi-granularity pattern mining on the data to be processed; and generating the multi-granularity pattern mining results according to the data to be processed. Process the multi-granularity data pattern corresponding to the data; output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern includes a basic pattern corresponding to the data to be processed, and the basic pattern includes a first A hierarchical data model and a second-level data model, each hierarchical data model includes a data pattern sample and the number of data matching the data pattern sample and its proportion in the data to be processed.

In the solution provided by this application, when the data processing system processes data, it is not limited to the mining of basic patterns, but performs multi-granularity mining of data through multiple dimensions to obtain different levels of data patterns corresponding to the data. And display it to users to help users comprehensively and effectively identify the characteristics of the data, so that subsequent users can perform data cleaning, format conversion, data integration and other work based on the data characteristics of the data.

In conjunction with the first aspect, in a possible implementation of the first aspect, the first level data pattern includes at least one sub-level data pattern, and the mining granularity of the at least one sub-level data pattern is smaller than that of the first level data pattern. Data patterns, each sub-level data pattern in the at least one sub-level data pattern has a common substring with the first level data pattern.

In the solution provided by this application, during the process of mining multi-granularity patterns in data, the data processing system can conduct deeper mining of data patterns at a certain level based on common substrings, thereby obtaining the substrings under the data pattern at that level. The hierarchical data model can display the data characteristics of the data in more detail and help users better identify the data.

In conjunction with the first aspect, in a possible implementation of the first aspect, based on the multi-granularity data pattern, each level of data pattern in the multi-granularity data pattern is retrieved and matched through a knowledge base, and the The knowledge base includes regular expressions corresponding to different business models; according to the search results, multi-granularity business models corresponding to the data to be processed are output and displayed, where the multi-granularity business models include multiple levels of business models, and each The hierarchical business model matches one of the multi-granularity data schemas, and each hierarchical business model corresponds to a business insight.

In the solution provided by this application, regular expressions corresponding to different business models are stored in the knowledge base, thereby providing corresponding business insights for the data. After the data processing system mines the multi-granularity data model of the data, it can target each level The data patterns are retrieved and matched in the knowledge base to determine the business model corresponding to the data pattern at each level. Finally, the multi-granularity business model corresponding to the data can be obtained to help users identify the business meaning expressed by the data.

Combined with the first aspect, in a possible implementation of the first aspect, the data to be processed is parsed to obtain a basic pattern corresponding to the data to be processed; a common substring mining algorithm is used to analyze the basic patterns that have the same basic pattern. The data to be processed is iteratively mined for common substrings; based on the common substrings obtained after each iterative mining, an atomic pattern corresponding to the data to be processed is generated; the atomic patterns corresponding to the data to be processed are merged to obtain the above Multi-granularity data schema corresponding to the data to be processed.

In the solution provided by this application, the data processing system first parses the data to obtain the basic pattern, and then performs iterative mining of common substrings on data with the same basic pattern based on the common substring mining algorithm, and based on the common substrings mined each time Generate corresponding atomic patterns, and finally merge the atomic patterns to obtain multi-granularity data patterns, which can achieve more granularity and deeper mining of data, and thus display data characteristics more comprehensively.

Combined with the first aspect, in a possible implementation of the first aspect, based on the suffix data of the data to be processed in the same basic mode, a substring with an occurrence frequency value greater than a preset threshold is obtained; for all the occurrence frequencies Substrings with values greater than the preset threshold are filtered to determine the common substrings generated after each iteration of mining.

In the solution provided by this application, the data processing system performs common substring mining on data with the same basic pattern based on the suffix array, thereby finding the common substring obtained after each mining, thereby achieving multi-granularity mining of data, and then displaying the data. Corresponding multi-granularity data schema.

In conjunction with the first aspect, in a possible implementation of the first aspect, the data to be processed with the same basic mode are aligned, the data to be processed with the same basic mode are compared bit by bit, and the same positions are compared Whether the data on are the same; determine the common substring generated after each iteration of mining based on the comparison results.

In the solution provided by this application, for data with the same basic pattern, it is assumed that common substrings appear at the same position, so that comparisons are made based on this assumption, and all common substrings can be identified according to the comparison results, thereby realizing data comparison. Perform multi-granularity mining to display the multi-granularity data patterns corresponding to the data.

Combined with the first aspect, in a possible implementation of the first aspect, a most frequent itemset tree FP-tree is constructed based on the data to be processed in the same basic mode; according to the FP-tree, each iteration is determined Common substrings generated after mining.

In the solution provided by this application, the data processing system builds an FP-tree for data with the same basic model based on the association relationship, then finds the common substring, and generates the substring set according to the common substring processing program, thereby achieving multi-granularity processing of the data. Mining, and then display the multi-granularity data patterns corresponding to the data.

In combination with the first aspect, in a possible implementation of the first aspect, based on a dynamic programming algorithm, the edit distance between any two atomic patterns in all the atomic patterns is calculated; according to the edit distance calculation result and the preset The merging strategy is used to merge the atomic patterns corresponding to the data to be processed.

In the solution provided by this application, the data processing system uses a dynamic programming algorithm to calculate the edit distance between any two atomic patterns, and then merges the atomic patterns according to the preset merging strategy to generate the final multi-granularity data pattern, so that Similar data patterns can be merged to make the final display result more concise, and at the same time, it can better help users identify the characteristics of the data.

Combined with the first aspect, in a possible implementation of the first aspect, the context-free grammar CFG is used to parse the data to be processed to obtain a regular expression corresponding to the basic pattern; according to the regular expression corresponding to the basic pattern formula to generate the basic schema corresponding to the data to be processed.

In the solution provided by this application, the data processing system parses the data based on CFG, obtains the regular expression corresponding to the basic pattern, and then generates the basic pattern corresponding to the data, thereby completing the preliminary mining of the data and further processing the data in the future. Get ready for multi-granularity mining.

In the second aspect, this application provides a multi-granularity data pattern mining device, including: a reading and parsing module for reading data to be processed; a processing module for performing multi-granularity pattern mining on the data to be processed, and According to the multi-granularity pattern mining results, a multi-granularity data pattern corresponding to the data to be processed is generated; an output display module is used to output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data The pattern includes a basic pattern corresponding to the data to be processed. The basic pattern includes a first-level data pattern and a second-level data pattern. Each level of data pattern includes a data pattern sample and data matching the data pattern sample. The quantity and proportion in the data to be processed.

In conjunction with the second aspect, in a possible implementation of the second aspect, the first level data pattern includes at least one sub-level data pattern, and the mining granularity of the at least one sub-level data pattern is smaller than that of the first level data pattern. Data patterns, each sub-level data pattern in the at least one sub-level data pattern has a common substring with the first level data pattern.

In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is further configured to: based on the multi-granularity data model, use a knowledge base to perform analysis on each level in the multi-granularity data model Data patterns are retrieved and matched, and the knowledge base includes regular expressions corresponding to different business models; according to the search results, multi-granularity business models corresponding to the data to be processed are output and displayed, where the multi-granularity business models include multiple business models at each level, and the business model at each level matches one of the data patterns in the multi-granularity data patterns, and the business model at each level corresponds to a business insight.

In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: parse the data to be processed to obtain a basic pattern corresponding to the data to be processed; use a common substring The mining algorithm performs iterative mining of common substrings on data to be processed with the same basic pattern; based on the common substrings obtained after each iterative mining, An atomic pattern corresponding to the data to be processed is generated; the atomic patterns corresponding to the data to be processed are merged to obtain a multi-granularity data pattern corresponding to the data to be processed.

In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: based on the suffix data of the data to be processed in the same basic mode, obtain the occurrence frequency value greater than the preset threshold. Substring; filter all substrings whose occurrence frequency value is greater than the preset threshold and determine the common substring generated after each iterative mining.

In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: align the data to be processed with the same basic mode, and align the data to be processed with the same basic mode. The data is compared bit by bit to see if the data at the same position are the same; the common substring generated after each iterative mining is determined based on the comparison results.

In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: construct a most frequent itemset tree FP-tree based on the data to be processed in the same basic mode; Describe FP-tree to determine the common substrings generated after each iteration of mining.

In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: calculate an edit distance between any two atomic patterns in all the atomic patterns based on a dynamic programming algorithm. ; Merge the atomic patterns corresponding to the data to be processed according to the edit distance calculation result and the preset merging strategy.

In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: use a context-free grammar CFG to parse the data to be processed to obtain a regular expression corresponding to the basic pattern; According to the regular expression corresponding to the basic pattern, a basic pattern corresponding to the data to be processed is generated.

In a third aspect, the present application provides a computing device. The computing device includes a processor and a memory. The processor and the memory are connected through an internal bus. Instructions are stored in the memory. The processor calls the The instructions in the memory are used to execute the above first aspect and the method provided in conjunction with any implementation of the above first aspect.

In a fourth aspect, the present application provides a computer storage medium that stores a computer program. When the computer program is executed by a processor, the above first aspect and any combination of the above first aspect can be implemented. A process that implements the methods provided by the method.

In a fifth aspect, the present application provides a computer program product. The computer program includes instructions. When the computer program is executed by a computer, the computer can execute the above-mentioned first aspect and any implementation in combination with the above-mentioned first aspect. The flow of the method presented.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present invention, which are of great significance to this field. Ordinary technicians can also obtain other drawings based on these drawings without exerting creative work.

Figure 1 is a schematic diagram of a multi-granularity data mode provided by an embodiment of the present application;

Figure 2 is a schematic structural diagram of a multi-granularity data pattern mining system provided by an embodiment of the present application;

Figure 3 is a schematic flowchart of a multi-granularity data pattern mining method provided by an embodiment of the present application;

Figure 4 is a schematic diagram of customer number data provided by an embodiment of the present application;

Figure 5 is a schematic diagram of an atomic pattern generation example provided by an embodiment of the present application;

Figure 6 is a schematic diagram of an atomic mode merger provided by an embodiment of the present application;

Figure 7 is a schematic diagram of a multi-granularity business model provided by an embodiment of the present application;

Figure 8 is a schematic structural diagram of a multi-granularity data pattern mining device provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application.

Detailed ways

The following is a clear and complete description of the technical solutions in the embodiments of the present application with reference to the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

First, some terms and related technologies involved in this application will be explained with reference to the accompanying drawings to facilitate understanding by those skilled in the art.

Data schema is an important means to display data content and reflect the distribution of data content. It is also an important basis for intelligent data governance algorithms such as automatic ETL, data feature extraction, and operator recommendation. For example, if the data content of a certain column is [123,121,34,58,1], then the data pattern corresponding to the data in this column is: {number}[3], accounting for 40%; {number}[2], accounting for 40% ;{Number}[1], accounting for 20%.

The basic pattern (basicpattern) is a relatively simple and basic data pattern type among data patterns. Its level is relatively simple and only displays numbers, English, Chinese, symbols and their corresponding quantities and proportions.

Multi-granularity pattern (multi-granularity pattern) is a data pattern that can display the numerical characteristics existing within the data. It can express the hierarchical relationship of the data pattern and add numerical constraints. It can express the data characteristics in multiple dimensions and provide data format conversion, feature expression, Data cleaning and more bring more insights. For example, the data content of a certain column is [CSBI0005568,CSB I0008729,BMI0002930,BMI0003187], and the basic mode of the data in this column is: {English uppercase}[4]{Number}[7], accounting for 50%; {English uppercase}[ 3]{number}[7], accounting for 50%. There are multiple levels of granularity in this column of data, and the corresponding fine-grained layers are: {CSBI}{000}{number}[4], accounting for 50%; {BMI}{000}{number}[4], accounting for 50% 50%; it can be merged, and after merging, we get: {CSBI/BMI}{000}{number}[4], accounting for 100%; its final multi-granularity data model is shown in Figure 1, which can be clearly seen The data schema at each level of granularity of the data and the hierarchical relationship between the data schemas are obtained.

Multi-granularity business pattern is used to display data patterns with business meanings. It can be understood as giving actual business meanings based on multi-granularity data patterns, such as {region}{,}{area code} , {year}{month}{date}, {region}{region}{number}{cloud service}{IP address} and other data patterns with actual business insights.

Context-free grammar (CFG) is an important formal grammar in computer science. The grammatical category (or grammatical unit) it defines is completely independent of the environment in which this category may appear. For example, in a programming language, when an arithmetic expression is encountered, you can only consider the arithmetic expression itself without having to consider it. context, which is different from natural language, where the same word or the same word may have the same meaning in different contexts Different meanings and implications, and today's programming languages are context-free.

The longest common substring (LCS) problem refers to the problem of finding the longest common substring of a given set of strings. There are currently many algorithms for finding the longest common substring of multiple strings, such as the exhaustive method, the Knut-Morris-Prattalgorithm (KMP) algorithm, the generalized suffix tree algorithm, and the Frequent itemset tree (frequentpatterntree, FP-tree) algorithm, etc.

In the scenario of pattern mining and processing of data, the data is first read. After the data is read in, the data needs to be further split. The splitting rules are based on the system's built-in data schema. Currently, most manufacturers have built-in The data model only supports the recognition of English, numbers, and symbols. A few manufacturers can support the recognition of dates, times, and regions. However, the number of data model templates is limited and only supports complete matching. Each data model is iterated during the data splitting process. Statistics, generate three attributes: pattern, quantity (support), and frequency (frequency). Finally, the statistical results are output and displayed, which can be displayed in the form of a bar chart. It can be seen that this data pattern mining solution relies heavily on the system's built-in data pattern and is only based on a single string. It does not conduct in-depth mining and analysis from the perspective of the entire column of data characteristics, and cannot comprehensively display data characteristics and effective features for users. data insights.

For example, assume that there is a set of data: 20200625053258, 20200412132640, 2021061821031235, 20210421185832, ...; for the above set of data, currently in the process of splitting and analyzing it, only one data pattern can be generated, that is: {number}[14], quantity 1000, accounting for 100%. Users cannot identify the characteristics of this group of data through this data pattern. Users can only rely on accumulated experience for analysis to determine that this group of data represents date and time, with an accuracy of seconds. . However, based on the technical solution provided by this application, multi-granularity data patterns can be generated: {2020}{04}{number}[8], quantity 150, proportion 15%; {2020}{06}{number}[8], quantity 250, proportion 25%; {2021}{04}{number}[8], quantity 320, proportion 32%; {2021}{06}{number}[8], quantity 280, proportion 28%; further carry out Combined to get: {2020/2021}{04/06}{number}[8], quantity 1000, proportion 100%. Users can easily identify the coding characteristics of this set of data from the multi-granularity data pattern, so that they can quickly determine that this set of data represents date and the time accuracy is seconds.

This application provides a multi-granularity data pattern mining method and related equipment. The method is executed by a data processing system. The data processing system first reads the data stream, and then uses the basic pattern mining component to parse the data stream based on the built-in basic pattern settings. Finally, the basic pattern corresponding to the data stream is obtained, and then based on the obtained basic pattern, the multi-granularity pattern mining component is further used to mine it, mainly using the common substring mining algorithm to iteratively mine common substrings for data with the same basic pattern. Then based on the common substrings obtained after each iteration of mining, the corresponding atomic patterns are generated, and finally the atomic patterns are merged to obtain multi-granularity data patterns. Optionally, based on the obtained multi-granularity data pattern, combined with the pre-stored business types in the knowledge base, the data processing system uses the multi-granularity business model mining component to mine the data for business patterns, and finally obtains the multi-granularity business model corresponding to the data. By executing this multi-granularity data pattern mining method, the mining granularity of the data pattern is enriched, which can help users comprehensively and effectively identify the characteristics of the data. It is not limited to a single string and can display the data characteristics and business insights of the data in multiple dimensions.

The technical solutions of the embodiments of this application can also be applied to various scenarios that require data content processing and display, including but not limited to data cleaning, automatic ETL, semi-structured data structuring, similar column/table mining, data standard identification, Confidential privacy label delivery.

The data processing system is used to read the data stream in the business system, parse and process the data stream, obtain the multi-granularity data pattern corresponding to the data, and display the data characteristics and business insights corresponding to the data in multiple dimensions to the user. As shown in Figure 2, the data The processing system 210 can connect and communicate with the client 220 and the client 230 through the Internet or an internal bus, where the client 220 and the client 230 are used to provide business data to be processed, and the data processing system 210 is used to process business data, They can be deployed on the same physical entity, such as the same server, or they can be deployed on different servers, which is not limited in this application. The data processing system 210 includes a data reader 2110, a basic pattern mining component 2120, a multi-granularity data pattern mining component 2130, a multi-granularity business pattern mining component 2140, a knowledge base 2150 and an output presenter 2160. The data reader 2110 reads data from the client 220 or client 230, and then sends the read business data to the basic pattern mining component 2120. The basic pattern mining component 2120 parses the business data based on the built-in basic pattern settings to obtain the corresponding basic pattern, and then The multi-granularity data pattern mining component 2130 performs further mining processing according to the basic pattern mined by the basic pattern mining component 2120 to obtain the corresponding multi-granularity data pattern. Finally, the multi-granularity business model mining component 2140 mines the multi-granularity data pattern based on the multi-granularity data pattern mining component 2130. The multi-granularity data model and the business type stored in the knowledge base 2150 are mined for business models to obtain the corresponding multi-granularity business model. In particular, the output presenter 2160 can output the results obtained by the basic pattern mining component 2120, the multi-granularity data pattern mining component 2130 and the multi-granularity business pattern mining component 2140 respectively and display them to the user. The display method can be a bar chart, Tree diagram, etc., this application does not limit this.

Based on the above, the multi-granularity data pattern mining method and related equipment provided by the embodiments of the present application are described below. Referring to Figure 3, Figure 3 is a schematic flow chart of a multi-granularity data pattern mining method provided by an embodiment of the present application. As shown in Figure 3, the method includes but is not limited to the following steps:

S301: The data processing system reads the data to be processed and performs multi-granularity pattern mining on the data to be processed.

Specifically, a variety of different data schema identifiers are preset in the data processing system. The data schema identifiers can include English uppercase, English lowercase, symbols, numbers, Chinese, other languages, spaces and other general and conventional data schema identifiers, and can also include Other data pattern identifiers defined by the user. After obtaining the data to be processed, the data processing system compares and matches the data to be processed with the preset data pattern identifiers one by one, thereby completing the analysis of the data to be processed and obtaining the data to be processed. The basic pattern corresponding to processing data.

In one possible implementation, the data processing system uses CFG to parse the data to be processed, obtain a regular expression corresponding to the basic pattern, and generate a basic pattern corresponding to the data to be processed based on the regular expression corresponding to the basic pattern.

For example, assume that the data read by the data processing system is expression, and the data contains multiple elements (terms). When the data processing system parses the data, it can be found that each element in the data can be represented by a Lowercase English letters are replaced, that is, any element in the data and {English lowercase} can be interchanged. The data processing system further parses, and the second element x of the data can also be exchanged with {English lowercase}. If they are interchangeable, they will be merged in the same way as the first element, that is, {English lowercase} and {English lowercase} will be merged to obtain {English lowercase}[2]. According to the above rules, the data will continue to be parsed and merged iteratively. Finally, the final basic pattern of the data can be obtained, which is {English lowercase}[10].

It can be understood that the above description is only based on the example that the data to be processed contains only one data mode identifier. For data that contains multiple data mode identifiers at the same time, it can also be analyzed using the same parsing rules mentioned above to obtain its corresponding Basic mode. For example, the data to be processed is "Beijing Beijing-01032145680". After CFG analysis and processing, the final basic pattern generated is: {Chinese}[2]{English uppercase}[1]{English lowercase}[6]{symbol}[1 ]{number}[11]. In addition, if the data The processing system reads a set of data at the same time, performs the above-mentioned CFG analysis processing on each data in the set of data, and finally obtains the basic pattern corresponding to the set of data through simple statistics and calculations, as shown in Figure 4 , the data processing system obtains a column of customer number data, uses the above method to parse the column of data, and finally generates the basic pattern corresponding to the column of data: {English uppercase}[1]{English lowercase}[1]{Number}[4 ]: [4, 0.571]; {empty}: [2, 0.286]; {symbol}[1]{number}[1]: [1, 0.143].

In another possible implementation, the data processing system uses a common substring mining algorithm to iteratively mine common substrings for data to be processed with the same basic pattern.

Specifically, the data processing system performs common substring mining on data with the same basic pattern, so that data with the same substring can be found. It is easy to understand that when two data have the same common substring, then the two data have a very high probability. Most likely they are at the same data level and have the same data characteristics, so that the data characteristics hidden in the data can be displayed in a more fine-grained manner through this common substring.

Optionally, the data processing system finds substrings whose occurrence frequency value is greater than the preset threshold based on the suffix array of data with the same basic mode, filters all substrings whose occurrence frequency value is greater than the preset threshold, and determines the number of substrings after each iteration of mining. Generated public substring.

Specifically, when the data processing system solves for the longest common substring of N pieces of data with the same basic pattern, it can be processed by converting it into solving for the maximum value of the longest common prefix of some suffixes, which belong to N pieces of data. . For example, assuming that N data are S1, S2, S3,..., SN, first create a data set S, and connect these N data with different delimiters, that is, S=S1[P1]S2[P2]... SN-1[PN-1]SN, where P1, P2,..., PN-1 are different N-1 characters that are not in the character set, serving as separators. Then you can use the doubling algorithm or DC3 algorithm to find the suffix array and extreme (Height) array of S, and then enumerate the answer A in two (that is, assuming that N data can have a common substring of length A), and finally it is feasible for A Verify the validity, and according to the verification results, you can finally find the longest common substring of N pieces of data.

It can be seen that by executing the above algorithm, data with the same common substring can be found in data with the same basic mode. These data all have the data characteristics represented by the common substring, so that the data can be displayed in a more fine-grained manner. feature. In particular, when multiple data with the same basic pattern have more than one common substring, all the common substrings can be found by iteratively executing the above algorithm.

Optionally, the data processing system aligns data with the same basic model, compares data with the same basic model bit by bit, compares whether the data at the same position are the same, and determines the data generated after each iteration of mining based on the comparison results. Public substring.

Specifically, when the basic patterns are the same, common substrings are more likely to appear in the same position. Based on this assumption, data with the same basic pattern are compared bit by bit iteratively, so that all common substrings can be found. substring.

Optionally, the data processing system constructs an FP-tree based on data with the same basic pattern, and determines the common substrings generated after each iteration of mining based on the FP-tree.

Specifically, the data processing system first scans data with the same basic pattern and obtains the count of all frequent 1-item sets, where the itemset represents a set of multiple items, where the items are characters contained in each data, and the frequent items A set refers to a set whose support is greater than or equal to the minimum support. The support refers to the probability that a certain set appears in all transactions. For example, suppose there is an item set A, and the table has N rows, including the items set A. M rows, then the absolute support is M and the relative support is M/N. Then, the data processing system deletes items whose support is lower than the threshold and puts 1 item frequent set into the item header table. And sort it in descending order of support, then scan the data again, remove the non-frequent 1-item set from the read original data, and sort it in descending order of support, then read the sorted data set, insert it into the FP tree, and sort it according to the order Insert into the FP tree in the last order. The node ranked first is the ancestor node, and the node ranked later is the descendant node. If there is a common ancestor, the corresponding common ancestor node count is increased by 1. If a new node appears after insertion, then The nodes corresponding to the header table will be linked to new nodes through the node linked list, until all data is inserted into the FP tree, then the establishment of the FP tree is completed.

After constructing the FP tree, you can find the conditional pattern base corresponding to the header table item from the bottom of the header table upwards. Recursive mining from the conditional pattern base can obtain the frequent item set of the header table item. Based on its mining results, you can Find common substrings of data with the same underlying pattern.

Furthermore, the data processing system generates atomic patterns corresponding to the data to be processed based on the common substrings obtained after each iteration of mining.

Specifically, after the data processing system completes the iterative mining of common substrings of data with the same basic pattern, it can generate more detailed data patterns based on the common substrings obtained from each mining. These more detailed data patterns can It is called atomic mode, that is, atomic mode is a more detailed data mode under basic mode.

For example, see Figure 5. Figure 5 is a schematic diagram of an atomic pattern generation example provided by an embodiment of the present application. Assume that there is a set of data in the data table: 057112345123, 057187541123, 123456789521, 0571-45571233, 0571-89654432, First, these data are analyzed to obtain their corresponding basic patterns. The basic patterns are: {number}[12][3,0.6], {number}[4]{symbol}[1]{number}[8][2 , 0.4], for data with the same basic mode, the data processing system can use any of the common substring mining algorithms described in S302 above to perform public substring iterative mining, for the basic mode is {number}[12][ 3,0.6], the longest common substring found in the first mining is 0571, then according to the substring 0571, the atomic pattern of the first layer can be obtained: 0571{number}[8], and then in the Continue to mine common substrings based on the atomic pattern of one layer. For the data in the data pattern 0571{number}[8], the longest common substring found in the second mining is 123, so the second layer can be obtained. The atomic mode is: 0571{number}[5]123; for the data in the basic mode is {number}[4]{symbol}[1]{number}[8][2, 0.4], for the first time The longest common substring found by mining is 0571-. Then based on the substring 0571-, the atomic pattern of the first layer can be obtained: 0571-{number}[8], and then continue based on the atomic pattern of the first layer. Common substring mining, because other common substrings cannot be found in the second mining, so there is no more detailed atomic pattern, and the mining ends.

S302: The data processing system generates a multi-granularity data pattern corresponding to the data to be processed based on the multi-granularity pattern mining result.

Specifically, after the data processing system obtains the atomic schema corresponding to the data to be processed, it needs to merge the atomic schema, and then based on the merging result, finally generates a multi-granularity data schema corresponding to the data to be processed.

In one possible implementation, the data processing system calculates the edit distance between any two atomic patterns in all atomic patterns based on a dynamic programming algorithm. Based on the edit distance calculation results and the preset merging strategy, the data to be processed corresponds to Atomic mode for merging.

Specifically, the data processing system first uses an edit distance strategy. If there is a deletion or insertion operation, the edit distance is 2. If a character needs to be replaced, the edit distance will depend on the basic regularity corresponding to the character and whether it has Literal meaning (literal) and its specific value. When the basic regular expressions are the same, but one is literal and the other is not, the edit distance is 1; when the basic regular expressions are different, the edit distance is 3; when the basic regular expressions are the same and both are literal, but If the values are different, the edit distance is 0.5; when they are exactly the same, the edit distance is 0.

For example, assume that there are two atomic modes, namely 0571{number}[5]123 and 0571-{number}[8]. For the convenience of subsequent description, 0571{number}[5]123 and 0571-{number }[8] named mode one and mode two respectively. Now we need to calculate the edit distance of mode one relative to mode two. We need to calculate the edit distance of each character and then accumulate it to get the edit distance between the two modes. First, the 0 in mode one is relative to the 0 in mode two. In terms of edit distance, the basic rules are the same, both are literal and have the same value, so the edit distance between them is 0. Similarly, the edit distance between 5, 7, and 1 is also 0, because mode 2 is equivalent to mode First, there is an extra "-", so it is necessary to insert a "-" on the left side of mode one. The corresponding editing distance of the insertion operation is 2. {Number}[8] in mode two can be regarded as {number} [5]{number}[3], then the edit distance of the {number}[5] part is 0, because the one in mode one is 123, and the one in mode two is {number}[3], both are numbers. , so the corresponding basic rules are the same, but one is literal and the other is not, so for each bit, the edit distance is 1, there are three in total, so the total edit distance is 3, and finally all the edit distances are Accumulation, the accumulation result is the edit distance between mode one and mode two, which is 5.

The data processing system can generate a distance matrix based on the edit distance calculation result. The number of rows of the distance matrix is equal to the number of digits of one of the atomic modes (such as the above-mentioned mode 1), and the number of columns of the distance matrix is the same as that of another atomic mode (such as the above-mentioned mode). 2) The number of digits is the same, and then fill in the corresponding editing distance corresponding to operations such as deletion, insertion, replacement, etc. Then the shortest path among all feasible paths is found from the distance matrix based on the shortest path algorithm. The shortest path is the least costly operation for conversion between two atomic patterns.

Further, the data processing system calculates the atomic patterns in pairs based on the edit distance, and merges the two atomic patterns with the smallest edit distance each time. During the merging process, the conversion action can be generated based on the shortest path found above. Merge based on this conversion action until all merges are completed or the basic regularities of the two elements at the same position are different and the merge is stopped at the same time. Finally, the data processing system generates a multi-granularity data schema corresponding to the data to be processed based on the merging results. After generating the multi-granularity data schema, the data schema of any level dimension can be taken for forward compilation and matching to generate a user-readable data schema. , you can also perform reverse compilation and matching to generate machine-readable regular expressions, and can be used to check whether the newly added data matches and perform format conversion and other operations.

For example, see Figure 6. Figure 6 is a schematic diagram of atomic pattern merging provided by an embodiment of the present application. Taking the atomic pattern generated in Figure 5 above as an example, if it is necessary to further merge the generated atomic patterns, according to The merging method and strategy described above first merges the atomic pattern 0571{number}[5]123 and the atomic pattern 0571-{number}[8]. The merged data pattern is 0571{-}{number} [8], and then merged with the atomic pattern {number}[12] to obtain the data pattern {number}[4]{-}{number}[8], thus finally generating a multi-granularity data pattern.

S303: The data processing system outputs and displays the multi-granularity data pattern corresponding to the data to be processed.

Specifically, after the data processing system generates multi-granularity data patterns, it needs to output them and display the data characteristics corresponding to the data in multiple dimensions, so as to help users comprehensively and effectively identify the data.

In a possible implementation, the data processing system outputs and displays the multi-granularity data patterns in a tree structure. The tree structure includes multiple levels, and each level corresponds to one type of granularity data pattern.

It is easy to understand that in order to more intuitively display the hierarchical relationship between data patterns to users and help users identify data features more effectively and comprehensively, the data processing system chooses a tree structure to output multi-granularity data patterns, through the connections between each node of the tree. The connection relationship clearly shows the hierarchical relationship between each data model.

It should be understood that in the multi-granularity data pattern displayed in the output, the basic pattern may include data patterns at different levels, and the data patterns at each level include data pattern samples and the number of data matching the data pattern samples and the number of data patterns matching the data pattern samples. treat The proportion of the processed data. In particular, for each level of data model, it can include multiple sub-level data models. The mining granularity of each sub-level data model is smaller than the data model of this level, and each sub-level data model has the same This hierarchical data pattern has common substrings. For example, in the multi-granularity data pattern diagram obtained after the merger shown in Figure 6 above, the basic pattern is {number}[4]{-}{number}[8], the number is 5, and the proportion is 100%. Including the first-level data mode and the second-level data mode. Among them, the first-level data mode is 0571{-}{number}[8], the number is 4, accounting for 80%, and the second-level data mode is { Number}[12], the quantity is 1, accounting for 20%, the first-level data pattern includes the first sub-level data pattern and the second sub-level data pattern, both of which contain the common substring 0571, the first sub-level The data pattern is 0571{number}[5]123, the quantity is 2, accounting for 40%, the second sub-level data pattern is 0571-{number}[8], the quantity is 2, accounting for 40%.

In another possible implementation, the data processing system uses a knowledge base to retrieve and match the multi-granularity data patterns corresponding to the data to be processed, and determine the multi-granularity business patterns corresponding to the multi-granularity data patterns.

Specifically, the knowledge base of the data processing system stores regular expressions corresponding to the business model, such as regular expressions of Internet protocol (IP), uniform resource locator (URL) or region, zip code, Possible values for area code, surname, orientation, etc., and the knowledge base can be updated and learned at any time to enrich the business models it stores. After the data processing system obtains the multi-granularity data model, for each level of granularity data model, Search the knowledge base for matching business models to generate multi-granularity business models.

For example, see FIG. 7 , which is a schematic diagram of a multi-granularity service model provided by an embodiment of the present application. As shown in Figure 7, the business pattern matching the data pattern 0571{number}[8] is {Hangzhou area code}[4]{number}[8], and the business pattern matching the data pattern 0571-{number}[8] is {area code}[4]-{number}[8], and the business pattern matching the data pattern {number}[11] is {telephone number}[11]. Similarly, for each data pattern, from the knowledge base Find the matching business model, so that the corresponding multi-granularity business model can be obtained.

It can be seen that using the knowledge base to find the business model that matches each data model in the multi-granularity data model, thereby generating a multi-granularity business model, can help users better identify the business characteristics corresponding to the data, thereby bringing users Get more and more effective data insights.

It should be noted that while the data processing system completes the mining and display of multi-granularity data patterns and multi-granularity business models, it can be applied to other scenarios that require data content processing. For example, if -999999999 is mixed into the data shown in Figure 7 above, the data processing system can identify that the data pattern corresponding to this data is {symbol}[1]{number}[9], which is similar to other data patterns. The degree is lower than the threshold, so the data pattern is determined to be an outlier pattern, and the data can be cleaned. Similarly, for existing data with different data patterns, the data processing system can perform processing based on the similarity between the two. Conversion and unification, as shown in Figure 7 above, most of the data patterns do not contain "-", only a small part contains "-", and the editing distance between them is very small and the similarity is high, so it can be based on the majority rule In principle, the data pattern containing "-" is converted into a data pattern that does not contain "-", and the "-" is removed, thereby completing the conversion and unification between data patterns. In addition, when newly imported data needs to be imported, pattern recognition can be performed on the newly imported data, and then the similarity between the recognized data pattern and the existing data pattern can be calculated. If they are not similar, the existing data pattern can be used as the similarity calculation. Perform schema conversion on it to ensure that the converted data schema is consistent with the existing data schema. Of course, it can also be applied to scenarios such as semi-structured data structuring, similar column/table mining, data standard identification, etc., which will not be described again here.

It can be understood that by executing the multi-granularity data pattern mining method described in Figure 3, the data characteristics and business characteristics of the data can be displayed in multiple dimensions, helping users comprehensively identify data characteristics, especially the coding characteristics of the data, and bring more to users. More effective data insights improve user flexibility. In addition, it can be applied to a variety of business scenarios and assist other data management capabilities, thus effectively expanding applicable scenarios.

The methods of the embodiments of the present application are described in detail above. In order to facilitate better implementation of the above solutions of the embodiments of the present application, correspondingly, the following also provides relevant equipment for cooperating with the implementation of the above solutions.

Referring to Figure 8, Figure 8 is a schematic structural diagram of a multi-granularity data pattern mining device provided by an embodiment of the present application. The device can be the data processing system in the method embodiment described in Figure 3 above, and can execute what is described in Figure 3 Steps S301-S303 in the method embodiment, and optionally execute the optional method in the aforementioned steps S301-S303. As shown in FIG. 8 , the device 800 includes a read parsing module 810 , a processing module 820 and an output display module 830 . in,

Reading and parsing module 810, used to read data to be processed;

The processing module 820 is configured to perform multi-granularity pattern mining on the data to be processed, and generate multi-granularity data patterns corresponding to the data to be processed based on the multi-granularity pattern mining results;

The output display module 830 is used to output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern includes a basic pattern corresponding to the data to be processed, and the basic pattern includes first-level data pattern and the second-level data pattern, each level of data pattern includes data pattern samples and the number of data matching the data pattern samples and the proportion in the data to be processed.

The above three modules can transmit data to each other through communication channels. It should be understood that each module included in the multi-granularity data pattern mining device 800 can be a software unit, a hardware unit, or part of a software unit and part of a hardware unit.

As an embodiment, the first-level data pattern includes at least one sub-level data pattern. The mining granularity of the at least one sub-level data pattern is smaller than the first-level data pattern. The at least one sub-level data pattern has Each sub-level data pattern has a common substring with the first level data pattern.

As an embodiment, the processing module 820 is further configured to: based on the multi-granularity data pattern, retrieve and match each level of data pattern in the multi-granularity data pattern through a knowledge base, where the knowledge base includes Regular expressions corresponding to different business models; according to the search results, output and display the multi-granularity business model corresponding to the data to be processed, wherein the multi-granularity business model includes multiple levels of business models, and the business models of each level are The pattern matches one of the multi-granularity data patterns, with each level of business pattern corresponding to a business insight.

As an embodiment, the processing module 820 is specifically configured to: parse the data to be processed to obtain the basic pattern corresponding to the data to be processed; and use a common substring mining algorithm to analyze the data to be processed with the same basic pattern. Carry out iterative mining of common substrings; generate an atomic pattern corresponding to the data to be processed based on the common substring obtained after each iterative mining; merge the atomic patterns corresponding to the data to be processed to obtain the data to be processed Corresponding multi-granularity data schema.

As an embodiment, the processing module 820 is specifically configured to: obtain substrings whose occurrence frequency value is greater than a preset threshold based on the suffix data of the same basic mode data to be processed; and obtain substrings whose occurrence frequency value is greater than a preset threshold; Filter the substrings with a threshold to determine the common substrings generated after each iteration of mining.

As an embodiment, the processing module 820 is specifically configured to: align the data to be processed with the same basic mode, compare the data to be processed with the same basic mode bit by bit, and compare the data at the same position. Whether they are the same; determine the common substring generated after each iteration of mining based on the comparison results.

As an embodiment, the processing module 820 is specifically configured to: construct the most frequent itemset tree FP-tree based on the data to be processed in the same basic mode; and determine the generated data after each iteration of mining according to the FP-tree. public substring.

As an embodiment, the processing module 820 is specifically configured to: calculate the edit distance between any two atomic patterns in all the atomic patterns based on a dynamic programming algorithm; based on the edit distance calculation result and the preset merging strategy , merge the atomic patterns corresponding to the data to be processed.

As an embodiment, the processing module 820 is specifically configured to: use context-free grammar CFG to parse the data to be processed to obtain a regular expression corresponding to the basic pattern; and generate a regular expression corresponding to the basic pattern according to the regular expression corresponding to the basic pattern. The basic schema corresponding to the data to be processed.

It should be noted that the structure of the above-mentioned multi-granularity data pattern mining device is only an example and should not constitute a specific limitation. Each module in the above-mentioned device can be added, reduced or combined as needed. In addition, the operation and/or function of each module in the above-mentioned device is to implement the corresponding process of the method described in FIG. 3, and for the sake of brevity, the details will not be described again.

Referring to Figure 9, Figure 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application. As shown in FIG. 9 , the computing device 900 includes a processor 910 , a communication interface 920 , and a memory 930 . The processor 910 , the communication interface 920 , and the memory 930 are connected to each other through an internal bus 940 .

The processor 910 may be composed of one or more general-purpose processors, such as a central processing unit (CPU), or a combination of a CPU and a hardware chip. The above-mentioned hardware chip can be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.

The bus 940 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus 940 can be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only one thick line is used in Figure 9, but it does not mean that there is only one bus or one type of bus.

The memory 930 may include volatile memory (volatile memory), such as random access memory (RAM); the memory 930 may also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory). -only memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); the memory 930 may also include a combination of the above types.

It should be noted that the memory 930 of the computing device 900 stores codes corresponding to each module of the multi-granularity data pattern mining device 800. The processor 910 executes these codes to implement the functions of each module of the device 800, that is, S301-S303 are executed. and optional methods in S301-S303.

This application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program. When the computer program is executed by a processor, any part of the method described in the above method embodiments can be implemented. or all steps.

An embodiment of the present invention also provides a computer program. The computer program includes instructions. When the computer program is executed by a computer, the computer can perform part or all of the steps of any method.

In the above embodiments, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

It should be noted that for the sake of simple description, the foregoing method embodiments are expressed as a series of action combinations. However, those skilled in the art should know that the present application is not limited by the described action sequence. Because according to this application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily necessary for this application.

In the several embodiments provided in this application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the above units is only a logical function division. In actual implementation, there may be other divisions. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical or other forms.

The units described above as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

Claims

A multi-granularity data pattern mining method, characterized by including:

Read the data to be processed and perform multi-granularity pattern mining on the data to be processed;

According to the multi-granularity pattern mining results, generate a multi-granularity data pattern corresponding to the data to be processed;

Output and display the multi-granularity data schema corresponding to the data to be processed, where the multi-granularity data schema includes a basic schema corresponding to the data to be processed, and the basic schema includes a first-level data schema and a second-level data schema. , each level of data pattern includes data pattern samples and the number of data matching the data pattern samples and their proportion in the data to be processed.
The method of claim 1, further comprising:

The first level data pattern includes at least one sub-level data pattern, the mining granularity of the at least one sub-level data pattern is smaller than the first level data pattern, and each sub-level data pattern in the at least one sub-level data pattern A common substring with the first-level data pattern.
The method according to claim 1 or 2, characterized in that the method further includes:

Based on the multi-granularity data pattern, each level of data pattern in the multi-granularity data pattern is retrieved and matched through a knowledge base, where the knowledge base includes regular expressions corresponding to different business patterns;

According to the search results, the multi-granularity business model corresponding to the data to be processed is output and displayed, wherein the multi-granularity business model includes multiple levels of business models, and the business model of each level is consistent with the multi-granularity data model. A data pattern matching, each level of business model corresponds to a business insight.
The method according to any one of claims 1 to 3, characterized in that the multi-granularity pattern mining of the data to be processed includes:

Analyze the data to be processed to obtain the basic pattern corresponding to the data to be processed;

Using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern;

Based on the common substring obtained after each iteration of mining, generate the atomic pattern corresponding to the data to be processed;

The atomic patterns corresponding to the data to be processed are merged to obtain a multi-granularity data pattern corresponding to the data to be processed.
The method of claim 4, wherein using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern includes:

Based on the suffix data of the data to be processed in the same basic mode, obtain the substring whose occurrence frequency value is greater than the preset threshold;

All substrings whose occurrence frequency value is greater than the preset threshold are screened to determine the common substrings generated after each iterative mining.
The method of claim 4, wherein using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern includes:

Align the data to be processed with the same basic mode, compare the data to be processed with the same basic mode bit by bit, and compare whether the data at the same position are the same;

The common substring generated after each iteration of mining is determined based on the comparison results.
The method of claim 4, wherein using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern includes:

Based on the data to be processed with the same basic model, construct the most frequent itemset tree FP-tree;

According to the FP-tree, the common substring generated after each iteration of mining is determined.
The method according to any one of claims 4 to 7, characterized in that merging the atomic patterns corresponding to the data to be processed includes:

Based on a dynamic programming algorithm, calculate the edit distance between any two atomic patterns in all the atomic patterns;

According to the edit distance calculation result and the preset merging strategy, the atomic patterns corresponding to the data to be processed are merged.
The method according to any one of claims 4 to 8, characterized in that said parsing the data to be processed to obtain the basic pattern corresponding to the data to be processed includes:

Use the context-free grammar CFG to parse the data to be processed and obtain the regular expression corresponding to the basic pattern;

According to the regular expression corresponding to the basic pattern, a basic pattern corresponding to the data to be processed is generated.
A multi-granularity data pattern mining device, characterized by including:

Reading and parsing module, used to read data to be processed;

A processing module, configured to perform multi-granularity pattern mining on the data to be processed, and generate multi-granularity data patterns corresponding to the data to be processed based on the multi-granularity pattern mining results;

The output display module is used to output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern includes a basic pattern corresponding to the data to be processed, and the basic pattern includes a first-level data pattern. and a second-level data pattern, each level of data pattern including a data pattern sample and the number of data matching the data pattern sample and its proportion in the data to be processed.
The device of claim 10, wherein the first-level data pattern includes at least one sub-level data pattern, and the mining granularity of the at least one sub-level data pattern is smaller than the first-level data pattern, and the Each of the at least one sub-level data pattern has a common substring with the first level data pattern.
The device according to claim 10 or 11, characterized in that the processing module is also used to:

Based on the multi-granularity data pattern, each level of data pattern in the multi-granularity data pattern is retrieved and matched through a knowledge base, where the knowledge base includes regular expressions corresponding to different business patterns;

According to the search results, the multi-granularity business model corresponding to the data to be processed is output and displayed, wherein the multi-granularity business model includes multiple levels of business models, and the business model of each level is consistent with the multi-granularity data model. A data pattern matching, each level of business model corresponds to a business insight.
The device according to any one of claims 10 to 12, characterized in that the processing module is specifically used for:

Analyze the data to be processed to obtain the basic pattern corresponding to the data to be processed;

Using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern;

Based on the common substring obtained after each iteration of mining, generate the atomic pattern corresponding to the data to be processed;

The atomic patterns corresponding to the data to be processed are merged to obtain a multi-granularity data pattern corresponding to the data to be processed.
The device according to claim 13, characterized in that the processing module is specifically used for:

Based on the suffix data of the data to be processed in the same basic mode, obtain the substring whose occurrence frequency value is greater than the preset threshold;

All substrings whose occurrence frequency value is greater than the preset threshold are screened to determine the public number generated after each iterative mining. A total of substrings.
The device according to claim 13, characterized in that the processing module is specifically used for:

Align the data to be processed with the same basic mode, compare the data to be processed with the same basic mode bit by bit, and compare whether the data at the same position are the same;

The common substring generated after each iteration of mining is determined based on the comparison results.
The device according to claim 13, characterized in that the processing module is specifically used for:

Based on the data to be processed with the same basic model, construct the most frequent itemset tree FP-tree;

According to the FP-tree, the common substring generated after each iteration of mining is determined.
The device according to any one of claims 13-16, characterized in that the processing module is specifically used for:

Based on a dynamic programming algorithm, calculate the edit distance between any two atomic patterns in all the atomic patterns;

According to the edit distance calculation result and the preset merging strategy, the atomic patterns corresponding to the data to be processed are merged.
The device according to any one of claims 13-17, characterized in that the processing module is specifically used for:

Use the context-free grammar CFG to parse the data to be processed and obtain the regular expression corresponding to the basic pattern;

According to the regular expression corresponding to the basic pattern, a basic pattern corresponding to the data to be processed is generated.
A computing device, characterized in that the computing device includes a processor and a memory, and the processor executes computer instructions stored in the memory, so that the computing device executes the method described in any one of claims 1-9. method.
A computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the processor executes the method described in any one of claims 1-9.