WO2023185377A1 - Multi-granularity data pattern mining method and related device - Google Patents

Multi-granularity data pattern mining method and related device Download PDF

Info

Publication number
WO2023185377A1
WO2023185377A1 PCT/CN2023/079655 CN2023079655W WO2023185377A1 WO 2023185377 A1 WO2023185377 A1 WO 2023185377A1 CN 2023079655 W CN2023079655 W CN 2023079655W WO 2023185377 A1 WO2023185377 A1 WO 2023185377A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
pattern
processed
granularity
mining
Prior art date
Application number
PCT/CN2023/079655
Other languages
French (fr)
Chinese (zh)
Inventor
魏子恒
郝诗源
龙江
吕红
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023185377A1 publication Critical patent/WO2023185377A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Provided in the present application are a multi-granularity data pattern mining method and a related device. The method comprises: reading data to be processed, and performing multi-granularity pattern mining on the data to be processed; according to a multi-granularity pattern mining result, generating a multi-granularity data pattern corresponding to the data to be processed; outputting and displaying the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern comprises a basic pattern corresponding to the data to be processed, the basic pattern comprises a first-level data pattern and a second-level data pattern, and each level of data pattern comprises a data pattern sample, the amount of data matching the data pattern sample, and the proportion of the data in the data to be processed. The method can enrich the mining granularities of a data pattern, helps a user to comprehensively and effectively recognize data features, and can display, in multiple dimensions, the data features of the data and service insights.

Description

一种多粒度数据模式挖掘方法及相关设备A multi-granularity data pattern mining method and related equipment
本申请要求于2022年03月30日提交中国专利局、申请号为2022103260772、申请名称为“一种多粒度数据模式挖掘方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on March 30, 2022, with application number 2022103260772 and the application title "A multi-granularity data pattern mining method and related equipment", the entire content of which is incorporated by reference. in this application.
技术领域Technical field
本发明涉及数据挖掘技术领域,尤其涉及一种多粒度数据模式挖掘方法及相关设备。The present invention relates to the technical field of data mining, and in particular to a multi-granularity data pattern mining method and related equipment.
背景技术Background technique
数据模式(datapattern)是一种展示数据内容,体现数据内容分布的重要手段,是数据准备、数据资产管理、数据仓库(extract-transform-load,ETL)等多种产品的重要组成部分,同时也是进行自动ETL、数据特征提取、算子推荐等数据治理智能算法的重要依据。Data pattern (datapattern) is an important means of displaying data content and reflecting the distribution of data content. It is an important component of various products such as data preparation, data asset management, and data warehouse (extract-transform-load, ETL). It is also It is an important basis for automatic ETL, data feature extraction, operator recommendation and other data governance intelligent algorithms.
模式挖掘(patternmining,PM)是获取数据模式的主要技术手段,目前大部分数据治理厂商,例如Informatica、Trifacta、Talend等均具有模式挖掘功能,并且将其集成于数据准备、数据清理、数据目录、数据概览等多个模块中,用于挖掘得到数据的数据模式,从而帮助用户完成数据内容剖析、数据清理、数据格式转换、数据集成等主要数据治理功能。但目前模式挖掘算法只能支持基础模式和一些特殊的数据模式(如日期地址模板),无法识别数据的其他内容特征和业务特征,尤其是编码类数据,现有模式挖掘算法完全无法识别其编码特征,因此,也难以为用户提供多维度的数据见解和业务见解。Pattern mining (PM) is the main technical means to obtain data patterns. Currently, most data governance vendors, such as Informatica, Trifacta, Talend, etc., have pattern mining functions and integrate them in data preparation, data cleaning, data catalog, Multiple modules such as data overview are used to mine the data patterns of the obtained data, thereby helping users complete major data management functions such as data content analysis, data cleaning, data format conversion, and data integration. However, the current pattern mining algorithm can only support basic patterns and some special data patterns (such as date address templates), and cannot identify other content characteristics and business characteristics of the data, especially encoding data. The existing pattern mining algorithm cannot identify its encoding at all. Features, therefore, it is also difficult to provide users with multi-dimensional data insights and business insights.
因此,如何丰富数据模式的挖掘粒度,帮助用户全面有效的识别数据的特征,多维度展示数据的数据特征和业务见解是目前亟待解决的问题。Therefore, how to enrich the mining granularity of data patterns, help users comprehensively and effectively identify the characteristics of data, and display the data characteristics and business insights of data in multiple dimensions is an urgent problem that needs to be solved.
发明内容Contents of the invention
本发明实施例公开了一种多粒度数据模式挖掘方法及相关设备,能够丰富数据模式的挖掘粒度,帮助用户全面有效的识别数据的特征,多维度展示数据的数据特征和业务见解。Embodiments of the present invention disclose a multi-granularity data pattern mining method and related equipment, which can enrich the mining granularity of data patterns, help users comprehensively and effectively identify data characteristics, and display data characteristics and business insights of data in multiple dimensions.
第一方面,本申请提供一种多粒度数据模式挖掘方法,包括:读取待处理数据,并对所述待处理数据进行多粒度模式挖掘;根据所述多粒度模式挖掘结果,生成所述待处理数据对应的多粒度数据模式;输出并展示所述待处理数据对应的多粒度数据模式,其中,所述多粒度数据模式包括所述待处理数据对应的基础模式,所述基础模式包括第一层级数据模式和第二层级数据模式,每一层级数据模式包括数据模式样例和与所述数据模式样例匹配的数据的数量以及在所述待处理数据中所占的比例。In a first aspect, the present application provides a multi-granularity data pattern mining method, which includes: reading data to be processed, and performing multi-granularity pattern mining on the data to be processed; and generating the multi-granularity pattern mining results according to the data to be processed. Process the multi-granularity data pattern corresponding to the data; output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern includes a basic pattern corresponding to the data to be processed, and the basic pattern includes a first A hierarchical data model and a second-level data model, each hierarchical data model includes a data pattern sample and the number of data matching the data pattern sample and its proportion in the data to be processed.
在本申请提供的方案中,数据处理系统在对数据进行处理时,不仅仅局限于基础模式的挖掘,而是通过多种维度对数据进行多粒度挖掘,得到数据对应的不同层级的数据模式,并向用户进行展示,从而帮助用户全面有效的识别数据的特征,以便于后续用户根据数据的数据特征进行数据清理、格式转换、数据集成等工作。 In the solution provided by this application, when the data processing system processes data, it is not limited to the mining of basic patterns, but performs multi-granularity mining of data through multiple dimensions to obtain different levels of data patterns corresponding to the data. And display it to users to help users comprehensively and effectively identify the characteristics of the data, so that subsequent users can perform data cleaning, format conversion, data integration and other work based on the data characteristics of the data.
结合第一方面,在第一方面的一种可能的实现方式中,所述第一层级数据模式包括至少一个子层级数据模式,所述至少一个子层级数据模式的挖掘粒度小于所述第一层级数据模式,所述至少一个子层级数据模式中的每个子层级数据模式与所述第一层级数据模式具有公共子字符串。In conjunction with the first aspect, in a possible implementation of the first aspect, the first level data pattern includes at least one sub-level data pattern, and the mining granularity of the at least one sub-level data pattern is smaller than that of the first level data pattern. Data patterns, each sub-level data pattern in the at least one sub-level data pattern has a common substring with the first level data pattern.
在本申请提供的方案中,数据处理系统在对数据进行多粒度模式挖掘过程中,可以基于公共子字符串对某一层级的数据模式进行更深层次的挖掘,从而得到该层级数据模式下的子层级数据模式,从而能够更加细致的展示数据具有的数据特征,帮助用户更好的对数据进行识别。In the solution provided by this application, during the process of mining multi-granularity patterns in data, the data processing system can conduct deeper mining of data patterns at a certain level based on common substrings, thereby obtaining the substrings under the data pattern at that level. The hierarchical data model can display the data characteristics of the data in more detail and help users better identify the data.
结合第一方面,在第一方面的一种可能的实现方式中,基于所述多粒度数据模式,通过知识库对所述多粒度数据模式中的每一层级数据模式进行检索和匹配,所述知识库包括不同业务模式对应的正则表达式;根据检索结果,输出并展示所述待处理数据对应的多粒度业务模式,其中,所述多粒度业务模式包括多个层级的业务模式,且每个层级的业务模式与所述多粒度数据模式中的一个数据模式匹配,每个层级的业务模式对应一种业务见解。In conjunction with the first aspect, in a possible implementation of the first aspect, based on the multi-granularity data pattern, each level of data pattern in the multi-granularity data pattern is retrieved and matched through a knowledge base, and the The knowledge base includes regular expressions corresponding to different business models; according to the search results, multi-granularity business models corresponding to the data to be processed are output and displayed, where the multi-granularity business models include multiple levels of business models, and each The hierarchical business model matches one of the multi-granularity data schemas, and each hierarchical business model corresponds to a business insight.
在本申请提供的方案中,知识库中存储了不同业务模式对应的正则表达式,从而为数据提供相应的业务见解,数据处理系统在挖掘得到数据的多粒度数据模式之后,可以针对每一层级的数据模式在知识库中进行检索和匹配,从而确定每一层级的数据模式所对应的业务模式,最终可以得到数据所对应的多粒度业务模式,帮助用户识别出数据所表达的业务含义。In the solution provided by this application, regular expressions corresponding to different business models are stored in the knowledge base, thereby providing corresponding business insights for the data. After the data processing system mines the multi-granularity data model of the data, it can target each level The data patterns are retrieved and matched in the knowledge base to determine the business model corresponding to the data pattern at each level. Finally, the multi-granularity business model corresponding to the data can be obtained to help users identify the business meaning expressed by the data.
结合第一方面,在第一方面的一种可能的实现方式中,对所述待处理数据进行解析得到所述待处理数据对应的基础模式;利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘;基于每一次迭代挖掘后得到的公共子串,生成所述待处理数据对应的原子模式;对所述待处理数据对应的原子模式进行归并处理,得到所述待处理数据对应的多粒度数据模式。Combined with the first aspect, in a possible implementation of the first aspect, the data to be processed is parsed to obtain a basic pattern corresponding to the data to be processed; a common substring mining algorithm is used to analyze the basic patterns that have the same basic pattern. The data to be processed is iteratively mined for common substrings; based on the common substrings obtained after each iterative mining, an atomic pattern corresponding to the data to be processed is generated; the atomic patterns corresponding to the data to be processed are merged to obtain the above Multi-granularity data schema corresponding to the data to be processed.
在本申请提供的方案中,数据处理系统首先对数据进行解析得到基础模式,然后基于公共子串挖掘算法对基础模式相同的数据进行公共子串迭代挖掘,并根据每一次挖掘到的公共子串生成对应的原子模式,最后对原子模式进行归并处理得到多粒度数据模式,这样可以实现对数据更多粒度和更深层次的挖掘,从而能够更全面的展示数据特征。In the solution provided by this application, the data processing system first parses the data to obtain the basic pattern, and then performs iterative mining of common substrings on data with the same basic pattern based on the common substring mining algorithm, and based on the common substrings mined each time Generate corresponding atomic patterns, and finally merge the atomic patterns to obtain multi-granularity data patterns, which can achieve more granularity and deeper mining of data, and thus display data characteristics more comprehensively.
结合第一方面,在第一方面的一种可能的实现方式中,基于所述基础模式相同的待处理数据的后缀数据,得到出现频率值大于预设阈值的子串;对所有所述出现频率值大于预设阈值的子串进行筛选,确定每一次迭代挖掘后生成的公共子串。Combined with the first aspect, in a possible implementation of the first aspect, based on the suffix data of the data to be processed in the same basic mode, a substring with an occurrence frequency value greater than a preset threshold is obtained; for all the occurrence frequencies Substrings with values greater than the preset threshold are filtered to determine the common substrings generated after each iteration of mining.
在本申请提供的方案中,数据处理系统基于后缀数组对基础模式相同的数据进行公共子串挖掘,从而找到每一次挖掘后得到的公共子串,从而实现对数据进行多粒度挖掘,进而展示数据对应的多粒度数据模式。In the solution provided by this application, the data processing system performs common substring mining on data with the same basic pattern based on the suffix array, thereby finding the common substring obtained after each mining, thereby achieving multi-granularity mining of data, and then displaying the data. Corresponding multi-granularity data schema.
结合第一方面,在第一方面的一种可能的实现方式中,对所述基础模式相同的待处理数据进行对齐,并对所述基础模式相同的待处理数据进行逐位对比,比较相同位置上的数据是否相同;基于比较结果确定每一次迭代挖掘后生成的公共子串。In conjunction with the first aspect, in a possible implementation of the first aspect, the data to be processed with the same basic mode are aligned, the data to be processed with the same basic mode are compared bit by bit, and the same positions are compared Whether the data on are the same; determine the common substring generated after each iteration of mining based on the comparison results.
在本申请提供的方案中,对基础模式相同的数据,假设公共子串出现在相同的位置上,从而基于这个假设进行诸位对比,根据比较结果可以识别出所有的公共子串,从而实现对数据进行多粒度挖掘,进而展示数据对应的多粒度数据模式。 In the solution provided by this application, for data with the same basic pattern, it is assumed that common substrings appear at the same position, so that comparisons are made based on this assumption, and all common substrings can be identified according to the comparison results, thereby realizing data comparison. Perform multi-granularity mining to display the multi-granularity data patterns corresponding to the data.
结合第一方面,在第一方面的一种可能的实现方式中,基于所述基础模式相同的待处理数据,构建最频项集树FP-tree;根据所述FP-tree,确定每一次迭代挖掘后生成的公共子串。Combined with the first aspect, in a possible implementation of the first aspect, a most frequent itemset tree FP-tree is constructed based on the data to be processed in the same basic mode; according to the FP-tree, each iteration is determined Common substrings generated after mining.
在本申请提供的方案中,数据处理系统基于关联关系对基础模式相同的数据构建FP-tree,然后找到公共子串,并根据公共子串处理程序生成子串集,从而实现对数据进行多粒度挖掘,进而展示数据对应的多粒度数据模式。In the solution provided by this application, the data processing system builds an FP-tree for data with the same basic model based on the association relationship, then finds the common substring, and generates the substring set according to the common substring processing program, thereby achieving multi-granularity processing of the data. Mining, and then display the multi-granularity data patterns corresponding to the data.
结合第一方面,在第一方面的一种可能的实现方式中,基于动态规划算法,计算所有所述原子模式中的任意两个原子模式之间的编辑距离;根据编辑距离计算结果和预先设置的合并策略,对所述待处理数据对应的原子模式进行合并。In combination with the first aspect, in a possible implementation of the first aspect, based on a dynamic programming algorithm, the edit distance between any two atomic patterns in all the atomic patterns is calculated; according to the edit distance calculation result and the preset The merging strategy is used to merge the atomic patterns corresponding to the data to be processed.
在本申请提供的方案中,数据处理系统利用动态规划算法计算任意两个原子模式之间的编辑距离,然后根据预设的合并策略对原子模式进行合并,从而生成最终的多粒度数据模式,这样可以将相似的数据模式进行合并,使得最终的展示结果更加简洁,同时也可以更好的帮助用户识别数据的特征。In the solution provided by this application, the data processing system uses a dynamic programming algorithm to calculate the edit distance between any two atomic patterns, and then merges the atomic patterns according to the preset merging strategy to generate the final multi-granularity data pattern, so that Similar data patterns can be merged to make the final display result more concise, and at the same time, it can better help users identify the characteristics of the data.
结合第一方面,在第一方面的一种可能的实现方式中,利用上下文无关文法CFG对所述待处理数据进行解析,得到基础模式对应的正则表达式;根据所述基础模式对应的正则表达式,生成所述待处理数据对应的基础模式。Combined with the first aspect, in a possible implementation of the first aspect, the context-free grammar CFG is used to parse the data to be processed to obtain a regular expression corresponding to the basic pattern; according to the regular expression corresponding to the basic pattern formula to generate the basic schema corresponding to the data to be processed.
在本申请提供的方案中,数据处理系统基于CFG对数据进行解析,得到基础模式对应的正则表达式,进而生成数据对应的基础模式,从而完成对数据的初步挖掘,为后续对数据进行更进一步的多粒度挖掘做好准备。In the solution provided by this application, the data processing system parses the data based on CFG, obtains the regular expression corresponding to the basic pattern, and then generates the basic pattern corresponding to the data, thereby completing the preliminary mining of the data and further processing the data in the future. Get ready for multi-granularity mining.
第二方面,本申请提供了一种多粒度数据模式挖掘装置,包括:读取解析模块,用于读取待处理数据;处理模块,用于对所述待处理数据进行多粒度模式挖掘,并根据所述多粒度模式挖掘结果,生成所述待处理数据对应的多粒度数据模式;输出展示模块,用于输出并展示所述待处理数据对应的多粒度数据模式,其中,所述多粒度数据模式包括所述待处理数据对应的基础模式,所述基础模式包括第一层级数据模式和第二层级数据模式,每一层级数据模式包括数据模式样例和与所述数据模式样例匹配的数据的数量以及在所述待处理数据中所占的比例。In the second aspect, this application provides a multi-granularity data pattern mining device, including: a reading and parsing module for reading data to be processed; a processing module for performing multi-granularity pattern mining on the data to be processed, and According to the multi-granularity pattern mining results, a multi-granularity data pattern corresponding to the data to be processed is generated; an output display module is used to output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data The pattern includes a basic pattern corresponding to the data to be processed. The basic pattern includes a first-level data pattern and a second-level data pattern. Each level of data pattern includes a data pattern sample and data matching the data pattern sample. The quantity and proportion in the data to be processed.
结合第二方面,在第二方面的一种可能的实现方式中,所述第一层级数据模式包括至少一个子层级数据模式,所述至少一个子层级数据模式的挖掘粒度小于所述第一层级数据模式,所述至少一个子层级数据模式中的每个子层级数据模式与所述第一层级数据模式具有公共子字符串。In conjunction with the second aspect, in a possible implementation of the second aspect, the first level data pattern includes at least one sub-level data pattern, and the mining granularity of the at least one sub-level data pattern is smaller than that of the first level data pattern. Data patterns, each sub-level data pattern in the at least one sub-level data pattern has a common substring with the first level data pattern.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,还用于:基于所述多粒度数据模式,通过知识库对所述多粒度数据模式中的每一层级数据模式进行检索和匹配,所述知识库包括不同业务模式对应的正则表达式;根据检索结果,输出并展示所述待处理数据对应的多粒度业务模式,其中,所述多粒度业务模式包括多个层级的业务模式,且每个层级的业务模式与所述多粒度数据模式中的一个数据模式匹配,每个层级的业务模式对应一种业务见解。In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is further configured to: based on the multi-granularity data model, use a knowledge base to perform analysis on each level in the multi-granularity data model Data patterns are retrieved and matched, and the knowledge base includes regular expressions corresponding to different business models; according to the search results, multi-granularity business models corresponding to the data to be processed are output and displayed, where the multi-granularity business models include multiple business models at each level, and the business model at each level matches one of the data patterns in the multi-granularity data patterns, and the business model at each level corresponds to a business insight.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,具体用于:对所述待处理数据进行解析得到所述待处理数据对应的基础模式;利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘;基于每一次迭代挖掘后得到的公共子串, 生成所述待处理数据对应的原子模式;对所述待处理数据对应的原子模式进行归并处理,得到所述待处理数据对应的多粒度数据模式。In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: parse the data to be processed to obtain a basic pattern corresponding to the data to be processed; use a common substring The mining algorithm performs iterative mining of common substrings on data to be processed with the same basic pattern; based on the common substrings obtained after each iterative mining, An atomic pattern corresponding to the data to be processed is generated; the atomic patterns corresponding to the data to be processed are merged to obtain a multi-granularity data pattern corresponding to the data to be processed.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,具体用于:基于所述基础模式相同的待处理数据的后缀数据,得到出现频率值大于预设阈值的子串;对所有所述出现频率值大于预设阈值的子串进行筛选,确定每一次迭代挖掘后生成的公共子串。In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: based on the suffix data of the data to be processed in the same basic mode, obtain the occurrence frequency value greater than the preset threshold. Substring; filter all substrings whose occurrence frequency value is greater than the preset threshold and determine the common substring generated after each iterative mining.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,具体用于:对所述基础模式相同的待处理数据进行对齐,并对所述基础模式相同的待处理数据进行逐位对比,比较相同位置上的数据是否相同;基于比较结果确定每一次迭代挖掘后生成的公共子串。In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: align the data to be processed with the same basic mode, and align the data to be processed with the same basic mode. The data is compared bit by bit to see if the data at the same position are the same; the common substring generated after each iterative mining is determined based on the comparison results.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,具体用于:基于所述基础模式相同的待处理数据,构建最频项集树FP-tree;根据所述FP-tree,确定每一次迭代挖掘后生成的公共子串。In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: construct a most frequent itemset tree FP-tree based on the data to be processed in the same basic mode; Describe FP-tree to determine the common substrings generated after each iteration of mining.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,具体用于:基于动态规划算法,计算所有所述原子模式中的任意两个原子模式之间的编辑距离;根据编辑距离计算结果和预先设置的合并策略,对所述待处理数据对应的原子模式进行合并。In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: calculate an edit distance between any two atomic patterns in all the atomic patterns based on a dynamic programming algorithm. ; Merge the atomic patterns corresponding to the data to be processed according to the edit distance calculation result and the preset merging strategy.
结合第二方面,在第二方面的一种可能的实现方式中,所述处理模块,具体用于:利用上下文无关文法CFG对所述待处理数据进行解析,得到基础模式对应的正则表达式;根据所述基础模式对应的正则表达式,生成所述待处理数据对应的基础模式。In conjunction with the second aspect, in a possible implementation of the second aspect, the processing module is specifically configured to: use a context-free grammar CFG to parse the data to be processed to obtain a regular expression corresponding to the basic pattern; According to the regular expression corresponding to the basic pattern, a basic pattern corresponding to the data to be processed is generated.
第三方面,本申请提供了一种计算设备,所述计算设备包括处理器和存储器,所述处理器和所述存储器通过内部总线相连,所述存储器中存储有指令,所述处理器调用所述存储器中的指令以执行上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的方法。In a third aspect, the present application provides a computing device. The computing device includes a processor and a memory. The processor and the memory are connected through an internal bus. Instructions are stored in the memory. The processor calls the The instructions in the memory are used to execute the above first aspect and the method provided in conjunction with any implementation of the above first aspect.
第四方面,本申请提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,当所述计算机程序被处理器执行时,可以实现上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的方法的流程。In a fourth aspect, the present application provides a computer storage medium that stores a computer program. When the computer program is executed by a processor, the above first aspect and any combination of the above first aspect can be implemented. A process that implements the methods provided by the method.
第五方面,本申请提供了一种计算机程序产品,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的方法的流程。In a fifth aspect, the present application provides a computer program product. The computer program includes instructions. When the computer program is executed by a computer, the computer can execute the above-mentioned first aspect and any implementation in combination with the above-mentioned first aspect. The flow of the method presented.
附图说明Description of drawings
为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present invention, which are of great significance to this field. Ordinary technicians can also obtain other drawings based on these drawings without exerting creative work.
图1是本申请实施例提供的一种多粒度数据模式的示意图;Figure 1 is a schematic diagram of a multi-granularity data mode provided by an embodiment of the present application;
图2是本申请实施例提供的一种多粒度数据模式挖掘系统的结构示意图;Figure 2 is a schematic structural diagram of a multi-granularity data pattern mining system provided by an embodiment of the present application;
图3是本申请实施例提供的一种多粒度数据模式挖掘方法的流程示意图;Figure 3 is a schematic flowchart of a multi-granularity data pattern mining method provided by an embodiment of the present application;
图4是本申请实施例提供的一种客户编号数据的示意图;Figure 4 is a schematic diagram of customer number data provided by an embodiment of the present application;
图5是本申请实施例提供的一种原子模式生成样例示意图;Figure 5 is a schematic diagram of an atomic pattern generation example provided by an embodiment of the present application;
图6是本申请实施例提供的一种原子模式合并示意图; Figure 6 is a schematic diagram of an atomic mode merger provided by an embodiment of the present application;
图7是本申请实施例提供的一种多粒度业务模式的示意图;Figure 7 is a schematic diagram of a multi-granularity business model provided by an embodiment of the present application;
图8是本申请实施例提供的一种多粒度数据模式挖掘装置的结构示意图;Figure 8 is a schematic structural diagram of a multi-granularity data pattern mining device provided by an embodiment of the present application;
图9是本申请实施例提供的一种计算设备的结构示意图。FIG. 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图对本申请实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following is a clear and complete description of the technical solutions in the embodiments of the present application with reference to the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解是,本文所描述的实施例可以与其它实施例相结合。Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
首先,结合附图对本申请中所涉及的部分用语和相关技术进行解释说明,以便于本领域技术人员理解。First, some terms and related technologies involved in this application will be explained with reference to the accompanying drawings to facilitate understanding by those skilled in the art.
数据模式是一种展示数据内容、体现数据内容分布的重要手段,同时也是进行自动ETL、数据特征提取、算子推荐等数据治理智能算法的重要依据。例如,某列数据内容为[123,121,34,58,1],则该列数据对应的数据模式为:{数}[3],占比40%;{数}[2],占比40%;{数}[1],占比20%。Data schema is an important means to display data content and reflect the distribution of data content. It is also an important basis for intelligent data governance algorithms such as automatic ETL, data feature extraction, and operator recommendation. For example, if the data content of a certain column is [123,121,34,58,1], then the data pattern corresponding to the data in this column is: {number}[3], accounting for 40%; {number}[2], accounting for 40% ;{Number}[1], accounting for 20%.
基础模式(basicpattern)是数据模式中较为简单、基本的数据模式类型,其层级比较简单,仅展示数字、英文、中文、符号及其对应的数量和所占的比例。The basic pattern (basicpattern) is a relatively simple and basic data pattern type among data patterns. Its level is relatively simple and only displays numbers, English, Chinese, symbols and their corresponding quantities and proportions.
多粒度数据模式(multi-granularitypattern)是可以展示数据内部存在的数值特性的数据模式,能够表达数据模式的层级关系并加入数值约束,可以多维度表达数据特征,为数据的格式转换、特征表达、数据清理等带来更多的见解。例如,某列数据内容为[CSBI0005568,CSB I0008729,BMI0002930,BMI0003187],该列数据的基础模式为:{英文大写}[4]{数}[7],占比50%;{英文大写}[3]{数}[7],占比50%。该列数据存在多个层级粒度,其对应的细粒度层为:{CSBI}{000}{数}[4],占比50%;{BMI}{000}{数}[4],占比50%;可以对其进行合并,合并之后得到:{CSBI/BMI}{000}{数}[4],占比100%;其最终的多粒度数据模式如图1所示,可以清晰的看出数据的每一层级粒度的数据模式以及数据模式之间的层级关系。Multi-granularity pattern (multi-granularity pattern) is a data pattern that can display the numerical characteristics existing within the data. It can express the hierarchical relationship of the data pattern and add numerical constraints. It can express the data characteristics in multiple dimensions and provide data format conversion, feature expression, Data cleaning and more bring more insights. For example, the data content of a certain column is [CSBI0005568,CSB I0008729,BMI0002930,BMI0003187], and the basic mode of the data in this column is: {English uppercase}[4]{Number}[7], accounting for 50%; {English uppercase}[ 3]{number}[7], accounting for 50%. There are multiple levels of granularity in this column of data, and the corresponding fine-grained layers are: {CSBI}{000}{number}[4], accounting for 50%; {BMI}{000}{number}[4], accounting for 50% 50%; it can be merged, and after merging, we get: {CSBI/BMI}{000}{number}[4], accounting for 100%; its final multi-granularity data model is shown in Figure 1, which can be clearly seen The data schema at each level of granularity of the data and the hierarchical relationship between the data schemas are obtained.
多粒度业务模式(multi-granularitybusinesspattern)是用于展示具有业务含义的数据模式,其可以理解为在多粒度数据模式的基础上为其赋予实际的业务含义,例如{地区}{,}{区号}、{年份}{月份}{日期}、{区域}{地域}{编号}{云服务}{IP地址}等具有实际业务见解的数据模式。Multi-granularity business pattern is used to display data patterns with business meanings. It can be understood as giving actual business meanings based on multi-granularity data patterns, such as {region}{,}{area code} , {year}{month}{date}, {region}{region}{number}{cloud service}{IP address} and other data patterns with actual business insights.
上下文无关文法(context-freegrammar,CFG)是计算机科学中一种重要的形式文法。它定义的语法范畴(或语法单位)是完全独立于这种范畴可能出现的环境,例如,在程序设计语言中,当碰到一个算术表达式时,可以仅考虑算术表达式本身而不必考虑它所处的上下文,这与自然语言不同,在自然语言中,同一个词或同一个字的意思在不同的上下文中可能具有 不同的意思和含义,而当今的程序设计语言都是上下文无关的。Context-free grammar (CFG) is an important formal grammar in computer science. The grammatical category (or grammatical unit) it defines is completely independent of the environment in which this category may appear. For example, in a programming language, when an arithmetic expression is encountered, you can only consider the arithmetic expression itself without having to consider it. context, which is different from natural language, where the same word or the same word may have the same meaning in different contexts Different meanings and implications, and today's programming languages are context-free.
最长公共子串(longestcommonsubstring,LCS)问题是指求给定的一组字符串长度最大的共有的子串的问题。目前存在多种求解多个字符串的最长公共子串的算法,例如穷举法、克努特-莫里斯-普拉特(theknuth-morris-prattalgorithm,KMP)算法、广义后缀树算法、最频项集树(frequentpatterntree,FP-tree)算法等。The longest common substring (LCS) problem refers to the problem of finding the longest common substring of a given set of strings. There are currently many algorithms for finding the longest common substring of multiple strings, such as the exhaustive method, the Knut-Morris-Prattalgorithm (KMP) algorithm, the generalized suffix tree algorithm, and the Frequent itemset tree (frequentpatterntree, FP-tree) algorithm, etc.
在对数据进行模式挖掘和处理的场景中,首先进行数据的读取,待将数据读入之后需要对数据进行进一步的拆分,其拆分规则基于系统内置的数据模式,目前大多数厂商内置的数据模式仅支持识别英文、数字、符号,少数厂商可以支持日期、时间、地区的识别,但是数据模式的模板数量有限且仅支持完全匹配,在对数据拆分过程中对各数据模式进行迭代统计,生成模式(pattern)、数量(support)、占比(frequency)3个属性,最后对统计结果进行输出展示,可以以条形图的形式进行展示。可以看出,这种数据模式的挖掘方案严重依赖于系统内置的数据模式且仅基于单一的字符串,没有从整列数据特征的角度出发进行深度挖掘和分析,无法为用户全面展示数据特征和有效的数据见解。In the scenario of pattern mining and processing of data, the data is first read. After the data is read in, the data needs to be further split. The splitting rules are based on the system's built-in data schema. Currently, most manufacturers have built-in The data model only supports the recognition of English, numbers, and symbols. A few manufacturers can support the recognition of dates, times, and regions. However, the number of data model templates is limited and only supports complete matching. Each data model is iterated during the data splitting process. Statistics, generate three attributes: pattern, quantity (support), and frequency (frequency). Finally, the statistical results are output and displayed, which can be displayed in the form of a bar chart. It can be seen that this data pattern mining solution relies heavily on the system's built-in data pattern and is only based on a single string. It does not conduct in-depth mining and analysis from the perspective of the entire column of data characteristics, and cannot comprehensively display data characteristics and effective features for users. data insights.
示例性的,假设存在一组数据为:20200625053258,20200412132640,2021061821031235,20210421185832,…;对于上述这组数据,当前在对其进行拆分和分析的过程中,只能生成一种数据模式,即:{数}[14],数量1000,占比100%,用户无法通过该数据模式识别该组数据的特征,用户只能靠经验积累进行分析,从而确定该组数据表示日期和时间,精度为秒。但基于本申请提供的技术方案,可以生成多粒度数据模式:{2020}{04}{数}[8],数量150,比例15%;{2020}{06}{数}[8],数量250,比例25%;{2021}{04}{数}[8],数量320,比例32%;{2021}{06}{数}[8],数量280,比例28%;进一步对其进行合并得到:{2020/2021}{04/06}{数}[8],数量1000,比例100%。用户可以轻易的从多粒度数据模式识别出该组数据的编码特征,从而可以快速确定该组数据表示日期且时间精度为秒。For example, assume that there is a set of data: 20200625053258, 20200412132640, 2021061821031235, 20210421185832, ...; for the above set of data, currently in the process of splitting and analyzing it, only one data pattern can be generated, that is: {number}[14], quantity 1000, accounting for 100%. Users cannot identify the characteristics of this group of data through this data pattern. Users can only rely on accumulated experience for analysis to determine that this group of data represents date and time, with an accuracy of seconds. . However, based on the technical solution provided by this application, multi-granularity data patterns can be generated: {2020}{04}{number}[8], quantity 150, proportion 15%; {2020}{06}{number}[8], quantity 250, proportion 25%; {2021}{04}{number}[8], quantity 320, proportion 32%; {2021}{06}{number}[8], quantity 280, proportion 28%; further carry out Combined to get: {2020/2021}{04/06}{number}[8], quantity 1000, proportion 100%. Users can easily identify the coding characteristics of this set of data from the multi-granularity data pattern, so that they can quickly determine that this set of data represents date and the time accuracy is seconds.
本申请提供了一种多粒度数据模式挖掘方法及相关设备,该方法由数据处理系统执行,数据处理系统首先读取数据流,然后基于内置的基础模式设置利用基础模式挖掘组件对数据流进行解析后得到该数据流对应的基础模式,然后根据得到的基础模式,进一步利用多粒度模式挖掘组件对其进行挖掘,主要是利用公共子串挖掘算法对基础模式相同的数据进行公共子串迭代挖掘,然后基于每一次迭代挖掘后得到的公共子串,生成对应的原子模式,最后对原子模式进行归并处理得到多粒度数据模式。可选的,数据处理系统基于得到的多粒度数据模式,结合知识库中预先存储的业务类型,利用多粒度业务模式挖掘组件对数据进行业务模式挖掘,最终得到数据相对应的多粒度业务模式。通过执行该多粒度数据模式挖掘方法,丰富了数据模式的挖掘粒度,可以帮助用户全面有效的识别数据的特征,不局限于单一的字符串,能够多维度展示数据的数据特征和业务见解。This application provides a multi-granularity data pattern mining method and related equipment. The method is executed by a data processing system. The data processing system first reads the data stream, and then uses the basic pattern mining component to parse the data stream based on the built-in basic pattern settings. Finally, the basic pattern corresponding to the data stream is obtained, and then based on the obtained basic pattern, the multi-granularity pattern mining component is further used to mine it, mainly using the common substring mining algorithm to iteratively mine common substrings for data with the same basic pattern. Then based on the common substrings obtained after each iteration of mining, the corresponding atomic patterns are generated, and finally the atomic patterns are merged to obtain multi-granularity data patterns. Optionally, based on the obtained multi-granularity data pattern, combined with the pre-stored business types in the knowledge base, the data processing system uses the multi-granularity business model mining component to mine the data for business patterns, and finally obtains the multi-granularity business model corresponding to the data. By executing this multi-granularity data pattern mining method, the mining granularity of the data pattern is enriched, which can help users comprehensively and effectively identify the characteristics of the data. It is not limited to a single string and can display the data characteristics and business insights of the data in multiple dimensions.
本申请实施例的技术方案还可以应用于各种需要进行数据内容处理和展示的场景,包括但不限于数据清理、自动ETL、半结构化数据结构化、相似列/表挖掘、数据标准识别、密级隐私标签传递。The technical solutions of the embodiments of this application can also be applied to various scenarios that require data content processing and display, including but not limited to data cleaning, automatic ETL, semi-structured data structuring, similar column/table mining, data standard identification, Confidential privacy label delivery.
数据处理系统用于读取业务系统中的数据流,对数据流进行解析处理,得到数据对应的多粒度数据模式,并向用户多维度展示数据对应的数据特征和业务见解。如图2所示,数据 处理系统210可以通过互联网或内部总线与客户端220和客户端230进行连接和通信,其中,客户端220和客户端230用于提供待处理的业务数据,数据处理系统210用于处理业务数据,它们可以部署于同一个物理实体上,例如部署于同一个服务器上,也可以部署在不同的服务器上,本申请对此不作限定。数据处理系统210包括数据读取器2110、基础模式挖掘组件2120、多粒度数据模式挖掘组件2130、多粒度业务模式挖掘组件2140、知识库2150和输出展示器2160,数据读取器2110从客户端220或客户端230中读取业务数据,然后将读取到的业务数据发送给基础模式挖掘组件2120,基础模式挖掘组件2120基于内置的基础模式设置对业务数据进行解析得到对应的基础模式,然后多粒度数据模式挖掘组件2130根据基础模式挖掘组件2120挖掘得到的基础模式进行进一步挖掘处理得到对应的多粒度数据模式,最后,多粒度业务模式挖掘组件2140基于多粒度数据模式挖掘组件2130挖掘得到的多粒度数据模式和知识库2150中存储的业务类型,对其进行业务模式挖掘,得到对应的多粒度业务模式。特别的,输出展示器2160可以将基础模式挖掘组件2120、多粒度数据模式挖掘组件2130及多粒度业务模式挖掘组件2140得到的结果分别进行输出并展示给用户,其展示方式可以是条形图、树状图等,本申请对此不作限定。The data processing system is used to read the data stream in the business system, parse and process the data stream, obtain the multi-granularity data pattern corresponding to the data, and display the data characteristics and business insights corresponding to the data in multiple dimensions to the user. As shown in Figure 2, the data The processing system 210 can connect and communicate with the client 220 and the client 230 through the Internet or an internal bus, where the client 220 and the client 230 are used to provide business data to be processed, and the data processing system 210 is used to process business data, They can be deployed on the same physical entity, such as the same server, or they can be deployed on different servers, which is not limited in this application. The data processing system 210 includes a data reader 2110, a basic pattern mining component 2120, a multi-granularity data pattern mining component 2130, a multi-granularity business pattern mining component 2140, a knowledge base 2150 and an output presenter 2160. The data reader 2110 reads data from the client 220 or client 230, and then sends the read business data to the basic pattern mining component 2120. The basic pattern mining component 2120 parses the business data based on the built-in basic pattern settings to obtain the corresponding basic pattern, and then The multi-granularity data pattern mining component 2130 performs further mining processing according to the basic pattern mined by the basic pattern mining component 2120 to obtain the corresponding multi-granularity data pattern. Finally, the multi-granularity business model mining component 2140 mines the multi-granularity data pattern based on the multi-granularity data pattern mining component 2130. The multi-granularity data model and the business type stored in the knowledge base 2150 are mined for business models to obtain the corresponding multi-granularity business model. In particular, the output presenter 2160 can output the results obtained by the basic pattern mining component 2120, the multi-granularity data pattern mining component 2130 and the multi-granularity business pattern mining component 2140 respectively and display them to the user. The display method can be a bar chart, Tree diagram, etc., this application does not limit this.
基于上述,下面对本申请实施例提供的多粒度数据模式挖掘方法及相关设备进行描述。参见图3,图3为本申请实施例提供的一种多粒度数据模式挖掘方法的流程示意图。如图3所示,该方法包括但不限于以下步骤:Based on the above, the multi-granularity data pattern mining method and related equipment provided by the embodiments of the present application are described below. Referring to Figure 3, Figure 3 is a schematic flow chart of a multi-granularity data pattern mining method provided by an embodiment of the present application. As shown in Figure 3, the method includes but is not limited to the following steps:
S301:数据处理系统读取待处理数据,并对所述待处理数据进行多粒度模式挖掘。S301: The data processing system reads the data to be processed and performs multi-granularity pattern mining on the data to be processed.
具体地,数据处理系统中预先设置了多种不同的数据模式标识,数据模式标识可以包括英文大写、英文小写、符号、数字、中文、其它语言、空格等通用常规的数据模式标识,也可以包括用户自己定义的其它数据模式标识,数据处理系统在获取到待处理数据后,将待处理数据与预置的数据模式标识进行一一比对和匹配,从而完成对待处理数据的解析,并得到待处理数据对应的基础模式。Specifically, a variety of different data schema identifiers are preset in the data processing system. The data schema identifiers can include English uppercase, English lowercase, symbols, numbers, Chinese, other languages, spaces and other general and conventional data schema identifiers, and can also include Other data pattern identifiers defined by the user. After obtaining the data to be processed, the data processing system compares and matches the data to be processed with the preset data pattern identifiers one by one, thereby completing the analysis of the data to be processed and obtaining the data to be processed. The basic pattern corresponding to processing data.
在一种可能的实现方式中,数据处理系统利用CFG对待处理数据进行解析,得到基础模式对应的正则表达式,根据所述基础模式对应的正则表达式,生成待处理数据对应的基础模式。In one possible implementation, the data processing system uses CFG to parse the data to be processed, obtain a regular expression corresponding to the basic pattern, and generate a basic pattern corresponding to the data to be processed based on the regular expression corresponding to the basic pattern.
示例性的,假设数据处理系统读取到的数据为expression,该数据包含多个元素(term),数据处理系统在对该数据进行解析时,可以发现该数据中的每个元素都可以用一个小写英文字母进行替换,即该数据中的任意一个元素与{英文小写}是可以进行互换的,数据处理系统进行进一步解析,对于该数据的第二个元素x与{英文小写}也是可以进行互换的,则与第一个元素一样进行合并,即{英文小写}与{英文小写}进行合并,得到{英文小写}[2],依照上述规则,对该数据继续进行解析并经过迭代合并后可以得到该数据最终的基础模式,即{英文小写}[10]。For example, assume that the data read by the data processing system is expression, and the data contains multiple elements (terms). When the data processing system parses the data, it can be found that each element in the data can be represented by a Lowercase English letters are replaced, that is, any element in the data and {English lowercase} can be interchanged. The data processing system further parses, and the second element x of the data can also be exchanged with {English lowercase}. If they are interchangeable, they will be merged in the same way as the first element, that is, {English lowercase} and {English lowercase} will be merged to obtain {English lowercase}[2]. According to the above rules, the data will continue to be parsed and merged iteratively. Finally, the final basic pattern of the data can be obtained, which is {English lowercase}[10].
可以理解,上述仅是以待处理数据仅包含一种数据模式标识为例进行说明的,对于同时包含多种数据模式标识的数据,也可以用上述相同的解析规则进行解析,从而得到其对应的基础模式。例如待处理数据为“北京Beijing-01032145680”,则经过CFG解析处理,最终生成的基础模式为:{中文}[2]{英文大写}[1]{英文小写}[6]{符号}[1]{数字}[11]。此外,若数据 处理系统同时读取一组数据,对于该一组数据中的每个数据分别进行上述CFG解析处理,最后再经过简单的统计和计算,得到该一组数据对应的基础模式,如图4所示,数据处理系统获取到一列客户编号数据,利用上述方法对该列数据进行解析,最后生成该列数据对应的基础模式:{英文大写}[1]{英文小写}[1]{数字}[4]:[4,0.571];{空}:[2,0.286];{符号}[1]{数字}[1]:[1,0.143]。It can be understood that the above description is only based on the example that the data to be processed contains only one data mode identifier. For data that contains multiple data mode identifiers at the same time, it can also be analyzed using the same parsing rules mentioned above to obtain its corresponding Basic mode. For example, the data to be processed is "Beijing Beijing-01032145680". After CFG analysis and processing, the final basic pattern generated is: {Chinese}[2]{English uppercase}[1]{English lowercase}[6]{symbol}[1 ]{number}[11]. In addition, if the data The processing system reads a set of data at the same time, performs the above-mentioned CFG analysis processing on each data in the set of data, and finally obtains the basic pattern corresponding to the set of data through simple statistics and calculations, as shown in Figure 4 , the data processing system obtains a column of customer number data, uses the above method to parse the column of data, and finally generates the basic pattern corresponding to the column of data: {English uppercase}[1]{English lowercase}[1]{Number}[4 ]: [4, 0.571]; {empty}: [2, 0.286]; {symbol}[1]{number}[1]: [1, 0.143].
在又一种可能的实现方式中,数据处理系统利用公共子串挖掘算法对基础模式相同的待处理数据进行公共子串迭代挖掘。In another possible implementation, the data processing system uses a common substring mining algorithm to iteratively mine common substrings for data to be processed with the same basic pattern.
具体地,数据处理系统对基础模式相同的数据进行公共子串挖掘,从而可以找到具有相同子串的数据,容易理解,当两个数据具有相同的公共子串,那么这两个数据就有极大的可能处于同一数据层级并具备相同的某个数据特征,从而通过该公共子串就能够更细粒度的展示数据所隐藏的数据特征。Specifically, the data processing system performs common substring mining on data with the same basic pattern, so that data with the same substring can be found. It is easy to understand that when two data have the same common substring, then the two data have a very high probability. Most likely they are at the same data level and have the same data characteristics, so that the data characteristics hidden in the data can be displayed in a more fine-grained manner through this common substring.
可选的,数据处理系统基于基础模式相同的数据的后缀数组,找到出现频率值大于预设阈值的子串,对所有出现频率值大于预设阈值的子串进行筛选,确定每一次迭代挖掘后生成的公共子串。Optionally, the data processing system finds substrings whose occurrence frequency value is greater than the preset threshold based on the suffix array of data with the same basic mode, filters all substrings whose occurrence frequency value is greater than the preset threshold, and determines the number of substrings after each iteration of mining. Generated public substring.
具体地,数据处理系统在求解基础模式相同的N个数据的最长公共子串时,可以通过将其转换为求解一些后缀的最长公共前缀的最大值进行处理,这些后缀分属于N个数据。例如,假设N个数据分别为S1、S2、S3、…、SN,首先建立一个数据集合S,把这N个数据用不同的分隔符连接起来,即S=S1[P1]S2[P2]…SN-1[PN-1]SN,其中P1、P2、…、PN-1为不同的N-1个不在字符集中的字符,作为分隔符。然后可以利用倍增算法或DC3算法求出S的后缀数组和极端(Height)数组,之后再二分枚举答案A(即假设N个数据可以有长度为A的公共子串),最后对A的可行性进行验证,根据验证结果可以最终找到N个数据的最长公共子串。Specifically, when the data processing system solves for the longest common substring of N pieces of data with the same basic pattern, it can be processed by converting it into solving for the maximum value of the longest common prefix of some suffixes, which belong to N pieces of data. . For example, assuming that N data are S1, S2, S3,..., SN, first create a data set S, and connect these N data with different delimiters, that is, S=S1[P1]S2[P2]... SN-1[PN-1]SN, where P1, P2,..., PN-1 are different N-1 characters that are not in the character set, serving as separators. Then you can use the doubling algorithm or DC3 algorithm to find the suffix array and extreme (Height) array of S, and then enumerate the answer A in two (that is, assuming that N data can have a common substring of length A), and finally it is feasible for A Verify the validity, and according to the verification results, you can finally find the longest common substring of N pieces of data.
可以看出,通过执行上述算法,可以在基础模式相同的数据中找出具有相同公共子串的数据,这些数据都具备公共子串所表示的数据特征,从而可以更加细粒度的展示数据的数据特征。特别的,当基础模式相同的多个数据具有不止一个公共子串时,可以通过迭代执行上述算法找出所有的公共子串。It can be seen that by executing the above algorithm, data with the same common substring can be found in data with the same basic mode. These data all have the data characteristics represented by the common substring, so that the data can be displayed in a more fine-grained manner. feature. In particular, when multiple data with the same basic pattern have more than one common substring, all the common substrings can be found by iteratively executing the above algorithm.
可选的,数据处理系统对基础模式相同的数据进行对齐,并对所述基础模式相同的数据进行逐位对比,比较相同位置上的数据是否相同,基于比较结果确定每一次迭代挖掘后生成的公共子串。Optionally, the data processing system aligns data with the same basic model, compares data with the same basic model bit by bit, compares whether the data at the same position are the same, and determines the data generated after each iteration of mining based on the comparison results. Public substring.
具体地,在基础模式相同的情况下,公共子串更可能出现在相同的位置上,基于这个假设,对具有相同基础模式的数据进行一位一位的迭代比较,从而可以找出所有的公共子串。Specifically, when the basic patterns are the same, common substrings are more likely to appear in the same position. Based on this assumption, data with the same basic pattern are compared bit by bit iteratively, so that all common substrings can be found. substring.
可选的,数据处理系统基于基础模式相同的数据构建FP-tree,根据所述FP-tree,确定每一次迭代挖掘后生成的公共子串。Optionally, the data processing system constructs an FP-tree based on data with the same basic pattern, and determines the common substrings generated after each iteration of mining based on the FP-tree.
具体地,数据处理系统首先扫描基础模式相同的数据,得到所有频繁1项集的计数,这里项集表示多个项组成的一个集合,其中的项为每个数据中所包含的字符,频繁项集是指支持度大于等于最小支持度的集合,其中,支持度是指某个集合在所有事务中出现的概率,例如,假设有A项集,表格共有N行,其中包括A项集的有M行,则绝对支持度为M,相对支持度为M/N。接着,数据处理系统删除支持度低于阈值的项,将1项频繁集放入项头表, 并按照支持度降序排列,之后再次扫描数据,将读取到的原始数据剔除非频繁1项集,并按照支持度降序排列,再读入排序后的数据集,插入FP树,插入时按照排序后的顺序插入FP树中,排序靠前的节点是祖先节点,而靠后的是子孙节点,如果有公共的祖先,则对应的公共祖先节点计数加1,插入后如果有新节点出现,则项头表对应的节点会通过节点链表链接上新节点,直到所有的数据都插入到FP树后,则完成了FP树的建立。Specifically, the data processing system first scans data with the same basic pattern and obtains the count of all frequent 1-item sets, where the itemset represents a set of multiple items, where the items are characters contained in each data, and the frequent items A set refers to a set whose support is greater than or equal to the minimum support. The support refers to the probability that a certain set appears in all transactions. For example, suppose there is an item set A, and the table has N rows, including the items set A. M rows, then the absolute support is M and the relative support is M/N. Then, the data processing system deletes items whose support is lower than the threshold and puts 1 item frequent set into the item header table. And sort it in descending order of support, then scan the data again, remove the non-frequent 1-item set from the read original data, and sort it in descending order of support, then read the sorted data set, insert it into the FP tree, and sort it according to the order Insert into the FP tree in the last order. The node ranked first is the ancestor node, and the node ranked later is the descendant node. If there is a common ancestor, the corresponding common ancestor node count is increased by 1. If a new node appears after insertion, then The nodes corresponding to the header table will be linked to new nodes through the node linked list, until all data is inserted into the FP tree, then the establishment of the FP tree is completed.
在构建完成FP树之后,可以从项头表的底部依次向上找到项头表项对应的条件模式基,从条件模式基递归挖掘可以得到项头表项的频繁项集,基于其挖掘结果,可以找到基础模式相同的数据的公共子串。After constructing the FP tree, you can find the conditional pattern base corresponding to the header table item from the bottom of the header table upwards. Recursive mining from the conditional pattern base can obtain the frequent item set of the header table item. Based on its mining results, you can Find common substrings of data with the same underlying pattern.
进一步的,数据处理系统基于每一次迭代挖掘后得到的公共子串,生成待处理数据对应的原子模式。Furthermore, the data processing system generates atomic patterns corresponding to the data to be processed based on the common substrings obtained after each iteration of mining.
具体地,数据处理系统在完成对基础模式相同的数据的公共子串迭代挖掘之后,可以根据每一次挖掘得到的公共子串,生成更加细化的数据模式,这些更加细化的数据模式就可以称为原子模式,即原子模式是基础模式下更加细化的数据模式。Specifically, after the data processing system completes the iterative mining of common substrings of data with the same basic pattern, it can generate more detailed data patterns based on the common substrings obtained from each mining. These more detailed data patterns can It is called atomic mode, that is, atomic mode is a more detailed data mode under basic mode.
示例性的,参见图5,图5是本申请实施例提供的一种原子模式生成样例示意图,假设数据表中存在一组数据为:057112345123,057187541123,123456789521,0571-45571233,0571-89654432,首先对这些数据进行解析得到其对应的基础模式,其基础模式为:{数字}[12][3,0.6]、{数字}[4]{符号}[1]{数字}[8][2,0.4],对于基础模式相同的数据,数据处理系统可以利用上述S302中所描述的任意一种公共子串挖掘算法对其进行公共子串迭代挖掘,对于基础模式为{数字}[12][3,0.6]中的数据来说,第一次挖掘找到的最长公共子串是0571,那么根据子串0571可以得到第一层的原子模式为:0571{数字}[8],之后在第一层的原子模式的基础上继续进行公共子串挖掘,对于数据模式为0571{数字}[8]中的数据,第二次挖掘找到的最长公共子串是123,所以可以得到第二层的原子模式为:0571{数字}[5]123;对于基础模式为{数字}[4]{符号}[1]{数字}[8][2,0.4]中的数据来说,第一次挖掘找到的最长公共子串是0571-,那么根据子串0571-可以得到第一层的原子模式为:0571-{数字}[8],之后在第一层的原子模式的基础上继续进行公共子串挖掘,由于第二次无法挖掘找到其它的公共子串,所以其没有更加细化的原子模式,挖掘结束。For example, see Figure 5. Figure 5 is a schematic diagram of an atomic pattern generation example provided by an embodiment of the present application. Assume that there is a set of data in the data table: 057112345123, 057187541123, 123456789521, 0571-45571233, 0571-89654432, First, these data are analyzed to obtain their corresponding basic patterns. The basic patterns are: {number}[12][3,0.6], {number}[4]{symbol}[1]{number}[8][2 , 0.4], for data with the same basic mode, the data processing system can use any of the common substring mining algorithms described in S302 above to perform public substring iterative mining, for the basic mode is {number}[12][ 3,0.6], the longest common substring found in the first mining is 0571, then according to the substring 0571, the atomic pattern of the first layer can be obtained: 0571{number}[8], and then in the Continue to mine common substrings based on the atomic pattern of one layer. For the data in the data pattern 0571{number}[8], the longest common substring found in the second mining is 123, so the second layer can be obtained. The atomic mode is: 0571{number}[5]123; for the data in the basic mode is {number}[4]{symbol}[1]{number}[8][2, 0.4], for the first time The longest common substring found by mining is 0571-. Then based on the substring 0571-, the atomic pattern of the first layer can be obtained: 0571-{number}[8], and then continue based on the atomic pattern of the first layer. Common substring mining, because other common substrings cannot be found in the second mining, so there is no more detailed atomic pattern, and the mining ends.
S302:数据处理系统根据所述多粒度模式挖掘结果,生成所述待处理数据对应的多粒度数据模式。S302: The data processing system generates a multi-granularity data pattern corresponding to the data to be processed based on the multi-granularity pattern mining result.
具体地,数据处理系统在得到待处理数据对应的原子模式之后,需要对原子模式进行合并处理,然后基于合并结果,最终生成待处理数据对应的多粒度数据模式。Specifically, after the data processing system obtains the atomic schema corresponding to the data to be processed, it needs to merge the atomic schema, and then based on the merging result, finally generates a multi-granularity data schema corresponding to the data to be processed.
在一种可能的实现方式中,数据处理系统基于动态规划算法,计算所有原子模式中的任意两个原子模式之间的编辑距离,根据编辑距离计算结果和预先设置的合并策略,对待处理数据对应的原子模式进行合并。In one possible implementation, the data processing system calculates the edit distance between any two atomic patterns in all atomic patterns based on a dynamic programming algorithm. Based on the edit distance calculation results and the preset merging strategy, the data to be processed corresponds to Atomic mode for merging.
具体地,数据处理系统首先会编辑距离策略,如果存在删除操作或插入操作,其编辑距离为2,如果需要替换某个字符,那么其编辑距离将取决于该字符所对应的基础正则、是否具有字面意义(literal)以及其具体的数值,当基础正则相同,但一个是literal,一个不是,则编辑距离为1;当基础正则不同,编辑距离为3;当基础正则相同且都是literal,但数值不同,编辑距离为0.5;当完全相同时,编辑距离为0。 Specifically, the data processing system first uses an edit distance strategy. If there is a deletion or insertion operation, the edit distance is 2. If a character needs to be replaced, the edit distance will depend on the basic regularity corresponding to the character and whether it has Literal meaning (literal) and its specific value. When the basic regular expressions are the same, but one is literal and the other is not, the edit distance is 1; when the basic regular expressions are different, the edit distance is 3; when the basic regular expressions are the same and both are literal, but If the values are different, the edit distance is 0.5; when they are exactly the same, the edit distance is 0.
示例性的,假设存在两种原子模式,分别为0571{数字}[5]123和0571-{数字}[8],为了后续叙述方便,将0571{数字}[5]123和0571-{数字}[8]分别命名为模式一和模式二。现在需要计算模式一相对于模式二的编辑距离,则需要计算每个字符的编辑距离然后进行累加得到两个模式之间的编辑距离,首先,模式一中的0相对于模式二中的0的编辑距离来说,其基础正则相同,都是literal且数值相同,所以它们之间的编辑距离为0,同理,5、7、1之间的编辑距离也为0,由于模式二相当于模式一来说多了一个“-”,所以需要在模式一的左侧插入一个“-”,其插入操作对应的编辑距离为2,模式二中的{数字}[8]可以看做{数字}[5]{数字}[3],那么{数字}[5]部分的编辑距离为0,由于模式一中的是123,而模式二中的是{数字}[3],两者都是数字,所以对应的基础正则相同,但一个是literal,一个不是,所以对于每一位来说,其编辑距离都是1,一共有三位,所以编辑距离总共是3,最后将所有的编辑距离进行累加,其累加结果就是模式一和模式二之间的编辑距离,为5。For example, assume that there are two atomic modes, namely 0571{number}[5]123 and 0571-{number}[8]. For the convenience of subsequent description, 0571{number}[5]123 and 0571-{number }[8] named mode one and mode two respectively. Now we need to calculate the edit distance of mode one relative to mode two. We need to calculate the edit distance of each character and then accumulate it to get the edit distance between the two modes. First, the 0 in mode one is relative to the 0 in mode two. In terms of edit distance, the basic rules are the same, both are literal and have the same value, so the edit distance between them is 0. Similarly, the edit distance between 5, 7, and 1 is also 0, because mode 2 is equivalent to mode First, there is an extra "-", so it is necessary to insert a "-" on the left side of mode one. The corresponding editing distance of the insertion operation is 2. {Number}[8] in mode two can be regarded as {number} [5]{number}[3], then the edit distance of the {number}[5] part is 0, because the one in mode one is 123, and the one in mode two is {number}[3], both are numbers. , so the corresponding basic rules are the same, but one is literal and the other is not, so for each bit, the edit distance is 1, there are three in total, so the total edit distance is 3, and finally all the edit distances are Accumulation, the accumulation result is the edit distance between mode one and mode two, which is 5.
数据处理系统可以根据编辑距离计算结果生成距离矩阵,该距离矩阵的行数与其中一个原子模式(如上述模式一)的位数相等,该距离矩阵的列数与另一个原子模式(如上述模式二)的位数相等,然后将需要进行删除、插入、替换等操作对应的编辑距离填在对应的位置上。然后基于最短路径算法从距离矩阵中找出所有可行路径中的最短路径,该最短路径就是两个原子模式之间进行转化所花费代价最少的操作。The data processing system can generate a distance matrix based on the edit distance calculation result. The number of rows of the distance matrix is equal to the number of digits of one of the atomic modes (such as the above-mentioned mode 1), and the number of columns of the distance matrix is the same as that of another atomic mode (such as the above-mentioned mode). 2) The number of digits is the same, and then fill in the corresponding editing distance corresponding to operations such as deletion, insertion, replacement, etc. Then the shortest path among all feasible paths is found from the distance matrix based on the shortest path algorithm. The shortest path is the least costly operation for conversion between two atomic patterns.
进一步的,数据处理系统基于编辑距离在对原子模式进行两两计算之后,每次取其中编辑距离最小的两个原子模式进行合并,在合并过程中,可以根据上述找到的最短路径生成转换动作,基于该转换动作进行合并,直到全部合并完成或者两者相同位置的元素的基础正则不同时停止合并。最后,数据处理系统根据合并结果,生成待处理数据对应的多粒度数据模式,在生成多粒度数据模式之后,可以取任一层级维度的数据模式进行正向编译匹配,生成用户可读的数据模式,也可以进行反向编译匹配,生成机器可读的正则表达式,并可以用于检验新加入的数据是否匹配和进行格式转换等多种操作。Further, the data processing system calculates the atomic patterns in pairs based on the edit distance, and merges the two atomic patterns with the smallest edit distance each time. During the merging process, the conversion action can be generated based on the shortest path found above. Merge based on this conversion action until all merges are completed or the basic regularities of the two elements at the same position are different and the merge is stopped at the same time. Finally, the data processing system generates a multi-granularity data schema corresponding to the data to be processed based on the merging results. After generating the multi-granularity data schema, the data schema of any level dimension can be taken for forward compilation and matching to generate a user-readable data schema. , you can also perform reverse compilation and matching to generate machine-readable regular expressions, and can be used to check whether the newly added data matches and perform format conversion and other operations.
示例性的,参见图6,图6是本申请实施例提供的一种原子模式合并示意图,以上述图5所生成的原子模式为例,若需要进一步对所生成的原子模式其进行合并,根据上述所阐述的合并方法及策略,首先对原子模式0571{数字}[5]123和原子模式0571-{数字}[8]进行合并,其所合并得到的数据模式为0571{-}{数字}[8],然后再与原子模式{数字}[12]进行合并,得到数据模式{数字}[4]{-}{数字}[8],从而最终生成多粒度数据模式。For example, see Figure 6. Figure 6 is a schematic diagram of atomic pattern merging provided by an embodiment of the present application. Taking the atomic pattern generated in Figure 5 above as an example, if it is necessary to further merge the generated atomic patterns, according to The merging method and strategy described above first merges the atomic pattern 0571{number}[5]123 and the atomic pattern 0571-{number}[8]. The merged data pattern is 0571{-}{number} [8], and then merged with the atomic pattern {number}[12] to obtain the data pattern {number}[4]{-}{number}[8], thus finally generating a multi-granularity data pattern.
S303:数据处理系统输出并展示所述待处理数据对应的多粒度数据模式。S303: The data processing system outputs and displays the multi-granularity data pattern corresponding to the data to be processed.
具体地,数据处理系统在生成多粒度数据模式之后,需要将其进行输出并多维度展示数据对应的数据特征,从而帮助用户全面有效的识别该数据。Specifically, after the data processing system generates multi-granularity data patterns, it needs to output them and display the data characteristics corresponding to the data in multiple dimensions, so as to help users comprehensively and effectively identify the data.
在一种可能的实现方式中,数据处理系统将多粒度数据模式以树状结构进行输出展示,所述树状结构包括多个层级,每个层级对应一种粒度的数据模式。In a possible implementation, the data processing system outputs and displays the multi-granularity data patterns in a tree structure. The tree structure includes multiple levels, and each level corresponds to one type of granularity data pattern.
容易理解,数据处理系统为了更直观的向用户展示数据模式之间的层次关系,帮助用户更加有效全面的识别数据特征,所以选用树状结构输出多粒度数据模式,通过树的各个节点之间的连接关系清晰的展示各个数据模式之间的层级关系。It is easy to understand that in order to more intuitively display the hierarchical relationship between data patterns to users and help users identify data features more effectively and comprehensively, the data processing system chooses a tree structure to output multi-granularity data patterns, through the connections between each node of the tree. The connection relationship clearly shows the hierarchical relationship between each data model.
应理解,在输出展示的多粒度数据模式中,基础模式可以包括不同层级的数据模式,每一层级的数据模式包括数据模式样例和与所述数据模式样例匹配的数据的数量以及在所述待 处理数据中所占的比例,特别的,对于每一层级的数据模式,其可以包括多个子层级数据模式,每个子层级数据模式的挖掘粒度小于该层级的数据模式,且每个子层级数据模式与该层级数据模式具有公共子字符串。例如在上述图6所示的合并之后得到的多粒度数据模式图中,基础模式为{数字}[4]{-}{数字}[8],数量为5个,占比是100%,其包括第一层级数据模式和第二层级数据模式,其中,第一层级数据模式为0571{-}{数字}[8],数量为4个,占比是80%,第二层级数据模式为{数字}[12],数量为1个,占比是20%,第一层级数据模式包括第一子层级数据模式和第二子层级数据模式,它们都包含公共子字符串0571,第一子层级数据模式为0571{数字}[5]123,数量为2个,占比是40%,第二子层级数据模式为0571-{数字}[8],数量为2个,占比是40%。It should be understood that in the multi-granularity data pattern displayed in the output, the basic pattern may include data patterns at different levels, and the data patterns at each level include data pattern samples and the number of data matching the data pattern samples and the number of data patterns matching the data pattern samples. treat The proportion of the processed data. In particular, for each level of data model, it can include multiple sub-level data models. The mining granularity of each sub-level data model is smaller than the data model of this level, and each sub-level data model has the same This hierarchical data pattern has common substrings. For example, in the multi-granularity data pattern diagram obtained after the merger shown in Figure 6 above, the basic pattern is {number}[4]{-}{number}[8], the number is 5, and the proportion is 100%. Including the first-level data mode and the second-level data mode. Among them, the first-level data mode is 0571{-}{number}[8], the number is 4, accounting for 80%, and the second-level data mode is { Number}[12], the quantity is 1, accounting for 20%, the first-level data pattern includes the first sub-level data pattern and the second sub-level data pattern, both of which contain the common substring 0571, the first sub-level The data pattern is 0571{number}[5]123, the quantity is 2, accounting for 40%, the second sub-level data pattern is 0571-{number}[8], the quantity is 2, accounting for 40%.
在另一种可能的实现方式中,数据处理系统利用知识库对所述待处理数据对应的多粒度数据模式进行检索和匹配,确定与所述多粒度数据模式对应的多粒度业务模式。In another possible implementation, the data processing system uses a knowledge base to retrieve and match the multi-granularity data patterns corresponding to the data to be processed, and determine the multi-granularity business patterns corresponding to the multi-granularity data patterns.
具体地,数据处理系统的知识库中存储了业务模式对应的正则表达式,例如网际协议(internet protocol,IP)、统一资源定位地址(uniform resource locator,URL)的正则表达式或地区、邮编、区号、姓氏、方位等的可取值,而且知识库可以随时进行更新和学习以丰富其所存储的业务模式,数据处理系统在得到多粒度数据模式之后,针对每一层级粒度的数据模式,在知识库中搜索与之匹配的业务模式,从而生成多粒度业务模式。Specifically, the knowledge base of the data processing system stores regular expressions corresponding to the business model, such as regular expressions of Internet protocol (IP), uniform resource locator (URL) or region, zip code, Possible values for area code, surname, orientation, etc., and the knowledge base can be updated and learned at any time to enrich the business models it stores. After the data processing system obtains the multi-granularity data model, for each level of granularity data model, Search the knowledge base for matching business models to generate multi-granularity business models.
示例性的,参见图7,图7是本申请实施例提供的一种多粒度业务模式的示意图。如图7所示,与数据模式0571{数字}[8]匹配的业务模式是{杭州区号}[4]{数字}[8],与数据模式0571-{数字}[8]匹配的业务模式是{区号}[4]-{数字}[8],与数据模式{数字}[11]匹配的业务模式是{电话号码}[11],类似的,对于每一种数据模式,从知识库中找到与之匹配的业务模式,从而可以得到对应的多粒度业务模式。For example, see FIG. 7 , which is a schematic diagram of a multi-granularity service model provided by an embodiment of the present application. As shown in Figure 7, the business pattern matching the data pattern 0571{number}[8] is {Hangzhou area code}[4]{number}[8], and the business pattern matching the data pattern 0571-{number}[8] is {area code}[4]-{number}[8], and the business pattern matching the data pattern {number}[11] is {telephone number}[11]. Similarly, for each data pattern, from the knowledge base Find the matching business model, so that the corresponding multi-granularity business model can be obtained.
可以看出,利用知识库找到与多粒度数据模式中的每一种数据模式匹配的业务模式,从而生成多粒度业务模式,可以帮助用户更好的识别数据所对应的业务特征,从而为用户带来更多更有效的数据见解。It can be seen that using the knowledge base to find the business model that matches each data model in the multi-granularity data model, thereby generating a multi-granularity business model, can help users better identify the business characteristics corresponding to the data, thereby bringing users Get more and more effective data insights.
需要说明的是,数据处理系统在完成多粒度数据模式和多粒度业务模式的挖掘与展示的同时,可以将其应用于其它需要进行数据内容处理的场景中。例如,在上述图7所展示的数据中混入了-999999999该数据,数据处理系统可以识别出该数据对应的数据模式为{符号}[1]{数字}[9],与其它数据模式的相似度低于阈值,因此该数据模式被判定为离群模式,可以对该数据进行清理,类似的,对于已存在的具有不同数据模式的数据,数据处理系统可以根据两者之间的相似度进行转换和统一,如上述图7中的大部分数据模式不包含“-”,仅少部分包含“-”,且它们之间的编辑距离很小,相似度较高,所以可以根据少数服从多数的原则,将包含“-”的数据模式转换为不包含“-”的数据模式,去除“-”,从而完成数据模式之间的转换和统一。此外,当需要新导入数据时,可以对新导入的数据进行模式识别,然后将识别出的数据模式与已经存在的数据模式进行相似性计算,若不相似,则可以以已存在的数据模式为准对其进行模式转换,使得转换后的数据模式与已存在的数据模式保持一致。当然还可以将其应用于半结构化数据结构化、相似列/表挖掘、数据标准识别等场景中,在此不再赘述。It should be noted that while the data processing system completes the mining and display of multi-granularity data patterns and multi-granularity business models, it can be applied to other scenarios that require data content processing. For example, if -999999999 is mixed into the data shown in Figure 7 above, the data processing system can identify that the data pattern corresponding to this data is {symbol}[1]{number}[9], which is similar to other data patterns. The degree is lower than the threshold, so the data pattern is determined to be an outlier pattern, and the data can be cleaned. Similarly, for existing data with different data patterns, the data processing system can perform processing based on the similarity between the two. Conversion and unification, as shown in Figure 7 above, most of the data patterns do not contain "-", only a small part contains "-", and the editing distance between them is very small and the similarity is high, so it can be based on the majority rule In principle, the data pattern containing "-" is converted into a data pattern that does not contain "-", and the "-" is removed, thereby completing the conversion and unification between data patterns. In addition, when newly imported data needs to be imported, pattern recognition can be performed on the newly imported data, and then the similarity between the recognized data pattern and the existing data pattern can be calculated. If they are not similar, the existing data pattern can be used as the similarity calculation. Perform schema conversion on it to ensure that the converted data schema is consistent with the existing data schema. Of course, it can also be applied to scenarios such as semi-structured data structuring, similar column/table mining, data standard identification, etc., which will not be described again here.
可以理解,通过执行图3所描述的多粒度数据模式挖掘方法,可以多维度展示数据的数据特征和业务特征,帮助用户全面识别数据特征,尤其是数据的编码特征,为用户带来更多 更有效的数据见解,提升了用户使用的灵活性,此外能够应用于多种业务场景,辅助其它数据治理能力,从而有效扩展了适用场景。It can be understood that by executing the multi-granularity data pattern mining method described in Figure 3, the data characteristics and business characteristics of the data can be displayed in multiple dimensions, helping users comprehensively identify data characteristics, especially the coding characteristics of the data, and bring more to users. More effective data insights improve user flexibility. In addition, it can be applied to a variety of business scenarios and assist other data management capabilities, thus effectively expanding applicable scenarios.
上述详细阐述了本申请实施例的方法,为了便于更好的实施本申请实施例的上述方案,相应地,下面还提供用于配合实施上述方案的相关设备。The methods of the embodiments of the present application are described in detail above. In order to facilitate better implementation of the above solutions of the embodiments of the present application, correspondingly, the following also provides relevant equipment for cooperating with the implementation of the above solutions.
参见图8,图8是本申请实施例提供的一种多粒度数据模式挖掘装置的结构示意图,该装置可以是上述图3所述的方法实施例中的数据处理系统,可以执行图3所述的方法实施例中步骤S301-S303,且可选的执行前述步骤S301-S303中可选的方法。如图8所示,该装置800包括读取解析模块810、处理模块820和输出展示模块830。其中,Referring to Figure 8, Figure 8 is a schematic structural diagram of a multi-granularity data pattern mining device provided by an embodiment of the present application. The device can be the data processing system in the method embodiment described in Figure 3 above, and can execute what is described in Figure 3 Steps S301-S303 in the method embodiment, and optionally execute the optional method in the aforementioned steps S301-S303. As shown in FIG. 8 , the device 800 includes a read parsing module 810 , a processing module 820 and an output display module 830 . in,
读取解析模块810,用于读取待处理数据;Reading and parsing module 810, used to read data to be processed;
处理模块820,用于对所述待处理数据进行多粒度模式挖掘,并根据所述多粒度模式挖掘结果,生成所述待处理数据对应的多粒度数据模式;The processing module 820 is configured to perform multi-granularity pattern mining on the data to be processed, and generate multi-granularity data patterns corresponding to the data to be processed based on the multi-granularity pattern mining results;
输出展示模块830,用于输出并展示所述待处理数据对应的多粒度数据模式,其中,所述多粒度数据模式包括所述待处理数据对应的基础模式,所述基础模式包括第一层级数据模式和第二层级数据模式,每一层级数据模式包括数据模式样例和与所述数据模式样例匹配的数据的数量以及在所述待处理数据中所占的比例。The output display module 830 is used to output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern includes a basic pattern corresponding to the data to be processed, and the basic pattern includes first-level data pattern and the second-level data pattern, each level of data pattern includes data pattern samples and the number of data matching the data pattern samples and the proportion in the data to be processed.
上述三个模块之间互相可以通过通信通路进行数据传输,应理解,多粒度数据模式挖掘装置800包括的各模块可以为软件单元、也可以为硬件单元、或部分为软件单元部分为硬件单元。The above three modules can transmit data to each other through communication channels. It should be understood that each module included in the multi-granularity data pattern mining device 800 can be a software unit, a hardware unit, or part of a software unit and part of a hardware unit.
作为一个实施例,所述第一层级数据模式包括至少一个子层级数据模式,所述至少一个子层级数据模式的挖掘粒度小于所述第一层级数据模式,所述至少一个子层级数据模式中的每个子层级数据模式与所述第一层级数据模式具有公共子字符串。As an embodiment, the first-level data pattern includes at least one sub-level data pattern. The mining granularity of the at least one sub-level data pattern is smaller than the first-level data pattern. The at least one sub-level data pattern has Each sub-level data pattern has a common substring with the first level data pattern.
作为一个实施例,所述处理模块820,还用于:基于所述多粒度数据模式,通过知识库对所述多粒度数据模式中的每一层级数据模式进行检索和匹配,所述知识库包括不同业务模式对应的正则表达式;根据检索结果,输出并展示所述待处理数据对应的多粒度业务模式,其中,所述多粒度业务模式包括多个层级的业务模式,且每个层级的业务模式与所述多粒度数据模式中的一个数据模式匹配,每个层级的业务模式对应一种业务见解。As an embodiment, the processing module 820 is further configured to: based on the multi-granularity data pattern, retrieve and match each level of data pattern in the multi-granularity data pattern through a knowledge base, where the knowledge base includes Regular expressions corresponding to different business models; according to the search results, output and display the multi-granularity business model corresponding to the data to be processed, wherein the multi-granularity business model includes multiple levels of business models, and the business models of each level are The pattern matches one of the multi-granularity data patterns, with each level of business pattern corresponding to a business insight.
作为一个实施例,所述处理模块820,具体用于:对所述待处理数据进行解析得到所述待处理数据对应的基础模式;利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘;基于每一次迭代挖掘后得到的公共子串,生成所述待处理数据对应的原子模式;对所述待处理数据对应的原子模式进行归并处理,得到所述待处理数据对应的多粒度数据模式。As an embodiment, the processing module 820 is specifically configured to: parse the data to be processed to obtain the basic pattern corresponding to the data to be processed; and use a common substring mining algorithm to analyze the data to be processed with the same basic pattern. Carry out iterative mining of common substrings; generate an atomic pattern corresponding to the data to be processed based on the common substring obtained after each iterative mining; merge the atomic patterns corresponding to the data to be processed to obtain the data to be processed Corresponding multi-granularity data schema.
作为一个实施例,所述处理模块820,具体用于:基于所述基础模式相同的待处理数据的后缀数据,得到出现频率值大于预设阈值的子串;对所有所述出现频率值大于预设阈值的子串进行筛选,确定每一次迭代挖掘后生成的公共子串。As an embodiment, the processing module 820 is specifically configured to: obtain substrings whose occurrence frequency value is greater than a preset threshold based on the suffix data of the same basic mode data to be processed; and obtain substrings whose occurrence frequency value is greater than a preset threshold; Filter the substrings with a threshold to determine the common substrings generated after each iteration of mining.
作为一个实施例,所述处理模块820,具体用于:对所述基础模式相同的待处理数据进行对齐,并对所述基础模式相同的待处理数据进行逐位对比,比较相同位置上的数据是否相同;基于比较结果确定每一次迭代挖掘后生成的公共子串。 As an embodiment, the processing module 820 is specifically configured to: align the data to be processed with the same basic mode, compare the data to be processed with the same basic mode bit by bit, and compare the data at the same position. Whether they are the same; determine the common substring generated after each iteration of mining based on the comparison results.
作为一个实施例,所述处理模块820,具体用于:基于所述基础模式相同的待处理数据,构建最频项集树FP-tree;根据所述FP-tree,确定每一次迭代挖掘后生成的公共子串。As an embodiment, the processing module 820 is specifically configured to: construct the most frequent itemset tree FP-tree based on the data to be processed in the same basic mode; and determine the generated data after each iteration of mining according to the FP-tree. public substring.
作为一个实施例,所述处理模块820,具体用于:基于动态规划算法,计算所有所述原子模式中的任意两个原子模式之间的编辑距离;根据编辑距离计算结果和预先设置的合并策略,对所述待处理数据对应的原子模式进行合并。As an embodiment, the processing module 820 is specifically configured to: calculate the edit distance between any two atomic patterns in all the atomic patterns based on a dynamic programming algorithm; based on the edit distance calculation result and the preset merging strategy , merge the atomic patterns corresponding to the data to be processed.
作为一个实施例,所述处理模块820,具体用于:利用上下文无关文法CFG对所述待处理数据进行解析,得到基础模式对应的正则表达式;根据所述基础模式对应的正则表达式,生成所述待处理数据对应的基础模式。As an embodiment, the processing module 820 is specifically configured to: use context-free grammar CFG to parse the data to be processed to obtain a regular expression corresponding to the basic pattern; and generate a regular expression corresponding to the basic pattern according to the regular expression corresponding to the basic pattern. The basic schema corresponding to the data to be processed.
需要说明的是,上述多粒度数据模式挖掘装置的结构仅仅作为一种示例,不应构成具体限定,可以根据需要对上述装置中的各个模块进行增加、减少或合并。此外,上述装置中的各个模块的操作和/或功能是为了实现上述图3所描述的方法的相应流程,为了简洁,在此不再赘述。It should be noted that the structure of the above-mentioned multi-granularity data pattern mining device is only an example and should not constitute a specific limitation. Each module in the above-mentioned device can be added, reduced or combined as needed. In addition, the operation and/or function of each module in the above-mentioned device is to implement the corresponding process of the method described in FIG. 3, and for the sake of brevity, the details will not be described again.
参见图9,图9是本申请实施例提供的一种计算设备的结构示意图。如图9所示,该计算设备900包括:处理器910、通信接口920以及存储器930,所述处理器910、通信接口920以及存储器930通过内部总线940相互连接。Referring to Figure 9, Figure 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application. As shown in FIG. 9 , the computing device 900 includes a processor 910 , a communication interface 920 , and a memory 930 . The processor 910 , the communication interface 920 , and the memory 930 are connected to each other through an internal bus 940 .
所述处理器910可以由一个或者多个通用处理器构成,例如中央处理器(central processing unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。The processor 910 may be composed of one or more general-purpose processors, such as a central processing unit (CPU), or a combination of a CPU and a hardware chip. The above-mentioned hardware chip can be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
总线940可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线940可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但不表示仅有一根总线或一种类型的总线。The bus 940 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus 940 can be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only one thick line is used in Figure 9, but it does not mean that there is only one bus or one type of bus.
存储器930可以包括易失性存储器(volatile memory),例如随机存取存储器(random a ccess memory,RAM);存储器930也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM)、快闪存储器(flash memory)、硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器930还可以包括上述种类的组合。The memory 930 may include volatile memory (volatile memory), such as random access memory (RAM); the memory 930 may also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory). -only memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); the memory 930 may also include a combination of the above types.
需要说明的是,计算设备900的存储器930中存储了多粒度数据模式挖掘装置800的各个模块对应的代码,处理器910执行这些代码实现了装置800的各个模块的功能,即执行了S301-S303及S301-S303中可选的方法。It should be noted that the memory 930 of the computing device 900 stores codes corresponding to each module of the multi-granularity data pattern mining device 800. The processor 910 executes these codes to implement the functions of each module of the device 800, that is, S301-S303 are executed. and optional methods in S301-S303.
本申请还提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,当该计算机程序被处理器执行时,可以实现上述方法实施例中记载的任意一种的部分或全部步骤。This application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program. When the computer program is executed by a processor, any part of the method described in the above method embodiments can be implemented. or all steps.
本发明实施例还提供一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行任意一种方法的部分或全部步骤。 An embodiment of the present invention also provides a computer program. The computer program includes instructions. When the computer program is executed by a computer, the computer can perform part or all of the steps of any method.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above embodiments, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that for the sake of simple description, the foregoing method embodiments are expressed as a series of action combinations. However, those skilled in the art should know that the present application is not limited by the described action sequence. Because according to this application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily necessary for this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the above units is only a logical function division. In actual implementation, there may be other divisions. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical or other forms.
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。 In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

Claims (20)

  1. 一种多粒度数据模式挖掘方法,其特征在于,包括:A multi-granularity data pattern mining method, characterized by including:
    读取待处理数据,并对所述待处理数据进行多粒度模式挖掘;Read the data to be processed and perform multi-granularity pattern mining on the data to be processed;
    根据所述多粒度模式挖掘结果,生成所述待处理数据对应的多粒度数据模式;According to the multi-granularity pattern mining results, generate a multi-granularity data pattern corresponding to the data to be processed;
    输出并展示所述待处理数据对应的多粒度数据模式,其中,所述多粒度数据模式包括所述待处理数据对应的基础模式,所述基础模式包括第一层级数据模式和第二层级数据模式,每一层级数据模式包括数据模式样例和与所述数据模式样例匹配的数据的数量以及在所述待处理数据中所占的比例。Output and display the multi-granularity data schema corresponding to the data to be processed, where the multi-granularity data schema includes a basic schema corresponding to the data to be processed, and the basic schema includes a first-level data schema and a second-level data schema. , each level of data pattern includes data pattern samples and the number of data matching the data pattern samples and their proportion in the data to be processed.
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1, further comprising:
    所述第一层级数据模式包括至少一个子层级数据模式,所述至少一个子层级数据模式的挖掘粒度小于所述第一层级数据模式,所述至少一个子层级数据模式中的每个子层级数据模式与所述第一层级数据模式具有公共子字符串。The first level data pattern includes at least one sub-level data pattern, the mining granularity of the at least one sub-level data pattern is smaller than the first level data pattern, and each sub-level data pattern in the at least one sub-level data pattern A common substring with the first-level data pattern.
  3. 如权利要求1或2所述的方法,其特征在于,所述方法还包括:The method according to claim 1 or 2, characterized in that the method further includes:
    基于所述多粒度数据模式,通过知识库对所述多粒度数据模式中的每一层级数据模式进行检索和匹配,所述知识库包括不同业务模式对应的正则表达式;Based on the multi-granularity data pattern, each level of data pattern in the multi-granularity data pattern is retrieved and matched through a knowledge base, where the knowledge base includes regular expressions corresponding to different business patterns;
    根据检索结果,输出并展示所述待处理数据对应的多粒度业务模式,其中,所述多粒度业务模式包括多个层级的业务模式,且每个层级的业务模式与所述多粒度数据模式中的一个数据模式匹配,每个层级的业务模式对应一种业务见解。According to the search results, the multi-granularity business model corresponding to the data to be processed is output and displayed, wherein the multi-granularity business model includes multiple levels of business models, and the business model of each level is consistent with the multi-granularity data model. A data pattern matching, each level of business model corresponds to a business insight.
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述对所述待处理数据进行多粒度模式挖掘,包括:The method according to any one of claims 1 to 3, characterized in that the multi-granularity pattern mining of the data to be processed includes:
    对所述待处理数据进行解析得到所述待处理数据对应的基础模式;Analyze the data to be processed to obtain the basic pattern corresponding to the data to be processed;
    利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘;Using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern;
    基于每一次迭代挖掘后得到的公共子串,生成所述待处理数据对应的原子模式;Based on the common substring obtained after each iteration of mining, generate the atomic pattern corresponding to the data to be processed;
    对所述待处理数据对应的原子模式进行归并处理,得到所述待处理数据对应的多粒度数据模式。The atomic patterns corresponding to the data to be processed are merged to obtain a multi-granularity data pattern corresponding to the data to be processed.
  5. 如权利要求4所述的方法,其特征在于,所述利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘包括:The method of claim 4, wherein using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern includes:
    基于所述基础模式相同的待处理数据的后缀数据,得到出现频率值大于预设阈值的子串;Based on the suffix data of the data to be processed in the same basic mode, obtain the substring whose occurrence frequency value is greater than the preset threshold;
    对所有所述出现频率值大于预设阈值的子串进行筛选,确定每一次迭代挖掘后生成的公共子串。All substrings whose occurrence frequency value is greater than the preset threshold are screened to determine the common substrings generated after each iterative mining.
  6. 如权利要求4所述的方法,其特征在于,所述利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘包括:The method of claim 4, wherein using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern includes:
    对所述基础模式相同的待处理数据进行对齐,并对所述基础模式相同的待处理数据进行逐位对比,比较相同位置上的数据是否相同;Align the data to be processed with the same basic mode, compare the data to be processed with the same basic mode bit by bit, and compare whether the data at the same position are the same;
    基于比较结果确定每一次迭代挖掘后生成的公共子串。The common substring generated after each iteration of mining is determined based on the comparison results.
  7. 如权利要求4所述的方法,其特征在于,所述利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘包括: The method of claim 4, wherein using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern includes:
    基于所述基础模式相同的待处理数据,构建最频项集树FP-tree;Based on the data to be processed with the same basic model, construct the most frequent itemset tree FP-tree;
    根据所述FP-tree,确定每一次迭代挖掘后生成的公共子串。According to the FP-tree, the common substring generated after each iteration of mining is determined.
  8. 如权利要求4-7任一项所述的方法,其特征在于,所述对所述待处理数据对应的原子模式进行归并处理包括:The method according to any one of claims 4 to 7, characterized in that merging the atomic patterns corresponding to the data to be processed includes:
    基于动态规划算法,计算所有所述原子模式中的任意两个原子模式之间的编辑距离;Based on a dynamic programming algorithm, calculate the edit distance between any two atomic patterns in all the atomic patterns;
    根据编辑距离计算结果和预先设置的合并策略,对所述待处理数据对应的原子模式进行合并。According to the edit distance calculation result and the preset merging strategy, the atomic patterns corresponding to the data to be processed are merged.
  9. 如权利要求4-8任一项所述的方法,其特征在于,所述对所述待处理数据进行解析得到所述待处理数据对应的基础模式包括:The method according to any one of claims 4 to 8, characterized in that said parsing the data to be processed to obtain the basic pattern corresponding to the data to be processed includes:
    利用上下文无关文法CFG对所述待处理数据进行解析,得到基础模式对应的正则表达式;Use the context-free grammar CFG to parse the data to be processed and obtain the regular expression corresponding to the basic pattern;
    根据所述基础模式对应的正则表达式,生成所述待处理数据对应的基础模式。According to the regular expression corresponding to the basic pattern, a basic pattern corresponding to the data to be processed is generated.
  10. 一种多粒度数据模式挖掘装置,其特征在于,包括:A multi-granularity data pattern mining device, characterized by including:
    读取解析模块,用于读取待处理数据;Reading and parsing module, used to read data to be processed;
    处理模块,用于对所述待处理数据进行多粒度模式挖掘,并根据所述多粒度模式挖掘结果,生成所述待处理数据对应的多粒度数据模式;A processing module, configured to perform multi-granularity pattern mining on the data to be processed, and generate multi-granularity data patterns corresponding to the data to be processed based on the multi-granularity pattern mining results;
    输出展示模块,用于输出并展示所述待处理数据对应的多粒度数据模式,其中,所述多粒度数据模式包括所述待处理数据对应的基础模式,所述基础模式包括第一层级数据模式和第二层级数据模式,每一层级数据模式包括数据模式样例和与所述数据模式样例匹配的数据的数量以及在所述待处理数据中所占的比例。The output display module is used to output and display the multi-granularity data pattern corresponding to the data to be processed, wherein the multi-granularity data pattern includes a basic pattern corresponding to the data to be processed, and the basic pattern includes a first-level data pattern. and a second-level data pattern, each level of data pattern including a data pattern sample and the number of data matching the data pattern sample and its proportion in the data to be processed.
  11. 如权利要求10所述的装置,其特征在于,所述第一层级数据模式包括至少一个子层级数据模式,所述至少一个子层级数据模式的挖掘粒度小于所述第一层级数据模式,所述至少一个子层级数据模式中的每个子层级数据模式与所述第一层级数据模式具有公共子字符串。The device of claim 10, wherein the first-level data pattern includes at least one sub-level data pattern, and the mining granularity of the at least one sub-level data pattern is smaller than the first-level data pattern, and the Each of the at least one sub-level data pattern has a common substring with the first level data pattern.
  12. 如权利要求10或11所述的装置,其特征在于,所述处理模块,还用于:The device according to claim 10 or 11, characterized in that the processing module is also used to:
    基于所述多粒度数据模式,通过知识库对所述多粒度数据模式中的每一层级数据模式进行检索和匹配,所述知识库包括不同业务模式对应的正则表达式;Based on the multi-granularity data pattern, each level of data pattern in the multi-granularity data pattern is retrieved and matched through a knowledge base, where the knowledge base includes regular expressions corresponding to different business patterns;
    根据检索结果,输出并展示所述待处理数据对应的多粒度业务模式,其中,所述多粒度业务模式包括多个层级的业务模式,且每个层级的业务模式与所述多粒度数据模式中的一个数据模式匹配,每个层级的业务模式对应一种业务见解。According to the search results, the multi-granularity business model corresponding to the data to be processed is output and displayed, wherein the multi-granularity business model includes multiple levels of business models, and the business model of each level is consistent with the multi-granularity data model. A data pattern matching, each level of business model corresponds to a business insight.
  13. 如权利要求10-12任一项所述的装置,其特征在于,所述处理模块,具体用于:The device according to any one of claims 10 to 12, characterized in that the processing module is specifically used for:
    对所述待处理数据进行解析得到所述待处理数据对应的基础模式;Analyze the data to be processed to obtain the basic pattern corresponding to the data to be processed;
    利用公共子串挖掘算法对所述基础模式相同的待处理数据进行公共子串迭代挖掘;Using a common substring mining algorithm to iteratively mine common substrings on the data to be processed with the same basic pattern;
    基于每一次迭代挖掘后得到的公共子串,生成所述待处理数据对应的原子模式;Based on the common substring obtained after each iteration of mining, generate the atomic pattern corresponding to the data to be processed;
    对所述待处理数据对应的原子模式进行归并处理,得到所述待处理数据对应的多粒度数据模式。The atomic patterns corresponding to the data to be processed are merged to obtain a multi-granularity data pattern corresponding to the data to be processed.
  14. 如权利要求13所述的装置,其特征在于,所述处理模块,具体用于:The device according to claim 13, characterized in that the processing module is specifically used for:
    基于所述基础模式相同的待处理数据的后缀数据,得到出现频率值大于预设阈值的子串;Based on the suffix data of the data to be processed in the same basic mode, obtain the substring whose occurrence frequency value is greater than the preset threshold;
    对所有所述出现频率值大于预设阈值的子串进行筛选,确定每一次迭代挖掘后生成的公 共子串。All substrings whose occurrence frequency value is greater than the preset threshold are screened to determine the public number generated after each iterative mining. A total of substrings.
  15. 如权利要求13所述的装置,其特征在于,所述处理模块,具体用于:The device according to claim 13, characterized in that the processing module is specifically used for:
    对所述基础模式相同的待处理数据进行对齐,并对所述基础模式相同的待处理数据进行逐位对比,比较相同位置上的数据是否相同;Align the data to be processed with the same basic mode, compare the data to be processed with the same basic mode bit by bit, and compare whether the data at the same position are the same;
    基于比较结果确定每一次迭代挖掘后生成的公共子串。The common substring generated after each iteration of mining is determined based on the comparison results.
  16. 如权利要求13所述的装置,其特征在于,所述处理模块,具体用于:The device according to claim 13, characterized in that the processing module is specifically used for:
    基于所述基础模式相同的待处理数据,构建最频项集树FP-tree;Based on the data to be processed with the same basic model, construct the most frequent itemset tree FP-tree;
    根据所述FP-tree,确定每一次迭代挖掘后生成的公共子串。According to the FP-tree, the common substring generated after each iteration of mining is determined.
  17. 如权利要求13-16任一项所述的装置,其特征在于,所述处理模块,具体用于:The device according to any one of claims 13-16, characterized in that the processing module is specifically used for:
    基于动态规划算法,计算所有所述原子模式中的任意两个原子模式之间的编辑距离;Based on a dynamic programming algorithm, calculate the edit distance between any two atomic patterns in all the atomic patterns;
    根据编辑距离计算结果和预先设置的合并策略,对所述待处理数据对应的原子模式进行合并。According to the edit distance calculation result and the preset merging strategy, the atomic patterns corresponding to the data to be processed are merged.
  18. 如权利要求13-17任一项所述的装置,其特征在于,所述处理模块,具体用于:The device according to any one of claims 13-17, characterized in that the processing module is specifically used for:
    利用上下文无关文法CFG对所述待处理数据进行解析,得到基础模式对应的正则表达式;Use the context-free grammar CFG to parse the data to be processed and obtain the regular expression corresponding to the basic pattern;
    根据所述基础模式对应的正则表达式,生成所述待处理数据对应的基础模式。According to the regular expression corresponding to the basic pattern, a basic pattern corresponding to the data to be processed is generated.
  19. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器,所述处理器执行所述存储器中存储的计算机指令,使得所述计算设备执行权利要求1-9任一项所述的方法。A computing device, characterized in that the computing device includes a processor and a memory, and the processor executes computer instructions stored in the memory, so that the computing device executes the method described in any one of claims 1-9. method.
  20. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,当所述计算机程序被处理器执行时,所述处理器执行权利要求1-9任一项所述的方法。 A computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the processor executes the method described in any one of claims 1-9.
PCT/CN2023/079655 2022-03-30 2023-03-03 Multi-granularity data pattern mining method and related device WO2023185377A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210326077.2A CN116932604A (en) 2022-03-30 2022-03-30 Multi-granularity data pattern mining method and related equipment
CN202210326077.2 2022-03-30

Publications (1)

Publication Number Publication Date
WO2023185377A1 true WO2023185377A1 (en) 2023-10-05

Family

ID=88198963

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/079655 WO2023185377A1 (en) 2022-03-30 2023-03-03 Multi-granularity data pattern mining method and related device

Country Status (2)

Country Link
CN (1) CN116932604A (en)
WO (1) WO2023185377A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103220274A (en) * 2013-03-25 2013-07-24 北京东方网信科技股份有限公司 Operator network outlet network message pattern matching method and system
US9542446B1 (en) * 2015-12-17 2017-01-10 Palantir Technologies, Inc. Automatic generation of composite datasets based on hierarchical fields
CN106992969A (en) * 2017-03-03 2017-07-28 南京理工大学 DGA based on domain name character string statistical nature generates the detection method of domain name
CN107818079A (en) * 2017-09-05 2018-03-20 苏州大学 More granularity participle labeled data automatic obtaining methods and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103220274A (en) * 2013-03-25 2013-07-24 北京东方网信科技股份有限公司 Operator network outlet network message pattern matching method and system
US9542446B1 (en) * 2015-12-17 2017-01-10 Palantir Technologies, Inc. Automatic generation of composite datasets based on hierarchical fields
CN106992969A (en) * 2017-03-03 2017-07-28 南京理工大学 DGA based on domain name character string statistical nature generates the detection method of domain name
CN107818079A (en) * 2017-09-05 2018-03-20 苏州大学 More granularity participle labeled data automatic obtaining methods and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HELEN PINTO ; JIAWEI HAN ; JIAN PEI ; KE WANG ; QIMING CHEN ; UMESHWAR DAYAL: "Multi-dimensional sequential pattern mining", PROCEEDINGS OF THE 2001 ACM CIKM 10TH. INTERNATIONAL CONFERENCE ON INFORMATION AND KWOWLEDGE MANAGEMENT. ATLANTA, GA, NOV. 5 - 10, 2001., NEW YORK, NY : ACM., US, 5 October 2001 (2001-10-05) - 10 October 2001 (2001-10-10), US , pages 81 - 88, XP058105105, ISBN: 978-1-58113-436-0, DOI: 10.1145/502585.502600 *

Also Published As

Publication number Publication date
CN116932604A (en) 2023-10-24

Similar Documents

Publication Publication Date Title
US10338977B2 (en) Cluster-based processing of unstructured log messages
US11238069B2 (en) Transforming a data stream into structured data
JP2021527878A (en) Regular expression generation using the longest common subsequence algorithm on the regular expression code
WO2021068547A1 (en) Log schema extraction method and apparatus
US8103705B2 (en) System and method for storing text annotations with associated type information in a structured data store
US9390176B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
US8793120B1 (en) Behavior-driven multilingual stemming
US9430520B2 (en) Semantic reflection storage and automatic reconciliation of hierarchical messages
US9706005B2 (en) Providing automatable units for infrastructure support
CN111708805A (en) Data query method and device, electronic equipment and storage medium
JP6781820B2 (en) Distributed Computing Framework and Distributed Computing Method (DISTRIBUTED COMPUTING FRAMEWORK AND DISTRIBUTED COMPUTING METHOD)
JP2022538704A (en) Regular expression generation with span highlight alignment
CN112069305B (en) Data screening method and device and electronic equipment
WO2016093839A1 (en) Structuring of semi-structured log messages
US9984108B2 (en) Database joins using uncertain criteria
US10360208B2 (en) Method and system of process reconstruction
WO2023185377A1 (en) Multi-granularity data pattern mining method and related device
US10229105B1 (en) Mobile log data parsing
US9824140B2 (en) Method of creating classification pattern, apparatus, and recording medium
US20180046656A1 (en) Constructing filterable hierarchy based on multidimensional key
CN113934430A (en) Data retrieval analysis method and device, electronic equipment and storage medium
JP2022538703A (en) User interface commands for regular expression generation
CN111078727A (en) Brief description generation method and device and computer readable storage medium
JP2022538705A (en) Regular expression generation for negative examples with context
CN117472940A (en) Data blood relationship construction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777755

Country of ref document: EP

Kind code of ref document: A1