数据探查方法、装置、电子设备和存储介质Data exploration method, apparatus, electronic device and storage medium
本申请要求在2020年09月27日提交中国专利局、申请号为202011036080.8的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202011036080.8 filed with the China Patent Office on September 27, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请实施例涉及大数据处理领域,例如涉及一种数据探查方法、装置、电子设备和存储介质。The embodiments of the present application relate to the field of big data processing, for example, to a data exploration method, apparatus, electronic device, and storage medium.
背景技术Background technique
随着万物互联的发展,我们已经进入了大数据时代,数据的存储结构也日新月异。数量庞大、增长迅猛、种类多样的数据已经成为企业在大数据时代发展不得不面临的现实境况,能够快速地,准确地掌握多种来源数据的数据结构信息,可以大量降低在数据分析上的投入成本,快速挖掘出数据的价值。数据探查是未来大数据处理领域的一大趋势。With the development of the Internet of Everything, we have entered the era of big data, and the storage structure of data is also changing with each passing day. The huge amount, rapid growth, and variety of data have become a reality that enterprises have to face in the era of big data. Being able to quickly and accurately grasp the data structure information of data from various sources can greatly reduce the investment in data analysis. cost, and quickly tap the value of data. Data exploration is a major trend in the field of big data processing in the future.
数据探查方法,通常是通过获取不同数据源的待处理数据,利用待处理数据的字段来进行数据探查。相关技术中至少存在以下缺点:相关技术只能根据数据本身默认的字段进行探查,对于有些数据源对默认的字段进行探查并不能获得有用的数据信息,降低了数据的利用效率。The data exploration method is usually to obtain data to be processed from different data sources and use the fields of the data to be processed to perform data exploration. The related art has at least the following disadvantages: the related art can only probe according to the default fields of the data itself, and for some data sources, the default fields cannot be probed to obtain useful data information, which reduces the efficiency of data utilization.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种数据探查方法、装置、电子设备和存储介质,以实现对不同数据源的数据进行探查,提高数据利用效率。Embodiments of the present application provide a data exploration method, apparatus, electronic device, and storage medium, so as to implement data exploration of different data sources and improve data utilization efficiency.
本申请实施例提供了一种数据探查方法,包括:获取与数据源对应的数据抽取规则、分割规则和探查需求;基于所述数据抽取规则和所述数据源,确定待处理数据;基于所述分割规则对所述待处理数据进行分割,得到分割数据;基于所述探查需求对所述分割数据进行探查,得到探查结果。An embodiment of the present application provides a data exploration method, including: acquiring data extraction rules, segmentation rules, and exploration requirements corresponding to a data source; determining data to be processed based on the data extraction rules and the data source; The splitting rule splits the data to be processed to obtain split data; the split data is probed based on the probe requirement to obtain a probe result.
本申请实施例还提供了一种数据探查装置,该数据探查装置包括:获取模块,设置为获取与数据源对应的数据抽取规则、分割规则和探查需求;待处理数据确定模块,设置为基于所述数据抽取规则和所述数据源,确定待处理数据;分割模块,设置为基于所述分割规则对所述待处理数据进行分割,得到分割数据;探查模块,设置为基于所述探查需求对所述分割数据进行探查,得到探查结果。The embodiment of the present application also provides a data exploration device, the data exploration device includes: an acquisition module, configured to acquire data extraction rules, segmentation rules, and detection requirements corresponding to the data source; a data to be processed determination module, configured to be based on the data source. The data extraction rule and the data source are used to determine the data to be processed; the segmentation module is configured to segment the to-be-processed data based on the segmentation rule to obtain segmented data; the detection module is configured to perform segmentation based on the detection requirement. The above segmentation data is probed to obtain the probe results.
本申请实施例还提供了一种电子设备,所述电子设备包括:一个或多个处理器;存储装置,设置为存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请任一实施例所述的数据探查方法。An embodiment of the present application further provides an electronic device, the electronic device includes: one or more processors; a storage device configured to store one or more programs, when the one or more programs are stored by the one or more programs A plurality of processors execute such that the one or more processors implement a data exploration method as described in any embodiment of the present application.
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任意实施例提供的数据探查方法。Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, implements the data exploration method provided by any embodiment of the present application.
附图说明Description of drawings
图1是本申请实施例一提供的一种数据探查方法的流程示意图;1 is a schematic flowchart of a data detection method provided in Embodiment 1 of the present application;
图2是本申请实施例二提供的一种数据探查方法的流程示意图;2 is a schematic flowchart of a data detection method provided in Embodiment 2 of the present application;
图3是本申请实施例三提供的一种多源异构数据的探查方法的示意图;3 is a schematic diagram of a method for detecting multi-source heterogeneous data provided in Embodiment 3 of the present application;
图4是本申请实施例四提供的一种数据探查装置的结构框图;4 is a structural block diagram of a data detection device provided in Embodiment 4 of the present application;
图5是本申请实施例五提供的一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device according to Embodiment 5 of the present application.
具体实施方式detailed description
下面结合附图和实施例对本申请进行说明。可以理解的是,此处所描述的实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。The present application will be described below with reference to the accompanying drawings and embodiments. It should be understood that the embodiments described herein are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all the structures related to the present application.
实施例一Example 1
图1为本申请实施例一提供的一种数据探查方法的流程示意图。本实施例可适用于对不同数据源的不同结构的数据进行探查的情况。该方法可以由数据探查装置来执行,该装置可以由软件和/或硬件的方式来实现,集成于本申请实施例提供的电子设备中,例如可配置于电脑中,在此不作限制。FIG. 1 is a schematic flowchart of a data detection method provided in Embodiment 1 of the present application. This embodiment can be applied to the situation where data of different structures from different data sources is probed. The method may be performed by a data detection apparatus, which may be implemented in software and/or hardware, and integrated into the electronic device provided in the embodiments of the present application, for example, may be configured in a computer, which is not limited herein.
如图1所示,本实施例提供的数据探查方法包括以下步骤。As shown in FIG. 1 , the data detection method provided by this embodiment includes the following steps.
S110、获取与数据源对应的数据抽取规则、分割规则和探查需求。S110. Acquire data extraction rules, segmentation rules, and exploration requirements corresponding to the data source.
不同数据源的数据对应的数据抽取规则不同,数据抽取规则可以包括不同数据源的数据抽取方法、抽取的数据数量等。The data extraction rules corresponding to the data of different data sources are different, and the data extraction rules may include the data extraction methods of different data sources, the amount of data to be extracted, and so on.
可选的,根据数据源的类型确定数据抽取规则。Optionally, the data extraction rule is determined according to the type of the data source.
不同类型的数据源的数据结构不同,根据不同的数据结构确定对应的数据抽取规则。Different types of data sources have different data structures, and corresponding data extraction rules are determined according to different data structures.
示例性的,当数据源的类型为数据库表时,此时的数据类型为结构数据, 数据库表的抽取规则可以是,利用join方法制定抽取待抽取数据库表的数据,和/或抽取数据库表中待抽取字段的数据,其中,待抽取数据库表和待抽取字段可以根据探查需求进行选择和设定;当数据源的类型为文件时,此时的数据类型为非结构数据,文件的抽取规则可以是,利用机器学习的抽取方法根据数据文件制定读取的数据文件的个数,可以是每个文件中的数据较少时,读取较多的数据文件,如果文件中数据量较大,则读取较少的数据文件,本申请实施例对数据抽取规则不作限制。Exemplarily, when the type of the data source is a database table, the data type at this time is structured data, and the extraction rule of the database table may be, using the join method to formulate and extract the data of the database table to be extracted, and/or extract the data in the database table. The data of the field to be extracted, in which the database table to be extracted and the field to be extracted can be selected and set according to the exploration requirements; when the type of the data source is file, the data type at this time is unstructured data, and the extraction rules of the file can be Yes, use the extraction method of machine learning to determine the number of data files to be read according to the data files. When the data in each file is small, read more data files. If the amount of data in the file is large, then To read fewer data files, the embodiment of the present application does not limit the data extraction rules.
数据抽取规则中可以设定数据源对应的数据抽取方法,以使对数据源的数据进行抽取时读取配置的抽取方法,可以在配置数据抽取方法时设定读取数据的数量。In the data extraction rule, the data extraction method corresponding to the data source can be set, so that when extracting the data of the data source, the configured extraction method can be read, and the quantity of read data can be set when configuring the data extraction method.
根据数据源的类型确定数据抽取规则,可以实现对不同数据源的数据进行快速抽取。The data extraction rules are determined according to the type of the data source, which can realize the rapid extraction of data from different data sources.
分割规则是指对于抽取之后的数据如何进行行分割、列分割的规则,基于分割规则对抽取数据进行分割,能够方便后续进行数据探查。分割规则例如可以是通过设置不同的分隔符对数据进行行分割和列分割,也可以是根据数据中包含的需求属性设置分割符,本申请实施例对分割规则不作限制。The segmentation rule refers to the rules for how to perform row segmentation and column segmentation for the extracted data. The extraction data is segmented based on the segmentation rules, which can facilitate subsequent data exploration. For example, the division rule may be to perform row division and column division of the data by setting different delimiters, or may be to set the delimiter according to the requirement attribute contained in the data, and the embodiment of the present application does not limit the division rules.
探查需求是指对于不同数据源的数据,需要了解的数据属性信息,通过获取探查需求可以确定与探查需求对应的数据源的数据属性信息,示例性的,对于数据库表来说,探查需求可以是字段名、字段类型等信息;对于文件数据,探查需求可以是字段长度、语义等,本申请实施例对探查需求不作限制。Probing requirements refer to the data attribute information that needs to be understood for data from different data sources. By obtaining the probing requirements, the data attribute information of the data source corresponding to the probing requirements can be determined. For example, for a database table, the probing requirements can be: Field name, field type and other information; for file data, the detection requirement may be field length, semantics, etc., and the embodiment of this application does not limit the detection requirement.
数据抽取规则、分割规则和探查需求可以根据数据源的类型预先进行设定,且不同数据源的数据抽取规则、分割规则和探查需求可以不相同。Data extraction rules, segmentation rules, and detection requirements can be preset according to the type of data source, and data extraction rules, segmentation rules, and detection requirements of different data sources can be different.
对数据抽取规则、分割规则和探查需求预先进行设定并配置在系统文件中,进行数据探查时可以从系统文件中直接读取。Data extraction rules, segmentation rules, and exploration requirements are pre-set and configured in the system file, and can be directly read from the system file during data exploration.
S120、基于数据抽取规则和数据源,确定待处理数据。S120. Determine the data to be processed based on the data extraction rule and the data source.
基于不同数据源对应的数据抽取规则,可使用JAVA多线程对不同数据源的数据进行抽取,将抽取到的数据确定为待处理数据。待处理数据用于进行数据探查。Based on data extraction rules corresponding to different data sources, JAVA multithreading can be used to extract data from different data sources, and the extracted data is determined as data to be processed. Pending data is used for data exploration.
示例性的,对不同系统的不同结构的数据利用对应的数据抽取方法进行数据抽取时,可以是全量抽取,即将数据源中的数据原封不动的从数据系统中抽取出来,还可以是基于增量抽取,即抽取自上次抽取以来数据系统中新增、修改、删除的数据,本申请实施例对数据抽取的方式不作限制。Exemplarily, when using the corresponding data extraction method to extract data from data of different structures in different systems, it may be full extraction, that is, the data in the data source is extracted from the data system intact, or it may be based on incremental data extraction. Volume extraction, that is, extraction of newly added, modified, and deleted data in the data system since the last extraction, and the data extraction method is not limited in this embodiment of the present application.
S130、基于分割规则对待处理数据进行分割,得到分割数据。S130. Segment the data to be processed based on the segmentation rule to obtain segmented data.
从不同数据源抽取到的数据,数据结构可能不相同,需要对不同结构的待处理数据归一化成相同的结构形式。分割规则是指用于对不同结构的待处理数据进行归一化的规则,利用分割规则将不同结构的待处理数据分割成相同的形式,得到不同数据源对应的分割数据。The data extracted from different data sources may have different data structures, and it is necessary to normalize the data to be processed with different structures into the same structural form. The segmentation rule refers to a rule used to normalize the data to be processed with different structures. The segmentation rules are used to segment the data to be processed with different structures into the same form to obtain the segmented data corresponding to different data sources.
S140、基于探查需求对分割数据进行探查,得到探查结果。S140. Investigate the segmented data based on the probing requirement, and obtain a probing result.
不同数据源对应的探查需求中包含了关于不同数据源的数据属性信息,可以将不同的数据属性信息制定成探查项,并根据探查项对分割后的数据进行探查。对分割数据进行探查时,可以认为根据探查规则对分割数据进行数据探查,其中,探查规则可以包括探查项和分割规则中包含的分割位置信息。示例性的,分割规则中根据分隔符对待处理数据进行分割,则探查规则中可以包含分割符的位置信息,根据分隔符的位置信息和探查项对分割数据进行探查,得到不同数据源的数据探查信息。The exploration requirements corresponding to different data sources include data attribute information about different data sources. Different data attribute information can be formulated into exploration items, and the segmented data can be probed according to the exploration items. When the segmented data is probed, it can be considered that data exploration is performed on the segmented data according to a probe rule, wherein the probe rule may include the probe item and the segment location information contained in the segment rule. Exemplarily, in the segmentation rule, the data to be processed is segmented according to the delimiter, then the detection rule may include the location information of the delimiter, and the segmented data is examined according to the location information of the delimiter and the detection item, to obtain data detection of different data sources. information.
数据探查主要是实现对表数据、文件数据等进行探索、分析的功能,把与数据质量相关的需求内置为探查项,探查数据是否满足用户的数据质量需求。Data exploration is mainly to realize the function of exploration and analysis of table data and file data, etc., and build the requirements related to data quality as exploration items to detect whether the data meets the data quality requirements of users.
可选的,基于探查需求对分割数据进行探查,得到探查结果之后,还包括:对探查结果进行统计,得到统计结果,基于探查需求和统计结果,输出探查报告。Optionally, the split data is probed based on the probe requirements, and after the probe results are obtained, the method further includes: collecting statistics on the probe results, obtaining the statistical results, and outputting a probe report based on the probe requirements and the statistical results.
由于待处理数据按分割规则进行分割后,再按探查规则包含的分割位置信息对分割数据进行探查,所以探查的信息是多个分隔符分割的字段信息,并不是整个抽取的数据的信息,还需要对探查后的数据信息进行统计,以得到对抽取数据的整体探查结果,然后基于整体探查结果和探查需求包含的探查项,输出探查报告。探查报告可以包含对多个字段的探查信息,也可以包含整个数据源探查的统计结果,还可以是数据探查项的分布图,本申请实施例对探查报告的内容不作限制。After the data to be processed is divided according to the division rules, the division data is probed according to the division position information contained in the probe rules, so the probed information is the field information divided by multiple delimiters, not the information of the entire extracted data, but also the information of the fields. The data information after the probe needs to be counted to obtain the overall probe result of the extracted data, and then a probe report is output based on the overall probe result and the probe items included in the probe requirements. The probe report may include probe information on multiple fields, or may include the statistical results of the entire data source probe, and may also be a distribution map of data probe items. This embodiment of the present application does not limit the content of the probe report.
本申请实施例提供的数据探查方法,可以针对多种数据库、数据文件等不同数据结构的数据,进行提取、探查、分析、加工,对探查结果进行统计、输出,以便快速掌握数据内容、数据质量和数据结构,为大数据的数据分析和数据清洗提供重要依据。且该方法可以支持数据探查结果预览、探查结果详情查看、数据分布图生成等功能,可以帮助数据分析师更快、更好的了解来源数据的内容、结构以及特性,快速发现来源数据中的异常数据,更好的进行数据分析以及数据清洗,可极大加快提取数据价值的速度。The data exploration method provided by the embodiment of the present application can extract, explore, analyze, and process data with different data structures such as various databases and data files, and perform statistics and output on the exploration results, so as to quickly grasp the data content and data quality. It provides an important basis for data analysis and data cleaning of big data. Moreover, this method can support functions such as data exploration result preview, detailed viewing of exploration results, and generation of data distribution graph, which can help data analysts to understand the content, structure and characteristics of source data faster and better, and quickly discover anomalies in source data. Data, better data analysis and data cleaning can greatly speed up the extraction of data value.
本申请实施例提供的一种数据探查的方法,通过获取与数据源对应的数据抽取规则、分割规则和探查需求,基于数据抽取规则和数据源,确定待处理数 据,基于分割规则对待处理数据进行分割,得到分割数据,基于探查需求对分割数据进行探查,得到探查结果,实现了对不同数据源的数据分别通过对应的分割规则进行分割,之后根据分割结果进行探查,解决了仅根据数据源数据的默认字段进行探查而无法获取有用的数据信息的问题,提高了数据的利用效率。A method for data exploration provided by an embodiment of the present application, by acquiring data extraction rules, segmentation rules and detection requirements corresponding to data sources, determining data to be processed based on the data extraction rules and data sources, and performing processing on the data to be processed based on the segmentation rules. Segmentation, obtains segmented data, probes the segmented data based on the exploration requirements, and obtains the exploration results, realizes that the data of different data sources is segmented according to the corresponding segmentation rules, and then probes according to the segmentation results. The problem that useful data information cannot be obtained when the default fields are probed, which improves the efficiency of data utilization.
实施例二Embodiment 2
图2是本申请实施例二提供的数据探查方法的流程示意图。本实施例在上述多个实施例的基础上,将分割规则细化为行分割规则和列分割规则,并按行分割规则进行行探查,按列分割规则进行列探查,可以对不同数据源的数据根据探查需求,通过自定义确定字段进行探查。与上述多个实施例相同或相应的术语的解释在此不再赘述。FIG. 2 is a schematic flowchart of a data detection method provided in Embodiment 2 of the present application. In this embodiment, on the basis of the above-mentioned multiple embodiments, the segmentation rules are refined into row segmentation rules and column segmentation rules, and row detection is performed according to the row segmentation rules, and column detection is performed according to the column segmentation rules. The data is probed through custom determined fields according to the probe requirements. The explanations of terms that are the same as or corresponding to the above-mentioned multiple embodiments are not repeated here.
参见图2,本实施例提供的数据探查方法,包括如下步骤。Referring to FIG. 2 , the data detection method provided by this embodiment includes the following steps.
S210、获取与数据源对应的数据抽取规则、分割规则和探查需求。S210: Acquire data extraction rules, segmentation rules, and exploration requirements corresponding to the data source.
S220、基于数据抽取规则和数据源,确定待处理数据。S220. Determine the data to be processed based on the data extraction rule and the data source.
S230、基于行分割规则,对待处理数据进行行分割,得到行分割数据。S230. Based on the row segmentation rule, perform row segmentation on the data to be processed to obtain row segmentation data.
行分割规则可以认为是,针对归一化成行列式的待处理数据进行行分割的规则。行分割规则例如可以为,根据加载的行分割规则中的行分隔符,对待处理数据进行每行数据的分割,分割之后数据以一行一行的形式表示为行分割数据。可选的,行分割符默认可以是Tab键,此时将Tab键作为行分割规则。示例性的,对于文件数据,可以以句号作为分隔符,将每一句分割成一行;对于数据库表中的数据,可以以制表符作为分隔符进行数据分割,得到行分割数据,本申请实施例对于对待处理数据进行行分割的方式不作限制。The row division rule can be considered as a rule for row division for the data to be processed which is normalized into a determinant. The row division rule may be, for example, that according to the row separator in the loaded row division rule, the data to be processed is divided into each row of data, and after division, the data is represented as row division data in the form of row by row. Optionally, the line separator can be the Tab key by default. In this case, the Tab key is used as the line dividing rule. Exemplarily, for file data, a period can be used as a delimiter to divide each sentence into a line; for data in a database table, a tab can be used as a delimiter to perform data division to obtain line-segmented data, an embodiment of the present application. There is no restriction on the manner of row division of the data to be processed.
S240、基于列分割规则,对行分割数据进行列分割,得到行列分割数组。S240. Based on the column division rule, perform column division on the row division data to obtain a row and column division array.
对待处理数据基于行分割规则进行行分割,得到行分割数据后,可以基于列分割规则中的列分隔符,对每行数据进行列分割,最终得到了行列分割数组。因此,行列分割数组是由行分隔符和列分隔符对行列式形式的待处理数据进行分割得到的数组,其中,每个数组元素可以看成是一个字段,不同字段的长度可以不同。The data to be processed is divided into rows based on the row division rules, and after the row division data is obtained, each row of data can be divided into columns based on the column separators in the column division rules, and finally a row and column division array is obtained. Therefore, a row-column-split array is an array obtained by dividing the data to be processed in the form of a determinant by a row separator and a column separator, wherein each array element can be regarded as a field, and the lengths of different fields can be different.
S250、基于探查需求对行列分割数组进行行探查,得到行探查结果。S250. Perform row exploration on the row-column segmented array based on the exploration requirement, to obtain a row exploration result.
可以根据探查需求中包含的行探查项,对行列分割数组的每行数据进行探查,得到每行数据关于行探查项的探查结果。Each row of data in the row-column split array can be probed according to the row probe items contained in the probe requirements, and the probe results of the row probe items of each row of data can be obtained.
可选的,对行列分割数组进行下述至少一项行探查:异常数据筛选、数据排序、数据重复率以及数据记录数。Optionally, at least one of the following row detection is performed on the row-column split array: abnormal data filtering, data sorting, data repetition rate, and the number of data records.
对行列分割数组的每行数据进行数据筛选、排序以及记录数据重复率,统计待处理数据的记录数等信息。数据筛选可以是通过判断异常数据,对有问题的异常数据进行筛选,数据排序可以是对不同行的数据按每行的长短重新进行排序,数据重复率可以是确定不同行的数据重复的行数。可以根据总的数据行数,和重复的数据行数,确定数据重复率。每一行数据为一条数据,同时也是一个记录,纪录数为多行数据的总数量,通过记录数可以确定从对应的数据源中共抽取了多少条数据,本申请实施例对于行探查项的内容不作限制。Perform data filtering, sorting, and record data repetition rates for each row of data in the row-column-divided array, and count the number of records of data to be processed. Data filtering can be used to screen abnormal data by judging abnormal data. Data sorting can be to re-sort the data of different rows according to the length of each row. The data repetition rate can be to determine the number of rows that duplicate data of different rows. . The data repetition rate can be determined according to the total number of data rows and the number of repeated data rows. Each row of data is a piece of data and also a record. The number of records is the total number of rows of data. The number of records can be used to determine how many pieces of data have been extracted from the corresponding data source. This embodiment of the present application does not make any changes to the content of the row detection item. limit.
对行列分割数组按行探查规则进行行探查,可以确定每行数据的信息,进而分析每行数据的质量。The row-column-divided array is row-detected according to the row-detection rules, and the information of each row of data can be determined, and then the quality of each row of data can be analyzed.
S260、基于探查需求对行列分割数组进行列探查,得到列探查结果。S260. Perform column exploration on the row-column divided array based on the exploration requirement, to obtain a column exploration result.
可以根据探查需求包含的列探查项,对行列分割数组的每列数据分别进行探查,得到每列数据关于列探查项的探查结果。列探查实际上是基于列分隔符对每列数据进行关于列探查项的探查,其中,字段是上述步骤中根据列分隔符分割每行数据确定的。The data of each column of the row-column divided array can be probed separately according to the column probe items included in the probe requirements, and the probe results of the column probe items of each column of data can be obtained. Column profiling is actually to perform profiling items on each column of data based on the column delimiter, wherein the field is determined by dividing each row of data according to the column delimiter in the above steps.
可选的,对行列分割数组进行下述至少一项列探查:格式、类型、长度、数值、空值率、最大值、最小值、平均值以及值域分布。Optionally, perform at least one of the following column exploration on the row-column-divided array: format, type, length, value, null rate, maximum value, minimum value, average value, and range distribution.
对数据进行行探查之后,再根据列分隔符对每行数据的每列探查对应列的数据格式、数据类型、数据长度、数值、空值率、最大值、最小值、平均值以及值域分布等信息。After row exploration is performed on the data, the data format, data type, data length, value, null rate, maximum value, minimum value, average value, and value range distribution of the corresponding column are probed for each column of each row of data according to the column separator. and other information.
空值率可以指该列对应的数据的值中的未知的数据项占数据项总量的比值,值域分布可以是指每列数据的最大值和最小值包括的值域范围。The null value rate may refer to the ratio of unknown data items in the values of the data corresponding to the column to the total amount of data items, and the value range distribution may refer to the range of value ranges included in the maximum and minimum values of each column of data.
示例性的,数据格式可以是十进制的,也可以是二进制的等,数据类型可以是数值型数据,也可以是字符型数据,还可以是代码类型的数据。对于数值型的数据,可以探查具体数值、空值率和值域分布等探查项。本申请实施例对列探查项的内容不作限制。Exemplarily, the data format may be decimal, or binary, etc., and the data type may be numerical data, character data, or code type data. For numerical data, you can explore specific numerical values, null value rates, and value range distributions. This embodiment of the present application does not limit the content of the column probe item.
对行列分割数组进行列探查,可以获取每列数据的信息,便于对每行数据的每列进行探查,进而分析按列分割规则分割的每列数据的质量。Column exploration is performed on the row-column division array to obtain the information of each column of data, which is convenient for exploration of each column of each row of data, and then analyzes the quality of each column of data divided according to the column division rules.
S270、对探查结果进行统计,得到统计结果,基于探查需求和所述统计结果,输出探查报告。S270. Statistical analysis is performed on the exploration results to obtain the statistical results, and based on the inspection requirements and the statistical results, an inspection report is output.
本申请提供的实施例针对多种数据库、数据文件等不同数据结构的数据,进行提取、探查、分析、加工,对探查结果进行统计、输出,以便快速掌握数据内容、数据质量和数据结构,为大数据的数据分析和数据清洗提供重要依据。The embodiments provided in this application extract, probe, analyze, and process data with different data structures such as various databases and data files, and perform statistics and output on the probe results, so as to quickly grasp the data content, data quality, and data structure. Data analysis and data cleaning of big data provide an important basis.
本实施例的技术方案,通过获取与数据源对应的数据抽取规则、分割规则和探查需求,基于数据抽取规则和数据源,确定待处理数据,基于行分割规则,对待处理数据进行行分割,得到行分割数据,基于列分割规则,对行分割数据进行列分割,得到行列分割数组,基于探查需求对行列分割数组进行行探查,得到行探查结果,基于探查需求对行列分割数组进行列探查,得到列探查结果,对探查结果进行统计,得到统计结果,基于探查需求和所述统计结果,输出探查报告,实现了对不同数据源的数据根据探查需求,通过自定义确定字段进行数据探查,提高了数据的使用效率。In the technical solution of this embodiment, the data to be processed is determined based on the data extraction rules and the data source by acquiring the data extraction rules, segmentation rules and exploration requirements corresponding to the data source, and the data to be processed is divided into rows based on the row segmentation rules, and the result is obtained. Row segmentation data, based on the column segmentation rules, perform column segmentation on the row segmentation data to obtain a row and column segmentation array, perform row detection on the row and column segmentation array based on the detection requirements, and obtain the row detection result. Column exploration results, make statistics on the exploration results, get the statistical results, and output the exploration report based on the exploration requirements and the statistical results. Data usage efficiency.
实施例三Embodiment 3
图3为本申请实施例三提供的一种多源异构数据的探查方法示意图。与上述多个实施例相同或相应的术语的解释在此不再赘述。FIG. 3 is a schematic diagram of a method for detecting multi-source heterogeneous data according to Embodiment 3 of the present application. The explanations of terms that are the same as or corresponding to the above-mentioned multiple embodiments are not repeated here.
如图3所示,示例性的,多源异构数据的探查的过程为如下。As shown in FIG. 3 , an exemplary process of detecting multi-source heterogeneous data is as follows.
S310、数据抽取。对数据库、文件系统、文件传输协议(File Transfer Protocol,FTP)系统以及消息中间件系统的不同结构的数据,分别按对应的数据抽取规则抽取数据。S310, data extraction. For data of different structures of database, file system, File Transfer Protocol (FTP) system and message middleware system, extract data according to corresponding data extraction rules.
S320、读取探查规则。根据读取的探查规则中的行探查规则探查多行数据的重复率、记录数据的行数等,根据探查规则中的字段探查规则,即列探查规则,进行字段探查,字段探查可以包括探查字段的完整性、唯一性、正确性、一致性或者有效性。可选的,可以通过探查字段是否有空值确定字段的完整性,可以探查一字段值是否是唯一的,还可以探查字段的长度和类型等是否正确,字段长度是否和字段规则指定的字段长度一致等,还可以探查字段的值对于探查需求来说是否具有有效性。S320. Read the detection rule. Detect the repetition rate of multi-line data, the number of rows of recorded data, etc. according to the row detection rules in the read detection rules, and perform field detection according to the field detection rules in the detection rules, that is, column detection rules. Field detection can include detection fields completeness, uniqueness, correctness, consistency or validity. Optionally, the integrity of the field can be determined by checking whether the field has a null value, whether the value of a field is unique, whether the length and type of the field are correct, and whether the length of the field is the same as that specified by the field rule. Consistent, etc., you can also probe whether the value of the field is valid for the probe requirements.
S330、数据操作。对数据进行数据探查时,基于数据探查规则对数据进行分析、排序、数据结构探查和筛选等数据操作。S330, data operation. When data exploration is performed, data operations such as analysis, sorting, data structure exploration, and filtering are performed on the data based on data exploration rules.
S340、质量统计。根据探查结果通过质量统计进行质量评估,质量统计可以是行统计和字段统计,行统计可以是对重复率进行统计、对记录数进行统计,字段统计可以是字段类型统计、字段值数统计、字段长度统计、字段空值统计及占比、字段最大值统计、字段最小值统计、字段平均值统计、字段值重复率统计以及字段值域分析等。S340, quality statistics. According to the probing results, the quality is evaluated through quality statistics. The quality statistics can be row statistics and field statistics. The row statistics can be statistics on the repetition rate and the number of records. The field statistics can be field type statistics, field value statistics, and field statistics. Length statistics, field null value statistics and proportion, field maximum value statistics, field minimum value statistics, field average statistics, field value repetition rate statistics and field value range analysis, etc.
S350、探查报告输出。最后根据实际数据探查的统计信息,以及探查需求输出最终数据探查报告,以供数据分析和数据清洗使用。S350, outputting an investigation report. Finally, output the final data exploration report based on the statistical information of the actual data exploration and the exploration requirements for data analysis and data cleaning.
本实施例提供的数据探查方法能够根据行探查规则和字段探查规则对不同数据源的数据进行探查,并根据探查结果进行质量统计,提高了数据的利用效 率。The data detection method provided by this embodiment can detect data of different data sources according to the row detection rule and the field detection rule, and perform quality statistics according to the detection result, which improves the data utilization efficiency.
本公开任意一个实施例所描述的技术特征的实现方式,在与本实施例不矛盾的前提下,均可应用到本实施例中,在此不再赘述。The implementation manner of the technical features described in any embodiment of the present disclosure can be applied to this embodiment on the premise that there is no contradiction with this embodiment, and details are not repeated here.
实施例四Embodiment 4
图4是本申请实施例四提供的一种数据探查装置的结构框图,本实施例可适用于对不同数据源的不同结构的数据进行探查的情况。应用数据探查的装置可以实现本申请任一实施例所提供的数据探查方法。如图4所示,数据探查装置包括:获取模块410,设置为获取与数据源对应的数据抽取规则、分割规则和探查需求;待处理数据确定模块420,设置为基于所述数据抽取规则和所述数据源,确定待处理数据;分割模块430,设置为基于所述分割规则对所述待处理数据进行分割,得到分割数据;探查模块440,设置为基于所述探查需求对所述分割数据进行探查,得到探查结果。FIG. 4 is a structural block diagram of a data detection apparatus provided in Embodiment 4 of the present application, and this embodiment may be applied to the case where data of different structures from different data sources is probed. The apparatus for applying data detection can implement the data detection method provided by any embodiment of the present application. As shown in FIG. 4 , the data detection device includes: an acquisition module 410, configured to acquire data extraction rules, segmentation rules and detection requirements corresponding to the data source; a data to be processed determination module 420, configured to be based on the data extraction rules and all The data source is used to determine the data to be processed; the segmentation module 430 is configured to segment the data to be processed based on the segmentation rules to obtain segmented data; the detection module 440 is configured to perform segmentation on the segmented data based on the detection requirements. Probe and get the probing results.
可选的,获取与数据源对应的数据抽取规则,包括:根据数据源的类型确定数据抽取规则。Optionally, acquiring the data extraction rule corresponding to the data source includes: determining the data extraction rule according to the type of the data source.
分割规则包括行分割规则和列分割规则。Splitting rules include row splitting rules and column splitting rules.
分割模块430包括:行分割单元,设置为基于行分割规则,对待处理数据进行行分割,得到行分割数据;列分割单元,设置为基于列分割规则,对行分割数据进行列分割,得到行列分割数组。The segmentation module 430 includes: a row segmentation unit, configured to perform row segmentation on the data to be processed based on a row segmentation rule to obtain row segmentation data; a column segmentation unit, configured to perform column segmentation on the row segmentation data based on the column segmentation rule to obtain row-column segmentation array.
探查模块440包括:行探查模块,设置为基于探查需求对行列分割数组进行行探查,得到行探查结果;列探查模块,设置为基于探查需求对行列分割数组进行列探查,得到列探查结果。The probe module 440 includes: a row probe module configured to perform row probe on the row-column split array based on probe requirements to obtain row probe results; a column probe module configured to perform column probe on the row-column split array based on probe requirements to obtain column probe results.
对行列分割数组进行下述至少一项行探查:异常数据筛选、数据排序、数据重复率以及数据记录数;对行列分割数组进行下述至少一项列探查:格式、类型、长度、数值、空值率、空值占比、最大值、最小值、平均值以及值域分布。Perform at least one of the following row profiling on row-column split arrays: abnormal data filtering, data sorting, data repetition rate, and number of data records; perform at least one of the following column profiling on row-column split arrays: format, type, length, value, null Value rate, percentage of nulls, maximum, minimum, mean, and range distribution.
可选的,数据探查装置还包括:统计模块,设置为对探查结果进行统计,得到统计结果;输出模块,设置为基于探查需求和统计结果,输出探查报告。Optionally, the data detection device further includes: a statistics module, configured to perform statistics on the detection results to obtain the statistical results; and an output module, configured to output a detection report based on the detection requirements and the statistical results.
本申请实施例所提供的数据探查装置,获取模块,设置为获取与数据源对应的数据抽取规则、分割规则和探查需求,待处理数据确定模块,设置为基于数据抽取规则和所述数据源,确定待处理数据,分割模块,设置为基于分割规则对待处理数据进行分割,得到分割数据,探查模块,设置为基于探查需求对分割数据进行探查,得到探查结果。In the data exploration device provided by the embodiment of the present application, the acquisition module is configured to acquire data extraction rules, segmentation rules and detection requirements corresponding to the data source, and the data to be processed determination module is configured to be based on the data extraction rules and the data source, Determine the data to be processed, the segmentation module is set to segment the data to be processed based on the segmentation rules to obtain the segmented data, and the detection module is set to perform detection on the segmented data based on the detection requirements to obtain the detection result.
本申请实施例所提供的数据探查装置,可执行本申请任一实施例所数据探查方法,具备执行方法相应的功能模块。未描述的技术细节,可参见本申请任一实施例所提供的数据探查方法。The data detection apparatus provided by the embodiment of the present application can execute the data detection method of any embodiment of the present application, and has functional modules corresponding to the execution method. For technical details that are not described, reference may be made to the data exploration method provided in any embodiment of the present application.
实施例五Embodiment 5
图5是本申请实施例五提供的一种电子设备的结构示意图。图5示出了适于用来实现本申请任一实施方式的示例性电子设备12的框图。图5显示的电子设备12仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。FIG. 5 is a schematic structural diagram of an electronic device provided in Embodiment 5 of the present application. Figure 5 shows a block diagram of an exemplary electronic device 12 suitable for use in implementing any of the embodiments of the present application. The electronic device 12 shown in FIG. 5 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
如图5所示,电子设备12以通用计算设备的形式表现。电子设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,存储器28,连接不同组件(包括存储器28和处理单元16)的总线18。As shown in FIG. 5, the electronic device 12 takes the form of a general-purpose computing device. Components of the electronic device 12 may include, but are not limited to, one or more processors or processing units 16 , a memory 28 , and a bus 18 connecting the various components including the memory 28 and the processing unit 16 .
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture,ISA)总线,微通道体系结构(Micro Channel Architecture,MCA)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。 Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards Association) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.
电子设备12包括多种计算机可读介质。这些介质可以是任何能够被电子设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。 Electronic device 12 includes a variety of computer-readable media. These media can be any available media that can be accessed by electronic device 12, including both volatile and non-volatile media, removable and non-removable media.
存储器28可以包括易失性存储器形式的计算机装置可读介质,例如随机存取存储器(Random Access Memory,RAM)30和/或高速缓存存储器32。电子设备12还可以包括其它可移动/不可移动的、易失性/非易失性计算机存储介质。仅作为举例,存储系统34可以设置为读写不可移动的、非易失性磁介质(图5未显示,通常称为“硬盘驱动器”)。尽管图5中未示出,存储系统34可以提供设置为对可移动非易失性磁盘(例如“软盘”)进行读写的磁盘驱动器,以及对可移动非易失性光盘(例如只读光盘(Compact DiscRead-Only Memory,CD-ROM)、数字视盘(Digital Video DiscRead-Only Memory,DVD-ROM)或者其它光介质)进行读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品40,该程序产品40具有一组程序模块42,这些程序模块被配置以执行本申请多个实施例的功能。程序产品40,可以存储在例如存储器28中,这样的程序模块42包括但不限于一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常 执行本申请所描述的实施例中的功能和/或方法。 Memory 28 may include computer device readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 . Electronic device 12 may also include other removable/non-removable, volatile/non-volatile computer storage media. For example only, storage system 34 may be configured to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard drive"). Although not shown in FIG. 5, storage system 34 may provide disk drives configured to read and write to removable non-volatile magnetic disks (eg, "floppy disks"), as well as removable non-volatile optical disks (eg, optical disks) (Compact DiscRead-Only Memory, CD-ROM), digital video disc (Digital Video DiscRead-Only Memory, DVD-ROM) or other optical media) optical disc drive for reading and writing. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. The memory 28 may include at least one program product 40 having a set of program modules 42 configured to perform the functions of various embodiments of the present application. Program product 40, which may be stored, for example, in memory 28, such program modules 42 including, but not limited to, one or more application programs, other program modules, and program data, each or some combination of these examples may include a network environment realization. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
电子设备12也可以与一个或多个外部设备14(例如键盘、鼠标、摄像头等和显示器)通信,还可与一个或者多个使得用户能与该电子设备12交互的设备通信,和/或与使得该电子设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口22进行。并且,电子设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图5所示,网络适配器20通过总线18与电子设备12的其它模块通信。应当明白,尽管图5中未示出,可以结合电子设备12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)装置、磁带驱动器以及数据备份存储装置等。The electronic device 12 may also communicate with one or more external devices 14 (eg, keyboard, mouse, camera, etc., and display), with one or more devices that enable a user to interact with the electronic device 12, and/or with Any device (eg, network card, modem, etc.) that enables the electronic device 12 to communicate with one or more other computing devices. Such communication may take place through an input/output (I/O) interface 22 . Furthermore, the electronic device 12 can also communicate with one or more networks (eg, Local Area Network (LAN), Wide Area Network, WAN) and/or public networks such as the Internet through the network adapter 20. As shown in FIG. 5 , the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18 . It should be understood that, although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, disk arrays (Redundant Arrays of Independent Disks, RAID) devices, tape drives, and data backup storage devices, etc.
处理器16通过运行存储在存储器28中的程序,从而执行多种功能应用以及数据处理,例如实现本申请上述实施例所提供的数据探查方法,该方法包括:获取与数据源对应的数据抽取规则、分割规则和探查需求;基于数据抽取规则和所述数据源,确定待处理数据;基于分割规则对所述待处理数据进行分割,得到分割数据;基于探查需求对分割数据进行探查,得到探查结果。The processor 16 executes a variety of functional applications and data processing by running the programs stored in the memory 28, for example, to implement the data detection method provided by the above-mentioned embodiments of the present application, the method includes: acquiring data extraction rules corresponding to the data source , segmentation rules and exploration requirements; determine the data to be processed based on the data extraction rules and the data source; segment the to-be-processed data based on the segmentation rules to obtain segmented data; based on the exploration requirements, probe the segmented data to obtain the detection results .
当然,本领域技术人员可以理解,处理器还可以实现本申请任一实施例所提供的数据探查方法。Of course, those skilled in the art can understand that the processor can also implement the data detection method provided by any embodiment of the present application.
实施例六Embodiment 6
本申请实施例六还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任意实施例提供的数据探查方法,该方法包括:获取与数据源对应的数据抽取规则、分割规则和探查需求;基于数据抽取规则和所述数据源,确定待处理数据;基于分割规则对待处理数据进行分割,得到分割数据;基于探查需求对分割数据进行探查,得到探查结果。Embodiment 6 of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, implements the data detection method provided by any embodiment of the present application, the method comprising: acquiring and data Data extraction rules, segmentation rules, and exploration requirements corresponding to the source; determine the data to be processed based on the data extraction rules and the data source; segment the to-be-processed data based on the segmentation rules to obtain segmentation data; Get the probe results.
当然,本申请实施例所提供的一种计算机可读存储介质,其上存储的计算机程序不限于如上的方法指令,还可以执行本申请任一实施例所提供的数据探查方法。Certainly, the computer program stored on a computer-readable storage medium provided by the embodiments of the present application is not limited to the above method instructions, and can also execute the data detection method provided by any embodiment of the present application.
本申请实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的装置、装置或器件,或者任意以上的组合。计算机可读存储介质的 例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、只读存储器(Read-Only Memory,ROM)、可擦式可编程只读存储器(Electrically Programmable Read-Only Memory,EPROM)或闪存、光纤、便携式CD-ROM、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行装置、装置或者器件使用或者与其结合使用。The computer storage medium of the embodiments of the present application may adopt any combination of one or more computer-readable media. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor device, device, or device, or a combination of any of the above. Examples (non-exhaustive list) of computer-readable storage media include: electrical connections having one or more wires, portable computer disks, hard disks, RAM, Read-Only Memory (ROM), erasable Programmable read-only memory (Electrically Programmable Read-Only Memory, EPROM) or flash memory, optical fiber, portable CD-ROM, optical storage device, magnetic storage device, or any suitable combination of the above. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution apparatus, apparatus, or device.
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行装置、装置或者器件使用或者与其结合使用的程序。A computer-readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution apparatus, apparatus, or device .
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、电线、光缆、射频(Radio Frequency,RF)等,或者上述的任意合适的组合。The program code embodied on the computer-readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
可以以一种或多种程序设计语言或其组合来编写用于执行本申请指令的计算机程序代码,程序设计语言包括面向对象的程序设计语言诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络包括LAN或WAN连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for executing the instructions of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programs, or a combination thereof A design language such as the "C" language or similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. Where a remote computer is involved, the remote computer may be connected to the user's computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, through the Internet using an Internet service provider).