CN113779003A

CN113779003A - Information processing method and device

Info

Publication number: CN113779003A
Application number: CN202110175697.6A
Authority: CN
Inventors: 刘欢
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-12-10
Anticipated expiration: 2041-02-09
Also published as: CN113779003B

Abstract

The application discloses an information processing method and device, and the specific implementation scheme is as follows: in response to receiving an information processing request sent by a user, extracting respective data cleansing information and cleansing codes from a configuration file, wherein each piece of data cleansing information comprises: the type of the source data source, the connection information of the source data source, the type of the target data source and the connection information of the target data source; acquiring each piece of data on each source data source according to the connection information of each source data source, and generating a data table corresponding to each piece of data cleaning information; based on the type of the source data source and the type of the target data source, cleaning the data in each data table by using a cleaning code; and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source according to the connection information of each target data source. The scheme realizes a simple, high-efficiency and differentiation-removing information processing method.

Description

Information processing method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of data processing, and particularly relates to an information processing method and device.

Background

With the rise and explosive growth of the mobile internet, data is becoming a core background for supporting the continuous growth of business businesses, and various products spread around the data are also applied. Because of factors such as business needs, most of data are stored and managed in different media and physical domains in a dispersed manner according to serial factors such as business domains and system architectures, and when data analysis and logic calculation are performed, the dependent basic data needs to be re-aggregated from different media to perform large-scale data processing, so that a fast and simple manner capable of aggregating the required data is needed.

The ETL (Extract-Transform-Load) data warehouse technology is a process of loading data of a business system to a data warehouse after extraction, cleaning and conversion, and aims to integrate scattered, disordered and standard-nonuniform data in an enterprise and provide an analysis basis for enterprise decision making. In addition to some ETL solutions promulgated by open source organizations, each corporate enterprise may derive its own ETL solution. The open source scheme generally needs to be directly used after secondary development due to the natural high abstraction characteristic, and the unique ETL solution has poor data universality and can not be used when being taken due to the fact that ETL is strongly related to data and data are strongly related to service, products among different companies cannot be mutually applied, and the products are generally attached with other related products, so that the landing cost and the maintenance cost are high.

Disclosure of Invention

The application provides an information processing method, an information processing device, information processing equipment and a storage medium.

According to a first aspect of the present application, there is provided an information processing method including: in response to receiving an information processing request sent by a user, extracting respective data cleansing information and cleansing codes from a configuration file, wherein each piece of data cleansing information comprises: the data cleaning system comprises a configuration file, a data cleaning process and a data cleaning system, wherein the configuration file is a normalized file for performing universal configuration on the data cleaning process by a user; acquiring each piece of data on each source data source according to the connection information of each source data source, and generating a data table corresponding to each piece of data cleaning information, wherein the data table is used for storing each piece of acquired data corresponding to each source data source; based on the type of the source data source and the type of the target data source in each data cleaning information, cleaning the data in each data table by using a cleaning code, wherein the cleaning code is used for converting the format of the data in the data table so as to enable the converted data in the data table to be matched with the type of the corresponding target data source; and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source according to the connection information of the target data source in each data cleaning information.

In some embodiments, the data table is obtained by registering in Spark memory based on the obtained data of each source data source.

In some embodiments, the cleansing code is SQL that describes data cleansing procedures between different data types; based on the type of the source data source and the type of the destination data source in each data cleaning information, cleaning the data in each data table by using a cleaning code, comprising the following steps: and executing SQL on each data table by utilizing spark SQL based on the type of the source data source and the type of the target data source in each data cleaning information.

In some embodiments, the method further comprises: and operating each data table based on a first program scanned in the configuration file, wherein the first program is used for characterizing a data implementation scheme customized in advance based on user requirements.

In some embodiments, obtaining the pieces of data on each source data source according to the connection information of each source data source includes: acquiring each piece of data on each source data source by using a reading program according to the connection information of each source data source, wherein the reading program is used for representing an interface program for acquiring corresponding data of each type of data source; according to the connection information of the target data source in each data cleaning information, writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source, including: and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a writing program according to the connection information of the target data source in each data cleaning information, wherein the writing program is used for representing an interface program for writing the corresponding data of various data sources.

In some embodiments, the obtaining the pieces of data on each source data source by using the reading program according to the connection information of each source data source includes: judging whether data required to be read by the source data sources exist or not based on a connection example of each source data source and a data judgment function in a reading program, wherein the connection example is generated based on an implementation scheme corresponding to each source data source, and the implementation scheme is obtained by scanning in a configuration file; responding to the data which needs to be read by the source data source, acquiring a piece of data corresponding to the data which needs to be read by the source data source by using a data acquisition function in the reading program, jumping to a connection example based on each source data source and a data judgment function in the reading program, and judging whether the data which needs to be read by the source data source exists.

In some embodiments, the method further comprises: and in response to the fact that the data which needs to be read by the source data source does not exist, stopping the operation on the source data source and closing the connection of the source data source by utilizing a data closing function in the reading program.

In some embodiments, according to the connection information of the destination data source in each piece of data cleansing information, writing data in the data table corresponding to the cleansed corresponding data cleansing information into the destination data source by using a writing program, including: according to the connection information of the target data source in each piece of data cleaning information, writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a data writing function in a writing program; in response to the success of writing the data in the cleaned data table, stopping the operation on the target data source and closing the connection of the target data source by using a data closing function in the writing program; and responding to unsuccessful writing of the data in the cleaned data table, and performing fault-tolerant processing on the data in the cleaned data table by using a fault-tolerant function in a writing program.

According to a second aspect of the present application, there is provided an information processing apparatus comprising: an extracting unit configured to extract respective data cleansing information and cleansing codes from a configuration file in response to receiving an information processing request sent by a user, wherein each data cleansing information includes: the data cleaning system comprises a configuration file, a data cleaning process and a data cleaning system, wherein the configuration file is a normalized file for performing universal configuration on the data cleaning process by a user; the acquisition unit is configured to acquire each piece of data on each source data source according to the connection information of each source data source, and generate a data table corresponding to each piece of data cleaning information, wherein the data table is used for storing each piece of acquired data corresponding to each source data source; the first processing unit is configured to wash the data in each data table by using a washing code based on the type of the source data source and the type of the destination data source in each data washing information, wherein the washing code is used for converting the format of the data in the data table so as to enable the converted data in the data table to be matched with the type of the corresponding destination data source; and the writing unit is configured to write the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source according to the connection information of the target data source in each data cleaning information.

In some embodiments, the data table in the obtaining unit is obtained by registering in Spark memory based on the obtained data of each source data source.

In some embodiments, the cleansing code in the extraction unit is SQL that describes a data cleansing process between different data types; a processing unit comprising: and executing SQL on each data table by utilizing spark SQL based on the type of the source data source and the type of the target data source in each data cleaning information.

In some embodiments, the apparatus further comprises: and the second processing unit is configured to operate each data table based on the first program scanned in the configuration file, wherein the first program is used for characterizing a data implementation scheme customized in advance based on user requirements.

In some embodiments, the obtaining unit is further configured to obtain, by using a reading program, pieces of data on each source data source according to the connection information of each source data source, where the reading program is used to characterize an interface program for obtaining corresponding data of each type of data source; the writing unit is further configured to write data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source by using a writing program according to the connection information of the destination data source in each data cleaning information, wherein the writing program is used for representing an interface program for writing the corresponding data of each type of data source.

In some embodiments, the obtaining unit comprises: the judging module is configured to judge whether data required to be read by the source data sources exist or not based on a connection example of each source data source and a data judging function in the reading program, wherein the connection example is generated based on an implementation scheme corresponding to each source data source, and the implementation scheme is obtained by scanning in a configuration file; and the acquisition module is configured to respond to the existence of the data which needs to be read by the source data source, acquire a piece of data corresponding to the data which needs to be read by the source data source by using a data acquisition function in the reading program, jump to a connection example based on each source data source and a data judgment function in the reading program, and judge whether the data which needs to be read by the source data source exists.

In some embodiments, the apparatus further comprises: and the closing module is configured to respond to the absence of data required to be read by the source data source, stop the operation on the source data source and close the connection of the source data source by utilizing a data closing function in the reading program.

In some embodiments, a write unit, comprising: the writing module is configured to write the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a data writing function in the writing program according to the connection information of the target data source in each data cleaning information; the closing module is configured to respond to successful data writing in the cleaned data table, stop the operation on the target data source and close the connection of the target data source by utilizing a data closing function in the writing program; and the fault-tolerant module is configured to respond to unsuccessful data writing in the cleaned data table and utilize a fault-tolerant function in the writing program to perform fault-tolerant processing on the data in the cleaned data table.

According to a third aspect of the present application, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

According to a fourth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions, wherein the computer instructions are for causing a computer to perform the method as described in any one of the implementations of the first aspect.

According to the technology of the application, each piece of data cleaning information and cleaning code are extracted from a configuration file in response to a received information processing request sent by a user, wherein the configuration file is a standardized file for performing universal configuration on a data cleaning process by the user, and each piece of data cleaning information comprises: the data cleaning method comprises the steps of providing the type of a source data source of data, connection information of the source data source, the type of a destination data source for writing in the data and the connection information of the destination data source, acquiring each piece of data on each source data source according to the connection information of each source data source, generating a data table corresponding to each piece of data cleaning information, cleaning the data in each data table by using a cleaning code based on the type of the source data source and the type of the destination data source in each piece of data cleaning information, wherein the cleaning code is used for converting the format of the data in the data table so as to enable the data in the converted data table to be matched with the type of the corresponding destination data source, and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source according to the connection information of the destination data source in each piece of data cleaning information, the data storage method solves the problems of poor data universality and high maintenance cost in the existing data storage technology, realizes the data source pertinence in the data extraction and writing process, eliminates the difference of the data source after the data arrives in a memory or a database, and ensures that all data is only one table in the data source, thereby avoiding the defect that the data is dispersed in each physical medium and cannot be directly associated and analyzed, realizing a simple, high-efficiency and differentiation-removing information processing method, and further realizing a data source engine capable of carrying out data processing across physical engines. The data type of the data source can be various, various data sources are treated equally, the product is a mesh model instead of the existing star model, mutual data synchronization among various data sources is supported, the problem of limited data source support is solved, and the system processing is simpler. The data source integration method comprises the steps that data information and data cleaning information are extracted from configuration files, wherein the configuration files are standardized files for carrying out universal configuration on a data cleaning process, a user can access the configuration files only by configuring data sources used in the configuration files, the user is transparent, the user does not need to care about bottom layer implementation, and a more efficient and convenient data source integration scheme is achieved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application.

Fig. 1 is a schematic diagram of a first embodiment of an information processing method according to the present application;

fig. 2 is a scene diagram of an information processing method that can implement an embodiment of the present application;

fig. 3 is a schematic diagram of a second embodiment of an information processing method according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of an information processing apparatus according to the present application;

fig. 5 is a block diagram of an electronic device for implementing the information processing method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows a schematic diagram 100 of a first embodiment of an information processing method according to the present application. The information processing method comprises the following steps:

step 101, in response to receiving an information processing request sent by a user, extracting each piece of data cleaning information and cleaning code from a configuration file.

In this embodiment, when the execution main body (for example, a terminal device or a server) receives an information processing request sent by a user, the execution main body may extract each piece of data cleansing information and cleansing code from a configuration file acquired locally or remotely by means of wired connection or wireless connection. The configuration file may be a normalized file for a user to perform general configuration on the data cleansing process, and each piece of data cleansing information may include: the type of the source data source for providing data, the connection information of the source data source, the type of the destination data source for writing data, and the connection information of the destination data source. The data source is a medium for providing and writing data, the source data source is a medium for providing data, and the destination data source is a medium for writing data. The types of data sources may include: mysql, Clickhouse, Jimdb, ES, JSS, Hbase, Hdfs, Hive, Kafka, Jmq2, Dgraph, Neo4j and the like, wherein the types of the data sources comprise a relational data source, a memory type data source, a file type data source, a graph data source and the like. The connection information of the data source may include: a connection method of data sources, a method of extracting data, and the like. The cleansing code may be used to convert the format of the data in the data table so that the converted data in the data table matches the type of the corresponding destination data source. It should be noted that the wireless connection means may include, but is not limited to, 3G, 4G, 5G connection, WiFi connection, bluetooth connection, WiMAX connection, Zigbee connection, uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

And 102, acquiring each piece of data on each source data source according to the connection information of each source data source, and generating a data table corresponding to each piece of data cleaning information.

In this embodiment, the execution subject may obtain, from the local or remote server, each piece of data on each source data source according to the connection information of each source data source obtained in step 101, and generate a data table corresponding to each piece of data cleansing information based on each piece of obtained data on each source data source. The data table may be used to store the pieces of data corresponding to each source data source obtained.

And 103, cleaning the data in each data table by using the cleaning code based on the type of the source data source and the type of the destination data source in each data cleaning information.

In this embodiment, the execution subject may perform cleansing on data in each data table by using cleansing codes based on the type of the source data source and the type of the destination data source in each data cleansing information acquired in step 101. The cleansing code may be used to convert the format of the data in the data table so that the converted data in the data table matches the type of the corresponding destination data source.

And 104, writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source according to the connection information of the target data source in each data cleaning information.

In this embodiment, the execution subject may write the data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source according to the connection information of the destination data source in each data cleaning information acquired in step 101.

In some optional implementations of this embodiment, the method further includes: and operating each data table based on a first program scanned in the configuration file, wherein the first program is used for characterizing a data implementation scheme customized in advance based on user requirements. The configuration file not only provides secondary research and development implementation based on an open interface for research and development personnel, but also provides product-level configuration implementation based on experience setting default implementation for common users. The data tables are operated through the first program which is predefined by the user, the flexibility of the whole information processing process is increased, and the first program can be used as an entry point for the finer-grained control of a specific process to be expanded.

With continued reference to fig. 2, the information processing method 200 of the present embodiment is executed in the electronic device 201. When the electronic device 201 receives an information processing request sent by a user, extracting each piece of data cleaning information and a cleaning code 202 from a configuration file, then the electronic device 201 acquires each piece of data on each source data source according to the connection information of each source data source, generating a data table 203 corresponding to each piece of data cleaning information, then the electronic device 201 cleans 204 the data in each data table by using the cleaning code based on the type of the source data source and the type of the destination data source in each piece of data cleaning information, and finally the electronic device 201 writes the data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source 205 according to the connection information of the destination data source in each piece of data cleaning information. Wherein each data cleansing information comprises: the data cleaning system comprises a data cleaning system, a data source, a configuration file and a data source management system, wherein the data cleaning system comprises a data cleaning system, a data source management system, a data cleaning system and a data cleaning system, the data cleaning system comprises a data cleaning system, a data cleaning system and a configuration file, the data cleaning system comprises a data cleaning system and a data cleaning system, the data cleaning system comprises a data cleaning system, a data cleaning system and a data cleaning system, the data cleaning system comprises a data cleaning system, a data cleaning system and a data cleaning system, a data cleaning system and a cleaning system, a data cleaning system and a configuration file, a data cleaning system, a source and a configuration file, a data cleaning system, a source and a configuration file, a data cleaning system for cleaning a data cleaning system for cleaning a data cleaning system for cleaning a data cleaning system for cleaning a data cleaning system for cleaning a data cleaning system for cleaning a data cleaning system for cleaning a data cleaning a data cleaning system for cleaning a data cleaning a data cleaning system for cleaning a data cleaning system, a data cleaning a data. The cleaning code is used for converting the format of the data in the data table so as to enable the converted data in the data table to be matched with the type of the corresponding destination data source.

The information processing method provided by the above embodiment of the present application adopts a method that, in response to receiving an information processing request sent by a user, each piece of data cleaning information and a cleaning code are extracted from a configuration file, where the configuration file is a normalized file for performing general configuration on a data cleaning process by the user, and each piece of data cleaning information includes: the data cleaning method comprises the steps of providing the type of a source data source of data, connection information of the source data source, the type of a destination data source for writing in the data and the connection information of the destination data source, acquiring each piece of data on each source data source according to the connection information of each source data source, generating a data table corresponding to each piece of data cleaning information, cleaning the data in each data table by using a cleaning code based on the type of the source data source and the type of the destination data source in each piece of data cleaning information, wherein the cleaning code is used for converting the format of the data in the data table so as to enable the data in the converted data table to be matched with the type of the corresponding destination data source, and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source according to the connection information of the destination data source in each piece of data cleaning information, the data storage method solves the problems of poor data universality and high maintenance cost in the existing data storage technology, realizes the data source pertinence in the data extraction and writing process, eliminates the difference of the data source after the data arrives in a memory or a database, and ensures that all data is only one table in the data source, thereby avoiding the defect that the data is dispersed in each physical medium and cannot be directly associated and analyzed, realizing a simple, high-efficiency and differentiation-removing information processing method, and further realizing a data source engine capable of carrying out data processing across physical engines. The data type of the data source can be various, various data sources are treated equally, the product is a mesh model instead of the existing star model, mutual data synchronization among various data sources is supported, the problem of limited data source support is solved, and the system processing is simpler. The data source integration method comprises the steps that data information and data cleaning information are extracted from configuration files, wherein the configuration files are standardized files for carrying out universal configuration on a data cleaning process, a user can access the configuration files only by configuring data sources used in the configuration files, the user is transparent, the user does not need to care about bottom layer implementation, and a more efficient and convenient data source integration scheme is achieved.

With further reference to fig. 3, a schematic diagram 300 of a second embodiment of an information processing method is shown. The process of the method comprises the following steps:

step 301, in response to receiving an information processing request sent by a user, extracting each piece of data cleaning information and cleaning code from a configuration file.

Step 302, according to the connection information of each source data source, obtaining each piece of data on each source data source by using a reading program, and generating a data table corresponding to each piece of data cleaning information.

In this embodiment, the execution subject may obtain, by using the reading program, each piece of data on each source data source according to the connection information of each source data source obtained in step 301, and generate a data table corresponding to each piece of data cleansing information based on each piece of data obtained on each source data source. The reading program is used for representing an interface program for acquiring corresponding data of various data sources. The data table may be used to store the pieces of data corresponding to each source data source obtained. The data table is obtained by registering in Spark memory based on the acquired data of each source data source. Spark is a fast, general-purpose, distributed computing engine designed for one large-scale data processing under the Apache foundation. Spark memory refers to the memory of each node of the Spark cluster.

In some optional implementation manners of this embodiment, acquiring, by using a reading program, each piece of data on each source data source according to the connection information of each source data source includes: judging whether data required to be read by the source data sources exist or not based on a connection example of each source data source and a data judgment function in a reading program, wherein the connection example is generated based on an implementation scheme corresponding to each source data source, and the implementation scheme is obtained by scanning in a configuration file; responding to the data which needs to be read by the source data source, acquiring a piece of data corresponding to the data which needs to be read by the source data source by using a data acquisition function in the reading program, jumping to a connection example based on each source data source and a data judgment function in the reading program, and judging whether the data which needs to be read by the source data source exists. And acquiring each piece of data on the source data source one by one based on the interface program.

In some optional implementations of this embodiment, the method further includes: in response to the fact that the data which are required to be read by the source data source do not exist, the data closing function in the reading program is utilized, the operation on the source data source is stopped, the connection of the source data source is closed, the operation and the connection are closed in time, and the overall operation efficiency is improved.

And step 303, cleaning the data in each data table by using the cleaning code based on the type of the source data source and the type of the destination data source in each data cleaning information.

In some optional implementations of this embodiment, the cleansing code is SQL that describes a data cleansing process between different data types; based on the type of the source data source and the type of the destination data source in each data cleaning information, cleaning the data in each data table by using a cleaning code, comprising the following steps: and executing SQL on each data table by utilizing spark SQL based on the type of the source data source and the type of the target data source in each data cleaning information. The data cleaning information is SQL for describing the data association cleaning process, and the SQL is a structured query language, and has great flexibility and strong functions. Spark SQL is a subcomponent of Spark, providing the ability to interactively parse memory data based on SQL. The powerful function of attaching to spark SQL can dynamically optimize SQL provided by users, thereby greatly shortening the operation period and improving the operation efficiency.

And 304, writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a writing program according to the connection information of the target data source in each data cleaning information.

In this embodiment, the execution subject may write, by using the write program, data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source according to the connection information of the destination data source in each data cleaning information acquired in step 301. The writing program is used for representing the interface program for writing the corresponding data of various data sources.

In some optional implementation manners of this embodiment, writing, by using a writing program, data in a data table corresponding to the cleaned corresponding data cleaning information into the destination data source according to connection information of the destination data source in each piece of data cleaning information, includes: according to the connection information of the target data source in each piece of data cleaning information, writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a data writing function in a writing program; in response to the success of writing the data in the cleaned data table, stopping the operation on the target data source and closing the connection of the target data source by using a data closing function in the writing program; and responding to unsuccessful writing of the data in the cleaned data table, and performing fault-tolerant processing on the data in the cleaned data table by using a fault-tolerant function in a writing program. The data is written into the data source based on the interface program, the operation and the connection are closed in time, and the error data is processed, so that the accuracy and the efficiency of the whole operation are improved.

In this embodiment, the specific operations of

steps

301 and 303 are substantially the same as the operations of

steps

101 and 103 in the embodiment shown in fig. 1, and are not described again here.

As can be seen from fig. 3, compared with the embodiment corresponding to fig. 1, the schematic diagram 300 of the information processing method in this embodiment adopts a method that, according to the connection information of each source data source, each piece of data on each source data source is obtained by using a reading program, a data table corresponding to each piece of data cleaning information is generated, according to the connection information of a target data source in each piece of data cleaning information, data in the data table corresponding to the cleaned corresponding data cleaning information is written into the target data source by using a writing program, a data source interface is encapsulated at a higher level, only a key step is reserved for extension, development of writing codes is omitted, development cost is reduced, and from an interface level, unification is achieved, and high-level calling is facilitated. The data table is obtained by registering each piece of data on each source data source in the Spark memory based on the obtained data, so that the data does not need to be copied to a disk, disk resources are saved, fusion processing on the data becomes more efficient compared with IO operation of the disk by using a memory storage and calculation engine, system processing efficiency is improved, the memory can be recycled, data copying becomes transparent to a user on the basis of reusable memory characteristics, and the data copying is closer to real requirements.

The method is operated in a Spark distributed environment, the Spark distributed computing engine is utilized, due to the highly extensible characteristic, the used memory size and the data volume are in a linear relation for dynamic expansion, extra maintenance work is not needed, and due to the Spark being a distributed computing engine which is already released and has a plurality of online application cases, the cluster maintenance work is simple, and the workload can be greatly reduced. The method realizes the cross-physical-domain fusion of two data which are originally not in the same physical domain and cannot be directly subjected to data fusion operation, the data are not required to be physically transferred, the remote fusion of multiple data source data is directly realized in the memory, and the performance of a data cleaning mode based on memory calculation is improved higher compared with that of a mode based on disk intermediate data processing. Due to the distributed characteristic of Spark, data can be extracted and written out by a plurality of nodes in the cluster in parallel, the data processing capacity is greatly improved, and all the nodes in the distributed processing process are maintained in a coordinated and consistent mode through Spark, which is also an important reason that the efficiency of ETL task processing by using the scheme is faster than that of a general processing mode. The reusable characteristic of the memory ensures that the data of the data source cannot be stored, the data can be automatically recovered along with the cluster nodes after the data is processed, the disk cost is reduced, and the processing efficiency is improved.

With further reference to fig. 4, as an implementation of the method shown in fig. 1 to 3, the present application provides an embodiment of an information processing apparatus, which corresponds to the embodiment of the method shown in fig. 1, and which is specifically applicable to various electronic devices.

As shown in fig. 4, the information processing apparatus 400 of the present embodiment includes: an extracting unit 401, an obtaining unit 402, a first processing unit 403 and a writing unit 404, wherein the extracting unit is configured to extract respective data cleansing information and cleansing codes from a configuration file in response to receiving an information processing request sent by a user, and each data cleansing information includes: the data cleaning system comprises a configuration file, a data cleaning process and a data cleaning system, wherein the configuration file is a normalized file for performing universal configuration on the data cleaning process by a user; the acquisition unit is configured to acquire each piece of data on each source data source according to the connection information of each source data source, and generate a data table corresponding to each piece of data cleaning information, wherein the data table is used for storing each piece of acquired data corresponding to each source data source; the first processing unit is configured to wash the data in each data table by using a washing code based on the type of the source data source and the type of the destination data source in each data washing information, wherein the washing code is used for converting the format of the data in the data table so as to enable the converted data in the data table to be matched with the type of the corresponding destination data source; and the writing unit is configured to write the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source according to the connection information of the target data source in each data cleaning information.

In this embodiment, specific processing of the extracting unit 401, the obtaining unit 402, the first processing unit 403, and the writing unit 404 of the information processing apparatus 400 and technical effects thereof may refer to the related descriptions of step 101 to step 104 in the embodiment corresponding to fig. 1, and are not described herein again.

In some optional implementation manners of this embodiment, the data table in the obtaining unit is obtained by registering in the Spark memory based on each piece of obtained data on each source data source.

In some optional implementations of this embodiment, the cleansing code in the extraction unit is SQL describing a data cleansing process between different data types; a processing unit comprising: and executing SQL on each data table by utilizing spark SQL based on the type of the source data source and the type of the target data source in each data cleaning information.

In some optional implementations of this embodiment, the apparatus further includes: and the second processing unit is configured to operate each data table based on the first program scanned in the configuration file, wherein the first program is used for characterizing a data implementation scheme customized in advance based on user requirements.

In some optional implementations of this embodiment, the obtaining unit is further configured to obtain, according to the connection information of each source data source, each piece of data on each source data source by using a reading program, where the reading program is used to characterize an interface program for obtaining data corresponding to each type of data source; the writing unit is further configured to write data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source by using a writing program according to the connection information of the destination data source in each data cleaning information, wherein the writing program is used for representing an interface program for writing the corresponding data of each type of data source.

In some optional implementation manners of this embodiment, the obtaining unit includes: the judging module is configured to judge whether data required to be read by the source data sources exist or not based on a connection example of each source data source and a data judging function in the reading program, wherein the connection example is generated based on an implementation scheme corresponding to each source data source, and the implementation scheme is obtained by scanning in a configuration file; and the acquisition module is configured to respond to the existence of the data which needs to be read by the source data source, acquire a piece of data corresponding to the data which needs to be read by the source data source by using a data acquisition function in the reading program, jump to a connection example based on each source data source and a data judgment function in the reading program, and judge whether the data which needs to be read by the source data source exists.

In some optional implementations of this embodiment, the apparatus further includes: and the closing module is configured to respond to the absence of data required to be read by the source data source, stop the operation on the source data source and close the connection of the source data source by utilizing a data closing function in the reading program.

In some optional implementations of this embodiment, the writing unit includes: the writing module is configured to write the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a data writing function in the writing program according to the connection information of the target data source in each data cleaning information; the closing module is configured to respond to successful data writing in the cleaned data table, stop the operation on the target data source and close the connection of the target data source by utilizing a data closing function in the writing program; and the fault-tolerant module is configured to respond to unsuccessful data writing in the cleaned data table and utilize a fault-tolerant function in the writing program to perform fault-tolerant processing on the data in the cleaned data table.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 5, it is a block diagram of an electronic device according to the information processing method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor, so that the at least one processor executes the information processing method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the information processing method provided by the present application.

The memory 502, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the information processing method in the embodiment of the present application (for example, the extraction unit 401, the acquisition unit 402, the first processing unit 403, and the writing unit 404 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 502, that is, implements the information processing method in the above-described method embodiments.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the information processing electronic device, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 optionally includes memory located remotely from processor 501, which may be connected to information handling electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the information processing method may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the information processing electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, each piece of data cleaning information and cleaning code are extracted from a configuration file in response to a received information processing request sent by a user, wherein the configuration file is a standardized file for performing universal configuration on a data cleaning process by the user, and each piece of data cleaning information comprises: the data cleaning method comprises the steps of providing the type of a source data source of data, connection information of the source data source, the type of a destination data source for writing in the data and the connection information of the destination data source, acquiring each piece of data on each source data source according to the connection information of each source data source, generating a data table corresponding to each piece of data cleaning information, cleaning the data in each data table by using a cleaning code based on the type of the source data source and the type of the destination data source in each piece of data cleaning information, wherein the cleaning code is used for converting the format of the data in the data table so as to enable the data in the converted data table to be matched with the type of the corresponding destination data source, and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source according to the connection information of the destination data source in each piece of data cleaning information, the data storage method solves the problems of poor data universality and high maintenance cost in the existing data storage technology, realizes the data source pertinence in the data extraction and writing process, eliminates the difference of the data source after the data arrives in a memory or a database, and ensures that all data is only one table in the data source, thereby avoiding the defect that the data is dispersed in each physical medium and cannot be directly associated and analyzed, realizing a simple, high-efficiency and differentiation-removing information processing method, and further realizing a data source engine capable of carrying out data processing across physical engines. The data type of the data source can be various, various data sources are treated equally, the product is a mesh model instead of the existing star model, mutual data synchronization among various data sources is supported, the problem of limited data source support is solved, and the system processing is simpler. The data source integration method comprises the steps that data information and data cleaning information are extracted from configuration files, wherein the configuration files are standardized files for carrying out universal configuration on a data cleaning process, a user can access the configuration files only by configuring data sources used in the configuration files, the user is transparent, the user does not need to care about bottom layer implementation, and a more efficient and convenient data source integration scheme is achieved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An information processing method, the method comprising:

in response to receiving an information processing request sent by a user, extracting respective data cleansing information and cleansing codes from a configuration file, wherein each piece of data cleansing information comprises: the configuration file is a normalized file used for the user to carry out universal configuration on the data cleaning process;

acquiring each piece of data on each source data source according to the connection information of each source data source, and generating a data table corresponding to each piece of data cleaning information, wherein the data table is used for storing each piece of acquired data corresponding to each source data source;

based on the type of a source data source and the type of a destination data source in each piece of data cleaning information, cleaning data in each data table by using the cleaning code, wherein the cleaning code is used for converting the format of the data in the data table so as to enable the converted data in the data table to be matched with the type of the corresponding destination data source;

and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source according to the connection information of the target data source in each data cleaning information.

2. The method of claim 1, wherein the data table is obtained by registering in Spark memory based on the obtained data of each source data source.

3. The method of claim 2, wherein the cleansing code is SQL describing a data cleansing process between different data types; the cleaning the data in each data table by using the cleaning code based on the type of the source data source and the type of the destination data source in each data cleaning information includes:

and executing the SQL on each data table by utilizing spark SQL based on the type of the source data source and the type of the target data source in each data cleaning information.

4. The method of claim 1, further comprising:

and operating each data table based on a first program scanned in the configuration file, wherein the first program is used for characterizing a data implementation scheme customized in advance based on the user requirements.

5. The method of claim 1, wherein the obtaining the pieces of data on each source data source according to the connection information of each source data source comprises:

acquiring each piece of data on each source data source by using a reading program according to the connection information of each source data source, wherein the reading program is used for representing an interface program for acquiring corresponding data of each type of data source;

the writing, according to connection information of a destination data source in each piece of data cleaning information, data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source includes:

and writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a writing program according to the connection information of the target data source in each piece of data cleaning information, wherein the writing program is used for representing an interface program for writing the corresponding data of various data sources.

6. The method of claim 5, wherein the obtaining the pieces of data on each source data source by using the reading program according to the connection information of each source data source comprises:

judging whether data required to be read by the source data sources exist or not based on a connection example of each source data source and a data judgment function in the reading program, wherein the connection example is generated based on an implementation scheme corresponding to each source data source, and the implementation scheme is obtained based on scanning in the configuration file;

responding to the data which needs to be read by the source data source, acquiring a piece of data corresponding to the data which needs to be read by the source data source by using a data acquisition function in the reading program, jumping to a connection example based on each source data source and a data judgment function in the reading program, and judging whether the data which needs to be read by the source data source exists.

7. The method of claim 6, further comprising:

and in response to the fact that the data which are required to be read by the source data source do not exist, stopping the operation on the source data source and closing the connection of the source data source by utilizing a data closing function in the reading program.

8. The method according to claim 5, wherein writing, by using a writing program, data in the data table corresponding to the cleaned corresponding data cleaning information into the destination data source according to connection information of the destination data source in each piece of data cleaning information, includes:

according to the connection information of a target data source in each piece of data cleaning information, writing the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source by using a data writing function in the writing program;

in response to the success of writing the data in the cleaned data table, stopping the operation on the target data source and closing the connection of the target data source by using a data closing function in the writing program;

and responding to unsuccessful writing of the data in the cleaned data table, and performing fault tolerance processing on the data in the cleaned data table by using a fault tolerance function in the writing program.

9. An information processing apparatus, the apparatus comprising:

an extracting unit configured to extract respective data cleansing information and cleansing codes from a configuration file in response to receiving an information processing request sent by a user, wherein each of the data cleansing information includes: the configuration file is a normalized file used for the user to carry out universal configuration on the data cleaning process;

the acquisition unit is configured to acquire each piece of data on each source data source according to the connection information of each source data source, and generate a data table corresponding to each piece of data cleaning information, wherein the data table is used for storing each piece of acquired data corresponding to each source data source;

a first processing unit, configured to flush data in each data table with the flushing code based on a type of a source data source and a type of a destination data source in each data flushing information, where the flushing code is used to convert a format of data in the data table so that the converted data in the data table matches with the type of the corresponding destination data source;

and the writing unit is configured to write the data in the data table corresponding to the cleaned corresponding data cleaning information into the target data source according to the connection information of the target data source in each piece of data cleaning information.

10. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.