CN113806434A

CN113806434A - Big data processing method, device, equipment and medium

Info

Publication number: CN113806434A
Application number: CN202111107162.1A
Authority: CN
Inventors: 潘康; 廖仁巍; 王威凌
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2021-12-17
Anticipated expiration: 2041-09-22
Also published as: CN113806434B

Abstract

The invention relates to the field of big data, and provides a big data processing method, a device, equipment and a medium, which can load dictionary information in a preset language to generate a data extraction statement, extract data from a source system as data to be processed according to the data type of the source system and the data extraction statement, perform data extraction with pertinence according to different data types to enable the obtained data to be more accurate, preprocess the data to be processed based on basic cleaning conversion logic to obtain intermediate data, realize the first basic cleaning of the data, process the intermediate data based on target cleaning conversion logic to obtain target data, transmit the target data to a target system, further perform secondary cleaning and conversion on the personalized requirements of the data according to different application scenes to ensure that the data is matched with the use scenes, and have higher applicability. In addition, the invention also relates to a block chain technology, and the target data can be stored in the block chain nodes.

Description

Big data processing method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a big data processing method, a big data processing device, big data processing equipment and a big data processing medium.

Background

The ETL (Extract-Transform-Load, data warehouse technology) is a general solution for data operation in the current big data industry, at present, each large open-source community and business company provides a plurality of tool libraries for local functions such as acquisition or cleaning conversion, but there is not a structure which is robust enough and integrates the complete function of the ETL, and realizes the function of automatically converting the mapping logic of the demand side to the code of the development side, so that the ETL is difficult to adapt to more and more complex service data scenes, and also cannot adapt to the rhythm of more and more rapid updating and iteration of technical tools.

Because framework integration and automatic conversion from logic to code are not realized, each ETL task needs manpower to be customized and developed independently, and ETL flows are connected in series, so that point-to-point data acquisition, cleaning and conversion are performed on ETL, the efficiency is low, and the universality is lacked.

Disclosure of Invention

In view of the above, it is necessary to provide a big data processing method, apparatus, device and medium, aiming at solving the problem of big data cleansing.

A big data processing method, comprising:

responding to a data processing instruction, and acquiring data generation dictionary information from the data processing instruction;

defining a preset language, loading the dictionary information in the preset language, and generating a data extraction statement;

determining a source system and a target system according to the data processing instruction;

identifying the data type of the source system, and extracting data from the source system according to the data type of the source system and the data extraction statement to serve as data to be processed;

extracting basic cleaning conversion logic from the data to be processed, and preprocessing the data to be processed based on the basic cleaning conversion logic to obtain intermediate data;

and acquiring target cleaning conversion logic of the target system, processing the intermediate data based on the target cleaning conversion logic to obtain target data, and transmitting the target data to the target system.

According to a preferred embodiment of the present invention, the acquiring data generation dictionary information from the data processing instruction includes:

acquiring the position information and the data structure of the field to be acquired from the data processing instruction;

constructing a data acquisition function according to the position information of the field to be acquired;

generating a drop table model according to the data structure;

and packaging the data acquisition function and the falling table model to obtain the dictionary information.

According to a preferred embodiment of the present invention, the defining the preset language includes:

carrying out rule definition based on a scala parser, wherein the rule definition is used for calculating an initial numerical value and obtaining a target numerical value;

defining a mapping rule based on a scala parser, wherein the mapping rule is used for representing the mapping relation between source data and target data;

and loading grammar writing rules of the groovy language, and generating a statement converter according to the grammar writing rules, the rule definitions and the mapping rules, wherein the statement converter is used for converting any statement into a groovy statement.

According to a preferred embodiment of the present invention, the extracting data from the source system as to-be-processed data according to the data type of the source system and the data extraction statement includes:

when the data type of the source system is the database type, acquiring a database connection string and a login certificate from the data processing instruction, connecting to the source system according to the database connection string and the login certificate, and extracting data from the source system by using the data extraction statement to obtain the data to be processed; or

And when the data type of the source system is a file type, extracting metadata from the source system by using the data extraction statement to obtain the data to be processed.

According to a preferred embodiment of the present invention, the preprocessing the data to be processed based on the basic cleaning conversion logic to obtain intermediate data includes:

carrying out duplicate removal processing on the data to be processed to obtain first data;

clustering the first data by adopting a clustering algorithm to obtain a plurality of sub-regions;

calculating the upper limit distance and the lower limit distance of all points in each sub-area in the plurality of sub-areas relative to other points;

acquiring a configuration threshold, and determining a sub-region of which the upper limit distance is greater than or equal to the configuration threshold as a sub-region to be screened;

obtaining isolated points from the sub-region to be screened;

deleting the isolated points from the first data to obtain second data;

acquiring a missing value in the second data;

and filling the missing value based on a configuration filling mechanism to obtain the intermediate data.

According to a preferred embodiment of the present invention, the padding the missing value based on the configuration padding mechanism to obtain the intermediate data includes:

calculating the conditional probability of each subdata in the second data;

sequencing the conditional probability of each subdata according to a sequence from large to small to obtain a target sequence;

acquiring a first element from the target sequence as a padding value;

and replacing the missing value by the filling value to obtain the intermediate data.

According to a preferred embodiment of the present invention, before the preprocessing the data to be processed based on the basic cleaning conversion logic, the method further comprises:

packaging the basic cleaning conversion logic into a class library, wherein the class library stores a plurality of cleaning conversion logics, and each cleaning conversion logic corresponds to an adapter;

configuring a target adapter corresponding to the basic cleaning conversion logic for the class library;

after the basic cleaning conversion logic is extracted from the data to be processed, connecting the data to the class library by using the target adapter;

and acquiring processing logic from the class library based on the basic cleaning conversion logic to preprocess the data to be processed.

A big data processing apparatus, the big data processing apparatus comprising:

the generating unit is used for responding to a data processing instruction, and acquiring data generation dictionary information from the data processing instruction;

the generating unit is further used for defining a preset language, loading the dictionary information in the preset language and generating a data extraction statement;

the determining unit is used for determining a source system and a target system according to the data processing instruction;

the extraction unit is used for identifying the data type of the source system and extracting data from the source system according to the data type of the source system and the data extraction statement to serve as data to be processed;

the preprocessing unit is used for extracting basic cleaning conversion logic from the data to be processed and preprocessing the data to be processed based on the basic cleaning conversion logic to obtain intermediate data;

and the processing unit is used for acquiring the target cleaning conversion logic of the target system, processing the intermediate data based on the target cleaning conversion logic to obtain target data, and transmitting the target data to the target system.

A computer device, the computer device comprising:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the big data processing method.

A computer-readable storage medium having stored therein at least one instruction, the at least one instruction being executable by a processor in a computer device to implement the big data processing method.

According to the technical scheme, the method can respond to a data processing instruction, acquire data from the data processing instruction to generate dictionary information, define a preset language, load the dictionary information in the preset language to generate a data extraction statement, determine a source system and a target system according to the data processing instruction, identify the data type of the source system, extract data from the source system according to the data type of the source system and the data extraction statement to serve as data to be processed, perform data extraction from the source system in a targeted manner according to different data types, enable the acquired data to be more accurate, extract basic cleaning conversion logic from the data to be processed, preprocess the data to be processed based on the basic cleaning conversion logic to obtain intermediate data, and realize the first basic cleaning of the data, and acquiring target cleaning conversion logic of the target system, processing the intermediate data based on the target cleaning conversion logic to obtain target data, transmitting the target data to the target system, and further performing secondary cleaning and conversion according to personalized requirements of different application scenes on the data to ensure that the data is matched with the use scenes, so that the method has higher applicability.

Drawings

FIG. 1 is a flow chart of a big data processing method according to a preferred embodiment of the present invention.

FIG. 2 is a functional block diagram of a preferred embodiment of the big data processing apparatus according to the present invention.

FIG. 3 is a schematic structural diagram of a computer device for implementing a big data processing method according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flow chart of a big data processing method according to a preferred embodiment of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.

The big data processing method is applied to one or more computer devices, and the computer devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware thereof includes but is not limited to a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive web Television (IPTV), an intelligent wearable device, and the like.

The computer device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network servers.

The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The Network in which the computer device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.

S10, in response to the data processing instruction, acquiring data generating dictionary information from the data processing instruction.

In at least one embodiment of the present invention, the data processing instruction may be triggered by a relevant worker, such as a developer, a tester, a salesperson, and the like.

In at least one embodiment of the invention, the obtaining data generation dictionary information from the data processing instruction comprises:

generating a drop table model according to the data structure;

For example: the data acquisition function may be a select () function or a regular expression, which is not limited in the present invention.

Through the data acquisition function, fields of specified positions can be acquired, such as: data is collected for field positions 1-20.

The falling form model can be used for clarifying the specific meaning of each field in the finally generated data, such as: fields 0-11 represent the user's cell phone number and fields 12-18 represent the user's information.

S11, defining a preset language, loading the dictionary information in the preset language, and generating a data extraction statement.

In at least one embodiment of the present invention, the defining the preset language includes:

For example: the rule definition may be:

if(value＝＝'RR')

'RR_MAPPER_123'

else if(value＝＝'ATM')

'ATM_MAPPER_123'

else"EEROR_MAPPER"。

the mapping rule may be:

tagContext["tagModeName1"]＝

srcContext["srcModeName1"]

tagContext["tagModeName8"]＝

hardCode("GALAXY-AQUILA")。

through the embodiment, the groovy language can be improved, so that the obtained language learning cost is lower, and the method is easy to use.

In at least one embodiment of the present invention, the loading the dictionary information in the preset language, and generating the data extraction statement includes:

and generating JAVA byte codes according to the dictionary information based on the dynamic compiling technology of the groovy language to obtain the data extraction statement.

And S12, determining a source system and a target system according to the data processing instruction.

In at least one embodiment of the present invention, said determining a source system and a target system from said data processing instructions comprises:

analyzing the data processing instruction to obtain a source system identification code and a target system identification code;

and determining the source system according to the source system identification code, and determining the target system according to the target system identification code.

It should be noted that the source system identification code and the target system identification code have uniqueness, so that one source system can be uniquely determined by the source system identification code, and one target system can be uniquely determined by the target system identification code.

In this embodiment, the source system may include a business system, and the target system may include a downstream system of an application.

And S13, identifying the data type of the source system, and extracting data from the source system as data to be processed according to the data type of the source system and the data extraction statement.

In at least one embodiment of the invention, the data types of the source system may include, but are not limited to: database type, file type.

In at least one embodiment of the present invention, the extracting data from the source system as to-be-processed data according to the data type of the source system and the data extraction statement includes:

In this embodiment, the login credentials may include, but are not limited to: a user name and a password.

Through the embodiment, data can be extracted from the source system in a targeted manner according to different data types, so that the acquired data is more accurate.

And S14, extracting basic cleaning conversion logic from the data to be processed, and preprocessing the data to be processed based on the basic cleaning conversion logic to obtain intermediate data.

In at least one embodiment of the invention, the base cleansing conversion logic may be configured according to history processing logic to implement a first base cleansing of data.

In at least one embodiment of the present invention, the preprocessing the data to be processed based on the basic cleaning conversion logic to obtain intermediate data includes:

obtaining isolated points from the sub-region to be screened;

deleting the isolated points from the first data to obtain second data;

acquiring a missing value in the second data;

Through the implementation mode, the basic cleaning and conversion of the data to be processed can be realized, and the subsequent calculation can be simplified through screening the isolated points, so that the data processing efficiency is improved.

Specifically, the filling the missing value based on the configuration filling mechanism to obtain the intermediate data includes:

calculating the conditional probability of each subdata in the second data;

acquiring a first element from the target sequence as a padding value;

In the above embodiment, the filling of the discrete missing values with the maximum conditional probability value has a higher correlation with other values than the conventional filling with a fixed value (e.g. 0).

In at least one embodiment of the present invention, before the preprocessing the data to be processed based on the basic cleaning conversion logic, the method further includes:

By the implementation mode, the cleaning conversion logic can be packaged, when new processing logic exists, the cleaning conversion logic can be directly changed in the class library, and a new adapter is configured, so that calling is facilitated.

And S15, acquiring the target cleaning conversion logic of the target system, processing the intermediate data based on the target cleaning conversion logic to obtain target data, and transmitting the target data to the target system.

In this embodiment, the target cleaning conversion logic refers to a data processing mode adapted to an actual application scenario corresponding to the target system.

It can be understood that each application scenario may have different requirements on data structures such as data formats, and therefore, after basic data cleaning and conversion are performed, secondary cleaning and conversion may be further performed according to personalized requirements of different application scenarios on the data, so as to ensure that the data is matched with the usage scenarios, and have higher applicability.

For example: when the target cleaning conversion logic of the actual usage scenario requires data to have consistency, a consistency detection requirement can be obtained from the target cleaning conversion logic, and the intermediate data is processed according to the consistency detection requirement to obtain the target data.

Further, the target data is transmitted to the target system for use by the target system.

It should be noted that, in order to further improve the security of the data and avoid malicious tampering of the data, the target data may be stored in the blockchain node.

FIG. 2 is a functional block diagram of a big data processing device according to a preferred embodiment of the present invention. The big data processing device 11 comprises a generating unit 110, a determining unit 111, an extracting unit 112, a preprocessing unit 113 and a processing unit 114. The module/unit referred to in the present invention refers to a series of computer program segments that can be executed by the processor 13 and that can perform a fixed function, and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.

In response to a data processing instruction, the generation unit 110 acquires data generation dictionary information from the data processing instruction.

In at least one embodiment of the present invention, the generating unit 110 obtaining data generation dictionary information from the data processing instruction includes:

generating a drop table model according to the data structure;

The generating unit 110 defines a preset language, and loads the dictionary information in the preset language to generate a data extraction statement.

In at least one embodiment of the present invention, the generating unit 110 defines the preset language including:

For example: the rule definition may be:

if(value＝＝'RR')

'RR_MAPPER_123'

else if(value＝＝'ATM')

'ATM_MAPPER_123'

else"EEROR_MAPPER"。

the mapping rule may be:

tagContext["tagModeName1"]＝

srcContext["srcModeName1"]

tagContext["tagModeName8"]＝

hardCode("GALAXY-AQUILA")。

In at least one embodiment of the present invention, the generating unit 110 loads the dictionary information in the preset language, and generating the data extraction statement includes:

The determining unit 111 determines a source system and a target system according to the data processing instruction.

In at least one embodiment of the present invention, the determining unit 111 for determining the source system and the target system according to the data processing instruction comprises:

The extraction unit 112 identifies the data type of the source system, and extracts data from the source system as data to be processed according to the data type of the source system and the data extraction statement.

In at least one embodiment of the present invention, the extracting unit 112 extracting data from the source system as the data to be processed according to the data type of the source system and the data extraction statement includes:

The preprocessing unit 113 extracts a basic cleaning conversion logic from the data to be processed, and preprocesses the data to be processed based on the basic cleaning conversion logic to obtain intermediate data.

In at least one embodiment of the present invention, the preprocessing unit 113 preprocesses the data to be processed based on the basic cleaning conversion logic, and obtaining intermediate data includes:

obtaining isolated points from the sub-region to be screened;

deleting the isolated points from the first data to obtain second data;

acquiring a missing value in the second data;

Specifically, the pre-processing unit 113 performs padding processing on the missing value based on a configured padding mechanism, and obtaining the intermediate data includes:

calculating the conditional probability of each subdata in the second data;

acquiring a first element from the target sequence as a padding value;

In at least one embodiment of the present invention, before the preprocessing of the data to be processed based on the basic cleaning conversion logic, the basic cleaning conversion logic is encapsulated into a class library, wherein the class library stores a plurality of cleaning conversion logics, and each cleaning conversion logic corresponds to one adapter;

The processing unit 114 obtains a target cleaning conversion logic of the target system, processes the intermediate data based on the target cleaning conversion logic to obtain target data, and transmits the target data to the target system.

Fig. 3 is a schematic structural diagram of a computer device according to a preferred embodiment of the present invention for implementing a big data processing method.

The computer device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a big data processing program, stored in the memory 12 and executable on the processor 13.

It will be understood by those skilled in the art that the schematic diagram is merely an example of the computer device 1, and does not constitute a limitation to the computer device 1, the computer device 1 may have a bus-type structure or a star-shaped structure, the computer device 1 may further include more or less other hardware or software than those shown, or different component arrangements, for example, the computer device 1 may further include an input and output device, a network access device, etc.

It should be noted that the computer device 1 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

The memory 12 includes at least one type of readable storage medium, which includes flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the computer device 1, for example a removable hard disk of the computer device 1. The memory 12 may also be an external storage device of the computer device 1 in other embodiments, such as a plug-in removable hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the computer device 1. The memory 12 can be used not only for storing application software installed in the computer apparatus 1 and various types of data such as a code of a big data processing program, etc., but also for temporarily storing data that has been output or is to be output.

The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 13 is a Control Unit (Control Unit) of the computer device 1, connects various components of the entire computer device 1 by using various interfaces and lines, and executes various functions and processes data of the computer device 1 by running or executing programs or modules (e.g., executing a big data processing program, etc.) stored in the memory 12 and calling data stored in the memory 12.

The processor 13 executes the operating system of the computer device 1 and various installed application programs. The processor 13 executes the application program to implement the steps in the above-mentioned various embodiments of big data processing method, such as the steps shown in fig. 1.

Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to accomplish the present invention. The one or more modules/units may be a series of computer readable instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the computer device 1. For example, the computer program may be divided into a generation unit 110, a determination unit 111, an extraction unit 112, a pre-processing unit 113, a processing unit 114.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the big data processing method according to the embodiments of the present invention.

The integrated modules/units of the computer device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented.

Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), random-access Memory, or the like.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one line is shown in FIG. 3, but this does not mean only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 or the like.

Although not shown, the computer device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 13 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the computer device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the computer device 1 and other computer devices.

Optionally, the computer device 1 may further comprise a user interface, which may be a Display (Display), an input unit, such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the computer device 1 and for displaying a visualized user interface.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

Fig. 3 shows only the computer device 1 with the components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the computer device 1 and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

With reference to fig. 1, the memory 12 of the computer device 1 stores a plurality of instructions to implement a big data processing method, and the processor 13 can execute the plurality of instructions to implement:

Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the present invention may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A big data processing method is characterized by comprising the following steps:

2. The big data processing method of claim 1, wherein the obtaining data generation dictionary information from the data processing instruction comprises:

generating a drop table model according to the data structure;

3. The big data processing method of claim 1, wherein the defining the preset language comprises:

4. The big data processing method according to claim 1, wherein the extracting data from the source system as the data to be processed according to the data type of the source system and the data extraction statement comprises:

5. The big data processing method of claim 1, wherein the preprocessing the data to be processed based on the base cleansing conversion logic to obtain intermediate data comprises:

obtaining isolated points from the sub-region to be screened;

deleting the isolated points from the first data to obtain second data;

acquiring a missing value in the second data;

6. The big data processing method according to claim 5, wherein the padding the missing value based on the configuration padding mechanism to obtain the intermediate data comprises:

calculating the conditional probability of each subdata in the second data;

acquiring a first element from the target sequence as a padding value;

7. The big data processing method of claim 1, wherein prior to the pre-processing the data to be processed based on the base cleansing conversion logic, the method further comprises:

8. A big data processing apparatus, characterized in that the big data processing apparatus comprises:

9. A computer device, characterized in that the computer device comprises:

a memory storing at least one instruction; and

a processor executing instructions stored in the memory to implement a big data processing method according to any of claims 1 to 7.

10. A computer-readable storage medium characterized by: the computer-readable storage medium stores at least one instruction which is executed by a processor in a computer device to implement the big data processing method according to any one of claims 1 to 7.