WO2016197852A1

WO2016197852A1 - Data processing method and device

Info

Publication number: WO2016197852A1
Application number: PCT/CN2016/084442
Authority: WO
Inventors: 戢洋; 甘云锋; 肖禹
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2015-06-09
Filing date: 2016-06-02
Publication date: 2016-12-15
Also published as: CN106294498A

Abstract

A data processing method and device. The method comprises: acquiring original data (101); classifying the acquired original data (102); and when a service to be processed is received, extracting needed data from the classified data according to a need of the service to be processed (103). The data processing method and device realize the automatization of data processing without conducting manual processing, so that the calculation result can be generally used and reused, thereby improving the efficiency.

Description

Data processing method and device

The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No

Technical field

The embodiments of the present invention relate to the field of communications technologies, and in particular, to a data processing method and device.

Background technique

The traditional method of data modeling is to extract the data from the source system, and then manually write the SQL (Structured Query Language) to integrate the extracted data into the data warehouse standard dimension table structure, and then the entire data warehouse. Modeling is complete, and following the Internet business model, there are generally two types of requirements:

First, the data warehouse standard dimension table is integrated into a business wide table by manually writing SQL;

Second, the multiple dimension tables of the data warehouse are integrated into the input sample set required by the algorithm model by manually writing SQL.

It can be seen that in the prior art, no matter what kind of requirements, it is required to be manually integrated according to requirements, so that the calculation results are not universally multiplexed, the efficiency is low, and the manual maintenance cost is relatively high.

Summary of the invention

In view of the deficiencies in the prior art, the present application proposes a data processing method, including:

Get the original data;

Classify the acquired raw data;

When receiving the to-be-processed service, the required data is extracted from the classified data according to the needs of the to-be-processed service.

Optionally, the original data includes: new data, updated data, and specific domain data;

The obtaining the original data includes:

Regularly obtaining new data from a preset plurality of databases;

Regularly obtaining updated data from a plurality of preset databases;

Regularly acquire data of a predetermined field based on keywords.

Optionally, before the classifying the obtained original data, the method further includes:

The obtained original data is stored in the operation data source ODS, and the original data in the ODS is obtained. Take the raw data for integration.

Optionally, the classifying the obtained original data includes:

The classification configuration parameters need to be set according to the preset classification rules and classifications;

Integrate all classification configuration parameters to generate classification integration template data;

Generating SQL code based on the classified integrated template data and the multi-source data integration framework;

Obtaining the original data from the ODS by using the SQL code, and classifying the acquired original data according to the object;

The classified data is stored in the data warehouse DW, and the original data in the DW is integrated with the obtained classified data;

The objects include: time, place, event, person, relationship.

Optionally, when the service to be processed is received, the required data is extracted from the classified data according to the needs of the to-be-processed service, and specifically includes:

After receiving the to-be-processed service, analyzing the need of the to-be-processed service based on a preset rule to determine data required to process the to-be-processed service;

The data extracted from the classified data based on the determined data is stored in the data mart DM.

The application also proposes a data processing device, comprising:

Get the module to get the original data,

a classification module for classifying the acquired raw data;

And an extracting module, configured to extract required data from the classified data according to the needs of the to-be-processed service when receiving the to-be-processed service.

The obtaining module is specifically configured to:

Regularly obtaining new data from a preset plurality of databases;

Regularly obtaining updated data from a plurality of preset databases;

Regularly acquire data of a predetermined field based on keywords.

Optionally, the device further includes:

The integration module is configured to store the acquired original data in the operation data source ODS, and integrate the original data in the ODS with the acquired original data.

Optionally, the classification module is specifically configured to:

The objects include: time, place, event, person, relationship.

Optionally, the extraction module is specifically used to:

Compared with the prior art, in the present application, the obtained original data is classified; so that when the to-be-processed service is received, the required data is extracted from the classified data according to the needs of the to-be-processed service, and data processing is realized. The automation does not require manual processing, so that the calculation results can be universally multiplexed, improving efficiency.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments will be briefly described below. Obviously, the drawings in the following description are only some of the present application. For the embodiments, those skilled in the art can obtain other drawings according to the drawings without any creative work.

1 is a schematic flowchart of a data processing method according to an embodiment of the present application;

2 is a schematic diagram of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a data processing device according to an embodiment of the present application.

detailed description

As described in the background art, in the prior art, a data processing method is proposed in the present application, as shown in FIG. 1 , including the following steps:

Step 101: Obtain original data.

Specifically, the original data may be various data, and may be selected based on needs, and the original data may be obtained from each database as needed, for example, as shown in FIG. 2, and may be obtained from the following database. Take: hotel accommodation reservation record database, railway ticket purchase travel record database, civil aviation reservation flight record database, census record database, illegal crime record database, etc., can be set based on needs, and can also be based on other databases Get the raw data in .

As time goes on, new data will continue to be generated, and the old data will be updated and updated continuously. In addition, data of specific fields may be needed based on certain needs, so the original data may include: new Increased data, updated data, domain-specific data; therefore the specific acquisition process can include:

Regularly obtaining new data from a preset plurality of databases;

Regularly obtaining updated data from a plurality of preset databases;

Regularly acquire data of a predetermined field based on keywords.

The preset plurality of databases may include the plurality of databases described above, and may also be obtained from other databases based on the need, for example, if it is required to query the online shopping situation of a person (for example, A), the network shopping record needs to be queried. The database, to get the account record of the user A on Taobao, to know the online shopping situation on Taobao, as for the shopping records of other websites, such as Tmall, etc. are similar.

After the original data is obtained, the original data needs to be processed. Specifically, the original data to be acquired may be stored in an ODS (Operational Data Store), and the original data and the ODS are acquired. The raw data is integrated. For example, the obtained original data includes data 1, data 2, and data 3, and the original data in the ODS has data 3, and the two data 3 are duplicated, and one can be deleted, for example, the original ODS can be retained. Data 3, and delete the data 3 in the original data obtained, so as to ensure that the data is complete and comprehensive, and avoid redundant data.

Step 102: Sort the obtained original data.

Specifically, in step 101, only data is acquired, and there are many data. For this reason, the obtained data is classified in the present application, and the specific process includes: setting classification configuration parameters according to preset classification rules and classification requirements; All classification configuration parameters generate classification integration template data; generate SQL code based on the classification integration template data and multi-source data integration framework; obtain original data from the ODS through the SQL code, and obtain original data Sorting according to the object; storing the classified data in the data warehouse DW, and integrating the original data in the DW with the obtained classified data; wherein the objects include: time, place, event, person, Relationship; so that the subsequent extraction can be performed quickly when needed, the specific classification process can be as follows:

Based on the preset classification rules and classifications, the classification configuration parameters need to be set. The classification rules include various steps of the classification. For example, the steps may include: extracting the original data, and scanning the original data to determine the multidimensional of each original data. Degree feature, based on the classification needs to select specific features to classify and integrate the original data, correspondingly, each step configures the corresponding classification configuration parameters, and all the classification configuration parameters are integrated into a set of classification processes, that is, corresponding The classification integrates the template data, and the subsequent integration of the template data into the multi-source data integration framework (for generating SQL code) can be used to generate the corresponding SQL code, so that if the subsequent classification needs to face the same classification, the generation can be directly utilized. The SQL code is used for classification, and if it is to meet different needs, only the corresponding adjustment configuration parameters can be adapted to different needs.

When the original data is stored in the ODS, the original data is obtained from the ODS and classified by using the SQL code;

The original data obtained is classified according to objects; the objects include: time, place, event, person, relationship; classification according to the object can better display events of various dimensions, so as to better meet the needs, follow-up The classified data is stored in a DW (Data Warehouse), and the original data in the DW is integrated with the acquired classified data.

The specific classification process is shown in Figure 2. The original data is obtained by using the SQL code, and the obtained original data is classified based on time, place, event, person, relationship, for example, data based on time division can be involved in the time. Arrange according to the order of time, and set the time interval to classify the time. For example, the time exists 2012.03.06, 2015.05.04, 2013.03.05, 2014.06.03, 2013.02.04, the time interval can be set to 1 year, so these time can be divided, specifically, divided into interval 1 (2012.03.06), interval 2 (2013.02.04, 2013.03.05), interval 3 (2013.03.05), interval 4 (2015.05 .04); Others such as places can be divided into countries, provinces, cities, counties, etc., or divided according to latitude and longitude, and events can be divided into transactions, transfers, crimes, travel, etc. based on needs. The character can be divided based on the ID card, name, mobile phone number, mailbox, etc. related to the person. Specifically, for example, there are 3 people, respectively A, B, C, then the classification A can be set. Including ID card, name, mobile phone number, email address, as for B and C, similar to this, no longer describe it here, but the relationship can include: interpersonal relationships, such as friends, classmates, fellows, etc., can also be the same car Drivers, gangs, etc., and the connection between the original data still exists, only the data is classified, for example, the original data is that user 1 has traded with user 2 at time 1, and user 1 has sold to user 2 for goods. 1, after classification, time is time 1, the character is user 1 and user 2, the relationship is transaction, the specific user 1 sells to user 2 goods 1, after classification, the data is divided into 3 parts, but after classification You can find other parts from any part.

Step 103: When receiving the to-be-processed service, when the to-be-processed service is received from the classified one according to the needs of the to-be-processed service, the required data is extracted from the classified data according to the needs of the to-be-processed service.

The specific operation of extracting data specifically includes:

After receiving the to-be-processed service, analyzing the need of the to-be-processed service based on the preset rule to determine data required for processing the to-be-processed service; and extracting the data extracted from the classified data based on the determined data, and storing in the data mart DM (Data Malt, data mart).

Specifically, for example, it is required to evaluate the performance of the merchant A on Taobao in 2014 to give a rating. First, the data of the business needs to be analyzed based on a preset rule, for example, various commodities sold in the merchant A are required. The price of the product, the sales of various commodities of the merchant A in 2014, whether there is any evaluation of the products sold, the proportion of the evaluation, the number and proportion of the scores of the good and bad in the evaluation, and the pictures in the score The quantity and proportion, for this purpose, the corresponding data can be obtained from the classified data, for example, the data of the character includes the account number of each buyer, the mobile phone number and others, and the relationship is the transaction with the merchant A, the specific transaction data, The buyer's evaluation of the products sold by the merchant A is from January 1, 2014 to December 1, 2014, in order to obtain the aforementioned data to jointly evaluate the performance of the merchant A on Taobao in 2014.

In order to further explain the present application, the present application also discloses a data processing device, as shown in FIG. 3, including:

An obtaining module 301, configured to acquire original data,

a classification module 302, configured to classify the acquired original data;

The extracting module 303 is configured to: when receiving the to-be-processed service, extract the required data from the classified data according to the needs of the to-be-processed service.

The obtaining module 301 is specifically configured to:

Regularly obtaining new data from a preset plurality of databases;

Regularly obtaining updated data from a plurality of preset databases;

Regularly acquire data of a predetermined field based on keywords.

Optionally, the data processing device further includes:

Optionally, the classification module 302 is specifically configured to:

The objects include: time, place, event, person, relationship.

Optionally, the extraction module 303 is specifically configured to:

Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a USB flash drive, a mobile hard disk, etc.), including several The instructions are for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various implementation scenarios of the present application.

A person skilled in the art can understand that the drawings are only a schematic diagram of a preferred implementation scenario, and the modules or processes in the drawings are not necessarily required to implement the application.

A person skilled in the art may understand that the modules in the apparatus in the implementation scenario may be distributed in the apparatus for implementing the scenario according to the implementation scenario description, or may be correspondingly changed in one or more devices different from the implementation scenario. The modules of the above implementation scenarios may be combined into one module, or may be further split into multiple sub-modules.

The above serial numbers are only for the description, and do not represent the advantages and disadvantages of the implementation scenario.

The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any changes that can be made by those skilled in the art should fall within the protection scope of the present application.

Claims

A data processing method, comprising:

Get the original data;

Classify the acquired raw data;

When receiving the to-be-processed service, the required data is extracted from the classified data according to the needs of the to-be-processed service.
The method of claim 1, wherein the original data comprises: new data, updated data, data of a specific domain;

The obtaining the original data includes:

Regularly obtaining new data from a preset plurality of databases;

Regularly obtaining updated data from a plurality of preset databases;

Regularly acquire data of a predetermined field based on keywords.
The method according to claim 1, wherein before the classifying the acquired original data, the method further comprises:

The obtained original data is stored in the operation data source ODS, and the original data in the ODS is integrated with the acquired original data.
The method of claim 3, wherein the classifying the acquired raw data comprises:

The classification configuration parameters need to be set according to the preset classification rules and classifications;

Integrate all classification configuration parameters to generate classification integration template data;

Generating SQL code based on the classified integrated template data and the multi-source data integration framework;

Obtaining the original data from the ODS by using the SQL code, and classifying the acquired original data according to the object;

The classified data is stored in the data warehouse DW, and the original data in the DW is integrated with the obtained classified data;

The objects include: time, place, event, person, relationship.
The method according to claim 1, wherein when the service to be processed is received, the required data is extracted from the classified data according to the needs of the to-be-processed service, and specifically includes:

After receiving the to-be-processed service, analyzing the need of the to-be-processed service based on a preset rule to determine data required to process the to-be-processed service;

The data extracted from the classified data based on the determined data is stored in the data mart DM.
A data processing device, comprising:

Get the module to get the original data,

a classification module for classifying the acquired raw data;

And an extracting module, configured to extract required data from the classified data according to the needs of the to-be-processed service when receiving the to-be-processed service.
The device according to claim 6, wherein the original data comprises: new data, updated data, data of a specific domain;

The obtaining module is specifically configured to:

Regularly obtaining new data from a preset plurality of databases;

Regularly obtaining updated data from a plurality of preset databases;

Regularly acquire data of a predetermined field based on keywords.
The device of claim 6 further comprising:

The integration module is configured to store the acquired original data in the operation data source ODS, and integrate the original data in the ODS with the acquired original data.
The device according to claim 8, wherein the classification module is specifically configured to:

The classification configuration parameters need to be set according to the preset classification rules and classifications;

Integrate all classification configuration parameters to generate classification integration template data;

Generating SQL code based on the classified integrated template data and the multi-source data integration framework;

Obtaining the original data from the ODS by using the SQL code, and classifying the acquired original data according to the object;

The classified data is stored in the data warehouse DW, and the original data in the DW is integrated with the obtained classified data;

The objects include: time, place, event, person, relationship.
The device of claim 6, wherein the extraction module is specifically configured to:

After receiving the to-be-processed service, analyzing the need of the to-be-processed service based on a preset rule to determine data required to process the to-be-processed service;

The data extracted from the classified data based on the determined data is stored in the data mart DM.