WO2022205938A1 - Data acquisition method and apparatus, computer device, and storage medium - Google Patents

Data acquisition method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2022205938A1
WO2022205938A1 PCT/CN2021/131752 CN2021131752W WO2022205938A1 WO 2022205938 A1 WO2022205938 A1 WO 2022205938A1 CN 2021131752 W CN2021131752 W CN 2021131752W WO 2022205938 A1 WO2022205938 A1 WO 2022205938A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
field information
information
time interval
collected
Prior art date
Application number
PCT/CN2021/131752
Other languages
French (fr)
Chinese (zh)
Inventor
孙岩
董光杰
顾永飞
杭军
吴金迎
钱津津
Original Assignee
苏宁易购集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁易购集团股份有限公司 filed Critical 苏宁易购集团股份有限公司
Publication of WO2022205938A1 publication Critical patent/WO2022205938A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a data acquisition method, device, computer equipment and storage medium.
  • Data collection generally uses collection tools to collect source data in the source database to the data warehouse of the big data platform.
  • the data update time is usually used as a filter condition to collect incremental data.
  • the newly updated data needs to be The data of the return order and the data of the original order corresponding to the return order are collected together into the data warehouse before they can be used for downstream statistical analysis.
  • the big data center platform needs to use the full calculation method to calculate the historical inventory data, which not only consumes computing resources, but also The efficiency of data collection is reduced.
  • a data processing method comprising:
  • the target field information that does not match the time interval information is filtered from the field information stored in the intermediate table.
  • the intermediate table includes the field information of the characteristic fields of the data to be collected.
  • the data to be collected is determined from the business database according to the preset collection logic. The data;
  • Data integration processing is performed on the first data and the second data to obtain target data.
  • the above method further includes:
  • the above-mentioned intermediate table is a data table set in a data warehouse.
  • the above method before obtaining the specified time interval information, further includes: obtaining preset collection logic information; determining data in the business database that conforms to the collection logic information as the data to be collected; The extracted field information is stored in the intermediate table.
  • filtering the target field information that does not match the time interval information from the field information stored in the intermediate table includes: acquiring task parameter information, reading the field information matching the task parameter information from the intermediate table and storing it in a temporary Table; filter the target field information that does not match the time interval information from the field information stored in the temporary table.
  • collecting the second data from the business database according to the target field information includes: generating a structured query language by using the target field information as a value corresponding to a query condition; collecting the second data from the business database according to the structured query language .
  • the above method further includes: performing data deduplication processing on the data after the data integration processing.
  • the above method further includes: comparing the data after the data integration processing with the data to be collected, and removing data different from the data to be collected.
  • a data acquisition device comprising:
  • a first data acquisition module configured to acquire specified time interval information, and collect the updated first data in the corresponding time interval from the business database according to the time interval information
  • the field information acquisition module is used to filter the target field information that does not match the time interval information from the field information stored in the intermediate table.
  • the intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is collected according to a preset Logic determines the data from the business database;
  • the second data collection module is used for collecting the second data from the business database according to the target field information
  • the data integration processing module is used for performing data integration processing on the first data and the second data to obtain target data.
  • a computer device includes a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor implements the steps of the data acquisition method when the processor executes the computer program.
  • the above-mentioned data collection method, device, computer equipment and storage medium collect the corresponding first data through a specified time interval, filter the target field information that is not in the specified time interval through an intermediate table containing the field information of the pre-determined data to be collected, and according to the The second data corresponding to the target field information is collected, and finally the first data and the second data are integrated to obtain the target data of this collection task.
  • the updated data within the specified time interval and the related data can be quickly collected.
  • the historical data of the relationship no longer needs to be fully calculated on the historical data, thereby improving the efficiency of data collection.
  • Fig. 1 is the application environment diagram of the data acquisition method in one embodiment
  • FIG. 2 is a schematic flowchart of a data collection method in one embodiment
  • Fig. 3 is the technical framework diagram of distributed data acquisition task execution in an application example
  • FIG. 4 is a schematic flowchart of a data acquisition method in an application example
  • FIG. 5 is a structural block diagram of a data acquisition device in one embodiment
  • FIG. 6 is a diagram of the internal structure of a computer device in one embodiment.
  • the data collection method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the server 102 obtains the specified time interval information, and collects the first data updated in the corresponding time interval from the business database 104 according to the time interval information; and selects the target fields that do not match the time interval information from the field information stored in the intermediate table 106.
  • the intermediate table 106 includes the field information of the characteristic fields of the data to be collected, the data to be collected is the data determined from the business database according to the preset collection logic; the second data is collected from the business database 104 according to the target field information; Data integration processing is performed on the first data and the second data to obtain target data.
  • the server 102 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a data collection method is provided, which is described by taking the method applied to the server in FIG. 1 as an example, including the following steps:
  • Step S202 Acquire specified time interval information, and collect first data updated in the corresponding time interval from the service database according to the time interval information.
  • the business database is a database that stores business data, and may be a relational database or a non-relational database.
  • the business database may contain at least one business data table.
  • the first data is data updated to a certain data table in the business database within a specified time interval.
  • the user can collect data by using the time interval of data update as a condition for data filtering.
  • the time interval information may be information indicating any valid time period or point in time.
  • the server obtains the time interval information specified by the user, uses the time interval information as a condition for data screening, and collects the data updated in the time interval or at the time point corresponding to the time interval information in the business database as the first data.
  • Step S204 Screen the target field information that does not match the time interval information from the field information stored in the intermediate table, the intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is obtained from the business database according to the preset collection logic. data identified in.
  • the intermediate table is a data table in the database for storing intermediate calculation results.
  • the data to be collected is the data to be collected determined from the data of the business database according to the user-defined or preset collection logic, and the purpose of determining the data to be collected is to frame the scope of data collection.
  • the characteristic fields can be set adaptively according to different data types. For example, for return order data, the characteristic fields can include at least one of order number, order line number, stock number, table number, and order time.
  • the field information can be the field value in the field.
  • the server performs matching and screening from the field information stored in the intermediate table according to the time interval information, filters out the field information of the data to be collected in the time interval not corresponding to the time interval information, and uses the filtered field information as the target field information. For example, if the time interval information is yesterday, the field information of the data to be collected that is not updated yesterday is selected from the intermediate table as the target field information.
  • Step S206 Collect second data from the service database according to the target field information.
  • the second data refers to the data queried from the business database according to the target field information.
  • the server can use the target field information as a data screening condition, and query the business data containing the target field information from the business database, and can use a query language matching the business database to query, for example , the IN query in the SQL language (Structured Query Language, structured query language) can be used to collect the queried business data as the second data.
  • the IN query in the SQL language Structured Query Language, structured query language
  • Step S208 Perform data integration processing on the first data and the second data to obtain target data.
  • data integration processing is performed on the collected first data and second data, and all the data in the data set obtained after the data integration is used as the target data of this data collection task.
  • the corresponding first data is collected through a designated time interval
  • the target field information that is not in the designated time interval is filtered through an intermediate table containing the field information of the predetermined data to be collected
  • the corresponding second data is collected according to the target field information.
  • the above method further includes: storing the target data in the partition table of the corresponding partition in the data warehouse.
  • the partition table can be a data table in the hive database, where data can be written to a partition table in a custom format or a default format, and the hive data table in a custom format can prevent the content of some fields from containing line breaks. Data corruption problem occurs.
  • the intermediate table is a data table set in a data warehouse.
  • the intermediate table is set in the data warehouse of the big data platform, which may be one or more data tables in the data warehouse, and its format is not limited, for example, it may be a hive data table.
  • an intermediate table is created in each sub-database of the business database, and then when the data of a certain table is extracted, it is collected in the form of an inner join of the intermediate table. For example, when collecting return table data, query by means of inner join intermediate table of the return table, and collect the newly added return order data and the original order data corresponding to the newly added return order to the data warehouse for statistical analysis of downstream sales data .
  • the intermediate table of each sub-database of the business database in the business system can be removed. Because the premise of writing data to the intermediate table in the business database (data source) is that the data source needs to be configured with read and write permissions, therefore, traditional data collection can only use the main database of the business database, which reduces system performance during collection. , which affects the normal operation of the business, and the data writing operation also reduces the security of the database.
  • the intermediate table created in the business database is removed, and there is no need to query the intermediate table in the form of inner join. Therefore, the standby business data database can be used for data collection, and there is no need for the main business data database. Influence, you can decouple business systems and ensure system security.
  • the above method before acquiring the specified time interval information, the above method further includes:
  • Obtain preset collection logic information determine data in the business database that conforms to the collection logic information as data to be collected; extract field information from characteristic fields of the data to be collected and store it in an intermediate table.
  • the collection logic information that conforms to the business rules can be preset according to the business rules of each collection task, and the scope of data collection can be determined according to the preset collection logic information, which is about to meet the collection logic information.
  • the logical data is determined as the data to be collected, and the field information in the characteristic fields of the data to be collected is extracted and stored in the intermediate table.
  • the characteristic fields can be pre-specified according to different collection tasks, for example, the order number, order line number can be specified. , stock number, table number or order time and other fields are characteristic fields.
  • filtering the target field information that does not match the time interval information from the field information stored in the intermediate table includes: acquiring task parameter information, reading the field information matching the task parameter information from the intermediate table and storing it in a temporary Table; filter the target field information that does not match the time interval information from the field information stored in the temporary table.
  • the task parameters refer to the parameters corresponding to the collection task configured by the user before the collection task is started.
  • data collection can be performed through the spark task of the big data platform.
  • the server obtains the user-configured parameters Task parameters, and load the task parameters to the spark task.
  • Task parameters can include information such as specifying the business database to be queried, specifying the field information of the source table to be collected, and specifying the partition table to be written.
  • the intermediate table may include data to be collected that is pre-determined according to the collection logic information of different collection tasks, by acquiring and loading the task parameters configured by the user before the collection task is started, it is possible to obtain data from the intermediate table that conform to the current collection
  • the data to be collected for the task, and the data to be collected for the current collection task is stored in a temporary table for subsequent processing.
  • distributed task execution can be performed, which solves the problem of single-point tasks and improves the efficiency of data collection.
  • collecting the second data from the business database according to the target field information includes: generating a structured query language by using the target field information as a value corresponding to a query condition; collecting the second data from the business database according to the structured query language .
  • the corresponding data can be quickly located from the business database according to the target field information, Improve the efficiency of data collection.
  • the above method further includes: performing data deduplication processing on the data after the data integration processing.
  • performing deduplication processing on the data redundant and redundant data can be removed, and the accuracy of data collection can be improved.
  • the above method further includes: comparing the data after the data integration processing with the data to be collected, and removing the data that is different from the data to be collected. data.
  • the data in the non-collection range can be excluded, and the accuracy of data collection can be further improved.
  • Figure 3 shows a technical framework diagram of the execution of distributed data collection tasks in an application example
  • Figure 4 shows a A schematic flowchart of the data collection method in the application example, which specifically includes the following steps:
  • Step 1 Data collection in the intermediate table.
  • the collection logic in the intermediate table can be the integration of multiple scenario logics, such as the return table, the exchange table, etc., and the corresponding collection logic can be defined according to business requirements.
  • the related expansion table, payment table and other data need to be re-collected to the latest partition.
  • the intermediate table covers the data to be collected for the collection task, the data will be collected to the latest after the service is started. in the partition.
  • Step 2 Read the intermediate table data, and collect spark tasks to read the intermediate table data into the memory, which is convenient for subsequent data processing.
  • Step 3 Incremental collection of the return form. This step is the first step in the collection of business table data. The collection is performed according to the update time of the business table data. The incremental data collected is the data that was added and changed yesterday, and is stored in the memory. To facilitate subsequent statistical summary.
  • Step 4 Collect non-yesterday new data. This step is the second step of business table data collection. Part of the stock data in the business table is queried by IN, and stored in the memory for subsequent statistical summary.
  • Step 5 Combine and filter the data, read the incremental data and part of the stock data collected in the first two steps, and after summarizing and deduplicating the two, filter the data with the intermediate table to exclude more data than the intermediate table.
  • Step 6 Write to the target table. Finally, the last data in the previous step is automatically matched with the table format according to the target HIVE table and target table format configured by the user, and finally the data is written to the target partition table.
  • steps in the flowcharts of FIGS. 2 and 4 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 2 and 4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. These sub-steps or stages The order of execution of the steps is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.
  • a data collection device including: a first data collection module 510, a field information acquisition module 520, a second data collection module 530, and a data integration processing module 540, wherein:
  • the first data collection module 510 is configured to obtain the specified time interval information, and collect the updated first data in the corresponding time interval from the service database according to the time interval information;
  • the field information acquisition module 520 is used to filter the target field information that does not match the time interval information from the field information stored in the intermediate table, the intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is based on preset Collection logic determines the data from the business database;
  • the second data collection module 530 is configured to collect the second data from the business database according to the target field information
  • the data integration processing module 540 is configured to perform data integration processing on the first data and the second data to obtain target data.
  • the data integration processing module 540 is further configured to store the target data in the partition table of the corresponding partition in the data warehouse.
  • the first data collection module 510 is further configured to obtain preset collection logic information before obtaining the specified time interval information; determine the data in the business database that conforms to the collection logic information as the data to be collected; The field information extracted from the characteristic fields of the collected data is stored in the intermediate table.
  • the field information acquisition module 520 acquires task parameter information, reads the field information that matches the task parameter information from the intermediate table and stores it in the temporary table; filters out the field information that does not match the time interval information from the field information stored in the temporary table. Target field information.
  • the second data collection module 530 uses the target field information as a value corresponding to the query condition to generate a structured query language; and collects the second data from the business database according to the structured query language.
  • the data integration processing module 540 is further configured to perform data deduplication processing on the data after the data integration processing after performing the data integration processing on the first data and the second data.
  • the data integration processing module 540 is further configured to perform data integration processing on the first data and the second data, compare the data after data integration processing with the data to be collected, and remove the data that is different from the data to be collected. data.
  • Each module in the above-mentioned data acquisition device can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device in one embodiment, the computer device may be a server, and its internal structure diagram may be as shown in FIG. 6 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store business data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program when executed by the processor, implements a data acquisition method.
  • FIG. 6 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device which includes a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, the processor implements the following steps: acquiring specified time interval information , collect the first data updated in the corresponding time interval from the business database according to the time interval information; filter the target field information that does not match the time interval information from the field information stored in the intermediate table, and the intermediate table includes the characteristic fields of the data to be collected
  • the data to be collected is the data determined from the business database according to the preset collection logic; the second data is collected from the business database according to the target field information; the data integration processing is performed on the first data and the second data to obtain target data.
  • the processor further implements the following steps when executing the computer program: storing the target data in the partition table of the corresponding partition in the data warehouse.
  • the processor before the processor executes the computer program to achieve the acquisition of the specified time interval information, it further implements the following steps: acquiring preset acquisition logic information; determining data in the business database that conforms to the acquisition logic information as the data to be acquired; The field information extracted from the characteristic fields of the data to be collected is stored in the intermediate table.
  • the processor when the processor executes the computer program to filter the target field information that does not match the time interval information from the field information stored in the intermediate table, it specifically implements the following steps: acquiring task parameter information, reading and matching from the intermediate table The field information that matches the task parameter information is stored in the temporary table; the target field information that does not match the time interval information is filtered from the field information stored in the temporary table.
  • the processor executes the computer program to collect the second data from the business database according to the target field information
  • the following steps are specifically implemented: generating a structured query language by using the target field information as a value corresponding to the query condition;
  • the query language collects the second data from the business database.
  • the processor after the processor executes the computer program to perform data integration processing on the first data and the second data, the processor further implements the following step: performing data deduplication processing on the data after the data integration processing.
  • the processor executes the computer program to perform data integration processing on the first data and the second data, it also implements the following steps: comparing the data after the data integration processing with the data to be collected, and removing the data from the data to be collected. different data.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: acquiring specified time interval information, and collecting information from a service database according to the time interval information The first data updated in the corresponding time interval; the target field information that does not match the time interval information is filtered from the field information stored in the intermediate table, and the intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is based on the prediction.
  • the set collection logic determines the data from the business database; collects the second data from the business database according to the target field information; performs data integration processing on the first data and the second data to obtain the target data.
  • the computer program further implements the following steps when executed by the processor: storing the target data in the partition table of the corresponding partition in the data warehouse.
  • the following steps are also implemented: acquiring preset acquisition logic information; determining data in the business database that conforms to the acquisition logic information as data to be acquired; The field information is extracted from the characteristic fields of the data to be collected and stored in the intermediate table.
  • the following steps are specifically implemented: acquiring task parameter information, reading from the intermediate table The field information that matches the task parameter information is stored in the temporary table; the target field information that does not match the time interval information is filtered from the field information stored in the temporary table.
  • the following steps are specifically implemented: generating a structured query language by using the target field information as a value corresponding to a query condition;
  • the second data is collected from the business database using a query language.
  • the following step is further implemented: performing data deduplication processing on the data after the data integration processing.
  • the following steps are also implemented: comparing the data after the data integration processing with the data to be collected, and removing the data with the data to be collected. data different data.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data acquisition method and apparatus, a computer device, and a storage medium. The method comprises: acquiring specified time interval information, and acquiring, from a service database according to the time interval information, first data updated in a corresponding time interval (S202); screening, from field information stored in an intermediate table, target field information that does not match the time interval information, the intermediate table comprising field information of a feature field of data to be acquired, and said data being data determined from the service database according to preset acquisition logic (S204); acquiring second data from the service database according to the target field information (S206); and integrating the first data and the second data to obtain target data (S208). Using the method can improve the efficiency of data acquisition.

Description

数据采集方法、装置、计算机设备和存储介质Data acquisition method, device, computer equipment and storage medium 技术领域technical field
本申请涉及数据处理技术领域,特别是涉及一种数据采集方法、装置、计算机设备和存储介质。 The present application relates to the technical field of data processing, and in particular, to a data acquisition method, device, computer equipment and storage medium.
背景技术Background technique
随着数据处理技术的发展,出现了数据采集技术,数据采集一般是使用采集工具将源数据库中的源数据采集至大数据平台的数据仓库。 With the development of data processing technology, data collection technology has emerged. Data collection generally uses collection tools to collect source data in the source database to the data warehouse of the big data platform.
在传统数据采集方法中,通常采用数据更新时间作为筛选条件来采集增量数据,但是,在增量数据和部分存量数据存在关联性的时候,例如,采集退货表数据时,需要将新更新的退货订单的数据及该退货订单对应的原单的数据一起采集到数据仓库,才可以供下游进行统计分析。然而,在根据数据更新时间采集完增量数据后,为获取与该增量数据对应的部分存量数据,需要大数据中心平台利用全量计算的方式对历史存量数据进行计算,不仅消耗计算资源,而且降低了数据采集的效率。In traditional data collection methods, the data update time is usually used as a filter condition to collect incremental data. However, when there is a correlation between incremental data and some inventory data, for example, when collecting return form data, the newly updated data needs to be The data of the return order and the data of the original order corresponding to the return order are collected together into the data warehouse before they can be used for downstream statistical analysis. However, after the incremental data is collected according to the data update time, in order to obtain part of the inventory data corresponding to the incremental data, the big data center platform needs to use the full calculation method to calculate the historical inventory data, which not only consumes computing resources, but also The efficiency of data collection is reduced.
技术解决方案technical solutions
基于此,有必要针对上述技术问题,提供一种能够提高数据采集效率的数据采集方法、装置、计算机设备和存储介质。 Based on this, it is necessary to provide a data collection method, device, computer equipment and storage medium that can improve the efficiency of data collection in response to the above technical problems.
一种数据处理方法,上述方法包括:A data processing method, the method comprising:
获取指定的时间区间信息,根据时间区间信息从业务数据库中采集对应时间区间内更新的第一数据;Obtain the specified time interval information, and collect the updated first data in the corresponding time interval from the business database according to the time interval information;
从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,中间表中包括待采集数据的特征字段的字段信息,待采集数据为根据预设的采集逻辑从业务数据库中确定出的数据;The target field information that does not match the time interval information is filtered from the field information stored in the intermediate table. The intermediate table includes the field information of the characteristic fields of the data to be collected. The data to be collected is determined from the business database according to the preset collection logic. The data;
根据目标字段信息从业务数据库中采集第二数据;Collect the second data from the business database according to the target field information;
将第一数据和第二数据进行数据整合处理,得到目标数据。Data integration processing is performed on the first data and the second data to obtain target data.
在一个实施例中,上述方法还包括:In one embodiment, the above method further includes:
将目标数据存入数据仓库中对应分区的分区表中。Store the target data in the partition table of the corresponding partition in the data warehouse.
在一个实施例中,上述中间表为设置于数据仓库中的数据表。In one embodiment, the above-mentioned intermediate table is a data table set in a data warehouse.
在一个实施例中,获取指定的时间区间信息之前,上述方法还包括:获取预设的采集逻辑信息;将业务数据库中符合采集逻辑信息的数据确定为待采集数据;从待采集数据的特征字段中提取字段信息存入中间表。In one embodiment, before obtaining the specified time interval information, the above method further includes: obtaining preset collection logic information; determining data in the business database that conforms to the collection logic information as the data to be collected; The extracted field information is stored in the intermediate table.
在一个实施例中,从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,包括:获取任务参数信息,从中间表中读取与任务参数信息匹配的字段信息存入临时表;从临时表存储的字段信息中筛选不匹配时间区间信息的目标字段信息。In one embodiment, filtering the target field information that does not match the time interval information from the field information stored in the intermediate table includes: acquiring task parameter information, reading the field information matching the task parameter information from the intermediate table and storing it in a temporary Table; filter the target field information that does not match the time interval information from the field information stored in the temporary table.
在一个实施例中,根据目标字段信息从业务数据库中采集第二数据,包括:将目标字段信息作为查询条件对应的值生成结构化查询语言;根据结构化查询语言从业务数据库中采集第二数据。In one embodiment, collecting the second data from the business database according to the target field information includes: generating a structured query language by using the target field information as a value corresponding to a query condition; collecting the second data from the business database according to the structured query language .
在一个实施例中,将第一数据和第二数据进行数据整合处理之后,上述方法还包括:对数据整合处理之后的数据进行数据去重处理。In one embodiment, after performing data integration processing on the first data and the second data, the above method further includes: performing data deduplication processing on the data after the data integration processing.
在一个实施例中,将第一数据和第二数据进行数据整合处理之后,上述方法还包括:将数据整合处理之后的数据与待采集数据进行比较,去掉与待采集数据不同的数据。In one embodiment, after performing data integration processing on the first data and the second data, the above method further includes: comparing the data after the data integration processing with the data to be collected, and removing data different from the data to be collected.
一种数据采集装置,上述装置包括:A data acquisition device, the device comprising:
第一数据采集模块,用于获取指定的时间区间信息,根据时间区间信息从业务数据库中采集对应时间区间内更新的第一数据;a first data acquisition module, configured to acquire specified time interval information, and collect the updated first data in the corresponding time interval from the business database according to the time interval information;
字段信息获取模块,用于从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,中间表中包括待采集数据的特征字段的字段信息,待采集数据为根据预设的采集逻辑从业务数据库中确定出的数据;The field information acquisition module is used to filter the target field information that does not match the time interval information from the field information stored in the intermediate table. The intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is collected according to a preset Logic determines the data from the business database;
第二数据采集模块,用于根据目标字段信息从业务数据库中采集第二数据;The second data collection module is used for collecting the second data from the business database according to the target field information;
数据整合处理模块,用于将第一数据和第二数据进行数据整合处理,得到目标数据。The data integration processing module is used for performing data integration processing on the first data and the second data to obtain target data.
一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述的数据采集方法的步骤。A computer device includes a memory, a processor, and a computer program stored in the memory and running on the processor. The processor implements the steps of the data acquisition method when the processor executes the computer program.
一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述的数据采集方法的步骤。A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the above-mentioned data acquisition method.
有益效果beneficial effect
上述数据采集方法、装置、计算机设备和存储介质,通过指定时间区间采集对应的第一数据,通过包含预先确定的待采集数据的字段信息的中间表筛选不是指定时间区间的目标字段信息,并根据目标字段信息采集对应的第二数据,最后将第一数据和第二数据进行整合后得到本次采集任务的目标数据,采用本方案可以快速地采集到指定时间区间内更新的数据以及与其有关联关系的历史数据,不再需要对历史数据进行全量计算,从而提高了数据采集的效率。The above-mentioned data collection method, device, computer equipment and storage medium collect the corresponding first data through a specified time interval, filter the target field information that is not in the specified time interval through an intermediate table containing the field information of the pre-determined data to be collected, and according to the The second data corresponding to the target field information is collected, and finally the first data and the second data are integrated to obtain the target data of this collection task. Using this solution, the updated data within the specified time interval and the related data can be quickly collected. The historical data of the relationship no longer needs to be fully calculated on the historical data, thereby improving the efficiency of data collection.
附图说明Description of drawings
图1为一个实施例中数据采集方法的应用环境图;Fig. 1 is the application environment diagram of the data acquisition method in one embodiment;
图2为一个实施例中数据采集方法的流程示意图;2 is a schematic flowchart of a data collection method in one embodiment;
图3为一个应用实例中分布式数据采集任务执行的技术构架图;Fig. 3 is the technical framework diagram of distributed data acquisition task execution in an application example;
图4为一个应用实例中数据采集方法的流程示意图;4 is a schematic flowchart of a data acquisition method in an application example;
图5为一个实施例中数据采集装置的结构框图;5 is a structural block diagram of a data acquisition device in one embodiment;
图6为一个实施例中计算机设备的内部结构图。FIG. 6 is a diagram of the internal structure of a computer device in one embodiment.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请提供的数据采集方法,可以应用于如图1所示的应用环境中。其中,服务器102获取指定的时间区间信息,根据时间区间信息从业务数据库104中采集对应时间区间内更新的第一数据;从中间表106中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,中间表106中包括待采集数据的特征字段的字段信息,待采集数据为根据预设的采集逻辑从业务数据库中确定出的数据;根据目标字段信息从业务数据库104中采集第二数据;将第一数据和第二数据进行数据整合处理,得到目标数据。其中,服务器102可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The data collection method provided in this application can be applied to the application environment shown in FIG. 1 . Wherein, the server 102 obtains the specified time interval information, and collects the first data updated in the corresponding time interval from the business database 104 according to the time interval information; and selects the target fields that do not match the time interval information from the field information stored in the intermediate table 106. information, the intermediate table 106 includes the field information of the characteristic fields of the data to be collected, the data to be collected is the data determined from the business database according to the preset collection logic; the second data is collected from the business database 104 according to the target field information; Data integration processing is performed on the first data and the second data to obtain target data. The server 102 may be implemented by an independent server or a server cluster composed of multiple servers.
在一个实施例中,如图2所示,提供了一种数据采集方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2 , a data collection method is provided, which is described by taking the method applied to the server in FIG. 1 as an example, including the following steps:
步骤S202:获取指定的时间区间信息,根据时间区间信息从业务数据库中采集对应时间区间内更新的第一数据。Step S202: Acquire specified time interval information, and collect first data updated in the corresponding time interval from the service database according to the time interval information.
其中,业务数据库为存放业务数据的数据库,可以是关系型数据库或非关系型数据库等。业务数据库中可以包含至少一张业务数据表。第一数据为在指定的时间区间内更新至该业务数据库中的某张数据表中的数据。The business database is a database that stores business data, and may be a relational database or a non-relational database. The business database may contain at least one business data table. The first data is data updated to a certain data table in the business database within a specified time interval.
具体地,用户可以将数据更新的时间区间作为数据筛选的条件来采集数据。时间区间信息可以是指示任意有效时间段或时间点的信息。服务器获取用户指定的时间区间信息,将该时间区间信息作为数据筛选的条件,将业务数据库中在该时间区间信息对应的时间区间内或时间点上更新的数据作为第一数据进行采集。Specifically, the user can collect data by using the time interval of data update as a condition for data filtering. The time interval information may be information indicating any valid time period or point in time. The server obtains the time interval information specified by the user, uses the time interval information as a condition for data screening, and collects the data updated in the time interval or at the time point corresponding to the time interval information in the business database as the first data.
步骤S204:从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,中间表中包括待采集数据的特征字段的字段信息,待采集数据为根据预设的采集逻辑从业务数据库中确定出的数据。Step S204: Screen the target field information that does not match the time interval information from the field information stored in the intermediate table, the intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is obtained from the business database according to the preset collection logic. data identified in.
其中,中间表为数据库中用于存放中间计算结果的数据表。待采集数据为根据用户自定义或预设的采集逻辑从业务数据库的数据中确定出的有待于采集的数据,确定待采集数据的目的是为了框定数据采集的范围。特征字段可以根据数据类型的不同进行适应性设置,例如,对于退货订单数据,特征字段可以包括订单号、订单行号、库号、表号和订单时间中的至少一个字段。字段信息可以是字段中的字段值。The intermediate table is a data table in the database for storing intermediate calculation results. The data to be collected is the data to be collected determined from the data of the business database according to the user-defined or preset collection logic, and the purpose of determining the data to be collected is to frame the scope of data collection. The characteristic fields can be set adaptively according to different data types. For example, for return order data, the characteristic fields can include at least one of order number, order line number, stock number, table number, and order time. The field information can be the field value in the field.
具体地,服务器根据时间区间信息从中间表中存储的字段信息中进行匹配筛选,筛选出非该时间区间信息对应的时间区间内的待采集数据的字段信息,并将筛选出来的字段信息作为目标字段信息。例如,时间区间信息为昨日,则从中间表中筛选出非昨日更新的待采集数据的字段信息作为目标字段信息。Specifically, the server performs matching and screening from the field information stored in the intermediate table according to the time interval information, filters out the field information of the data to be collected in the time interval not corresponding to the time interval information, and uses the filtered field information as the target field information. For example, if the time interval information is yesterday, the field information of the data to be collected that is not updated yesterday is selected from the intermediate table as the target field information.
步骤S206:根据目标字段信息从业务数据库中采集第二数据。Step S206: Collect second data from the service database according to the target field information.
其中,第二数据指的是根据目标字段信息从业务数据库中查询到的数据。具体地,服务器在获取到目标字段信息后,可以将目标字段信息作为数据筛选的条件,从业务数据库中查询包含所述目标字段信息的业务数据,可以采用匹配业务数据库的查询语言进行查询,例如,可以采用SQL语言 (Structured Query Language,结构化查询语言)中的IN查询,将查询到的业务数据作为第二数据进行采集。The second data refers to the data queried from the business database according to the target field information. Specifically, after acquiring the target field information, the server can use the target field information as a data screening condition, and query the business data containing the target field information from the business database, and can use a query language matching the business database to query, for example , the IN query in the SQL language (Structured Query Language, structured query language) can be used to collect the queried business data as the second data.
步骤S208:将第一数据和第二数据进行数据整合处理,得到目标数据。Step S208: Perform data integration processing on the first data and the second data to obtain target data.
具体地,将采集的第一数据和第二数据进行数据整合处理,将数据整合后得到的数据集合中的全部数据作为本次数据采集任务的目标数据。Specifically, data integration processing is performed on the collected first data and second data, and all the data in the data set obtained after the data integration is used as the target data of this data collection task.
上述数据采集方法,通过指定时间区间采集对应的第一数据,通过包含预先确定的待采集数据的字段信息的中间表筛选不是指定时间区间的目标字段信息,并根据目标字段信息采集对应的第二数据,最后将第一数据和第二数据进行整合后得到本次采集任务的目标数据,采用本方案可以快速地采集到指定时间区间内更新的数据以及与其有关联关系的历史数据,不再需要对历史数据进行全量计算,从而提高了数据采集的效率。In the above data collection method, the corresponding first data is collected through a designated time interval, the target field information that is not in the designated time interval is filtered through an intermediate table containing the field information of the predetermined data to be collected, and the corresponding second data is collected according to the target field information. Finally, after integrating the first data and the second data, the target data of this collection task can be obtained. Using this solution, the updated data within the specified time interval and the historical data related to it can be quickly collected. The full amount of historical data is calculated, thereby improving the efficiency of data collection.
在一个实施例中,上述方法还包括:将目标数据存入数据仓库中对应分区的分区表中。本实施例,通过将采集得到的目标数据存入对应的分区表中,可以实现数据的快速分区分表,减少了大数据平台的数据仓库进行数据分区处理所消耗的计算资源,提高了数据处理的效率。其中,分区表可以是hive数据库中的数据表,此处可以支持将数据写入自定义格式或默认格式的分区表,自定义格式的hive数据表可以防止因部分字段内容包含换行符而导致的数据错乱问题的发生。In one embodiment, the above method further includes: storing the target data in the partition table of the corresponding partition in the data warehouse. In this embodiment, by storing the collected target data in the corresponding partition table, the data can be quickly partitioned and divided into tables, the computing resources consumed by the data warehouse of the big data platform for data partition processing are reduced, and the data processing is improved. s efficiency. Among them, the partition table can be a data table in the hive database, where data can be written to a partition table in a custom format or a default format, and the hive data table in a custom format can prevent the content of some fields from containing line breaks. Data corruption problem occurs.
在一个实施例中,中间表为设置于数据仓库中的数据表。在本实施例中,中间表被设置于大数据平台的数据仓库中,可以是数据仓库中的一张或多张数据表,其格式不限,例如可以是hive数据表。在传统的采集方法中,会在业务数据库的各分库中创建一个中间表,然后抽某张表的数据时,采用该表关联(inner join)中间表的形式进行采集。例如,采集退货表数据时,通过退货表关联(inner join)中间表的方式进行查询,将新增退货订单数据及新增退货订单对应的原单数据一起采集到数据仓库,供下游销售数据统计分析。In one embodiment, the intermediate table is a data table set in a data warehouse. In this embodiment, the intermediate table is set in the data warehouse of the big data platform, which may be one or more data tables in the data warehouse, and its format is not limited, for example, it may be a hive data table. In the traditional collection method, an intermediate table is created in each sub-database of the business database, and then when the data of a certain table is extracted, it is collected in the form of an inner join of the intermediate table. For example, when collecting return table data, query by means of inner join intermediate table of the return table, and collect the newly added return order data and the original order data corresponding to the newly added return order to the data warehouse for statistical analysis of downstream sales data .
本实施例,通过直接在数据仓库中设置存放数据采集中间数据的中间表,可以去掉业务系统中业务数据库各分库的中间表。由于在业务数据库(数据源)中将数据写入中间表的前提是数据源需要被配置为具有读写权限,因此,导致传统的数据采集只能使用业务数据库主库,采集时降低了系统性能,影响了业务的正常运作,同时数据的写操作也减低了数据库的安全性。通过采用本实施例的方法,去掉业务数据库中创建的中间表,不需要采用关联(inner join)中间表的方式进行查询,因此,可以使用业务数据备用库进行数据采集,对业务数据主库无影响,可以解耦业务系统,保证系统安全。In this embodiment, by directly setting an intermediate table storing intermediate data for data collection in the data warehouse, the intermediate table of each sub-database of the business database in the business system can be removed. Because the premise of writing data to the intermediate table in the business database (data source) is that the data source needs to be configured with read and write permissions, therefore, traditional data collection can only use the main database of the business database, which reduces system performance during collection. , which affects the normal operation of the business, and the data writing operation also reduces the security of the database. By adopting the method of this embodiment, the intermediate table created in the business database is removed, and there is no need to query the intermediate table in the form of inner join. Therefore, the standby business data database can be used for data collection, and there is no need for the main business data database. Influence, you can decouple business systems and ensure system security.
在一个实施例中,获取指定的时间区间信息之前,上述方法还包括:In one embodiment, before acquiring the specified time interval information, the above method further includes:
获取预设的采集逻辑信息;将业务数据库中符合采集逻辑信息的数据确定为待采集数据;从待采集数据的特征字段中提取字段信息存入中间表。Obtain preset collection logic information; determine data in the business database that conforms to the collection logic information as data to be collected; extract field information from characteristic fields of the data to be collected and store it in an intermediate table.
在本实施例中,在各采集任务开始之前,可以根据各采集任务的业务规则预先设定符合业务规则的采集逻辑信息,根据预设的采集逻辑信息可以确定数据采集的范围,即将符合该采集逻辑的数据确定为待采集数据,并提取待采集数据的特征字段中的字段信息存入中间表,其中,特征字段可以根据采集任务的不同预先进行指定,例如,可以指定订单号、订单行号、库号、表号或订单时间等字段为特征字段。In this embodiment, before each collection task starts, the collection logic information that conforms to the business rules can be preset according to the business rules of each collection task, and the scope of data collection can be determined according to the preset collection logic information, which is about to meet the collection logic information. The logical data is determined as the data to be collected, and the field information in the characteristic fields of the data to be collected is extracted and stored in the intermediate table. The characteristic fields can be pre-specified according to different collection tasks, for example, the order number, order line number can be specified. , stock number, table number or order time and other fields are characteristic fields.
在一个实施例中,从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,包括:获取任务参数信息,从中间表中读取与任务参数信息匹配的字段信息存入临时表;从临时表存储的字段信息中筛选不匹配时间区间信息的目标字段信息。In one embodiment, filtering the target field information that does not match the time interval information from the field information stored in the intermediate table includes: acquiring task parameter information, reading the field information matching the task parameter information from the intermediate table and storing it in a temporary Table; filter the target field information that does not match the time interval information from the field information stored in the temporary table.
在本实施例中,任务参数指的是采集任务启动前用户所配置该采集任务对应的参数,例如,可以通过大数据平台的spark任务进行数据采集,在启动spark任务时,服务器获取用户配置的任务参数,并加载该任务参数至spark任务。任务参数可以包括指定需查询的业务数据库、指定需采集的源表字段信息、指定需写入的分区表等信息。In this embodiment, the task parameters refer to the parameters corresponding to the collection task configured by the user before the collection task is started. For example, data collection can be performed through the spark task of the big data platform. When the spark task is started, the server obtains the user-configured parameters Task parameters, and load the task parameters to the spark task. Task parameters can include information such as specifying the business database to be queried, specifying the field information of the source table to be collected, and specifying the partition table to be written.
本实施例,由于中间表中可以包括根据不同采集任务的采集逻辑信息预先确定的待采集数据,通过在采集任务启动前获取并加载用户配置的任务参数,可以从中间表中获取到符合当前采集任务的待采集数据,并将当前采集任务的待采集数据存入临时表以便进行后续处理。通过设置任务参数,可以进分布式任务执行,解决了单点任务的问题,提高了数据采集的效率。In this embodiment, since the intermediate table may include data to be collected that is pre-determined according to the collection logic information of different collection tasks, by acquiring and loading the task parameters configured by the user before the collection task is started, it is possible to obtain data from the intermediate table that conform to the current collection The data to be collected for the task, and the data to be collected for the current collection task is stored in a temporary table for subsequent processing. By setting task parameters, distributed task execution can be performed, which solves the problem of single-point tasks and improves the efficiency of data collection.
在一个实施例中,根据目标字段信息从业务数据库中采集第二数据,包括:将目标字段信息作为查询条件对应的值生成结构化查询语言;根据结构化查询语言从业务数据库中采集第二数据。In one embodiment, collecting the second data from the business database according to the target field information includes: generating a structured query language by using the target field information as a value corresponding to a query condition; collecting the second data from the business database according to the structured query language .
本实施例,通过将目标字段信息作为查询条件对应的值,并生成结构化查询语言,例如,关系型数据库的IN查询语句,可以快速地从业务数据库中根据目标字段信息定位到对应的数据,提高了数据采集的效率。In this embodiment, by taking the target field information as the value corresponding to the query condition, and generating a structured query language, for example, the IN query statement of the relational database, the corresponding data can be quickly located from the business database according to the target field information, Improve the efficiency of data collection.
在一个实施例中,将第一数据和第二数据进行数据整合处理之后,上述方法还包括:对数据整合处理之后的数据进行数据去重处理。本实施例,通过对数据进行去重处理,可以除去重复多余的数据,提高数据采集的准确性。In one embodiment, after performing data integration processing on the first data and the second data, the above method further includes: performing data deduplication processing on the data after the data integration processing. In this embodiment, by performing deduplication processing on the data, redundant and redundant data can be removed, and the accuracy of data collection can be improved.
在一个实施例中,将第一数据和第二数据进行数据整合处理之后,上述方法还包括:将数据整合处理之后的数据与所述待采集数据进行比较,去掉与所述待采集数据不同的数据。本实施例,通过将预先确定的待采集数据和整合后的目标数据进行比较,可以排除非采集范围内的数据,进一步提高数据采集的准确性。In one embodiment, after performing data integration processing on the first data and the second data, the above method further includes: comparing the data after the data integration processing with the data to be collected, and removing the data that is different from the data to be collected. data. In this embodiment, by comparing the predetermined data to be collected with the integrated target data, the data in the non-collection range can be excluded, and the accuracy of data collection can be further improved.
下面,结合一个应用实例对本申请的数据采集方法进行进一步说明,如图3至图4所示,图3示出了一个应用实例中分布式数据采集任务执行的技术构架图,4示出了一个应用实例中数据采集方法的流程示意图,具体包括以下步骤:Below, the data collection method of the present application will be further described in conjunction with an application example, as shown in Figures 3 to 4, Figure 3 shows a technical framework diagram of the execution of distributed data collection tasks in an application example, and Figure 4 shows a A schematic flowchart of the data collection method in the application example, which specifically includes the following steps:
步骤1:中间表数据采集,中间表中的采集逻辑可以是多个场景逻辑的融合,比如退货表、换货表等,都可以根据业务需求定义对应的采集逻辑,此外,若订单行表的数据更新了,相关的扩展表、支付表等的数据都需要重新采集到最新的分区中,只要中间表中涵盖了该采集任务的待采集数据,后续该业务启动后就会把数据采集到最新的分区中。Step 1: Data collection in the intermediate table. The collection logic in the intermediate table can be the integration of multiple scenario logics, such as the return table, the exchange table, etc., and the corresponding collection logic can be defined according to business requirements. When the data is updated, the related expansion table, payment table and other data need to be re-collected to the latest partition. As long as the intermediate table covers the data to be collected for the collection task, the data will be collected to the latest after the service is started. in the partition.
步骤2:读取中间表数据,采集spark任务读取中间表数据到内存中,便于后续数据处理。Step 2: Read the intermediate table data, and collect spark tasks to read the intermediate table data into the memory, which is convenient for subsequent data processing.
步骤3:增量采集退货表,该步骤为业务表数据采集的第一步,按照将业务表数据更新时间进行采集,采集的增量数据即为昨日新增的变化的数据,存放到内存中便于后续的统计汇总。Step 3: Incremental collection of the return form. This step is the first step in the collection of business table data. The collection is performed according to the update time of the business table data. The incremental data collected is the data that was added and changed yesterday, and is stored in the memory. To facilitate subsequent statistical summary.
步骤4:采集非昨日新增数据,该步骤为业务表数据采集的第二步,将业务表的部分存量数据通过IN的方式查询出来,存放到内存中便于后续的统计汇总。Step 4: Collect non-yesterday new data. This step is the second step of business table data collection. Part of the stock data in the business table is queried by IN, and stored in the memory for subsequent statistical summary.
步骤5:合并过滤数据,读取前两步采集的增量数据和部分存量数据,并将两者进行汇总去重后,在与中间表数据进行过滤,排除比中间表多的数据。Step 5: Combine and filter the data, read the incremental data and part of the stock data collected in the first two steps, and after summarizing and deduplicating the two, filter the data with the intermediate table to exclude more data than the intermediate table.
步骤6:写入目标表,最后将上个步骤最后的数据根据用户配置的目标HIVE表及目标表格式,进行自动数据和表格格式的匹配,最终将数据写入到目标分区表中。Step 6: Write to the target table. Finally, the last data in the previous step is automatically matched with the table format according to the target HIVE table and target table format configured by the user, and finally the data is written to the target partition table.
应该理解的是,虽然图2和4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2和4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIGS. 2 and 4 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 2 and 4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. These sub-steps or stages The order of execution of the steps is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.
在一个实施例中,如图5所示,提供了一种数据采集装置,包括:第一数据采集模块510、字段信息获取模块520、第二数据采集模块530和数据整合处理模块540,其中:In one embodiment, as shown in FIG. 5, a data collection device is provided, including: a first data collection module 510, a field information acquisition module 520, a second data collection module 530, and a data integration processing module 540, wherein:
第一数据采集模块510,用于获取指定的时间区间信息,根据时间区间信息从业务数据库中采集对应时间区间内更新的第一数据;The first data collection module 510 is configured to obtain the specified time interval information, and collect the updated first data in the corresponding time interval from the service database according to the time interval information;
字段信息获取模块520,用于从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,中间表中包括待采集数据的特征字段的字段信息,待采集数据为根据预设的采集逻辑从业务数据库中确定出的数据;The field information acquisition module 520 is used to filter the target field information that does not match the time interval information from the field information stored in the intermediate table, the intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is based on preset Collection logic determines the data from the business database;
第二数据采集模块530,用于根据目标字段信息从业务数据库中采集第二数据;The second data collection module 530 is configured to collect the second data from the business database according to the target field information;
数据整合处理模块540,用于将第一数据和第二数据进行数据整合处理,得到目标数据。The data integration processing module 540 is configured to perform data integration processing on the first data and the second data to obtain target data.
在一个实施例中,数据整合处理模块540,还用于将目标数据存入数据仓库中对应分区的分区表中。In one embodiment, the data integration processing module 540 is further configured to store the target data in the partition table of the corresponding partition in the data warehouse.
在一个实施例中,第一数据采集模块510,还用于获取指定的时间区间信息之前,获取预设的采集逻辑信息;将业务数据库中符合采集逻辑信息的数据确定为待采集数据;从待采集数据的特征字段中提取字段信息存入中间表。In one embodiment, the first data collection module 510 is further configured to obtain preset collection logic information before obtaining the specified time interval information; determine the data in the business database that conforms to the collection logic information as the data to be collected; The field information extracted from the characteristic fields of the collected data is stored in the intermediate table.
在一个实施例中,字段信息获取模块520获取任务参数信息,从中间表中读取与任务参数信息匹配的字段信息存入临时表;从临时表存储的字段信息中筛选不匹配时间区间信息的目标字段信息。In one embodiment, the field information acquisition module 520 acquires task parameter information, reads the field information that matches the task parameter information from the intermediate table and stores it in the temporary table; filters out the field information that does not match the time interval information from the field information stored in the temporary table. Target field information.
在一个实施例中,第二数据采集模块530将目标字段信息作为查询条件对应的值生成结构化查询语言;根据结构化查询语言从业务数据库中采集第二数据。In one embodiment, the second data collection module 530 uses the target field information as a value corresponding to the query condition to generate a structured query language; and collects the second data from the business database according to the structured query language.
在一个实施例中,数据整合处理模块540,还用于将第一数据和第二数据进行数据整合处理之后,对数据整合处理之后的数据进行数据去重处理。In one embodiment, the data integration processing module 540 is further configured to perform data deduplication processing on the data after the data integration processing after performing the data integration processing on the first data and the second data.
在一个实施例中,数据整合处理模块540,还用于将第一数据和第二数据进行数据整合处理之后,将数据整合处理之后的数据与待采集数据进行比较,去掉与待采集数据不同的数据。In one embodiment, the data integration processing module 540 is further configured to perform data integration processing on the first data and the second data, compare the data after data integration processing with the data to be collected, and remove the data that is different from the data to be collected. data.
关于数据采集装置的具体限定可以参见上文中对于数据采集方法的限定,在此不再赘述。上述数据采集装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the data collection device, reference may be made to the limitation of the data collection method above, which will not be repeated here. Each module in the above-mentioned data acquisition device can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储业务数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种数据采集方法。In one embodiment, a computer device is provided, the computer device may be a server, and its internal structure diagram may be as shown in FIG. 6 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store business data. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by the processor, implements a data acquisition method.
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现以下步骤:获取指定的时间区间信息,根据时间区间信息从业务数据库中采集对应时间区间内更新的第一数据;从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,中间表中包括待采集数据的特征字段的字段信息,待采集数据为根据预设的采集逻辑从业务数据库中确定出的数据;根据目标字段信息从业务数据库中采集第二数据;将第一数据和第二数据进行数据整合处理,得到目标数据。In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the processor implements the following steps: acquiring specified time interval information , collect the first data updated in the corresponding time interval from the business database according to the time interval information; filter the target field information that does not match the time interval information from the field information stored in the intermediate table, and the intermediate table includes the characteristic fields of the data to be collected The data to be collected is the data determined from the business database according to the preset collection logic; the second data is collected from the business database according to the target field information; the data integration processing is performed on the first data and the second data to obtain target data.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:将目标数据存入数据仓库中对应分区的分区表中。In one embodiment, the processor further implements the following steps when executing the computer program: storing the target data in the partition table of the corresponding partition in the data warehouse.
在一个实施例中,处理器执行计算机程序实现获取指定的时间区间信息之前,还实现以下步骤:获取预设的采集逻辑信息;将业务数据库中符合采集逻辑信息的数据确定为待采集数据;从待采集数据的特征字段中提取字段信息存入中间表。In one embodiment, before the processor executes the computer program to achieve the acquisition of the specified time interval information, it further implements the following steps: acquiring preset acquisition logic information; determining data in the business database that conforms to the acquisition logic information as the data to be acquired; The field information extracted from the characteristic fields of the data to be collected is stored in the intermediate table.
在一个实施例中,处理器执行计算机程序实现从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息时,具体实现以下步骤:获取任务参数信息,从中间表中读取与任务参数信息匹配的字段信息存入临时表;从临时表存储的字段信息中筛选不匹配时间区间信息的目标字段信息。In one embodiment, when the processor executes the computer program to filter the target field information that does not match the time interval information from the field information stored in the intermediate table, it specifically implements the following steps: acquiring task parameter information, reading and matching from the intermediate table The field information that matches the task parameter information is stored in the temporary table; the target field information that does not match the time interval information is filtered from the field information stored in the temporary table.
在一个实施例中,处理器执行计算机程序实现根据目标字段信息从业务数据库中采集第二数据时,具体实现以下步骤:将目标字段信息作为查询条件对应的值生成结构化查询语言;根据结构化查询语言从业务数据库中采集第二数据。In one embodiment, when the processor executes the computer program to collect the second data from the business database according to the target field information, the following steps are specifically implemented: generating a structured query language by using the target field information as a value corresponding to the query condition; The query language collects the second data from the business database.
在一个实施例中,处理器执行计算机程序实现将第一数据和第二数据进行数据整合处理之后,还实现以下步骤:对数据整合处理之后的数据进行数据去重处理。In one embodiment, after the processor executes the computer program to perform data integration processing on the first data and the second data, the processor further implements the following step: performing data deduplication processing on the data after the data integration processing.
在一个实施例中,处理器执行计算机程序实现将第一数据和第二数据进行数据整合处理之后,还实现以下步骤:将数据整合处理之后的数据与待采集数据进行比较,去掉与待采集数据不同的数据。In one embodiment, after the processor executes the computer program to perform data integration processing on the first data and the second data, it also implements the following steps: comparing the data after the data integration processing with the data to be collected, and removing the data from the data to be collected. different data.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:获取指定的时间区间信息,根据时间区间信息从业务数据库中采集对应时间区间内更新的第一数据;从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息,中间表中包括待采集数据的特征字段的字段信息,待采集数据为根据预设的采集逻辑从业务数据库中确定出的数据;根据目标字段信息从业务数据库中采集第二数据;将第一数据和第二数据进行数据整合处理,得到目标数据。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: acquiring specified time interval information, and collecting information from a service database according to the time interval information The first data updated in the corresponding time interval; the target field information that does not match the time interval information is filtered from the field information stored in the intermediate table, and the intermediate table includes the field information of the characteristic fields of the data to be collected, and the data to be collected is based on the prediction. The set collection logic determines the data from the business database; collects the second data from the business database according to the target field information; performs data integration processing on the first data and the second data to obtain the target data.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:将目标数据存入数据仓库中对应分区的分区表中。In one embodiment, the computer program further implements the following steps when executed by the processor: storing the target data in the partition table of the corresponding partition in the data warehouse.
在一个实施例中,计算机程序被处理器执行实现获取指定的时间区间信息之前,还实现以下步骤:获取预设的采集逻辑信息;将业务数据库中符合采集逻辑信息的数据确定为待采集数据;从待采集数据的特征字段中提取字段信息存入中间表。In one embodiment, before the computer program is executed by the processor to achieve the acquisition of the specified time interval information, the following steps are also implemented: acquiring preset acquisition logic information; determining data in the business database that conforms to the acquisition logic information as data to be acquired; The field information is extracted from the characteristic fields of the data to be collected and stored in the intermediate table.
在一个实施例中,计算机程序被处理器执行实现从中间表中存储的字段信息中筛选不匹配时间区间信息的目标字段信息时,具体实现以下步骤:获取任务参数信息,从中间表中读取与任务参数信息匹配的字段信息存入临时表;从临时表存储的字段信息中筛选不匹配时间区间信息的目标字段信息。In one embodiment, when the computer program is executed by the processor to filter the target field information that does not match the time interval information from the field information stored in the intermediate table, the following steps are specifically implemented: acquiring task parameter information, reading from the intermediate table The field information that matches the task parameter information is stored in the temporary table; the target field information that does not match the time interval information is filtered from the field information stored in the temporary table.
在一个实施例中,计算机程序被处理器执行实现根据目标字段信息从业务数据库中采集第二数据时,具体实现以下步骤:将目标字段信息作为查询条件对应的值生成结构化查询语言;根据结构化查询语言从业务数据库中采集第二数据。In one embodiment, when the computer program is executed by the processor to collect the second data from the business database according to the target field information, the following steps are specifically implemented: generating a structured query language by using the target field information as a value corresponding to a query condition; The second data is collected from the business database using a query language.
在一个实施例中,计算机程序被处理器执行实现将第一数据和第二数据进行数据整合处理之后,还实现以下步骤:对数据整合处理之后的数据进行数据去重处理。In one embodiment, after the computer program is executed by the processor to perform data integration processing on the first data and the second data, the following step is further implemented: performing data deduplication processing on the data after the data integration processing.
在一个实施例中,计算机程序被处理器执行实现将第一数据和第二数据进行数据整合处理之后,还实现以下步骤:将数据整合处理之后的数据与待采集数据进行比较,去掉与待采集数据不同的数据。In one embodiment, after the computer program is executed by the processor to perform data integration processing on the first data and the second data, the following steps are also implemented: comparing the data after the data integration processing with the data to be collected, and removing the data with the data to be collected. data different data.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims (10)

  1. 一种数据采集方法,所述方法包括:A data collection method, the method comprising:
    获取指定的时间区间信息,根据所述时间区间信息从业务数据库中采集对应时间区间内更新的第一数据;Obtain the specified time interval information, and collect the updated first data in the corresponding time interval from the business database according to the time interval information;
    从中间表中存储的字段信息中筛选不匹配所述时间区间信息的目标字段信息,所述中间表中包括待采集数据的特征字段的字段信息,所述待采集数据为根据预设的采集逻辑从所述业务数据库中确定出的数据;The target field information that does not match the time interval information is filtered from the field information stored in the intermediate table, where the intermediate table includes field information of characteristic fields of the data to be collected, and the data to be collected is based on a preset collection logic data determined from the business database;
    根据所述目标字段信息从所述业务数据库中采集第二数据;Collect second data from the business database according to the target field information;
    将所述第一数据和所述第二数据进行数据整合处理,得到目标数据。Perform data integration processing on the first data and the second data to obtain target data.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    将所述目标数据存入数据仓库中对应分区的分区表中。The target data is stored in the partition table of the corresponding partition in the data warehouse.
  3. 根据权利要求2所述的方法,其特征在于,所述中间表为设置于所述数据仓库中的数据表。The method according to claim 2, wherein the intermediate table is a data table set in the data warehouse.
  4. 根据权利要求1所述的方法,其特征在于,所述获取指定的时间区间信息之前,所述方法还包括:The method according to claim 1, characterized in that, before acquiring the specified time interval information, the method further comprises:
    获取预设的采集逻辑信息;Obtain preset collection logic information;
    将所述业务数据库中符合所述采集逻辑信息的数据确定为待采集数据;Determining the data in the business database that conforms to the collection logic information as the data to be collected;
    从所述待采集数据的特征字段中提取字段信息存入所述中间表。The field information is extracted from the characteristic fields of the data to be collected and stored in the intermediate table.
  5. 根据权利要求1所述的方法,其特征在于,所述从中间表中存储的字段信息中筛选不匹配所述时间区间信息的目标字段信息,包括:The method according to claim 1, wherein the filtering of target field information that does not match the time interval information from the field information stored in the intermediate table comprises:
    获取任务参数信息,从所述中间表中读取与所述任务参数信息匹配的字段信息存入临时表;Obtain task parameter information, read field information that matches the task parameter information from the intermediate table and store it in a temporary table;
    从所述临时表存储的字段信息中筛选不匹配所述时间区间信息的目标字段信息。The target field information that does not match the time interval information is filtered from the field information stored in the temporary table.
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述目标字段信息从所述业务数据库中采集第二数据,包括:The method according to claim 1, wherein the collecting the second data from the service database according to the target field information comprises:
    将所述目标字段信息作为查询条件对应的值生成结构化查询语言;Using the target field information as a value corresponding to the query condition to generate a structured query language;
    根据所述结构化查询语言从所述业务数据库中采集第二数据。Collect second data from the business database according to the structured query language.
  7. 根据权利要求1至6任意一项所述的方法,其特征在于,将所述第一数据和所述第二数据进行数据整合处理之后,所述方法还包括:The method according to any one of claims 1 to 6, wherein after performing data integration processing on the first data and the second data, the method further comprises:
    对数据整合处理之后的数据进行数据去重处理;和/或,Data deduplication is performed on the data after the data integration process; and/or,
    将数据整合处理之后的数据与所述待采集数据进行比较,去掉与所述待采集数据不同的数据。The data after data integration processing is compared with the data to be collected, and the data that is different from the data to be collected is removed.
  8. 一种数据采集装置,其特征在于,所述装置包括:A data collection device, characterized in that the device comprises:
    第一数据采集模块,用于获取指定的时间区间信息,根据所述时间区间信息从业务数据库中采集对应时间区间内更新的第一数据;a first data acquisition module, configured to acquire specified time interval information, and collect the updated first data in the corresponding time interval from the service database according to the time interval information;
    字段信息获取模块,用于从中间表中存储的字段信息中筛选不匹配所述时间区间信息的目标字段信息,所述中间表中包括待采集数据的特征字段的字段信息,所述待采集数据为根据预设的采集逻辑从所述业务数据库中确定出的数据;A field information acquisition module, configured to filter target field information that does not match the time interval information from the field information stored in the intermediate table, where the intermediate table includes field information of characteristic fields of the data to be collected, the data to be collected is the data determined from the business database according to the preset collection logic;
    第二数据采集模块,用于根据所述目标字段信息从所述业务数据库中采集第二数据;a second data collection module, configured to collect second data from the business database according to the target field information;
    数据整合处理模块,用于将所述第一数据和所述第二数据进行数据整合处理,得到目标数据。A data integration processing module, configured to perform data integration processing on the first data and the second data to obtain target data.
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述方法的步骤。A computer device, comprising a memory, a processor and a computer program stored in the memory and running on the processor, wherein the processor implements any one of claims 1 to 7 when executing the computer program the steps of the method.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.
PCT/CN2021/131752 2021-03-30 2021-11-19 Data acquisition method and apparatus, computer device, and storage medium WO2022205938A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110338940.1 2021-03-30
CN202110338940.1A CN112948504B (en) 2021-03-30 2021-03-30 Data acquisition method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022205938A1 true WO2022205938A1 (en) 2022-10-06

Family

ID=76227393

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/131752 WO2022205938A1 (en) 2021-03-30 2021-11-19 Data acquisition method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN112948504B (en)
WO (1) WO2022205938A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948504B (en) * 2021-03-30 2022-12-02 苏宁易购集团股份有限公司 Data acquisition method and device, computer equipment and storage medium
CN114791915B (en) * 2022-06-22 2022-09-27 深圳高灯计算机科技有限公司 Data aggregation method and device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017166644A1 (en) * 2016-03-31 2017-10-05 乐视控股(北京)有限公司 Data acquisition method and system
CN107329998A (en) * 2017-06-09 2017-11-07 广州虎牙信息科技有限公司 User's increment class data capture method, device and equipment
CN110262969A (en) * 2019-06-13 2019-09-20 泰康保险集团股份有限公司 Report test method, device, electronic equipment and computer readable storage medium
CN110704523A (en) * 2019-09-06 2020-01-17 中国平安财产保险股份有限公司 Data export method, device, equipment and computer readable storage medium
CN112100219A (en) * 2020-09-22 2020-12-18 平安养老保险股份有限公司 Report generation method, device, equipment and medium based on database query processing
CN112948504A (en) * 2021-03-30 2021-06-11 苏宁易购集团股份有限公司 Data acquisition method and device, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408535B (en) * 2018-09-28 2024-04-09 中国平安财产保险股份有限公司 Large data volume matching method, device, computer equipment and storage medium
CN110046168B (en) * 2019-03-28 2021-03-26 南京苏宁软件技术有限公司 Incremental data consistency implementation method and device
CN110674154B (en) * 2019-09-26 2023-04-07 浪潮软件股份有限公司 Spark-based method for inserting, updating and deleting data in Hive
CN112182104A (en) * 2020-09-25 2021-01-05 中国建设银行股份有限公司 Data synchronization method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017166644A1 (en) * 2016-03-31 2017-10-05 乐视控股(北京)有限公司 Data acquisition method and system
CN107329998A (en) * 2017-06-09 2017-11-07 广州虎牙信息科技有限公司 User's increment class data capture method, device and equipment
CN110262969A (en) * 2019-06-13 2019-09-20 泰康保险集团股份有限公司 Report test method, device, electronic equipment and computer readable storage medium
CN110704523A (en) * 2019-09-06 2020-01-17 中国平安财产保险股份有限公司 Data export method, device, equipment and computer readable storage medium
CN112100219A (en) * 2020-09-22 2020-12-18 平安养老保险股份有限公司 Report generation method, device, equipment and medium based on database query processing
CN112948504A (en) * 2021-03-30 2021-06-11 苏宁易购集团股份有限公司 Data acquisition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112948504B (en) 2022-12-02
CN112948504A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN106980636B (en) Policy data processing method and device
WO2022205938A1 (en) Data acquisition method and apparatus, computer device, and storage medium
CN107301214B (en) Data migration method and device in HIVE and terminal equipment
WO2019148713A1 (en) Sql statement processing method and apparatus, computer device, and storage medium
US10127252B2 (en) History and scenario data tracking
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
CN109783457B (en) CGI interface management method, device, computer equipment and storage medium
WO2017161540A1 (en) Data query method, data object storage method and data system
CN112395157A (en) Audit log obtaining method and device, computer equipment and storage medium
CA3148489A1 (en) Method of and device for assessing data query time consumption, computer equipment and storage medium
CN117033424A (en) Query optimization method and device for slow SQL (structured query language) statement and computer equipment
CN109656947B (en) Data query method and device, computer equipment and storage medium
CN107644041B (en) Policy settlement processing method and device
CN109522273B (en) Method and device for realizing data writing
US11625503B2 (en) Data integrity procedure
CN113590613A (en) Data table partitioning method and device, computer equipment and storage medium
CA3191210A1 (en) Data syncronization method and device, computer equipment and storage medium
CN115098503A (en) Null value data processing method and device, computer equipment and storage medium
CN115858471A (en) Service data change recording method, device, computer equipment and medium
CN115114284A (en) Table change processing method and system
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table
CN107239474B (en) Data recording method and device
CN116932470B (en) Method, system and storage medium capable of calculating and storing time sequence data of Internet of things
US11372885B2 (en) Replication of complex augmented views
CN115840539B (en) Data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21934574

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21934574

Country of ref document: EP

Kind code of ref document: A1