CN117421337A - Data acquisition method, device, equipment and computer readable medium - Google Patents

Data acquisition method, device, equipment and computer readable medium Download PDF

Info

Publication number
CN117421337A
CN117421337A CN202311254931.XA CN202311254931A CN117421337A CN 117421337 A CN117421337 A CN 117421337A CN 202311254931 A CN202311254931 A CN 202311254931A CN 117421337 A CN117421337 A CN 117421337A
Authority
CN
China
Prior art keywords
acquisition
target
data
task
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311254931.XA
Other languages
Chinese (zh)
Other versions
CN117421337B (en
Inventor
杨月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongtu Science And Technology Yichang Co ltd
Original Assignee
Dongtu Science And Technology Yichang Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongtu Science And Technology Yichang Co ltd filed Critical Dongtu Science And Technology Yichang Co ltd
Priority to CN202311254931.XA priority Critical patent/CN117421337B/en
Publication of CN117421337A publication Critical patent/CN117421337A/en
Application granted granted Critical
Publication of CN117421337B publication Critical patent/CN117421337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data acquisition method, device, equipment and computer readable medium. The method comprises the following steps: receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source; acquiring an incremental acquisition rule configured for a target acquisition task in advance, wherein the incremental acquisition rule comprises a first acquisition rule for incremental acquisition based on a position offset and a second acquisition rule for incremental acquisition based on a metadata mark; the target acquisition task is executed according to a first acquisition rule to acquire the incremental data of the target data source based on the position offset of each data field in the target data source, or the target acquisition task is executed according to a second acquisition rule to acquire the incremental data of the target data source based on the metadata mark in the target data source. The method and the device solve the technical problem that repeated collection of a large amount of identical data causes serious waste of computing resources.

Description

Data acquisition method, device, equipment and computer readable medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a data acquisition method, apparatus, device, and computer readable medium.
Background
In the present day when digital information grows exponentially, data acquisition by big data technology has been a common means of acquiring information. However, when performing multiple data acquisition tasks, a large amount of the same data is repeatedly acquired, and in practice the updated data occupies only a small portion of all the acquired data, which results in serious waste of computing resources.
Aiming at the problem of serious waste of computing resources caused by repeated collection of a large amount of identical data, no effective solution is proposed at present.
Disclosure of Invention
The application provides a data acquisition method, a device, equipment and a computer readable medium, so as to solve the technical problem that repeated acquisition of a large amount of identical data causes serious waste of computing resources.
According to an aspect of an embodiment of the present application, there is provided a data acquisition method, including: receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source; acquiring an incremental acquisition rule configured for the target acquisition task in advance, wherein the incremental acquisition rule comprises a first acquisition rule for incremental acquisition based on a position offset and a second acquisition rule for incremental acquisition based on a metadata mark; and executing the target acquisition task according to the first acquisition rule to acquire the incremental data of the target data source based on the position offset of each data field in the target data source, or executing the target acquisition task according to the second acquisition rule to acquire the incremental data of the target data source based on the metadata mark in the target data source.
Optionally, the executing the target acquisition task according to the first acquisition rule to acquire incremental data of the target data source based on the position offset of each data field in the target data source includes: acquiring a task record of a data acquisition task executed on the target data source for the last time; determining target fields required to be acquired by the target acquisition task, and acquiring the latest position offset of each target field in the task record, wherein the position offset is an offset from the initial position of a data source and is used for representing the acquired part; determining the position indicated by the position offset as the starting position of the data of the target field; and starting data acquisition from the initial position to obtain incremental data of the target data source.
Optionally, the method further comprises: synchronously updating the position offset of the target field in the process of collecting the incremental data of the target data source; and writing the finally updated position offset into a task record of the current acquisition task to indicate the starting position of data acquisition on the target field of the target data source next time.
Optionally, the executing the target acquisition task according to the second acquisition rule to acquire incremental data of the target data source based on the metadata tag in the target data source includes: acquiring a historical task record of a data acquisition task executed on the target data source; acquiring first metadata marked in the data acquisition task in the past from the historical task record, wherein the first metadata comprises at least one of a file name and a modification timestamp of an acquired part; comparing metadata of all data in the target data source with the marked first metadata to find unmarked second metadata; and acquiring data corresponding to the second metadata to obtain incremental data of the target data source.
Optionally, the method further comprises: determining a file directory of the target data source according to a storage path of the data in the target data source in the acquisition process; when the target acquisition task is completed, closing an acquisition channel, and creating a corresponding monitoring thread for each file directory; monitoring the corresponding file catalogue in the target data source through the monitoring thread; and under the condition that the data under the file directory is monitored to be changed, initiating a new data acquisition task for the target data source.
Optionally, after the collection is completed, the method further includes: determining a target record and a splitting field which need to be split in the acquired data records; splitting the target record into a plurality of data records according to the value of the splitting field, wherein the number of the data records obtained by splitting is the same as the number of the values of the splitting field; and inheriting other fields except the split field and corresponding field values in the target record into each split data record.
Optionally, after the collection is completed, the method further includes: creating a synchronous monitoring thread when executing a data synchronous task on the acquired data; monitoring a synchronization process of data through the synchronization monitoring thread; under the condition that the synchronous monitoring thread monitors the synchronous failure, re-executing the data synchronous task until the retry times reach a preset threshold value, recording a failure log and generating alarm information based on the failure log; and sending out the alarm information.
Optionally, after the collection is completed, the method further includes: calling a target callback function when executing a data import task based on the acquired data; transferring the execution result of the data import task through the target callback function; when the execution result is that the execution fails, calling a target processing function; and executing error processing through the target processing function.
According to another aspect of the embodiments of the present application, there is provided a data acquisition device, including: the task receiving module is used for receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source; the rule acquisition module is used for acquiring an increment acquisition rule configured for the target acquisition task in advance, wherein the increment acquisition rule comprises a first acquisition rule for performing increment acquisition based on a position offset and a second acquisition rule for performing increment acquisition based on a metadata mark; and the increment acquisition module is used for executing the target acquisition task according to the first acquisition rule so as to acquire the increment data of the target data source based on the position offset of each data field in the target data source, or executing the target acquisition task according to the second acquisition rule so as to acquire the increment data of the target data source based on the metadata mark in the target data source.
According to another aspect of the embodiments of the present application, there is provided an electronic device including a memory, a processor, a communication interface, and a communication bus, where the memory stores a computer program executable on the processor, the memory, the processor, and the processor communicate through the communication bus and the communication interface, and the processor executes the steps of the method.
According to another aspect of embodiments of the present application, there is also provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the above-described method.
Compared with the related art, the technical scheme provided by the embodiment of the application has the following advantages:
the application provides a data acquisition method, which comprises the following steps: receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source; acquiring an incremental acquisition rule configured for the target acquisition task in advance, wherein the incremental acquisition rule comprises a first acquisition rule for incremental acquisition based on a position offset and a second acquisition rule for incremental acquisition based on a metadata mark; and executing the target acquisition task according to the first acquisition rule to acquire the incremental data of the target data source based on the position offset of each data field in the target data source, or executing the target acquisition task according to the second acquisition rule to acquire the incremental data of the target data source based on the metadata mark in the target data source. According to the method and the device, the adaptive increment acquisition rules are set for different data sources and different data acquisition tasks, so that the fact that the data acquisition tasks are executed each time is guaranteed to directly acquire the increment data of the data sources is guaranteed, repeated acquisition of the same data is avoided, namely, the waste of computing resources is avoided, the efficiency of data acquisition is improved, and the technical problem that the repeated acquisition of a large amount of the same data causes serious waste of the computing resources is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.
FIG. 1 is a schematic diagram of an alternative hardware environment for a data acquisition method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of an alternative data acquisition method according to an embodiment of the present application;
FIG. 3 is a block diagram of an alternative data acquisition device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.
In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and are not of specific significance per se. Thus, "module" and "component" may be used in combination.
To solve the problems mentioned in the background art, according to an aspect of the embodiments of the present application, an embodiment of a data acquisition method is provided.
Alternatively, in the embodiment of the present application, the above data acquisition method may be applied to a hardware environment configured by the terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, the server 103 is connected to the terminal 101 through a network, which may be used to provide services to the terminal or a client installed on the terminal, and a database 105 may be provided on the server or independent of the server, for providing data storage services to the server 103, where the network includes, but is not limited to: a wide area network, metropolitan area network, or local area network, and terminal 101 includes, but is not limited to, a PC, a cell phone, a tablet computer, etc.
A data collection method in the embodiment of the present application may be performed by the server 103, or may be performed by the server 103 and the terminal 101 together, as shown in fig. 2, and the method may include the following steps:
Step S202, receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source;
step S204, acquiring an incremental acquisition rule configured for the target acquisition task in advance, wherein the incremental acquisition rule comprises a first acquisition rule for incremental acquisition based on a position offset and a second acquisition rule for incremental acquisition based on a metadata mark;
step S206, executing the target acquisition task according to the first acquisition rule to acquire incremental data of the target data source based on the position offset of each data field in the target data source, or executing the target acquisition task according to the second acquisition rule to acquire incremental data of the target data source based on the metadata mark in the target data source.
Through the steps S202 to S206, the adaptive incremental acquisition rules are set for different data sources and different data acquisition tasks, so that the incremental data of the data sources are directly acquired when the data acquisition tasks are executed each time, the repeated acquisition of the same data is avoided, namely the waste of computing resources is avoided, the data acquisition efficiency is improved, and the technical problem that a large amount of the same data are repeatedly acquired to cause serious waste of computing resources is solved.
The application supports multiple types of data source access, such as: local files (Text/CSV/EXCEL), FTP files, relational databases (Mysql/Oracle/SQLSERVER), message queues (Kafka/Rockettq), elasticsearch, HBase/Hive/ClickHouse, http/https (third party Api data), and the like. When a data source is newly added, different synchronization strategies are set according to different strategy modes correspondingly defined by a type background selected by a user, a field hidden relation is established, and incremental or full synchronization can be realized. And through the service parameters defined by the background, the configured hidden-emission relationship can be checked and tested, so that the accuracy of the data source is ensured. Specifically, when a data source is newly added, different corresponding policy modes are defined in a background system according to the type selected by a user. Meanwhile, a field mapping relation needs to be established, and the fields of the source data source and the fields of the target data source are corresponding. Depending on the different synchronization strategy, incremental or full synchronization may be selected. To ensure the accuracy of the data source, the background system defines some service parameters for checking and testing the configured field mapping relationship. Thus, the correctness of field mapping can be ensured, and the source data can be accurately mapped into the target data source. The specific policy and definition of field mappings may vary from system to system and from need to need. For example, in an incremental synchronization strategy, the data to be synchronized may be determined from a time stamp or an incremental mark; in a full-volume synchronization strategy, all data may be synchronized. By checking and testing the configured field mapping relation, configuration errors can be found and corrected in time, and accuracy and reliability of data source synchronization are ensured. Therefore, the stability of the system can be improved, and the correctness and consistency of the data source are ensured. In summary, several important steps in the new addition of data sources include: selecting a strategy mode, establishing a field mapping relation, setting a synchronous strategy, and performing checksum test through service parameters defined by a background so as to ensure the accuracy of a data source.
The incremental acquisition rule is a key for solving the technical problem that repeated acquisition of a large amount of identical data causes serious waste of calculation resources, and specifically comprises a first acquisition rule for incremental acquisition based on position offset and a second acquisition rule for incremental acquisition based on metadata marks for different data sources. For example, a text file, a log file is applied to a first collection rule for incremental collection based on positional offset, and stream data is applied to a second collection rule for incremental collection based on metadata tags. Of course, both rules may be used in combination. The task execution methods of the two rules are described below, respectively.
In an alternative embodiment, the performing the target acquisition task according to the first acquisition rule to acquire incremental data of the target data source based on the positional offsets of the respective data fields in the target data source includes: acquiring a task record of a data acquisition task executed on the target data source for the last time; determining target fields required to be acquired by the target acquisition task, and acquiring the latest position offset of each target field in the task record, wherein the position offset is an offset from the initial position of a data source and is used for representing the acquired part; determining the position indicated by the position offset as the starting position of the data of the target field; and starting data acquisition from the initial position to obtain incremental data of the target data source.
Further, the method further comprises: synchronously updating the position offset of the target field in the process of collecting the incremental data of the target data source; and writing the finally updated position offset into a task record of the current acquisition task to indicate the starting position of data acquisition on the target field of the target data source next time.
In the embodiment of the application, in the process of collecting the data file, the position offset of the file content can be set, and the position offset corresponds to the field. For text files and log files, incremental acquisitions may be made based on acquisition location offsets. The offset is an indicator of from which location in the file data is to be collected. At the beginning of each acquisition task, the system will locate a specific location in the file according to the previously acquired location offset and begin reading data from that location. By recording, storing and updating the offset, only the content of the newly added part of the file can be acquired in the next acquisition, thereby realizing incremental acquisition. Thus, only the newly added data part can be collected, and the workload of repeatedly collecting the processed data is reduced. For text files and log files, the acquisition position offset may be recorded according to the bytes or the number of lines of the file.
By setting the position offset of the file content and performing incremental acquisition based on the position offset, the efficiency and accuracy of data acquisition can be improved, and meanwhile, the occupation of system resources is reduced. This is particularly important for scenes where a large number of data files need to be collected and processed.
In an alternative embodiment, the performing the target acquisition task according to the second acquisition rule to acquire incremental data of the target data source based on metadata tags in the target data source includes: acquiring a historical task record of a data acquisition task executed on the target data source; acquiring first metadata marked in the data acquisition task in the past from the historical task record, wherein the first metadata comprises at least one of a file name and a modification timestamp of an acquired part; comparing metadata of all data in the target data source with the marked first metadata to find unmarked second metadata; and acquiring data corresponding to the second metadata to obtain incremental data of the target data source.
Further, the method further comprises: determining a file directory of the target data source according to a storage path of the data in the target data source in the acquisition process; when the target acquisition task is completed, closing an acquisition channel, and creating a corresponding monitoring thread for each file directory; monitoring the corresponding file catalogue in the target data source through the monitoring thread; and under the condition that the data under the file directory is monitored to be changed, initiating a new data acquisition task for the target data source.
In the embodiment of the application, when the server acquires the data file, the server marks the acquired file so as to avoid repeated acquisition of the same file. This may be accomplished by recording metadata of the file (e.g., filename, modification time, etc.). When a file is successfully acquired, the file is marked to indicate that it has been acquired in order to avoid repeated acquisitions. Thus, when the file is acquired next time, the marked file can be skipped, and only the newly added file is acquired. When the collection operation is completed, namely all files to be collected are collected, the server side can close the collection channel and stop further collection operation. This ensures the integrity and accuracy of the acquisition job and frees up resources and avoids unnecessary runs. This way, the process of data file collection can be effectively managed and controlled. Through a plurality of monitoring threads and a file marking mechanism, a plurality of files can be acquired in parallel, and the acquisition efficiency is improved. Meanwhile, the acquisition channel is closed after the acquisition operation is completed, so that the acquisition process can be stopped in time, and the waste of resources and the wrong acquisition operation are avoided.
In an alternative embodiment, after the collection is completed, the method further comprises: determining a target record and a splitting field which need to be split in the acquired data records; splitting the target record into a plurality of data records according to the value of the splitting field, wherein the number of the data records obtained by splitting is the same as the number of the values of the splitting field; and inheriting other fields except the split field and corresponding field values in the target record into each split data record.
In the embodiment of the application, typically, a record includes a plurality of fields, and each field stores different information. In some cases, further processing or manipulation is required depending on the value of a certain field. Therefore, one record can be split into a plurality of records according to the value of a certain field in the data file acquisition process. These split records will automatically inherit the values of other fields in the original record.
For example, assume that there is a data file containing order information, and each record contains fields such as an order number, a trade name, and a quantity. If the cut is made according to the number of goods, when the number of a certain order is greater than 1, the record may be cut into a plurality of pieces, each piece representing the number of one goods. For example, the original record is: order number a, trade name X, number 3. After cutting according to the field values, three records can be obtained: order number A, trade name X, quantity 1; order number A, trade name X, quantity 1; order number a, trade name X, quantity 1. Thus, the values of the fields such as the order number and the commodity name in the original record are automatically inherited into each record after cutting. The processing mode can split one record into a plurality of records according to service requirements, so that the data is more fine-grained. Thus, the requirements under certain specific business scenes can be better met, such as counting sales of each commodity or carrying out inventory management.
As another example, assume that there is one record containing three fields of name, age and gender, the record needs to be cut into two records according to the gender field, containing information of men and women, respectively. In this case, it becomes useful to cut a single record into a plurality of records according to the field value. By cutting, a record may be split into multiple records, each containing the same field, but different field values. In the cut record, the values of the other fields, except for the field values used for cutting, will automatically inherit the values in the original record. If a record is cut into male and female information based on the gender field, each record in the cut record will contain values for the name and age fields, which are inherited from the original record. Only the value of the gender field will be different depending on the result of the cut. The method can split a single record into a plurality of records conveniently according to the values of the fields, and the values of other fields in the original record are reserved, so that repeated input and redundant data are avoided, and therefore, the data processing requirements of different requirements and scenes can be better met.
In an alternative embodiment, after the collection is completed, the method further comprises: creating a synchronous monitoring thread when executing a data synchronous task on the acquired data; monitoring a synchronization process of data through the synchronization monitoring thread; under the condition that the synchronous monitoring thread monitors the synchronous failure, re-executing the data synchronous task until the retry times reach a preset threshold value, recording a failure log and generating alarm information based on the failure log; and sending out the alarm information.
In the embodiment of the application, in the file synchronization process, if the file synchronization monitoring thread detects a synchronization failure condition, a retry mechanism is started. According to the number of retries configured in advance, the data with failed synchronization are tried to be synchronized for a plurality of times. If the data cannot be successfully synchronized through multiple attempts, the data are put into a designated queue, corresponding log information is recorded, and an alarm message is sent.
Specifically, during the file synchronization process, some abnormal situations may occur, which may cause synchronization failure, such as network connection interruption, unavailability of the target file server, and the like. In order to ensure the integrity and accuracy of the data, a retry operation is required to try to synchronize the data again. In this process, the number of retries may be configured in advance, i.e., when synchronization fails, a specified number of retries may be performed. The number of retries may be set according to the actual situation to ensure that the synchronization data is tried as many times as possible. If multiple attempts still fail to synchronize data, the system will place the data that failed to synchronize in a designated queue. In this way, the data can be separated from other successfully synchronized data, so that subsequent processing and tracking are facilitated. Meanwhile, under the condition of synchronization failure, the system can record corresponding log information so as to conduct fault investigation and problem analysis. The log records may include information on the specific cause of the synchronization failure, the number of retries, time stamps, etc. to facilitate subsequent problem location and handling. In addition, in order to know the condition of synchronization failure in time, the system also sends an alarm message. The alarm message may be sent to the relevant personnel by mail, text message or other notification means so that they can take timely action to solve the synchronization problem. Through the processing mechanism, the possible synchronization failure condition in the file synchronization process can be effectively handled, and the integrity and accuracy of data are ensured. Meanwhile, through recording logs and sending alarm messages, the synchronization problem can be found and solved in time, and smooth data synchronization is ensured.
In this embodiment of the present application, the specific synchronization target location needs to be determined according to the system design and configuration, such as another server: the files can be synchronized to a designated directory or file system on another server, so that backup and redundant storage of the files can be realized to improve the reliability and availability of the data; database: the file content can be parsed and converted into structured data and written into a database, so that data query, analysis and processing can be conveniently performed; queues or message middleware: the file content can be converted into a message and sent to a message queue or a message middleware, so that asynchronous processing and decoupling can be realized, and data can be sent to other systems or modules for further processing; cloud storage service: the files can be synchronized into cloud storage services, such as Amazon S3, google Cloud Storage and the like, and the cloud storage provides an extensible storage space and a data backup function, so that the cloud storage is suitable for large-scale data storage and access; distributed file system: files may be synchronized into distributed file systems, such as Hadoop HDFS, glumerfs, etc., which may provide high availability, high performance, and fault tolerance capabilities suitable for large-scale data storage and analysis.
According to specific application scenes and requirements, a proper synchronous target place can be selected. Meanwhile, according to the system design and configuration, the data with failed synchronization is put into a designated queue, recorded in a log and sent with an alarm message so as to carry out subsequent processing and fault investigation.
In an alternative embodiment, after the collection is completed, the method further comprises: calling a target callback function when executing a data import task based on the acquired data; transferring the execution result of the data import task through the target callback function; when the execution result is that the execution fails, calling a target processing function; and executing error processing through the target processing function.
In the embodiment of the application, in the data import task, a callback function can be set through an error callback processing mechanism and used for processing the result of each task. This callback function is called after the task is completed and passes the result of the task as a parameter. Through the callback function, successful and failed feedback can be carried out on the result of each task. If the task execution is successful, corresponding successful processing logic can be executed; if task execution fails, corresponding failure processing logic may be executed. When processing failed tasks, the error and acceptance methods can be utilized to perform corresponding processing. The error method can be used to handle general error conditions such as task execution timeout, connection interrupt, etc.; the permission method can be used for handling more serious abnormal situations, such as abnormal errors in the task execution process. By setting the error callback processing mechanism, the execution result of each task can be timely obtained, and corresponding processing can be carried out according to the result. Therefore, the robustness and the reliability of the system can be enhanced, and the success rate of the data import task can be improved. It should be noted that the specific error callback processing mechanism and implementation of the callback function may vary from system to system and from framework to framework. In practical applications, a proper mechanism and implementation mode are required to be selected according to specific requirements and technical options to process the result of the data import task.
In an alternative embodiment, the local thread may periodically clean up files that are archived out of date, preventing excessive consumption of disk space.
According to the method and the device, the adaptive increment acquisition rules are set for different data sources and different data acquisition tasks, so that the fact that the data acquisition tasks are executed each time is guaranteed to directly acquire the increment data of the data sources is guaranteed, repeated acquisition of the same data is avoided, namely, the waste of computing resources is avoided, the efficiency of data acquisition is improved, and the technical problem that the repeated acquisition of a large amount of the same data causes serious waste of the computing resources is solved.
According to still another aspect of the embodiments of the present application, as shown in fig. 3, there is provided a data acquisition device, including:
the task receiving module 301 is configured to receive a target acquisition task, where the target acquisition task is used to instruct data acquisition on a target data source;
the rule acquisition module 303 is configured to acquire an incremental acquisition rule configured for the target acquisition task in advance, where the incremental acquisition rule includes a first acquisition rule for performing incremental acquisition based on a position offset and a second acquisition rule for performing incremental acquisition based on a metadata tag;
the incremental acquisition module 305 is configured to execute the target acquisition task according to the first acquisition rule to acquire incremental data of the target data source based on the position offset of each data field in the target data source, or execute the target acquisition task according to the second acquisition rule to acquire the incremental data of the target data source based on the metadata tag in the target data source.
It should be noted that, the task receiving module 301 in this embodiment may be configured to perform step S202 in the embodiment of the present application, the rule obtaining module 303 in this embodiment may be configured to perform step S204 in the embodiment of the present application, and the increment collecting module 305 in this embodiment may be configured to perform step S206 in the embodiment of the present application.
It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or hardware as a part of the apparatus in the hardware environment shown in fig. 1.
Optionally, the incremental acquisition module is specifically configured to: acquiring a task record of a data acquisition task executed on the target data source for the last time; determining target fields required to be acquired by the target acquisition task, and acquiring the latest position offset of each target field in the task record, wherein the position offset is an offset from the initial position of a data source and is used for representing the acquired part; determining the position indicated by the position offset as the starting position of the data of the target field; and starting data acquisition from the initial position to obtain incremental data of the target data source.
Optionally, the incremental acquisition module further includes an offset recording unit, specifically configured to: synchronously updating the position offset of the target field in the process of collecting the incremental data of the target data source; and writing the finally updated position offset into a task record of the current acquisition task to indicate the starting position of data acquisition on the target field of the target data source next time.
Optionally, the incremental acquisition module is further configured to: acquiring a historical task record of a data acquisition task executed on the target data source; acquiring first metadata marked in the data acquisition task in the past from the historical task record, wherein the first metadata comprises at least one of a file name and a modification timestamp of an acquired part; comparing metadata of all data in the target data source with the marked first metadata to find unmarked second metadata; and acquiring data corresponding to the second metadata to obtain incremental data of the target data source.
Optionally, the incremental acquisition module further includes a data source monitoring unit, specifically configured to: determining a file directory of the target data source according to a storage path of the data in the target data source in the acquisition process; when the target acquisition task is completed, closing an acquisition channel, and creating a corresponding monitoring thread for each file directory; monitoring the corresponding file catalogue in the target data source through the monitoring thread; and under the condition that the data under the file directory is monitored to be changed, initiating a new data acquisition task for the target data source.
Optionally, the data acquisition device further includes a splitting module, specifically configured to: determining a target record and a splitting field which need to be split in the acquired data records; splitting the target record into a plurality of data records according to the value of the splitting field, wherein the number of the data records obtained by splitting is the same as the number of the values of the splitting field; and inheriting other fields except the split field and corresponding field values in the target record into each split data record.
Optionally, the data acquisition device further includes a synchronization module, specifically configured to: creating a synchronous monitoring thread when executing a data synchronous task on the acquired data; monitoring a synchronization process of data through the synchronization monitoring thread; under the condition that the synchronous monitoring thread monitors the synchronous failure, re-executing the data synchronous task until the retry times reach a preset threshold value, recording a failure log and generating alarm information based on the failure log; and sending out the alarm information.
Optionally, the data acquisition device further includes an importing module, specifically configured to: calling a target callback function when executing a data import task based on the acquired data; transferring the execution result of the data import task through the target callback function; when the execution result is that the execution fails, calling a target processing function; and executing error processing through the target processing function.
According to another aspect of the embodiments of the present application, as shown in fig. 4, the present application provides an electronic device, including a memory 401, a processor 403, a communication interface 405 and a communication bus 407, where the memory 401 stores a computer program that can be executed on the processor 403, and the memory 401 and the processor 403 communicate with each other through the communication interface 405 and the communication bus 407, and the processor 403 executes the steps of the method.
The memory and the processor in the electronic device communicate with the communication interface through a communication bus. The communication bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The communication bus may be classified as an address bus, a data bus, a control bus, or the like.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
There is also provided, in accordance with yet another aspect of embodiments of the present application, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of any of the embodiments described above.
Optionally, in an embodiment of the present application, the computer readable medium is configured to store program code for the processor to perform the steps of:
receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source;
acquiring an incremental acquisition rule configured for the target acquisition task in advance, wherein the incremental acquisition rule comprises a first acquisition rule for incremental acquisition based on a position offset and a second acquisition rule for incremental acquisition based on a metadata mark;
and executing the target acquisition task according to the first acquisition rule to acquire the incremental data of the target data source based on the position offset of each data field in the target data source, or executing the target acquisition task according to the second acquisition rule to acquire the incremental data of the target data source based on the metadata mark in the target data source.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
In specific implementation, the embodiments of the present application may refer to the above embodiments, which have corresponding technical effects.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (DSP devices, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or, what contributes to the prior art, or part of the technical solutions, may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc. It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the application to enable one skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A method of data acquisition, comprising:
receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source;
acquiring an incremental acquisition rule configured for the target acquisition task in advance, wherein the incremental acquisition rule comprises a first acquisition rule for incremental acquisition based on a position offset and a second acquisition rule for incremental acquisition based on a metadata mark;
and executing the target acquisition task according to the first acquisition rule to acquire the incremental data of the target data source based on the position offset of each data field in the target data source, or executing the target acquisition task according to the second acquisition rule to acquire the incremental data of the target data source based on the metadata mark in the target data source.
2. The method of claim 1, wherein performing the target acquisition task according to the first acquisition rule to acquire incremental data of the target data source based on the positional offset of each data field in the target data source comprises:
acquiring a task record of a data acquisition task executed on the target data source for the last time;
determining target fields required to be acquired by the target acquisition task, and acquiring the latest position offset of each target field in the task record, wherein the position offset is an offset from the initial position of a data source and is used for representing the acquired part;
determining the position indicated by the position offset as the starting position of the data of the target field;
and starting data acquisition from the initial position to obtain incremental data of the target data source.
3. The method according to claim 2, wherein the method further comprises:
synchronously updating the position offset of the target field in the process of collecting the incremental data of the target data source;
and writing the finally updated position offset into a task record of the current acquisition task to indicate the starting position of data acquisition on the target field of the target data source next time.
4. The method of claim 1, wherein the performing the target acquisition task according to the second acquisition rule to acquire incremental data of the target data source based on metadata tags in the target data source comprises:
acquiring a historical task record of a data acquisition task executed on the target data source;
acquiring first metadata marked in the data acquisition task in the past from the historical task record, wherein the first metadata comprises at least one of a file name and a modification timestamp of an acquired part;
comparing metadata of all data in the target data source with the marked first metadata to find unmarked second metadata;
and acquiring data corresponding to the second metadata to obtain incremental data of the target data source.
5. The method according to claim 4, wherein the method further comprises:
determining a file directory of the target data source according to a storage path of the data in the target data source in the acquisition process;
when the target acquisition task is completed, closing an acquisition channel, and creating a corresponding monitoring thread for each file directory;
Monitoring the corresponding file catalogue in the target data source through the monitoring thread;
and under the condition that the data under the file directory is monitored to be changed, initiating a new data acquisition task for the target data source.
6. The method of any one of claims 1 to 5, wherein after the acquisition is completed, the method further comprises:
determining a target record and a splitting field which need to be split in the acquired data records;
splitting the target record into a plurality of data records according to the value of the splitting field, wherein the number of the data records obtained by splitting is the same as the number of the values of the splitting field;
and inheriting other fields except the split field and corresponding field values in the target record into each split data record.
7. The method of any one of claims 1 to 5, wherein after the acquisition is completed, the method further comprises:
creating a synchronous monitoring thread when executing a data synchronous task on the acquired data;
monitoring a synchronization process of data through the synchronization monitoring thread;
under the condition that the synchronous monitoring thread monitors the synchronous failure, re-executing the data synchronous task until the retry times reach a preset threshold value, recording a failure log and generating alarm information based on the failure log;
And sending out the alarm information.
8. The method of any one of claims 1 to 5, wherein after the acquisition is completed, the method further comprises:
calling a target callback function when executing a data import task based on the acquired data;
transferring the execution result of the data import task through the target callback function;
when the execution result is that the execution fails, calling a target processing function;
and executing error processing through the target processing function.
9. A data acquisition device, comprising:
the task receiving module is used for receiving a target acquisition task, wherein the target acquisition task is used for indicating to acquire data of a target data source;
the rule acquisition module is used for acquiring an increment acquisition rule configured for the target acquisition task in advance, wherein the increment acquisition rule comprises a first acquisition rule for performing increment acquisition based on a position offset and a second acquisition rule for performing increment acquisition based on a metadata mark;
and the increment acquisition module is used for executing the target acquisition task according to the first acquisition rule so as to acquire the increment data of the target data source based on the position offset of each data field in the target data source, or executing the target acquisition task according to the second acquisition rule so as to acquire the increment data of the target data source based on the metadata mark in the target data source.
10. An electronic device comprising a memory, a processor, a communication interface and a communication bus, said memory storing a computer program executable on said processor, said memory, said processor communicating with said communication interface via said communication bus, characterized in that said processor, when executing said computer program, implements the steps of the method of any of the preceding claims 1 to 8.
11. A computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any one of claims 1 to 8.
CN202311254931.XA 2023-09-26 2023-09-26 Data acquisition method, device, equipment and computer readable medium Active CN117421337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311254931.XA CN117421337B (en) 2023-09-26 2023-09-26 Data acquisition method, device, equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311254931.XA CN117421337B (en) 2023-09-26 2023-09-26 Data acquisition method, device, equipment and computer readable medium

Publications (2)

Publication Number Publication Date
CN117421337A true CN117421337A (en) 2024-01-19
CN117421337B CN117421337B (en) 2024-05-28

Family

ID=89525510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311254931.XA Active CN117421337B (en) 2023-09-26 2023-09-26 Data acquisition method, device, equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN117421337B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001025962A1 (en) * 1999-10-05 2001-04-12 S.C. Medicarom Group S.R.L. Database organization for increasing performance by splitting tables
CN109508355A (en) * 2018-10-19 2019-03-22 平安科技(深圳)有限公司 A kind of data pick-up method, system and terminal device
CN110032594A (en) * 2019-03-21 2019-07-19 厦门市美亚柏科信息股份有限公司 The data pick-up method, apparatus and storage medium of the Various database of customizable
CN111176645A (en) * 2019-12-30 2020-05-19 国电南瑞科技股份有限公司 Power grid big data application-oriented data integration management system and implementation method thereof
US20200409977A1 (en) * 2017-09-08 2020-12-31 Guangdong Construction Information Center Generic Multi-Source Heterogeneous Large-Scale Data Synchronization Client-Server Method
CN113485962A (en) * 2021-06-30 2021-10-08 中国民航信息网络股份有限公司 Log file storage method, device, equipment and storage medium
CN115712623A (en) * 2022-11-22 2023-02-24 中国司法大数据研究院有限公司 Batch data fault-tolerant acquisition method based on capture metadata change

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001025962A1 (en) * 1999-10-05 2001-04-12 S.C. Medicarom Group S.R.L. Database organization for increasing performance by splitting tables
US20200409977A1 (en) * 2017-09-08 2020-12-31 Guangdong Construction Information Center Generic Multi-Source Heterogeneous Large-Scale Data Synchronization Client-Server Method
CN109508355A (en) * 2018-10-19 2019-03-22 平安科技(深圳)有限公司 A kind of data pick-up method, system and terminal device
CN110032594A (en) * 2019-03-21 2019-07-19 厦门市美亚柏科信息股份有限公司 The data pick-up method, apparatus and storage medium of the Various database of customizable
CN111176645A (en) * 2019-12-30 2020-05-19 国电南瑞科技股份有限公司 Power grid big data application-oriented data integration management system and implementation method thereof
CN113485962A (en) * 2021-06-30 2021-10-08 中国民航信息网络股份有限公司 Log file storage method, device, equipment and storage medium
CN115712623A (en) * 2022-11-22 2023-02-24 中国司法大数据研究院有限公司 Batch data fault-tolerant acquisition method based on capture metadata change

Also Published As

Publication number Publication date
CN117421337B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
US8938421B2 (en) Method and a system for synchronizing data
US10303795B2 (en) Read descriptors at heterogeneous storage systems
CN112131237B (en) Data synchronization method, device, equipment and computer readable medium
CN112559475B (en) Data real-time capturing and transmitting method and system
US11954123B2 (en) Data processing method and device for data integration, computing device and medium
US20160294651A1 (en) Method, apparatus, and computer program product for monitoring an electronic data exchange
WO2013148488A1 (en) A method and system for centralized issue tracking
US11036590B2 (en) Reducing granularity of backup data over time
CN112905323B (en) Data processing method, device, electronic equipment and storage medium
JP6633642B2 (en) Method and device for processing data blocks in a distributed database
AU2020203735A1 (en) Automated generation and dynamic update of rules
US20200250188A1 (en) Systems, methods and data structures for efficient indexing and retrieval of temporal data, including temporal data representing a computing infrastructure
CN114048217A (en) Incremental data synchronization method and device, electronic equipment and storage medium
CN110717130B (en) Dotting method, dotting device, dotting terminal and storage medium
JP4928480B2 (en) Job processing system and job management method
CN117421337B (en) Data acquisition method, device, equipment and computer readable medium
CN112751722A (en) Data transmission quality monitoring method and system
CN111209138A (en) Operation and maintenance method and device of data storage system
CN115952227A (en) Data acquisition system and method, electronic device and storage medium
CN114238018B (en) Method, system and device for detecting integrity of log collection file and storage medium
US11782873B2 (en) System and method for managing timeseries data
US20130290385A1 (en) Durably recording events for performing file system operations
CN112448840B (en) Communication data quality monitoring method, device, server and storage medium
JP2009181494A (en) Job processing system and job information acquisition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant