CN113918631A - Data writing method and device, computer equipment and storage medium - Google Patents

Data writing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113918631A
CN113918631A CN202111105250.8A CN202111105250A CN113918631A CN 113918631 A CN113918631 A CN 113918631A CN 202111105250 A CN202111105250 A CN 202111105250A CN 113918631 A CN113918631 A CN 113918631A
Authority
CN
China
Prior art keywords
current data
current
record
data record
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111105250.8A
Other languages
Chinese (zh)
Inventor
陈晓欣
郭小龙
孙迁
李成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Suning Software Technology Co ltd
Original Assignee
Nanjing Suning Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Suning Software Technology Co ltd filed Critical Nanjing Suning Software Technology Co ltd
Priority to CN202111105250.8A priority Critical patent/CN113918631A/en
Publication of CN113918631A publication Critical patent/CN113918631A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data writing method, a data writing device, computer equipment and a storage medium. The method comprises the following steps: the method comprises the steps of obtaining a current data set of a current real-time batch, wherein the current data set comprises at least one current data record, conducting business logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record, writing the current data record into a target business database according to the current data mark, operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set, submitting the new current data set to a thread pool, and writing the new current data set into the third-party storage engine. By adopting the method, the third-party storage engine is introduced, and the business data is input into the third-party storage engine in real time in each batch, so that the data delay in the ingestion process is effectively reduced, and an IDE scheduling platform is not required to be relied on.

Description

Data writing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data writing method and apparatus, a computer device, and a storage medium.
Background
In the internet era of high-speed information development and extremely fast data expansion, continuous expansion of enterprise business generates a large amount of business data. How to extract information useful for enterprise analysis and decision from the mass data becomes a first problem for enterprise decision managers, and the enterprise emphasizes the accuracy and timeliness of decision due to the fact that market competition is intensified day by day. Thus, olap (online analytical processing) has emerged and rises rapidly, and it has a main purpose to support decision management analysis, providing analysts with efficient, rapid, and accurate decision information.
The traditional olap real-time non-time sequence data is written into a PostGresql (PG) database in batches due to the fact that the olap real-time non-time sequence data has no main time dimension and needs to be frequently inserted, updated and deleted according to business logic. And the export is to generate a corresponding task by an external IDE (Integrated Development Environment) scheduling platform, and the full-scale synchronous PG library writes a partial file every 10 time-sharing of the task. The problem of more than one hour of data delay exists, and the task scheduling consumes the computing resources of the platform and is limited by the concurrence of the IDE; and when the amount of data is very large, the catastrophic result of the PG library being too stressful to be used is also produced.
Disclosure of Invention
Therefore, it is necessary to provide a data writing method, device, computer device and storage medium for the above technical problems, and introduce a third-party storage engine into which service data is put in real time in batches, so as to effectively reduce data delay in the ingestion process, and eliminate the influence of platform concurrency limitation without depending on an IDE scheduling platform, and separate data export and data query without affecting the use of the PG database of the existing service.
A method of writing data, the method comprising:
acquiring a current data set of a current real-time batch, wherein the current data set comprises at least one current data record;
performing service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record;
writing the current data record into a target service database according to the current data mark, and operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set;
and submitting the new current data set to a thread pool and writing the new current data set into a third-party storage engine.
In one embodiment, the method further comprises the following steps: receiving a current query request through a query interface provided by a third-party storage engine, packaging the current query request into a current query task, splicing according to a current model name of the current query task to obtain a current storage engine data file writing path, generating a current temporary table name according to the current model name, the third-party storage engine name and a current timestamp, replacing the current model name in the current storage engine data file writing path with the current temporary table name, generating a new current storage engine data file writing path, and obtaining a target query result corresponding to the current query request according to the new current storage engine data file writing path.
In one embodiment, the method further comprises the following steps: the method comprises the steps of receiving a current export request through an export interface provided by a third-party storage engine, obtaining a default distributed file system name according to the current export request, obtaining a current model system name and a current model name, generating a current storage engine data export path according to the distributed file system name, the current model system name and the current model name, and exporting target export data corresponding to the current export request according to the current storage engine data export path when the current storage engine data export path exists in the third-party storage engine.
In one embodiment, obtaining the current dataset of the current real-time batch comprises: and acquiring a non-time sequence data set in a preset time period, and analyzing and widening the non-time sequence data set to obtain a current data set of the current real-time batch.
In one embodiment, the writing of the current data record into the target service database according to the current data mark includes: determining whether the same main key record exists in a target business database according to a current data main key, determining whether a current data record operation field is a deleted field when the same main key record exists in the target business database, acquiring a first data record version number corresponding to the same main key record in the target business database when the current data record operation field is a non-deleted field, determining that a current data mark corresponding to the current data record is a current data updating mark when the current data record version number is larger than the first data record version number, and replacing the current data record with the data record of the same main key record in the target business database according to the current data updating mark.
In one embodiment, the method further comprises the following steps: when the same primary key record does not exist in the target service database, determining that the current data mark corresponding to the current data record is a current data adding mark, adding the current data record into the target service database according to the current data adding mark, determining that the current data mark corresponding to the current data record is a current data deleting mark when the current data record operation field is a deleting field, and deleting the current data record from the target service database according to the current data deleting mark.
In one embodiment, the current data record is a plurality of current data records, and the current data record is operated according to the current data mark to obtain a new current data set, including: and obtaining a first current data set according to the current data record corresponding to the current data mark as a current data updating mark or a current data adding mark, obtaining a second current data set according to the current data record corresponding to the current data mark as a current data deleting mark, and determining the first current data set and the current data set as new current data sets.
A data writing apparatus, the apparatus comprising:
the acquisition module is used for acquiring a current data set of a current real-time batch, wherein the current data set comprises at least one current data record;
the judging module is used for carrying out service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record;
the first writing module is used for writing the current data record into the target business database according to the current data mark, and when a writing switch of the third-party storage engine is in an open state, operating the current data record according to the current data mark to obtain a new current data set;
and the second writing module is used for submitting the new current data set to the thread pool and writing the new current data set into the third-party storage engine.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring a current data set of a current real-time batch, wherein the current data set comprises at least one current data record;
performing service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record;
writing the current data record into a target service database according to the current data mark, and operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set;
and submitting the new current data set to a thread pool and writing the new current data set into a third-party storage engine.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a current data set of a current real-time batch, wherein the current data set comprises at least one current data record;
performing service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record;
writing the current data record into a target service database according to the current data mark, and operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set;
and submitting the new current data set to a thread pool and writing the new current data set into a third-party storage engine.
The data writing method, the data writing device, the computer equipment and the storage medium acquire a current data set of a current real-time batch, wherein the current data set comprises at least one current data record, business logic judgment is carried out on the current data record in the current data set to obtain a current data mark corresponding to the current data record, the current data record is written into a target business database according to the current data mark, when a write-in switch of a third-party storage engine is in an open state, the current data record is operated according to the current data mark to obtain a new current data set, and the new current data set is submitted to a thread pool and written into the third-party storage engine.
The method introduces a third-party storage engine, business data are input into the third-party storage engine in real time in batches, data delay in the ingestion process is effectively reduced, an IDE scheduling platform is not needed, the influence of platform concurrency limitation is eliminated, data export and data query are separated, and the use of a PG database of the existing business is not influenced at all. And the problems that the PG database is required to be fully synchronized by timing every hour due to the traditional real-time non-time sequence derivation, the data volume is huge, the operation is frequent, the pressure of the PG database is high, and the PG query is seriously influenced can be solved.
Drawings
FIG. 1 is a diagram of an exemplary data writing method;
FIG. 2 is a flow chart illustrating a data writing method according to an embodiment;
FIG. 3 is a flow chart illustrating a data writing method according to an embodiment;
FIG. 4 is a flow chart illustrating a data writing method according to an embodiment;
FIG. 5 is a flowchart illustrating a target traffic database writing step in one embodiment;
FIG. 6 is a flow diagram that illustrates the steps of the current data set operation in one embodiment;
FIG. 7 is a flow diagram illustrating the writing of a pg library and a third-party storage engine in one embodiment;
FIG. 8 is a block diagram showing the structure of a data writing apparatus according to one embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data writing method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
Specifically, the terminal 102 obtains a current data set of the current real-time batch, where the current data set includes at least one current data record, and sends the current data set to the server 104 through the network. The server 104 performs service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record, writes the current data record into the target service database according to the current data mark, operates the current data record according to the current data mark when a write-in switch of the third-party storage engine is in an on state to obtain a new current data set, submits the new current data set to the thread pool, and writes the new current data set into the third-party storage engine.
In one embodiment, as shown in fig. 2, a data writing method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:
step 202, a current data set of the current real-time batch is obtained, wherein the current data set comprises at least one current data record.
The current data set of the current real-time batch is a data record set corresponding to the currently processed batch, the current data set comprises at least one current data record, and the data record is a multi-field record, such as name, age, gender, home address and the like.
In one embodiment, obtaining the current dataset for the current real-time batch comprises: and acquiring a non-time sequence data set in a preset time period, and analyzing and widening the non-time sequence data set to obtain a current data set of the current real-time batch.
The current real-time batch may be a batch within a preset time period, for example, a batch is data every 15 seconds, and the preset time period may be set in advance according to an actual service requirement, a product requirement, or an actual application scenario. Specifically, a non-time-series data set within a preset time period is obtained, where the non-time-series data set is a data record set that is not recorded in chronological order by the same unified index. And analyzing and widening the non-time sequence data set to generate a current data set corresponding to the current real-time batch. Wherein the non-time sequential data set may be a collection of non-time sequential consumption data records.
And 204, performing service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record.
The service logic judgment is to judge some relevant service rules, and may respectively perform service logic judgment on all current data records in the current data set to obtain a current data mark corresponding to each current data record. The current data flag refers to an operation flag of the current data record, and the current data flag includes, but is not limited to, a current data update flag, a current data addition flag, a current data deletion flag, and the like. Different current data marks correspond to different current data recording operations, and the corresponding current data records can be operated through the current data marks.
And step 206, writing the current data record into the target service database according to the current data mark, and operating the current data record according to the current data mark when a write-in switch of the third-party storage engine is in an open state to obtain a new current data set.
And step 208, submitting the new current data set to a thread pool, and writing the new current data set into a third-party storage engine.
The target business database is a postgresql (pg) database, and the third-party storage engine is Apache Hudi (Hadoop updates and accommodals), which may also be called Hudi data lake.
Specifically, after the current data mark corresponding to the current data record is obtained, the current data record may be written into the target service database, that is, into the pg database, according to the current data mark. And meanwhile, acquiring the state of a write-in switch of the third-party storage engine, and operating the current data record according to the current data mark to generate a new current data set when the state of the write-in switch is an open state.
Eventually, this new current data set needs to be committed to the thread pool and written to the third party storage engine. Wherein the thread pool includes a plurality of threads by which a new current data set may be written to the hudi data lake.
After each batch of data is written into the pg database, the batch of data can be synchronously written into a third-party storage engine, namely a hudi data lake, so that the consistency of the data is ensured, the data writing is near real-time, basically has no delay, and the dependence on an external platform can be eliminated. And the problems that the PG database is required to be fully synchronized by timing every hour in the traditional real-time non-time sequence derivation, the data volume is huge, the operation is frequent, the pressure of the PG database is high, and the PG query is seriously influenced are solved.
The data writing method includes the steps of obtaining a current data set of a current real-time batch, wherein the current data set comprises at least one current data record, conducting business logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record, writing the current data record into a target business database according to the current data mark, operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set, submitting the new current data set to a thread pool, and writing the new current data set into the third-party storage engine.
The method introduces a third-party storage engine, business data are input into the third-party storage engine in real time in batches, data delay in the ingestion process is effectively reduced, an IDE scheduling platform is not needed, the influence of platform concurrency limitation is eliminated, data export and data query are separated, and the use of a PG database of the existing business is not influenced at all. And the problems that the PG database is required to be fully synchronized by timing every hour due to the traditional real-time non-time sequence derivation, the data volume is huge, the operation is frequent, the pressure of the PG database is high, and the PG query is seriously influenced can be solved.
In one embodiment, as shown in fig. 3, further comprising:
step 302, receiving a current query request through a query interface provided by a third-party storage engine, and encapsulating the current query request into a current query task.
And 304, splicing according to the current model name of the current query task to obtain a current storage engine data file writing path.
After the current data set is written into the third-party storage engine, in order to verify whether the current data set is correctly written according to the business logic, a query interface of the third-party storage engine can be provided, and whether the current data set is correctly written according to the business logic is verified through the query interface.
Specifically, the current query request is received through a query interface provided by the third-party storage engine, and is encapsulated into a current query task, for example, the current query request is encapsulated into a queryhudatsk (modelName: string, querySql: string), and then a worker (small application) is selected to execute the task.
And further, acquiring the current model name of the current query task, and splicing the current storage engine data file write-in path according to the current model name. For example, the Worker splices a specific engine data file writing path according to a model name modelName in the queryhudotask.
Step 306, generating a current temporary table name according to the current model name, the third-party storage engine name and the current timestamp.
And 308, replacing the current model name in the current storage engine data file writing path with the current temporary table name to generate a new current storage engine data file writing path.
And 310, obtaining a target query result corresponding to the current query request according to the new current storage engine data file write-in path.
Specifically, the current temporary table name may be generated from the current model name, the third-party storage engine name, and the current timestamp, i.e., a unique temporary table name is generated: model name engine name timestamp. And reading the writing path of the engine file, registering the temporary table, replacing the current model name in the writing path of the current storage engine data file with the name of the current temporary table, and generating a new writing path of the current storage engine data file. For example, the current temporary table name is replaced with the model name modelName in the reference querySql (query statement), generating a new querySql.
And finally, obtaining a target query result corresponding to the current query request according to the current storage engine data file write-in path, and verifying whether the current data set is written into the hudi data lake according to correct service logic. For example, the third-party storage engine executes a new querySql statement and returns the query result to the query end.
In one embodiment, as shown in fig. 4, further comprising:
step 402, receiving a current export request through an export interface provided by a third-party storage engine, and obtaining a default distributed file system name according to the current export request.
And step 404, acquiring a current model system name and a current model name, and generating a current storage engine data export path according to the distributed file system name, the current model system name and the current model name.
And 406, when the third-party storage engine has the current storage engine data export path, exporting target export data corresponding to the current export request according to the current storage engine data export path.
For data export in the third-party storage engine, only one export interface needs to be provided externally, and an engine file write-in path corresponding to the model is returned. Specifically, the current export request is received through an export interface provided by the third-party storage engine, and the default distributed file system name is obtained according to the current export request. For example, a default hdfs (distributed file system) file system name is obtained, such as hdfs:// routerprd, and the specific obtaining manner is as follows: spark contrast, hadoopconfiguration, getraw ("fs.defaultfs");
and further, acquiring a current model system name and a current model name, and splicing the distributed file system name, the current model system name and the current model name to obtain a current storage engine data export path. For example, according to the interface: combining a model system name systemId and a model name modelName, and combining an engine file path generation rule to spell out a specific file writing path: the method comprises the following steps of/user/bigquery/hudi/systemId _ modeName, splicing the hdfs file system name and the engine file writing path, and generating an absolute path (namely the current storage engine data export path) returning to the outside: hdfs:// routerprd/user/bigquery/hudi/systemId _ modeName.
Finally, the current storage engine data export path is spelled according to a specific rule, so that the external part needs to judge whether the path really exists or not after acquiring the path, and if the path really exists, the path can be directly read to realize data export.
The export and the query are separated, the use of the PG database is not influenced at all, and the export or the query of the data can be carried out only by corresponding paths.
In an embodiment, as shown in fig. 5, the writing of the current data record into the target service database according to the current data mark includes:
step 502, determining whether the same primary key record exists in the target service database according to the current data primary key.
Step 504, when the same primary key record exists in the target service database, determining whether the current data record operation field is a deletion field.
Step 506, when the current data record operation field is a non-deleted field, acquiring a first data record version number corresponding to the same primary key record in the target service database.
The current data record comprises a current data record main key, a current data record operation field and a current data record version number, the current data record main key is used for uniquely identifying the current data record, and different current data records correspond to different current data record main keys. The current data record operation field is a field associated with a current data record operation, the current data record version number is the version number of the current data record, and each data record will have a version number. The version number is generally identified according to time, and the newer the time is, the newer the data is, the data that the user most thinks of is.
Specifically, whether the data records of the same current data main key exist in a target service database (pg database) is checked, the current data main key is a concept of the database, the data corresponding to the unique identifier is the unique record of the database, and if the data of the same main key is inserted, the database is rejected. And if the data records of the same primary key exist in the target business database, checking whether the current data record operation field in the target business database is a deleted field.
And if the current data record operation field in the target service database is a non-deleted field, acquiring a first data record version number corresponding to a data record of a primary key which is the same as the primary key of the current data record in the target service database. That is, the data record version number corresponding to the same primary key record in the target service database is obtained.
Step 508, when the version number of the current data record is greater than the version number of the first data record, determining that the current data mark corresponding to the current data record is the current data update mark.
And 510, replacing the data record of the same main key record in the target service database with the current data record according to the current data updating mark.
Specifically, it is determined whether the current data record version number of the current data record is greater than the first data record version number, and if the current data record version number is greater than the first data record version number, it indicates that the current data record is more updated than the data record in the target service database in time. And the current data updating mark is used for updating the current data record, so that the data record of the same main key record in the target service database is replaced by the current data record according to the current data updating record, and the data record in the target service database is updated.
Step 512, when the same primary key record does not exist in the target service database, determining that the current data mark corresponding to the current data record is the current data newly added mark.
And 514, adding the current data record into the target service database according to the current data addition mark.
Specifically, if the same primary key record does not exist in the target service database, it is indicated that the current data record does not exist in the target service database, and therefore, it can be directly determined that the current data mark corresponding to the current data record is the current data newly added mark. The current data newly-added mark is used for adding the current data record into the target service database, so that the current data record can be newly added into the target service database according to the current data newly-added mark, namely, a data record, namely the current data record, is added into the target service database.
Step 516, when the operation field of the current data record is the delete field, determining that the current data mark corresponding to the current data record is the current data delete mark.
And step 518, deleting the current data record from the target service database according to the current data deletion mark.
Specifically, if the current data record operation field is a delete field, it indicates that the current data record needs to be deleted from the target service database, and therefore, it is determined that the current data record corresponding to the current data record is marked as a current data delete marker. The current data deletion flag is used for deleting the current data record from the target service database, so that the current data record can be deleted from the target service database according to the current data deletion flag.
In one embodiment, as shown in fig. 6, the current data record is multiple, and the current data record is operated according to the current data flag to obtain a new current data set, including:
step 602, obtaining a first current data set according to a current data record corresponding to the current data mark or the current data newly added mark.
Step 604, obtaining a second current data set according to the current data record corresponding to the current data deletion mark marked by the current data mark, and determining the first current data set and the current data set as a new current data set.
If a plurality of current data records exist, each current data record has a corresponding current data record mark, and a current data set can be divided according to the current data records. Specifically, the current data is marked as a current data update mark or a current data record corresponding to a current data addition mark to form a first current data set, the current data is marked as a current data record corresponding to a current data deletion mark to form a second current data set, and the first current data set and the current data set form a new current data set.
In one embodiment, as shown in FIG. 7, FIG. 7 illustrates a flow diagram of data writing to the pg library and the third-party storage engine in one embodiment. Consuming real-time data in a database, taking a batch every 15 seconds, analyzing and widening the data of each batch to generate a data set of the current real-time batch, grouping the data sets of the current real-time batch according to a main key to obtain a record with the maximum version number of each group, initializing a pg database, and performing service logic judgment on each record in the data set of the current real-time batch, specifically, judging whether the same main key record exists in the pg database, if so, judging whether the record is deleted by a service mark, and if so, marking the record as: (delete), otherwise, if the record is not deleted by the service mark, then determine if the record version number is greater than the version number of the same primary key record in the pg database, if so, then mark the record as: ("update", record).
Wherein if no identical primary key record exists in the pg database, marking the record as: ("insert," records) resulting in a current data set of data records for each marker.
Further, persisting a marked current data set by a persistence (system framework), writing each record in the current data set into a pg database according to marking operation (deleting/adding/updating), meanwhile, judging whether a leading-out switch in the hudi data lake is opened, if the leading-out switch is opened, constructing an appendRdd (newly added or updated data set) and a deleetrdd (deleted data set) according to the record mark, submitting a thread pool, writing into the hudi data lake, and completing writing of each record in the current data set of the current real-time batch.
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the above-described flowcharts may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 8, there is provided a data writing apparatus 800 comprising: an obtaining module 802, a determining module 804, a first writing module 806, and a second writing module 808, wherein:
an obtaining module 802, configured to obtain a current data set of a current real-time batch, where the current data set includes at least one current data record.
The determining module 804 is configured to perform service logic determination on the current data record in the current data set, so as to obtain a current data mark corresponding to the current data record.
The first writing module 806 is configured to write the current data record into the target service database according to the current data flag, and when a write switch of the third-party storage engine is in an on state, operate the current data record according to the current data flag to obtain a new current data set.
And a second writing module 808, configured to submit the new current data set to the thread pool, and write the new current data set into the third-party storage engine.
In one embodiment, the data writing apparatus 800 receives a current query request through a query interface provided by a third-party storage engine, encapsulates the current query request into a current query task, obtains a current storage engine data file writing path by splicing according to a current model name of the current query task, generates a current temporary table name according to the current model name, the third-party storage engine name and a current timestamp, replaces the current model name in the current storage engine data file writing path with the current temporary table name, generates a new current storage engine data file writing path, and obtains a target query result corresponding to the current query request according to the new current storage engine data file writing path.
In one embodiment, the data writing apparatus 800 receives a current export request through an export interface provided by a third-party storage engine, obtains a default distributed file system name according to the current export request, obtains a current model system name and a current model name, generates a current storage engine data export path according to the distributed file system name, the current model system name, and the current model name, and exports target export data corresponding to the current export request according to the current storage engine data export path when the third-party storage engine has the current storage engine data export path.
In an embodiment, the obtaining module 802 obtains a non-time-series data set within a preset time period, and obtains a current data set of a current real-time batch after analyzing and broadening the non-time-series data set.
In one embodiment, the current data record includes a current data record primary key, a current data record operation field, and a current data record version number, the decision module 804 determines whether the same primary key record already exists in the target business database according to the current data primary key, when the same primary key record exists in the target business database, determining whether the current data record operation field is a deletion field, when the current data record operation field is a non-deletion field, acquiring a first data record version number corresponding to the same primary key record in the target service database, when the version number of the current data record is greater than the version number of the first data record, determining that the current data mark corresponding to the current data record is a current data update mark, and replacing the current data record with the data record of the same main key record in the target service database according to the current data updating mark.
In an embodiment, the determining module 804 determines that the current data corresponding to the current data record is marked as a current data adding mark when the same primary key record does not exist in the target service database, adds the current data record to the target service database according to the current data adding mark, determines that the current data corresponding to the current data record is marked as a current data deleting mark when the current data record operation field is a deleting field, and deletes the current data record from the target service database according to the current data deleting mark.
In an embodiment, the current data records are multiple, the second writing module 808 obtains a first current data set according to the current data record corresponding to the current data update flag or the current data addition flag marked as the current data, obtains a second current data set according to the current data record corresponding to the current data delete flag marked as the current data, and determines the first current data set and the current data set as new current data sets.
For specific limitations of the data writing device, reference may be made to the above limitations of the data writing method, which is not described herein again. The respective modules in the data writing apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store the current data set. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data writing method.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: the method comprises the steps of obtaining a current data set of a current real-time batch, wherein the current data set comprises at least one current data record, conducting business logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record, writing the current data record into a target business database according to the current data mark, operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set, submitting the new current data set to a thread pool, and writing the new current data set into the third-party storage engine.
In one embodiment, the processor, when executing the computer program, further performs the steps of: receiving a current query request through a query interface provided by a third-party storage engine, packaging the current query request into a current query task, splicing according to a current model name of the current query task to obtain a current storage engine data file writing path, generating a current temporary table name according to the current model name, the third-party storage engine name and a current timestamp, replacing the current model name in the current storage engine data file writing path with the current temporary table name, generating a new current storage engine data file writing path, and obtaining a target query result corresponding to the current query request according to the new current storage engine data file writing path.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the method comprises the steps of receiving a current export request through an export interface provided by a third-party storage engine, obtaining a default distributed file system name according to the current export request, obtaining a current model system name and a current model name, generating a current storage engine data export path according to the distributed file system name, the current model system name and the current model name, and exporting target export data corresponding to the current export request according to the current storage engine data export path when the current storage engine data export path exists in the third-party storage engine.
In one embodiment, the processor, when executing the computer program, further performs the steps of: : and acquiring a non-time sequence data set in a preset time period, and analyzing and widening the non-time sequence data set to obtain a current data set of the current real-time batch.
In one embodiment, the current data record includes a current data record primary key, a current data record operation field, and a current data record version number, and the processor when executing the computer program further performs the steps of: determining whether the same main key record exists in a target business database according to a current data main key, determining whether a current data record operation field is a deleted field when the same main key record exists in the target business database, acquiring a first data record version number corresponding to the same main key record in the target business database when the current data record operation field is a non-deleted field, determining that a current data mark corresponding to the current data record is a current data updating mark when the current data record version number is larger than the first data record version number, and replacing the current data record with the data record of the same main key record in the target business database according to the current data updating mark.
In one embodiment, the processor, when executing the computer program, further performs the steps of: when the same primary key record does not exist in the target service database, determining that the current data mark corresponding to the current data record is a current data adding mark, adding the current data record into the target service database according to the current data adding mark, determining that the current data mark corresponding to the current data record is a current data deleting mark when the current data record operation field is a deleting field, and deleting the current data record from the target service database according to the current data deleting mark.
In one embodiment, the current data record is a plurality of data records, and the processor when executing the computer program further performs the following steps: and obtaining a first current data set according to the current data record corresponding to the current data mark as a current data updating mark or a current data adding mark, obtaining a second current data set according to the current data record corresponding to the current data mark as a current data deleting mark, and determining the first current data set and the current data set as new current data sets.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: the method comprises the steps of obtaining a current data set of a current real-time batch, wherein the current data set comprises at least one current data record, conducting business logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record, writing the current data record into a target business database according to the current data mark, operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set, submitting the new current data set to a thread pool, and writing the new current data set into the third-party storage engine.
In one embodiment, the processor, when executing the computer program, further performs the steps of: receiving a current query request through a query interface provided by a third-party storage engine, packaging the current query request into a current query task, splicing according to a current model name of the current query task to obtain a current storage engine data file writing path, generating a current temporary table name according to the current model name, the third-party storage engine name and a current timestamp, replacing the current model name in the current storage engine data file writing path with the current temporary table name, generating a new current storage engine data file writing path, and obtaining a target query result corresponding to the current query request according to the new current storage engine data file writing path.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the method comprises the steps of receiving a current export request through an export interface provided by a third-party storage engine, obtaining a default distributed file system name according to the current export request, obtaining a current model system name and a current model name, generating a current storage engine data export path according to the distributed file system name, the current model system name and the current model name, and exporting target export data corresponding to the current export request according to the current storage engine data export path when the current storage engine data export path exists in the third-party storage engine.
In one embodiment, the processor, when executing the computer program, further performs the steps of: : and acquiring a non-time sequence data set in a preset time period, and analyzing and widening the non-time sequence data set to obtain a current data set of the current real-time batch.
In one embodiment, the current data record includes a current data record primary key, a current data record operation field, and a current data record version number, and the processor when executing the computer program further performs the steps of: determining whether the same main key record exists in a target business database according to a current data main key, determining whether a current data record operation field is a deleted field when the same main key record exists in the target business database, acquiring a first data record version number corresponding to the same main key record in the target business database when the current data record operation field is a non-deleted field, determining that a current data mark corresponding to the current data record is a current data updating mark when the current data record version number is larger than the first data record version number, and replacing the current data record with the data record of the same main key record in the target business database according to the current data updating mark.
In one embodiment, the processor, when executing the computer program, further performs the steps of: when the same primary key record does not exist in the target service database, determining that the current data mark corresponding to the current data record is a current data adding mark, adding the current data record into the target service database according to the current data adding mark, determining that the current data mark corresponding to the current data record is a current data deleting mark when the current data record operation field is a deleting field, and deleting the current data record from the target service database according to the current data deleting mark.
In one embodiment, the current data record is a plurality of data records, and the processor when executing the computer program further performs the following steps: and obtaining a first current data set according to the current data record corresponding to the current data mark as a current data updating mark or a current data adding mark, obtaining a second current data set according to the current data record corresponding to the current data mark as a current data deleting mark, and determining the first current data set and the current data set as new current data sets.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of writing data, the method comprising:
acquiring a current data set of a current real-time batch, wherein the current data set comprises at least one current data record;
performing service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record;
writing the current data record into a target service database according to the current data mark, and operating the current data record according to the current data mark when a write-in switch of a third-party storage engine is in an open state to obtain a new current data set;
and submitting the new current data set to a thread pool, and writing the new current data set into the third-party storage engine.
2. The method of claim 1, further comprising:
receiving a current query request through a query interface provided by the third-party storage engine, and packaging the current query request into a current query task;
splicing according to the current model name of the current query task to obtain a current storage engine data file writing path;
generating a current temporary table name according to the current model name, the third-party storage engine name and the current timestamp;
replacing the current model name in the current storage engine data file writing path with the current temporary table name to generate a new current storage engine data file writing path;
and obtaining a target query result corresponding to the current query request according to the new current storage engine data file write-in path.
3. The method of claim 1, further comprising:
receiving a current export request through an export interface provided by the third-party storage engine, and acquiring a default distributed file system name according to the current export request;
acquiring a current model system name and a current model name, and generating a current storage engine data export path according to the distributed file system name, the current model system name and the current model name;
and when the third-party storage engine has the current storage engine data export path, exporting target export data corresponding to the current export request according to the current storage engine data export path.
4. The method of claim 1, wherein obtaining the current dataset for the current real-time batch comprises:
acquiring a non-time sequence data set in a preset time period;
and analyzing and widening the non-time sequence data set to obtain the current data set of the current real-time batch.
5. The method of claim 1, wherein the current data record includes a current data record primary key, a current data record operation field, and a current data record version number, and wherein performing a service logic determination on the current data record in the current dataset to obtain a current data tag corresponding to the current data record, and writing the current data record into a target service database according to the current data tag comprises:
determining whether the same main key record exists in the target service database according to the current data main key;
when the same primary key record exists in the target service database, determining whether the current data record operation field is a deleted field;
when the current data record operation field is a non-deleted field, acquiring a first data record version number corresponding to the same primary key record in the target service database;
when the version number of the current data record is greater than the version number of the first data record, determining that the current data mark corresponding to the current data record is a current data updating mark;
and replacing the current data record with the data record of the same main key record in the target service database according to the current data updating mark.
6. The method of claim 5, further comprising:
when the same primary key record does not exist in the target service database, determining that the current data mark corresponding to the current data record is a current data newly-added mark;
newly adding the current data record into the target service database according to the current data newly added mark;
when the current data record operation field is a deletion field, determining that a current data mark corresponding to the current data record is a current data deletion mark;
and deleting the current data record from the target service database according to the current data deletion mark.
7. The method according to any one of claims 5 or 6, wherein the current data record is a plurality of data records, and the operating the current data record according to the current data flag to obtain a new current data set comprises:
obtaining a first current data set according to a current data record corresponding to the current data mark or the current data newly added mark;
and obtaining a second current data set according to the current data record corresponding to the current data mark as the current data deletion mark, and determining the first current data set and the current data set as a new current data set.
8. A data writing apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring a current data set of a current real-time batch, wherein the current data set comprises at least one current data record;
the judging module is used for performing service logic judgment on the current data record in the current data set to obtain a current data mark corresponding to the current data record;
the first writing module is used for writing the current data record into a target business database according to the current data mark, and when a writing switch of a third-party storage engine is in an open state, operating the current data record according to the current data mark to obtain a new current data set;
and the second writing module is used for submitting the new current data set to a thread pool and writing the new current data set into the third-party storage engine.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111105250.8A 2021-09-22 2021-09-22 Data writing method and device, computer equipment and storage medium Pending CN113918631A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111105250.8A CN113918631A (en) 2021-09-22 2021-09-22 Data writing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111105250.8A CN113918631A (en) 2021-09-22 2021-09-22 Data writing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113918631A true CN113918631A (en) 2022-01-11

Family

ID=79235417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111105250.8A Pending CN113918631A (en) 2021-09-22 2021-09-22 Data writing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113918631A (en)

Similar Documents

Publication Publication Date Title
CN108920698B (en) Data synchronization method, device, system, medium and electronic equipment
CN111680008B (en) Log processing method and system, readable storage medium and intelligent device
WO2021217846A1 (en) Interface data processing method and apparatus, and computer device and storage medium
CN109446065A (en) User tag test method, device, computer equipment and storage medium
CN106445815B (en) A kind of automated testing method and device
CN109597979B (en) List table generation method and device, computer equipment and storage medium
CN110046155B (en) Method, device and equipment for updating feature database and determining data features
US11914574B2 (en) Generation of inconsistent testing data
CN114385722A (en) Interface attribute consistency checking method and device, electronic equipment and storage medium
CN114138907A (en) Data processing method, computer device, storage medium, and computer program product
CN107391539B (en) Transaction processing method, server and storage medium
CN107392560A (en) A kind of Excel list datas issue acquisition method and system based on internet
CN110515970B (en) Service processing method, device, computer equipment and storage medium
CN115329011A (en) Data model construction method, data query method, data model construction device and data query device, and storage medium
CN111090701B (en) Service request processing method, device, readable storage medium and computer equipment
US11556519B2 (en) Ensuring integrity of records in a not only structured query language database
CN113515518A (en) Data storage method and device, computer equipment and storage medium
CN113918631A (en) Data writing method and device, computer equipment and storage medium
CN115587247A (en) Method, device and equipment for monitoring user label and storage medium
CN113010550B (en) Batch object generation and batch processing method and device for structured data
CN110222290B (en) Page generation method and device, computer equipment and storage medium
CN114327377B (en) Method and device for generating demand tracking matrix, computer equipment and storage medium
CN110866036B (en) Data processing method, system, device, terminal and readable storage medium
CN115687512A (en) Risk data processing method, apparatus, device, medium, and computer program product
CN116738000A (en) Data storage relationship processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination