CN117880214A

CN117880214A - Data processing method and related device

Info

Publication number: CN117880214A
Application number: CN202410052596.3A
Authority: CN
Inventors: 王玲; 廖梓鸿; 何海锋
Original assignee: Guangzhou Dianjinshi Information Technology Co ltd
Current assignee: Guangzhou Dianjinshi Information Technology Co ltd
Priority date: 2024-01-12
Filing date: 2024-01-12
Publication date: 2024-04-12

Abstract

The invention relates to a data processing method and a related device. Determining a Kafka message queue topic to which the source data belongs by acquiring the source data; performing Avro serialization processing on the source data based on the schema matched with the topic to obtain serialized data; the identification information of the schema exists in the serialized data; publishing the serialized data to Kafka; obtaining serialized data from Kafka; determining a schema corresponding to the identification information in the serialized data, and analyzing the serialized data based on the schema to obtain the source data. The storage space required by data storage is reduced, and the bandwidth of data transmission is saved; the serialization time is reduced, and the performance is improved.

Description

Data processing method and related device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and a related device.

Background

In the process of rapid service rise, the user quantity is increased along with the rapid service rise, and meanwhile, the data reporting such as user behaviors and the like is increased increasingly, so that a great challenge is brought to a processing system for receiving the message. The kafka is currently used for transmitting json data format, which is simple and direct, easy to understand, and convenient for quickly applying the relevant characteristics of kafka in the early stages of business. This approach performs better also in small magnitudes. However, the problems of consumption accumulation, large storage space occupation, program abnormality and the like are easily caused due to large magnitude.

Disclosure of Invention

The present invention has been made in view of the above-mentioned problems, and it is an object of the present invention to provide a data processing method and related apparatus that overcomes or at least partially solves the above-mentioned problems.

In a first aspect, an embodiment of the present invention provides a data processing method, including the steps of:

acquiring source data, and determining a Kafka message queue topic to which the source data belongs;

performing Avro serialization processing on the source data based on the schema matched with the topic to obtain serialized data; the identification information of the schema exists in the serialized data;

publishing the serialized data to Kafka;

obtaining serialized data from Kafka;

determining a schema corresponding to the identification information in the serialized data, and analyzing the serialized data based on the schema to obtain the source data.

In one embodiment, the av ro serializing process is performed on the source data based on the schema matched with the topic to obtain serialized data, and the method includes the steps of:

performing version verification on the schema matched with the topic to determine the schema of the corresponding version;

performing Avro serialization processing on the source data based on the schema of the corresponding version to obtain serialized data; and the identification information of the schema of the corresponding version exists in the serialized data.

In one embodiment, the verifying the version of the schema matched with the topic, determining the schema of the corresponding version, includes the steps of:

if the schema version verification matched with the topic is not passed, creating corresponding version information and registering the schema.

In one embodiment, before the Avro serialization process is performed on the source data based on the schema matched with the topic, the method further comprises the steps of:

and if the schema matched with the topic to which the source data belongs does not exist, registering the schema.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, including:

the first acquisition module is used for acquiring source data and determining a Kafka message queue topic to which the source data belongs;

the serialization module is used for carrying out Avro serialization processing on the source data based on the schema matched with the topic to obtain serialized data; the identification information of the schema exists in the serialized data;

the publishing module is used for publishing the serialized data to Kafka;

a second acquisition module for acquiring the serialized data from Kafka;

the analysis module is used for determining a schema corresponding to the identification information in the serialized data, and analyzing the serialized data based on the schema to obtain the source data.

In one embodiment, the serialization module comprises:

the version verification sub-module is used for carrying out version verification on the schema matched with the topic and determining the schema of the corresponding version;

the serialization submodule is used for carrying out Avro serialization processing on the source data based on the schema of the corresponding version to obtain serialized data; and the identification information of the schema of the corresponding version exists in the serialized data.

In one embodiment, the version verification sub-module is further configured to create corresponding version information and register a schema if the schema version verification matched by the topic is not passed.

In one embodiment, the apparatus further comprises:

and the registration module is used for registering the schema if the schema matched with the topic to which the source data belongs does not exist.

In a third aspect, an embodiment of the present invention provides a computer apparatus, including:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the data processing method as claimed in any one of the first aspects.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium.

The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the data processing method according to any of the first aspects.

In the embodiment of the invention, the Kafka message queue topic to which the source data belongs is determined by acquiring the source data; performing Avro serialization processing on the source data based on the schema matched with the topic to obtain serialized data; the identification information of the schema exists in the serialized data; publishing the serialized data to Kafka; obtaining serialized data from Kafka; determining a schema corresponding to the identification information in the serialized data, and analyzing the serialized data based on the schema to obtain the source data. The storage space required by data storage is reduced, and the bandwidth of data transmission is saved; the serialization time is reduced, and the performance is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a data processing method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a data processing apparatus according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

To overcome or at least partially solve the above-mentioned problems, embodiments of the present application provide a data processing method, by which a storage space required for data storage can be reduced, and a bandwidth for data transmission can be saved. The following is a detailed description of examples.

Example 1

Fig. 1 is a flowchart of a data processing method according to a first embodiment of the present invention, where the method may be performed by a data processing apparatus, and the data processing apparatus may be implemented by software and/or hardware, and may be configured in a computer device, for example, a server, a personal computer, or the like. The data processing method specifically comprises the following steps:

and 101, acquiring source data, and determining a Kafka message queue topic to which the source data belongs.

Kafka is a distributed Message Queue (MQ) based on a publish/subscribe mode, and is mainly applied to the field of big data real-time processing. The source data is generated by a Producer message Producer that specifies a message queue topic when sending source data to the Kafka message queue.

In the step, after source data is acquired, a Kafka message queue topic to which the source data belongs is determined.

102, performing Avro serialization processing on the source data based on a schema matched with the topic to obtain serialized data; the identification information of the schema exists in the serialized data.

The Avro format is a binary format, and compared with json and xml formats, the Avro data format removes a tag or a data header field existing in each data node, reduces a storage space of data, greatly saves a bandwidth for data transmission, reduces serialization time to improve performance, and is superior to other data formats in both parsing performance and platform compatibility.

In the step, the source data is subjected to serialization processing according to the schema to obtain serialized data, and then when a Consumer information message Consumer consumes information, the same schema is determined according to the identification information of the schema existing in the serialized data, and the serialized data is subjected to deserialization processing by adopting the same schema, so that the source data can be obtained.

For example, a user-defined serialization class can be used to obtain the schema ID of the schema matched with the topic from the schema registry, apply for 4 bytes to fill the schema ID through the byte buffer. Allocation (4), write in the data stream, and then perform serialization processing, so as to finally achieve the custom serialization processing on the source data.

In one embodiment, step 102, before performing Avro serialization processing on the source data based on the schema matched with the topic, further includes the steps of:

In this step, if the schema matching the topic to which the source data belongs does not exist, registering the schema in a schema registry.

In one embodiment, the step 102 includes:

a substep A, performing version verification on the schema matched with the topic to determine the schema of the corresponding version;

a sub-step B of carrying out Avro serialization processing on the source data based on the schema of the corresponding version; and the identification information of the schema of the corresponding version exists in the serialized data.

For the case that multiple versions of schema exist for the same topic, version verification needs to be performed on the schema matched by the topic to determine the schema of the corresponding version. And then carrying out Avro serialization processing on the source data based on the schema of the corresponding version to obtain serialized data. To be compatible with multi-version messages.

In the sub-step a, if the schema version check matched with the topic is not passed, creating corresponding version information and registering the schema.

For example, if the schema version verification matched by the topic is not passed, that is, the schema registry does not have a schema of a corresponding version, it is necessary to create corresponding version information in the schema registry and register the schema in the schema registry based on the version information.

Step 103, publishing the serialized data to Kafka.

In the step, the serialized data is issued to Kafka, so that a Consumer information message person can acquire the data in time, and the timeliness of data acquisition is improved.

Step 104, obtaining the serialization data from Kafka.

In this step, the Consumer information messenger obtains the serialized data from Kafka based on its designated topic.

Step 105, determining a schema corresponding to the identification information in the serialized data, and analyzing the serialized data based on the schema to obtain the source data.

In the step, according to the corresponding relation between the identification information in the serialized data and the schema, the corresponding schema is determined, and the serialized data is analyzed based on the schema to obtain the source data.

For example, the schema ID is obtained through getInt () of the byte buffer, metadata can be obtained from the schema buffer, and then parsing of the remaining data stream is performed.

In an embodiment, determining a Kafka message queue topic to which source data belongs by acquiring the source data; performing Avro serialization processing on the source data based on the schema matched with the topic to obtain serialized data; the identification information of the schema exists in the serialized data; publishing the serialized data to Kafka; obtaining serialized data from Kafka; determining a schema corresponding to the identification information in the serialized data, and analyzing the serialized data based on the schema to obtain the source data. The storage space required by data storage is reduced, and the bandwidth of data transmission is saved; the serialization time is reduced, and the performance is improved.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Example two

Fig. 2 is a schematic structural diagram of a data processing device according to a second embodiment of the present invention, where the data processing device may specifically include the following modules:

a first obtaining module 201, configured to obtain source data, and determine a Kafka message queue topic to which the source data belongs;

a serialization module 202, configured to perform Avro serialization processing on the source data based on a schema matched with the topic, so as to obtain serialized data; the identification information of the schema exists in the serialized data;

a publishing module 203, configured to publish the serialized data to Kafka;

a second acquisition module 204 for acquiring the serialized data from Kafka;

the parsing module 205 is configured to determine a schema corresponding to the identification information in the serialized data, and parse the serialized data based on the schema to obtain source data.

In one embodiment, the serialization module 202 comprises:

In one embodiment, the apparatus further comprises:

The data processing device provided by the embodiment of the invention can execute the data processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example III

Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. FIG. 3 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 3 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in FIG. 3, computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, commonly referred to as a "hard disk drive"). Although not shown in fig. 3, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the data processing method provided by the embodiment of the present invention.

Example IV

The fourth embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the data processing method described above, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted herein.

The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of data processing, comprising:

publishing the serialized data to Kafka;

obtaining serialized data from Kafka;

2. The method of claim 1, wherein the performing Avro serialization processing on the source data based on the schema matched to the topic to obtain serialized data comprises:

3. The method of claim 2, wherein the performing a version check on the schema matched by the topic to determine the schema of the corresponding version comprises:

4. A method according to any one of claims 1 to 3, further comprising, prior to said Avro serializing the source data based on a schema matching the topic:

5. A data processing apparatus, comprising:

the publishing module is used for publishing the serialized data to Kafka;

a second acquisition module for acquiring the serialized data from Kafka;

6. The apparatus of claim 5, wherein the serialization module comprises:

7. The apparatus according to claim 6, wherein: and the version verification sub-module is further used for creating corresponding version information and registering the schema if the schema version verification matched with the topic is not passed.

8. The apparatus according to any one of claims 5 to 7, further comprising:

9. A computer device, the computer device comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data processing method of any of claims 1-4.

10. A computer-readable storage medium, characterized by: the computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data processing method according to any of claims 1-4.