CN116244258A - Semi-structured data processing method, processor, system and storage medium - Google Patents
Semi-structured data processing method, processor, system and storage medium Download PDFInfo
- Publication number
- CN116244258A CN116244258A CN202211552360.3A CN202211552360A CN116244258A CN 116244258 A CN116244258 A CN 116244258A CN 202211552360 A CN202211552360 A CN 202211552360A CN 116244258 A CN116244258 A CN 116244258A
- Authority
- CN
- China
- Prior art keywords
- data
- service
- field
- object data
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/116—Details of conversion of file system types or formats
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/164—File meta data generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure relates to the field of computer technologies, and in particular, to a method, a processor, a system, and a storage medium for processing semi-structured data. The method comprises the following steps: acquiring updated service data of a service system; pushing the service data to a kafka message queue; consuming the service data in the kafka message queue through a flink engine; converting the data format of the service data after the serialization processing into the service data of the standard format of the data lake; analyzing the business data in the standard format to generate first object data corresponding to the business data; in the case that the field in the first object data is newly increased relative to the second object data of the service system, adding the newly increased field to the tail end of the second object data; storing the updated second object data to a distributed file system to update the service files stored in the distributed file system; and updating the file directory of the metadata management center according to the updated service file.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, a processor, a system, and a storage medium for processing semi-structured data.
Background
In the prior art, when the structured data is written into a file, object data corresponding to the structured data, namely schema, needs to be stored. And storing the structural data according to the object data so as to facilitate the user to inquire in the data management system. However, since the service data is mostly semi-structured data, which is not fixed, and cannot be directly imported into the data management system, the service data is mostly imported by manual operation, which is large in workload and prone to errors.
In the prior art, the semi-structured data is difficult to import into the data management system, because object data which is not in the data management system may appear in the semi-structured data, and at this time, the data management system needs to be rewritten and repaired to import or acquire files in the data relation system again, which is large in workload and takes a long time.
Disclosure of Invention
The embodiment of the application aims to provide a semi-structured data processing method, a processor, a system and a storage medium.
To achieve the above object, a first aspect of the present application provides a method for processing semi-structured data, including:
acquiring service data updated by a service system, wherein the service data is semi-structured data;
pushing the service data to a kafka message queue;
consuming the service data in the kafka message queue through a flink engine to perform serialization processing on the service data in the kafka message queue;
converting the data format of the service data after the serialization processing into a preset format;
converting the business data in a preset format into business data in a standard format of a data lake according to a preset standard table;
analyzing the business data in the standard format to generate first object data corresponding to the business data;
in the case that the field in the first object data is newly increased relative to the second object data of the service system, adding the newly increased field to the tail end of the second object data;
storing the updated second object data to a distributed file system to update the service files stored in the distributed file system;
and updating the file directory of the metadata management center according to the updated service file.
In an embodiment of the present application, the processing method further includes: after updating the file directory of the metadata management center according to the updated service file, sending an update notification to the requiring party through the metadata management center, wherein the update notification carries partition information of updated service data; under the condition that the corresponding data interface is called by the metadata management center according to the partition information by the demander, the updated service file corresponding to the partition information is inquired in the distributed file system according to the partition information.
In the embodiment of the application, the consumption of the service data in the kafka message queue through the link engine to perform the serialization processing on the service data in the kafka message queue includes: creating a JAVA object, wherein the JAVA object comprises a schema field and a data field; and running the JAVA object to extract the schema in the service data and store the schema in the schema field, and extracting the service field in the service data and the field value of each service field and storing the field value in the data field.
In the embodiment of the present application, the preset format is a rowdata format, where the sequence of service fields included in the data field in the rowdata format is determined according to the sequence number of each service field.
In this embodiment of the present application, in a case where it is determined that a field in the first object data is newly added with respect to the second object data of the service system, adding the newly added field to the end of the second object data includes: acquiring aggregate data of a service system from a memory; analyzing the aggregate data to obtain second object data of the aggregate data; comparing the fields contained in the first object data and the second object data; in the case that the field in the first object data is newly increased relative to the second object data, adding the newly increased field to the end of the second object data to generate third object data; and updating the set data according to the third object data so that the third object data corresponds to the set data position.
In this embodiment of the present application, updating the file directory of the metadata management center according to the updated service file includes: acquiring fourth object data corresponding to the business file data every preset time period; in the case that the fourth object data has a newly added field compared with the fifth object data corresponding to the file directory of the metadata management center, the file directory of the metadata management center is modified according to the newly added field.
A second aspect of the present application provides a processor configured to perform the above-described method of processing semi-structured data.
A third aspect of the present application provides a processing system for semi-structured data, comprising: a kafka message queue for temporarily storing service data; the flink engine is used for consuming the service data in the kafka message queue so as to carry out serialization processing on the service data in the kafka message queue; the data lake is used for storing the business data in the standard format; a distributed file system for storing service files; the metadata management center is used for storing file catalogues of the business files; a processor as described above.
In this embodiment of the present application, the metadata management center is further configured to invoke a corresponding data interface, and send an update notification to the demander after updating the file directory of the metadata management center according to the updated service file.
A fourth aspect of the present application provides a programmable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to be configured to perform the above-described method of processing semi-structured data.
Through the technical scheme, the processor can acquire service data updated by the service system, and the service data is semi-structured data; pushing the service data to a kafka message queue; consuming the service data in the kafka message queue through a flink engine to perform serialization processing on the service data in the kafka message queue; converting the data format of the service data after the serialization processing into a preset format; converting the business data in a preset format into business data in a standard format of a data lake according to a preset standard table; analyzing the business data in the standard format to generate first object data corresponding to the business data; in the case that the field in the first object data is newly increased relative to the second object data of the service system, adding the newly increased field to the tail end of the second object data; storing the updated second object data to a distributed file system to update the service files stored in the distributed file system; and updating the file directory of the metadata management center according to the updated service file. The semi-structured business data can be converted into business files with specific formats and stored in the distributed file system, so that a user can acquire the business files in the distributed file system. The efficiency of data processing is improved, and possible errors of manually input data are avoided. And the processor can import the business data into the distributed file system in real time in the process, and the data management system does not need to be updated and repaired again, so that the stable operation of the data management system is ensured.
Additional features and advantages of embodiments of the present application will be set forth in the detailed description that follows.
Drawings
The accompanying drawings are included to provide a further understanding of embodiments of the present application and are incorporated in and constitute a part of this specification, illustrate embodiments of the present application and together with the description serve to explain, without limitation, the embodiments of the present application. In the drawings:
FIG. 1 schematically illustrates a flow diagram of a method of processing semi-structured data according to an embodiment of the present application;
FIG. 2 schematically illustrates a block diagram of a semi-structured data processing system according to an embodiment of the present application;
fig. 3 schematically shows an internal structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the specific implementations described herein are only for illustrating and explaining the embodiments of the present application, and are not intended to limit the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.
Fig. 1 schematically shows a flow diagram of a method of processing semi-structured data according to an embodiment of the present application. As shown in fig. 1, in an embodiment of the present application, a method for processing semi-structured data is provided, including the following steps:
s202, service data updated by a service system is acquired, wherein the service data is semi-structured data.
And S204, pushing the service data to the kafka message queue.
S206, consuming the service data in the kafka message queue through the link engine to perform serialization processing on the service data in the kafka message queue.
S208, converting the data format of the service data after the serialization processing into a preset format.
S210, converting the business data in the preset format into business data in the standard format of the data lake according to the preset standard table.
S112, analyzing the service data in the standard format to generate first object data corresponding to the service data.
S114, in the case where it is determined that the field in the first object data is newly added to the second object data of the business system, adding the newly added field to the end of the second object data.
And S116, storing the updated second object data to the distributed file system to update the service files stored in the distributed file system.
S118, updating the file catalogue of the metadata management center according to the updated service file.
Semi-structured data is data that is intermediate between completely structured data (e.g., relational data) and completely unstructured data (e.g., audio, video) that does not conform to a data model structure associated in the form of a relational database or other data table, but contains associated labels that separate semantic elements and layer records and fields. Common semi-structured data are HTML, XML, and JSON documents. The service data in the application is also semi-structured data, is data recorded in the working process, can comprise object data which is not contained in the structured data in the data management system, and can be imported into the database only by analyzing and confirming the semi-structured data and converting the semi-structured data into the service data in a data lake format. Therefore, the semi-structured data can be serialized and converted in format, the object data and the content data in the service data are analyzed, then updated to the distributed file system, and the semi-structured data is imported into the distributed file system after being converted. The distributed file system may store business files and may be used by users to obtain files in the system in a particular manner. The data lake may store business data in a standard format. The link engine is a real-time calculation engine, and can acquire data in real time for analysis and calculation. The metadata management center can generate a file directory of the service files in the distributed file system according to the object data corresponding to the distributed file system, so that a user can acquire the service files in the distributed file system through a specific interface.
Firstly, the processor can acquire service data in a service system, wherein the service data exported from the service system is semi-structured data, and cannot be directly imported into a data lake or a distributed file system. The processor pushes the traffic data to the kafka message queue to temporarily store the traffic data. And then the processor consumes the service data in the kafka message queue through the link engine so as to carry out serialization processing on the service data in the kafka message queue, and the object data and the content data in the service data are classified and stored. After the service data after the serialization processing is obtained, the processor can convert the service data after the serialization processing into a preset format, so that the processor can further process the service data conveniently. The processor can convert the business data in the preset format into the business data in the standard format of the data lake according to the preset specification table corresponding to the data lake, so that the processor can import the business data into the data lake. The processor analyzes the service data in the standard format to obtain first object data corresponding to the service data, wherein the first object data can comprise a plurality of object fields. The processor compares the field in the first object data with the second object data of the business system, and in the case that the field in the first object data is determined to be newly increased compared with the second object data of the business system, the processor can add the newly increased field to the end of the second object data. The second object data of the service data refers to object data determined by importing the service system into the distributed file system before the service system data is acquired this time, and includes object data corresponding to all the previous service data. If the semi-structured business data is not written into the distributed file system before, the second object data is blank, the processor may add the first object data to the second object data entirely to generate updated second object data. And finally, the processor stores the updated second object data in the distributed file system, and the processor can update the data in the distributed file system according to the updated second object data and update the service data in the standard format to the service file in the distributed object system. The processor may further update the file directory of the metadata management center according to the updated service file, so that the file directory stored in the metadata management center is the latest service file directory.
Through the technical scheme, the processor can acquire service data updated by the service system, and the service data is semi-structured data; pushing the service data to a kafka message queue; consuming the service data in the kafka message queue through a flink engine to perform serialization processing on the service data in the kafka message queue; converting the data format of the service data after the serialization processing into a preset format; converting the business data in a preset format into business data in a standard format of a data lake according to a preset standard table; analyzing the business data in the standard format to generate first object data corresponding to the business data; in the case that the field in the first object data is newly increased relative to the second object data of the service system, adding the newly increased field to the tail end of the second object data; storing the updated second object data to a distributed file system to update the service files stored in the distributed file system; and updating the file directory of the metadata management center according to the updated service file. The semi-structured business data can be converted into business files with specific formats and stored in the distributed file system, so that a user can acquire the business files in the distributed file system. The efficiency of data processing is improved, and possible errors of manually input data are avoided. And the processor can import the business data into the distributed file system in real time in the process, and the data management system does not need to be updated and repaired again, so that the stable operation of the data management system is ensured.
In one embodiment, the processing method further comprises: after updating the file directory of the metadata management center according to the updated service file, sending an update notification to the requiring party through the metadata management center, wherein the update notification carries partition information of updated service data; under the condition that the corresponding data interface is called by the metadata management center according to the partition information by the demander, the updated service file corresponding to the partition information is inquired in the distributed file system according to the partition information. After updating the file directory of the metadata management center, the processor may send an update notification to the requesting party through the metadata management center, where the update notification includes partition information of the updated service data, and the requesting party may query the corresponding service file according to the obtained partition information.
In one embodiment, consuming, by a flink engine, traffic data in a kafka message queue to sequence the traffic data in the kafka message queue includes: creating a JAVA object, wherein the JAVA object comprises a schema field and a data field; and running the JAVA object to extract the schema in the service data and store the schema in the schema field, and extracting the service field in the service data and the field value of each service field and storing the field value in the data field. The processor can analyze the data in the kafka message queue through the flink engine, and perform serialization processing on the data in the kafka message queue, and in the serialization processing process, a JAVA object can be created first, wherein the JAVA object comprises a schema field and a data field. After the processor analyzes the service data in the kafka message queue, extracting service fields and schema fields in the service data, storing the schema data in the service data in the schema fields in the JAVA object, and storing the service fields in the service data in the data fields in the JAVA object so as to complete serialization of the service data.
In one embodiment, the preset format is a rowdata format, wherein the order of the service fields contained in the data fields in the rowdata format is determined according to the sequence number of each service field. The processor may convert the serialized service data into a rowdata format and number the service fields in the serialized service data. In the application, the service fields are ordered according to the sequence number of each service field in the data fields, the sequence number is smaller and is larger, and the obtained rowdata can enable the flink engine to process data more efficiently and faster.
In one embodiment, in the event that it is determined that a field in the first object data is newly added relative to the second object data of the business system, adding the newly added field to the end of the second object data comprises: acquiring aggregate data of a service system from a memory; analyzing the aggregate data to obtain second object data of the aggregate data; comparing the fields contained in the first object data and the second object data; in the case that the field in the first object data is newly increased relative to the second object data, adding the newly increased field to the end of the second object data to generate third object data; and updating the set data according to the third object data so that the third object data corresponds to the set data position. After obtaining the first object data, the processor may compare the first object data with second object data of the aggregate data in the distributed file system, and in a case where a field in which the second object data does not exist exists in the first object data, add a newly added field at the end of the second object data to generate third object data, where the third object data corresponds to the latest service data. The processor may update the aggregate data in the distributed file system according to the third object data, and update the latest service data in the standard format to the service file in the distributed object system.
In one embodiment, updating the file directory of the metadata management center based on the updated business file includes: acquiring fourth object data corresponding to the business file data every preset time period; in the case that the fourth object data has a newly added field compared with the fifth object data corresponding to the file directory of the metadata management center, the file directory of the metadata management center is modified according to the newly added field. The processor may acquire the fourth object data of the service corresponding to the service file data in the distributed file system and the fifth object data corresponding to the file directory of the metadata management center at intervals of a preset time period, and in the case that a field newly added as compared with the fifth object data exists in the fourth object data, the processor may modify the file directory of the metadata management center according to the newly added field, so that the user may acquire the latest file directory through a specific interface, and further acquire the latest service data.
In a specific embodiment, the processor acquires service data in the service system, and the acquired service data is semi-structured data. The processor pushes the acquired service data to the kafka message queue, and consumes the service data in the kafka message queue through the flink engine so as to perform serialization processing on the service data in the kafka message queue. The serialization processing of the service data means that a JAVA object is created, and the JAVA object comprises a schema field and a data field. The processor runs the JAVA object to extract the schema in the service data and store it in the schema field, and extracts the service field in the service data and the field value of each service field and stores it in the data field. The processor converts the data format of the service data after the serialization processing into a rowdata format, and sequences the service fields in the service data according to the sequence numbers of the service fields in the service data in the rowdata format. The processor converts the business data in the preset format into business data in the standard format of the data lake according to the preset standard table, and analyzes the business data in the standard format to generate first object data corresponding to the business data. And then, the processor acquires the set data of the distributed file system from the memory, and analyzes the set data to obtain second object data corresponding to the set data. The processor compares the fields contained in the first object data with the fields contained in the second object data, and adds the newly added fields to the tail end of the second object data to generate third object data under the condition that the newly added fields in the first object data are determined to be newly added relative to the second object data. The processor updates the aggregate data according to the third object data such that the third object data corresponds to the aggregate data location. The processor acquires fourth object data corresponding to the business file data every preset time period; in the case that the fourth object data has a newly added field compared with the fifth object data corresponding to the file directory of the metadata management center, the file directory of the metadata management center is modified according to the newly added field. After updating the file directory of the metadata management center according to the updated service file, the processor sends an update notification to the demander through the metadata management center, wherein the update notification carries partition information of the updated service data. Under the condition that the corresponding data interface is called by the demander in the metadata management center according to the partition information, the updated service file corresponding to the partition information can be inquired in the distributed file system according to the partition information.
By the method, the semi-structured business data can be converted into the business files with specific formats and stored in the distributed file system, so that a user can acquire the business files in the distributed file system. The efficiency of data processing is improved, and possible errors of manually input data are avoided. And the processor can import the business data into the distributed file system in real time in the process, and the data management system does not need to be updated and repaired again, so that the stable operation of the data management system is ensured.
FIG. 1 is a flow chart of a method of processing semi-structured data in one embodiment. It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one embodiment, as shown in FIG. 2, a semi-structured data processing system is provided, comprising a kafka message queue 201, a flink engine 202, a data lake 203, a distributed file system 204, a metadata management center 205, and a processor 206, wherein:
the kafka message queue 201 is used for temporarily storing service data.
The link engine 202 is configured to consume the service data in the kafka message queue, so as to perform serialization processing on the service data in the kafka message queue.
And a data lake 203 for storing the business data in a standard format.
A distributed file system 204 for storing the business files.
The metadata management center 205 is configured to store a file directory of the service file.
A processor 206 for performing the above-described method of processing semi-structured data.
In one embodiment, the metadata management center is further configured to invoke a corresponding data interface and send an update notification to the requestor after updating the file directory of the metadata management center according to the updated service file.
The semi-structured data processing system includes a processor and a memory, where the kafka message queue 201, the link engine 202, the data lake 203, the distributed file system 204, the metadata management center 205, etc. are stored as program units, and the processor executes the program modules stored in the memory to implement corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and the processing method of the semi-structured data is realized by adjusting the parameters of the kernel.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
The embodiment of the application provides a storage medium, on which a program is stored, which when executed by a processor, implements the above-described method for processing semi-structured data.
The embodiment of the application provides a processor, which is used for running a program, wherein the processing method of the semi-structured data is executed when the program runs.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 3. The computer device includes a processor a01, a network interface a02, a memory (not shown) and a database (not shown) connected by a system bus. Wherein the processor a01 of the computer device is adapted to provide computing and control capabilities. The memory of the computer device includes internal memory a03 and nonvolatile storage medium a04. The nonvolatile storage medium a04 stores an operating system B01, a computer program B02, and a database (not shown in the figure). The internal memory a03 provides an environment for the operation of the operating system B01 and the computer program B02 in the nonvolatile storage medium a04. The network interface a02 of the computer device is used for communication with an external terminal through a network connection. The computer program B02, when executed by the processor a01, implements a method of processing semi-structured data.
It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
The embodiment of the application provides equipment, which comprises a processor, a memory and a program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the following steps: acquiring service data updated by a service system, wherein the service data is semi-structured data; pushing the service data to a kafka message queue; consuming the service data in the kafka message queue through a flink engine to perform serialization processing on the service data in the kafka message queue; converting the data format of the service data after the serialization processing into a preset format; converting the business data in a preset format into business data in a standard format of a data lake according to a preset standard table; analyzing the business data in the standard format to generate first object data corresponding to the business data; in the case that the field in the first object data is newly increased relative to the second object data of the service system, adding the newly increased field to the tail end of the second object data; storing the updated second object data to a distributed file system to update the service files stored in the distributed file system; and updating the file directory of the metadata management center according to the updated service file.
In one embodiment, the processing method further comprises: after updating the file directory of the metadata management center according to the updated service file, sending an update notification to the requiring party through the metadata management center, wherein the update notification carries partition information of updated service data; under the condition that the corresponding data interface is called by the metadata management center according to the partition information by the demander, the updated service file corresponding to the partition information is inquired in the distributed file system according to the partition information.
In one embodiment, consuming, by a flink engine, traffic data in a kafka message queue to sequence the traffic data in the kafka message queue includes: creating a JAVA object, wherein the JAVA object comprises a schema field and a data field; and running the JAVA object to extract the schema in the service data and store the schema in the schema field, and extracting the service field in the service data and the field value of each service field and storing the field value in the data field.
In one embodiment, the preset format is a rowdata format, wherein the order of the service fields contained in the data fields in the rowdata format is determined according to the sequence number of each service field.
In one embodiment, in the event that it is determined that a field in the first object data is newly added relative to the second object data of the business system, adding the newly added field to the end of the second object data comprises: acquiring aggregate data of a service system from a memory; analyzing the aggregate data to obtain second object data of the aggregate data; comparing the fields contained in the first object data and the second object data; in the case that the field in the first object data is newly increased relative to the second object data, adding the newly increased field to the end of the second object data to generate third object data; and updating the set data according to the third object data so that the third object data corresponds to the set data position.
In one embodiment, updating the file directory of the metadata management center based on the updated business file includes: acquiring fourth object data corresponding to the business file data every preset time period; in the case that the fourth object data has a newly added field compared with the fifth object data corresponding to the file directory of the metadata management center, the file directory of the metadata management center is modified according to the newly added field.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.
Claims (10)
1. A method of processing semi-structured data, the method comprising:
acquiring service data updated by a service system, wherein the service data is semi-structured data;
pushing the service data to a kafka message queue;
consuming the service data in the kafka message queue through a flink engine to perform serialization processing on the service data in the kafka message queue;
converting the data format of the service data after the serialization processing into a preset format;
converting the business data in the preset format into business data in a standard format of a data lake according to a preset standard table;
analyzing the service data in the standard format to generate first object data corresponding to the service data;
adding a newly added field to the tail end of second object data of the business system under the condition that the newly added field in the first object data is determined to be relative to the second object data of the business system;
storing the updated second object data to a distributed file system to update a service file stored in the distributed file system;
and updating the file directory of the metadata management center according to the updated service file.
2. The method of processing semi-structured data according to claim 1, further comprising:
after the file catalogue of the metadata management center is updated according to the updated service file, an update notification is sent to a requiring party through the metadata management center, wherein the update notification carries partition information of updated service data;
and under the condition that the corresponding data interface is called by the demander in the metadata management center according to the partition information, inquiring the updated service file corresponding to the partition information in the distributed file system according to the partition information.
3. The method of processing semi-structured data according to claim 1, wherein said consuming, by a flink engine, traffic data in the kafka message queue to sequence the traffic data in the kafka message queue comprises:
creating a JAVA object, wherein the JAVA object comprises a schema field and a data field;
and running the JAVA object to extract the schema in the service data and store the schema in the schema field, and extracting the service field in the service data and the field value of each service field and storing the field value in the data field.
4. A method of processing semi-structured data according to claim 3, wherein the predetermined format is a rowdata format, and wherein the order of the service fields contained in the data fields in the rowdata format is determined according to the sequence number of each service field.
5. The method according to claim 1, wherein in a case where it is determined that a field in the first object data is newly added with respect to second object data of the service system, adding the newly added field to an end of the second object data includes:
acquiring the aggregate data of the service system from a memory;
analyzing the aggregate data to obtain second object data of the aggregate data;
comparing the fields contained in the first object data and the second object data;
in the case that the field in the first object data is determined to be newly added relative to the second object data, adding the newly added field to the tail end of the second object data to generate third object data;
and updating the set data according to the third object data so as to enable the third object data to correspond to the set data position.
6. The method for processing semi-structured data according to claim 1, wherein updating the file directory of the metadata management center according to the updated service file comprises:
acquiring fourth object data corresponding to the business file data every preset time period;
and when the fourth object data has a newly added field compared with the fifth object data corresponding to the file directory of the metadata management center, modifying the file directory of the metadata management center according to the newly added field.
7. A processor configured to perform the method of processing semi-structured data according to any one of claims 1 to 6.
8. A system for processing semi-structured data, comprising:
a kafka message queue for temporarily storing service data;
the flink engine is used for consuming the service data in the kafka message queue to carry out serialization processing on the service data in the kafka message queue;
a data lake for storing the business data in the standard format;
a distributed file system for storing service files;
the metadata management center is used for storing file catalogues of the business files; and
the processor of claim 7.
9. The system of claim 8, wherein the metadata management center is further configured to invoke a corresponding data interface and to send an update notification to the requestor after updating the file directory of the metadata management center based on the updated business file.
10. A programmable storage medium having instructions stored thereon, which when executed by a processor cause the processor to be configured to perform a method of processing semi-structured data according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211552360.3A CN116244258A (en) | 2022-12-05 | 2022-12-05 | Semi-structured data processing method, processor, system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211552360.3A CN116244258A (en) | 2022-12-05 | 2022-12-05 | Semi-structured data processing method, processor, system and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116244258A true CN116244258A (en) | 2023-06-09 |
Family
ID=86631989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211552360.3A Pending CN116244258A (en) | 2022-12-05 | 2022-12-05 | Semi-structured data processing method, processor, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116244258A (en) |
-
2022
- 2022-12-05 CN CN202211552360.3A patent/CN116244258A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10831562B2 (en) | Method and system for operating a data center by reducing an amount of data to be processed | |
US9817877B2 (en) | Optimizing data processing using dynamic schemas | |
CN111258978B (en) | Data storage method | |
CN112182036A (en) | Data sending and writing method and device, electronic equipment and readable storage medium | |
CN112511591A (en) | Method, device, equipment and medium for realizing hospital interface data interaction | |
CN114490641A (en) | Industrial Internet data sharing method, equipment and medium | |
CN116233164A (en) | Method, apparatus, storage medium and processor for collecting device data | |
CN114265883B (en) | Method, equipment and storage medium for real-time data management | |
CN111159142B (en) | Data processing method and device | |
CN113254480B (en) | Data query method and device | |
CN113326305A (en) | Method and device for processing data | |
CN117453790A (en) | Data exchange method and device based on cloud object storage, equipment and storage medium | |
CN102456070B (en) | Indexing unit and search method | |
CN107463618B (en) | Index creating method and device | |
CN116775716A (en) | Data query method and device, storage medium and electronic equipment | |
CN111143310A (en) | Log recording method and device and readable storage medium | |
CN116244258A (en) | Semi-structured data processing method, processor, system and storage medium | |
CN112749158A (en) | Energy system data processing method and device | |
CN112541001A (en) | Data query method, device, storage medium and equipment | |
US10567469B1 (en) | Embedding hypermedia resources in data interchange format documents | |
US10114864B1 (en) | List element query support and processing | |
CN110740046B (en) | Method and device for analyzing service contract | |
CN112052341A (en) | Knowledge graph pruning method and device | |
CN112637288A (en) | Streaming data distribution method and system | |
CN113612832A (en) | Streaming data distribution method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |