CN110674220B

CN110674220B - Data heterogeneous method, device and equipment

Info

Publication number: CN110674220B
Application number: CN201910911575.1A
Authority: CN
Inventors: 杨森; 王彭
Original assignee: Enyike Beijing Data Technology Co ltd
Current assignee: Enyike Beijing Data Technology Co ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2022-09-09
Anticipated expiration: 2039-09-25
Also published as: CN110674220A

Abstract

A method, apparatus, device and computer-readable storage medium for data heterogeneity, the method comprising: setting data heterogeneous rules according to data input source information and data output source information; according to the data isomerism rule, isomerism is carried out on historical data, a monitoring database is used for updating a log binLog, and isomerism is carried out on real-time data according to the data isomerism rule; and outputting the data obtained by the isomerism to a search engine. The data heterogeneous stability and real-time performance are improved, the data heterogeneous configurability is supported, and the data availability is guaranteed.

Description

Data heterogeneous method, device and equipment

Technical Field

The present disclosure relates to the field of data processing, and more particularly, to a method, an apparatus, a device, and a computer-readable storage medium for data heterogeneous.

Background

In the current big data era, fragmentation and decentralization exist in the data distribution situation. Data isomerism is a premise of data searching and data analysis, most data are stored in a structured database by structured numbers at present, such as MySql and the like, the structured data have natural advantages in realizing most business logics, but short boards of the structured data are amplified under the business scenes of searching, recommending, data reporting and the like of a large amount of data, and the query speed shows the reduction of a proportional function along with the increase of the data.

In the related data heterogeneous technology, timing periodic data synchronization heterogeneous is generally adopted. The timing regular data synchronization isomerism is adopted, and full and incremental data isomerism is carried out in a time synchronization mode, namely data from the last synchronization time to the current synchronization time are synchronized, and data consistency is guaranteed. However, this method does not have real-time performance, and the structured database is dragged down under a large data volume, which affects the stability of the normal service system.

Disclosure of Invention

The application provides a data heterogeneous method, a data heterogeneous device, data heterogeneous equipment and a computer readable storage medium, so that stability and instantaneity of data heterogeneous are improved.

The embodiment of the application provides a data heterogeneous method, which comprises the following steps:

setting data heterogeneous rules according to data input source information and data output source information;

according to the data isomerism rule, isomerism is conducted on historical data, a log binLog is updated through a monitoring database, and isomerism is conducted on real-time data according to the data isomerism rule;

and outputting the data obtained by isomerism to a search engine.

In an embodiment, the data input source information includes a data input source table structure, the data output source information includes a data output source index, and the setting of the data heterogeneous rule according to the data input source information and the data output source information includes:

the method comprises the steps of obtaining a data input source table structure, obtaining a data output source index, carrying out data input source field mapping on each field in the data output source index according to the data input source table structure, generating a mapping relation rule model, and storing the mapping relation rule model in an unstructured database.

In one embodiment, the mapping relationship rule model includes a plurality of rule sets, and the data input source table structure is in a many-to-many relationship with the rule sets.

In an embodiment, the isomerizing the historical data according to the data isomerization rule includes:

and creating a historical data initialization task Job according to the data heterogeneous rule, setting the final modification time of the initialized data, and executing the Job.

In an embodiment, the method further comprises:

and after the Job finishes, setting an offset of real-time consumption according to the last modification time of the data, starting real-time data isomerism, and finishing historical data isomerism when the data consumption time is equal to the latest data time.

In an embodiment, the isomerizing, by monitoring the binLog, the real-time data according to the data isomerization rule includes:

the method comprises the steps of acquiring bin logs by monitoring the bin logs, setting a unique data ID for each bin log, sending the data ID and corresponding real-time data to a kafka data pipeline through kafka information, and carrying out data isomerism on the real-time data through a plurality of parallel heterogeneous services according to a data isomerism rule.

In an embodiment, the method further comprises:

and managing the plurality of parallel heterogeneous services through the Zookeeper.

In an embodiment, the managing the plurality of parallel heterogeneous services through Zookeeper includes:

when the heterogeneous service is increased or decreased, closing a data heterogeneous switch through a Zookeeper, re-hashing the kafka message, and reporting the service state of the heterogeneous service;

and after all the heterogeneous services are in a non-data processing state, opening a data heterogeneous switch through the Zookeeper to perform data heterogeneous.

In an embodiment, the method further comprises:

the real-time data is reprocessed for playback by Zookeeper in accordance with the offset of kafka.

The embodiment of the present application further provides a data heterogeneous device, including:

the data model rule module is used for setting data heterogeneous rules according to the information of the data input source and the information of the data output source;

the historical data isomerism module is used for isomerising the historical data according to the data isomerism rule;

and the data distribution module is used for updating the log binLog by monitoring the database, carrying out isomerism on the real-time data according to the data isomerism rule and outputting data obtained by isomerism to the search engine.

An embodiment of the present application further provides a device for data heterogeneous, including: the data heterogeneous method comprises the following steps of storing, processing and computer programs which are stored on the storing and can run on the processor, and the processor realizes the data heterogeneous method when executing the programs.

The embodiment of the application also provides a computer-readable storage medium, which stores computer-executable instructions, wherein the computer-executable instructions are used for executing the data heterogeneous method.

Compared with the related art, the application comprises the following steps: setting data heterogeneous rules according to data input source information and data output source information; according to the data isomerism rule, isomerism is carried out on historical data, a monitoring database is used for updating a log binLog, and isomerism is carried out on real-time data according to the data isomerism rule; and outputting the data obtained by isomerism to a search engine. The data heterogeneous stability and real-time performance are improved, the data heterogeneous configurability is supported, and the data availability is guaranteed.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification, claims, and drawings.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

Fig. 1 is a flowchart of a data isomerization method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for data isomerism according to an application example of the present application;

FIG. 3 is a data flow diagram of an example application of the present application;

fig. 4 is a schematic diagram of a data heterogeneous device according to an embodiment of the present application.

Detailed Description

The description herein describes embodiments, but is intended to be exemplary, rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible that are within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

At present, most heterogeneous strategies cannot simultaneously solve the problems of data heterogeneous instantaneity, data consistency, configurable data heterogeneous rules, heterogeneous data switching, system disaster tolerance, data recovery and the like, and have defects in data heterogeneous availability and stability.

The embodiment of the application provides a new data heterogeneous method from the aspects of stability and real-time performance, and on the basis, a data heterogeneous strategy is provided, so that the stability and the real-time performance of data heterogeneous are improved, the data heterogeneous configurability is supported, the disaster recovery strategy is optimized, the data recovery is ensured, and the data availability is ensured.

As shown in fig. 1, a method for data heterogeneous according to an embodiment of the present application includes:

step 101, setting data heterogeneous rules according to data input source information and data output source information.

In one embodiment, the data input source information includes a data input source table structure, the data output source information includes a data output source index, and the step 101 includes:

the method comprises the steps of obtaining a data input source table structure and a data output source Index (Index), carrying out data input source field mapping on each field in the data output source Index according to the data input source table structure, generating a mapping relation rule model, and storing the mapping relation rule model in an unstructured database.

The unstructured database may be Redis.

In this embodiment, based on the data input source information and the data output source information, the heterogeneous rules may be set on a page, an Index of an output source is selected, data input source field mapping is performed on each field of the Index, and finally, a Map-structured mapping relationship rule model is generated and stored in the unstructured database Redis, where the Map-structured mapping relationship rule model is used for facilitating modification of rules while data heterogeneous is performed, and a data rule creation record is stored in the structured database Mysql, so that a user can view and configure subsequent processes conveniently.

That is, one data table structure belongs to a plurality of rule groups, one rule group comprises a plurality of table structures, and one rule group is heterogeneous to form a data structure body comprising parent-child relationships and the like.

The rule set simultaneously supports the rules of multiple data input sources, multiple tables and multiple fields which are unified into one data output source, and is flexible and changeable. That is, data from multiple data input sources may be mapped to output sources by a set of rules.

And 102, isomerizing historical data according to the data isomerization rule, and isomerizing real-time data according to the data isomerization rule by monitoring a bin log (database update log).

and creating historical data according to the data heterogeneous rule to initialize Job (task), setting the final modification time of the initialized data, and executing the Job.

Wherein Job is a database timing task.

Job of historical data is used for heterogeneous historical data, and data before heterogeneous is guaranteed not to be lost.

The binLog is a database update log, is a file in a binary format, and is used for recording SQL statement information of a user for updating the database.

In an embodiment, the isomerizing the real-time data according to the data isomerization rule by monitoring the binLog includes:

the method comprises the steps of acquiring bin logs by monitoring the bin logs, setting a unique data ID for each bin log, sending the data ID and corresponding real-time data to a kafka data pipeline through a kafka message (message), and carrying out data isomerism on the real-time data through a plurality of parallel heterogeneous services according to a data isomerism rule.

The rule for the unique ID is database link + database name + table name + primary key ID, set into the ID of the kafka message, sent into the kafka data pipe.

In the embodiment, the binLog log can be monitored based on Mysql, simple data filtering can be performed, operations which do not affect data are mainly filtered, and the data processing amount is reduced. And generating the Id of the data, aiming at classifying the data, only needing to process the sequence of one row of records of one table in one library in order to ensure the sequence of the data, and setting an ID number for each binLog in order to meet the requirement, wherein the ID number rule database is linked with the database name, the table name and the primary key ID.

In one embodiment, after the Job is finished, the offset of real-time consumption is set according to the last modification time of the data, real-time data isomerism is started, and historical data isomerism is finished when the data consumption time is equal to the latest data time.

The offset consumed in real time is the last modification time of the initialized data, and when the data consumption time is equal to the latest data time, the data consumption time is referred to the latest data time, i.e. the heterogeneous of the historical data is completed, and the new data (i.e. the real-time heterogeneous data) is normally operated.

And setting an alias of the new data (namely the real-time heterogeneous data) which is the same as the historical heterogeneous data, deleting the alias of the historical heterogeneous data, starting to use the new data, and deleting the waste data.

And pulling the kafka message, and carrying out data isomerism on the data according to the data rule.

And step 103, outputting the data obtained by the isomerism to a search engine.

Wherein the search engine may be an Elasticsearch.

In an embodiment, the method further comprises: and managing the plurality of parallel heterogeneous services through the Zookeeper.

When the heterogeneous service is increased or decreased, closing a data heterogeneous switch through a Zookeeper, re-hashing the kafka message, and reporting the service state of the heterogeneous service; and after all the heterogeneous services are in a non-data processing state, opening a data heterogeneous switch through the Zookeeper to perform data heterogeneous.

And realizing the heterogeneous registry and heterogeneous service management by using the Zookeeper. Each time the heterogeneous service is started, the heterogeneous service is registered with a heterogeneous registration center, a switch for judging whether data heterogeneous is available or not is stored in the management of the heterogeneous service, and the current state of each service is as follows: data isomerism is in progress, and no data processing is performed. And when the heterogeneous services are increased or decreased, triggering data heterogeneous closing, hashing the kafka message again, triggering service state reporting, and when all the services are in a data processing-free state, opening a data heterogeneous switch to start data heterogeneous. The problem that the data ID is distributed on different machines and then sent first due to the fact that the kafka message is re-hashed in the service increasing and decreasing process is solved.

In an embodiment, the method further comprises:

In this embodiment, Zookeeper manages the offset of kafka, and data can be restored to a certain time point, and data isomerism can be performed again.

Wherein the data can be subjected to playback reprocessing based on the offset of kafka, and by setting the offset to kafka in a format that is a point in time, playback to the set time can be consumed with a message, thereby re-consuming refreshing at a time.

The embodiment of the application can improve the stability and effectiveness of heterogeneous services, effectively reduce the pressure of data isomerism on a structured database, effectively adapt to data change by using a data rule configuration mode, further effectively improve smooth switching of new and old data, and ensure system compatibility of data change.

The following is a description of an application example.

In the application example, kafka is used as a carrier of data circulation, Zookeeper is used as a data heterogeneous processing service and the management of offset of kafka, and Elasticsearch is used as a carrier of a heterogeneous result.

The Java language is used, data are distributed to Kafka, and the data heterogeneous processing service monitors Kafka messages to perform washing and heterogeneous processing on the data to Elasticissearch.

As shown in fig. 2 and 3, the method comprises the following steps:

step 201, setting data input source information.

Wherein, the data input source can be a plurality of.

Step 202, obtain the data input source table structure, execute step 205.

Step 203, setting data output source information.

In step 204, the data input source Index structure is obtained, and step 205 is executed.

Wherein steps 201 to 202 and steps 203 to 204 are executed in parallel.

Step 205, setting mapping rules.

And setting a mapping rule according to the data input source table structure and the data input source Index structure, wherein the mapping rule is stored in an unstructured database Redis.

Step 206, create historical data Job.

Wherein the historical data Job is created according to the mapping rule

And step 207, creating a real-time data heterogeneous task.

And the heterogeneous tasks of the data are real-time according to the mapping rule.

And step 208, starting the real-time heterogeneous rule and verifying the correctness of the real-time heterogeneous.

Step 209 starts historical data Job.

Wherein, the initialized data last modification time (namely Job opening time) is set, and Job is executed.

Step 210, Job, historical data is complete.

And step 211, returning the real-time data to the starting time of the historical data Job, and performing real-time isomerism again.

Step 212, the new rule mapping data is validated.

The new rule mapping data refers to current real-time heterogeneous data.

Step 213, stop the old rule heterogeneous data.

The old regular heterogeneous data refers to the real-time heterogeneous data which is performed before.

In step 214, the old rule data Index is deleted.

In step 215, the heterogeneous handover is completed.

As shown in fig. 4, the data heterogeneous apparatus according to the embodiment of the present invention includes:

the data model rule module 41 is used for setting data heterogeneous rules according to the information of the data input source and the information of the data output source;

the historical data isomerism module 42 is used for isomerising the historical data according to the data isomerism rule;

and the data distribution module 43 is configured to update the log binLog by monitoring the database, perform isomerism on the real-time data according to the data isomerism rule, and output data obtained through isomerism to the search engine.

In an embodiment, the data input source information includes a data input source table structure, the data output source information includes a data output source index, and the data model rule module 41 is configured to:

In an embodiment, the historical data heterogeneous module 42 is configured to:

In an embodiment, the data distribution module 43 is configured to:

the method comprises the steps of acquiring bin logs by monitoring the bin logs, setting a unique data ID for each bin log, sending the data IDs and corresponding real-time data to a kafka data pipeline through kafka messages, and carrying out data isomerism on the real-time data through a plurality of parallel isomerism services according to a data isomerism rule.

In one embodiment, the apparatus further comprises:

a data heterogeneous processing service management module 44, configured to:

In an embodiment, the data heterogeneous processing service management module is configured to:

In one embodiment, the apparatus further comprises:

and the data playback recovery module 45 is configured to perform playback reprocessing on the real-time data according to the offset of kafka by the Zookeeper.

In an embodiment, the data playback recovery module 45 is further configured to set an offset for real-time consumption according to the last modification time of the data after the Job is completed, start heterogeneous real-time data, and complete heterogeneous historical data when the data consumption time is equal to the latest data time.

The data heterogeneous stability and the real-time performance are improved, the data heterogeneous configurability is supported, and the data availability is guaranteed.

The embodiment of the present application further provides a device for data heterogeneous, including: the data heterogeneous method comprises the following steps of storing, processing and computer programs which are stored on the storing and can run on the processor, and the processor realizes the data heterogeneous method when executing the programs.

The embodiment of the application also provides a computer-readable storage medium, which stores computer-executable instructions, wherein the computer-executable instructions are used for the data heterogeneous method.

In this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method for data heterogeneity, comprising:

outputting the data obtained by isomerism to a search engine;

the data input source information includes a data input source table structure, the data output source information includes a data output source index, and the setting of the data heterogeneous rule according to the data input source information and the data output source information includes:

acquiring a data input source table structure and a data output source index, performing data input source field mapping on each field in the data output source index according to the data input source table structure, generating a mapping relation rule model, and storing the mapping relation rule model in an unstructured database;

the mapping relation rule model comprises a plurality of rule groups, and the data input source table structure and the rule groups are in a many-to-many relation;

the method further comprises the following steps: managing the plurality of parallel heterogeneous services through a Zookeeper, comprising:

and after all the heterogeneous services are in a non-data processing state, opening a data heterogeneous switch through a Zookeeper to perform data heterogeneous.

2. The method of claim 1, wherein the heterogeneous data according to the data heterogeneous rule comprises:

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein the isomerizing the real-time data according to the data isomerization rule by listening to a binLog comprises:

5. The method of claim 1, further comprising:

6. An apparatus for data heterogeneity, comprising:

the data distribution module is used for updating the log binLog by monitoring the database, carrying out isomerism on the real-time data according to the data isomerism rule and outputting data obtained by isomerism to a search engine;

the data distribution module is further configured to: managing the plurality of parallel heterogeneous services through a Zookeeper, comprising:

7. A device for data heterogeneity, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the program.

8. A computer-readable storage medium storing computer-executable instructions for performing the method of any one of claims 1-5.