CN117278572A

CN117278572A - Data transmission method and device, electronic equipment and readable storage medium

Info

Publication number: CN117278572A
Application number: CN202311208783.8A
Authority: CN
Inventors: 宗诚; 赵玉辉
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-22

Abstract

The embodiment of the invention provides a data transmission method, a data transmission device, electronic equipment and a readable storage medium, wherein main job data are determined; initiating a replication service and generating remote machine room replication data for the master job data; sending the remote machine room copy data to the target resource scheduling platform YARN; acquiring a user IP, and determining a first target server closest to the user IP from the target machine room based on the user IP when the user IP corresponds to the direct network of the target machine room; the target Router HDFS Router is adopted to send the copy data of the different-place machine room to the first target server, so that the problems of machine room capacity bottleneck and insufficient network bandwidth in a multi-machine room scheme are solved, the influence caused by network jitter/network disconnection problems of the multi-machine room scheme is reduced, and the service performance and stability of the multi-machine room architecture are ensured.

Description

Data transmission method and device, electronic equipment and readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of multi-machine-room data interaction, in particular to a data transmission method, a data transmission device, electronic equipment and a computer readable storage medium.

Background

With the high-speed development of big data service, the production speed of service data becomes faster and faster, the offline cluster scale expands rapidly, the machine position in the existing machine room is consumed rapidly, the upper limit of the machine room capacity can be reached in the foreseeable near future, and the development of service is blocked. Therefore, how to solve the capacity of a single machine room to become the bottleneck of data storage, the problem of capacity of the machine room is solved by adopting a multi-machine-room scheme (scale out), namely, the capacity of one machine room is limited, the machine room is expanded into a plurality of machine rooms, and meanwhile, the existing framework is modified to a certain extent, so that the view angle of a user is ensured to be still like one machine room. According to the service requirement, the capacity expansion can be increased in a flexible mode, so that the capacity redundancy problem is avoided to a certain extent. However, when the related technology still performs data interaction across machine rooms, the network bandwidth among the machine rooms is generally also in bottleneck; meanwhile, jitter or disconnection of the network may cause abnormality in the cross-machine-room service.

Disclosure of Invention

The embodiment of the invention provides a data transmission method, a data transmission device, electronic equipment and a computer readable storage medium, which are used for solving the problem of insufficient network bandwidth in a multi-machine-room scheme.

The embodiment of the invention discloses a data transmission method, which is characterized in that the method is applied to a service deployment global control layer, the service deployment global control layer is configured with a corresponding target machine room, the target machine room is configured with a target resource scheduling platform YARN and a target file storage system HDFS, the target file storage system HDFS is provided with a corresponding target Router HDFS Router, and the method comprises the following steps:

Determining main job data;

initiating a replication service and generating remote machine room replication data for the master job data;

sending the remote machine room copy data to the target resource scheduling platform YARN;

acquiring a user IP, and determining a first target server closest to the user IP from the target machine room based on the user IP when the user IP corresponds to the direct network of the target machine room;

and transmitting the copy data of the remote computer room to the first target server by adopting the target Router HDFS Router.

Optionally, the service deployment global control layer is provided with a dependency relationship solving program, the service deployment global control layer is further configured with a corresponding initial machine room, the initial machine room is used for storing initial data, and the step of determining main job data includes:

generating a work relation chain division model by adopting the dependency relation solving program;

generating a loop directed graph DAG aiming at the initial data by adopting the operation relation chain division model, and determining a data dependency relationship in the initial data based on the loop directed graph DAG; the job data with different data dependency relationships are different types of job data;

And determining main job data through the data dependency relationship.

Optionally, the service deployment global control layer includes a data manager, and the data manager is used for storing the master job data.

Optionally, the data manager DataManager is provided with a corresponding data replication server, the service deployment global control layer includes a job management platform, the job management platform is provided with a corresponding honeycomb metadata storage Hive MetaStore, and the step of generating the remote computer room replica data for the main job data includes:

acquiring activity information Event generated by the Hive MetaStore;

and determining rule paths in a rule base conforming to the data manager DataManager through the activity information Event, and calling the data replication server to generate a remote machine room copy when detecting that a new partition of a hot spot table aiming at the activity information Event is generated.

Optionally, the step of generating the remote computer room copy data for the main job data includes:

and generating a data snapshot aiming at the main operation data, and determining the data snapshot as a copy of the machine room in a different place.

Optionally, the initial machine room is configured with an initial resource scheduling platform yan and an initial file storage system HDFS, where the initial file storage system HDFS has a corresponding initial Router HDFS Router, and further includes:

when the user IP corresponds to the direct network of the initial machine room, determining a second target server closest to the user IP from the initial machine room based on the user IP;

and adopting the initial Router HDFS Router to send the main job data to the second target server.

Optionally, the method further comprises:

judging whether the target resource scheduling platform YARN finishes data copy preparation operation or not; wherein the data copy preparation operation includes verifying version information of the data copy;

and when the target resource scheduling platform YARN is judged to complete the data copy preparation operation, executing the copy initiating service and generating the remote computer room copy data aiming at the main job data.

Optionally, the method further comprises:

judging whether an access token corresponding to the user IP is received or not;

and when the access token is judged to be received and accords with a preset check rule, calling a Router HDFS Router to send the remote computer room copy data or the main job data to a target server.

Optionally, the target file storage system HDFS and the initial file storage system HDFS have corresponding engine end programs and data node end programs respectively, and further include:

determining planned flow information for the remote machine room replica data;

acquiring out-of-plan flow information other than the in-plan flow information;

optimizing the engine-side program and the data node-side program based on the out-of-plan traffic information.

The embodiment of the invention also discloses a data transmission device, which is applied to a service deployment global control layer, wherein the service deployment global control layer is configured with a corresponding target machine room, the target machine room is configured with a target resource scheduling platform YARN and a target file storage system HDFS, the target file storage system HDFS is provided with a corresponding target Router HDFS Router, and the data transmission device comprises:

the main job data determining module is used for determining main job data;

the remote machine room copy data generation module is used for initiating copy service and generating remote machine room copy data aiming at the main job data;

the remote machine room copy data sending module is used for sending the remote machine room copy data to the target resource scheduling platform YARN;

The user IP acquisition module is used for acquiring a user IP and determining a first target server closest to the user IP from the target machine room based on the user IP when the user IP corresponds to the direct network of the target machine room;

and the remote machine room copy data transmitting module is used for transmitting the remote machine room copy data to the first target server by adopting the target Router HDFS Router.

The embodiment of the invention also discloses electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method according to the embodiment of the present invention when executing the program stored in the memory.

The embodiment of the invention also discloses a computer program product which is stored in a storage medium and is executed by at least one processor to realize the method according to the embodiment of the invention.

Embodiments of the present invention also disclose a computer-readable storage medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the method according to the embodiments of the present invention.

The embodiment of the invention has the following advantages:

according to the embodiment of the invention, the main operation data are determined; initiating a replication service and generating remote machine room replication data for the master job data; sending the remote machine room copy data to the target resource scheduling platform YARN; acquiring a user IP, and determining a first target server closest to the user IP from the target machine room based on the user IP when the user IP corresponds to the direct network of the target machine room; the target Router HDFS Router is adopted to send the copy data of the different-place machine room to the first target server, so that the problems of machine room capacity bottleneck and insufficient network bandwidth in a multi-machine room scheme are solved, the influence caused by network jitter/network disconnection problems of the multi-machine room scheme is reduced, and the service performance and stability of the multi-machine room architecture are ensured.

Drawings

Fig. 1 is a flowchart of steps of a data transmission method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a service deployment global control layer provided in an embodiment of the present invention;

fig. 3 is a flow chart of a data transmission method according to an embodiment of the present invention;

FIG. 4 is a loop-free directed graph for job data dependencies provided in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data replication process according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a routing process provided in an embodiment of the present invention;

FIG. 7 is a schematic diagram of another routing process provided in an embodiment of the present invention;

FIG. 8 is a schematic diagram of a version verification process provided in an embodiment of the present invention;

FIG. 9 is a schematic diagram of an access token issuance process according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a log collection flow provided in an embodiment of the present invention;

fig. 11 is a block diagram of a data transmission apparatus provided in an embodiment of the present invention;

FIG. 12 is a schematic diagram of a hardware architecture of an electronic device implementing various embodiments of the invention;

fig. 13 is a schematic diagram of a computer readable medium provided in an embodiment of the invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The offline scene is mainly a batch processing scene, is a scene for performing offline analysis/processing on massive historical data, is insensitive to delay, and has large consumption on resources such as network bandwidth and the like due to huge processing data volume; in addition, in the production scene, the number of operations is generally large, the execution time is not controlled, if the hosts of two machine rooms are simply overlapped together to be used as a cluster, a large number of cross-machine room interviews can exist, a large number of random flow is generated to fill up limited cross-machine room bandwidth, and at the moment, the offline is influenced, and other cross-machine room services can be influenced. Therefore, how to prevent random traffic across the room from flooding across the room bandwidth is an important issue to be addressed by the multiple-room approach.

Cross-metropolitan networks can be subject to vendor quality of service (or construction) and cause jitter (or outage) that is much lower than the network quality of the in-house switching fabric CLOS. If the hosts of two machine rooms are used as a cluster, when the network shakes, not only the read-write delay of the cross machine rooms is increased, but also the processes of IBR (optimization processing code related implementation) of a data node DN (Datanode) and the like are affected, so that the service performance and stability are reduced; when a network is broken due to serious problems, data of a different-place machine room are not available, the different-place machine room DN is not connected, a large number of Block devices are lower than the expected number of copies, a large number of copies are supplemented by a trigger name node NN (Namenode), and the like. Therefore, how to reduce the network jitter and the network connectivity problem is another non-negligible problem to be solved by the multi-machine-room solution.

As described above, the main contradiction between insufficient bandwidth across the machine room, poor stability and efficient output of offline mass data processing tasks is that how to reduce the consumption of bandwidth across the machine room and how to reduce the impact of the network stability problem.

In the investigation of the inventor of the present application, it is found that the unitized architecture is a deployment architecture which has evolved to solve the multi-place multi-center problem, where a unit refers to a self-contained set capable of completing all business operations, and in this set, all services required by the business are contained, and data allocated to this unit. According to the unitized thought, in a multi-machine room scene, each machine room can be used as a unit, all services and data required by operation execution are provided in each unit, and the operation is ensured to be completed in the unit, so that the core problem faced by the multi-machine room is solved; the failure of any unit after unitization and splitting only affects the local part and does not cause the whole paralysis; after the multi-machine-room scheme is designed by adopting the unitization idea, the core problem of the multi-machine-room scheme is limited in how to decide the operation and the data placement and how to allow the operation to access the data with a short distance so as to reduce the consumption of the cross-machine-room bandwidth and the influence caused by the network stability problem.

Therefore, the embodiment of the invention adopts a limited unitized idea to design a multi-machine-room scheme, each machine room is provided with an independent complete resource scheduling platform YARN & HDFS (Hadoop Distributed File System, file storage system), and the most basic service guarantee is provided for the execution of the operation in one machine room, so that when the abnormality occurs across the machine room network, the influence range is reduced; meanwhile, through reasonable operation placement and planned data replication, the problems of random access flow across machine rooms, repeated consumption of data across machine rooms and the like are solved, and the purpose of reducing bandwidth consumption is achieved; in addition, by combining the internal infrastructure condition and meeting the requirements of two scenes, namely a table scene and a non-table scene, the scheme selects a multi-mount point based on an extended HDFS Router (RBF), RBF (Router-based Federation scheme based on routing) to realize the functions of data copy management and data routing, and automatically routes the data request to a closer machine room through Client IP perception.

From practical results, the scheme solves the contradiction between insufficient network bandwidth across the machine room, poor stability and high-efficiency output of offline tasks to a great extent. The method has remarkable effect on the aspect of resisting the risk of network connectivity, can further reduce the influence range of network problems, and simultaneously endows partial high-optimization operation 'double-activity' capability.

Referring to fig. 1, a step flowchart of a data transmission method provided in an embodiment of the present invention may specifically include the following steps:

step 101, determining main job data;

step 102, initiating a replication service and generating remote computer room replication data for the main job data;

step 103, sending the copy data of the different-place machine room to the target resource scheduling platform YARN;

104, acquiring a user IP, and determining a first target server closest to the user IP from the target machine room based on the user IP when the user IP corresponds to the direct network of the target machine room;

and step 105, transmitting the copy data of the remote computer room to the first target server by adopting the target Router HDFS Router.

In a specific implementation, the embodiment of the present invention may be applied to a service deployment global control layer Global controller layer (Archer), and referring to fig. 2, fig. 2 is a schematic structural diagram of a service deployment global control layer provided in the embodiment of the present invention, where the service deployment global control layer may be locally provided with a plurality of corresponding rooms, and the target room may be any room that receives copy data of a different-place room, where each room may be configured with a resource scheduling platform YARN and a file storage system HDFS, where the file storage system HDFS has a corresponding target Router HDFS Router. According to the embodiment of the invention, a set of independent complete clusters (YARN and HDFS) are deployed for each machine room, so that the most basic service guarantee is provided for the execution of the operation in one machine room, and the influence range is reduced when the abnormality occurs across the machine room network; meanwhile, through reasonable operation placement and planned data replication, the problems of random access flow across machine rooms, repeated consumption of data across machine rooms and the like are solved, and bandwidth consumption is reduced; in addition, by combining the internal infrastructure condition, the requirements of two scenes of a table and a non-table are met, the functions of data copy management and data routing are realized by selecting a plurality of mounting points based on the extended HDFS Router, and the data request can be automatically routed to a closer machine room through Client IP perception.

Referring to fig. 3, fig. 3 is a flow chart of a data transmission method according to an embodiment of the present invention;

illustratively, the service deployment global control layer Global controller layer (Archer) can periodically analyze the dependency relationship between the jobs and the data size of the dependency, determine the job placement position information, and store the job placement position information in the data manager DataManager, dataManager for managing the job placement information and the like in a lasting manner, when the job is submitted by the job scheduling platform, firstly acquire the placement machine room (initial machine room) information of the job, and check whether the data copy of the expected placement machine room (target machine room) is ready, if so, can submit the job, if not, block the submission, and wait for the data copy service to finish copying the data; secondly, after the Job scheduling is submitted, the honeycomb driving Hive Driver is pulled to generate an executable plan, and the Job Job is submitted to the Yarn cluster of the expected direct connection network DC (Direct Connect). Meanwhile, the method is also modified in the Yarn level, when the Yarn cluster pulls up the Job Job, the pulled up Job requests HDFS data, and the HDFS Router automatically routes the request to the NS of the machine room where the data copy closest to the Client is located according to the DC information to which the Client IP (user IP) belongs, and returns the result to the Client.

Through the mode, the problem of capacity bottleneck of a single machine room is solved by adopting the scheme of multiple machine rooms; and then, a multi-machine-room scheme is designed by adopting a limited unitization idea, each machine room is provided with an independent complete cluster (YARN & HDFS), the functions of data copy management and data routing are realized based on a plurality of mounting points of an extended HDFS Router (RBF), and data requests are automatically routed to a nearer machine room through Client IP perception, so that the problems of machine room capacity bottleneck and insufficient network bandwidth in the multi-machine-room scheme are solved, the influence caused by network jitter/network disconnection of the multi-machine-room scheme is reduced, and the service performance and stability of the multi-machine-room architecture are ensured.

On the basis of the above embodiments, modified embodiments of the above embodiments are proposed, and it is to be noted here that only the differences from the above embodiments are described in the modified embodiments for the sake of brevity of description.

The data interaction flow provided by the embodiment of the invention comprises a plurality of stages of Job placement, data replication, data routing, version control, data flow limiting, cross-machine room flow analysis and the like, the Job submitting flow is not fully covered, and the implementation part will be described in detail below.

In an optional embodiment of the present invention, the service deployment global control layer carries a dependency relationship solution program, and the service deployment global control layer is further configured with a corresponding initial machine room, where the initial machine room is used for storing initial data, and the step of determining main job data includes:

And determining main job data through the data dependency relationship.

In practical application, the large data is in an offline scene, the number of the jobs is large, and the dependence among the jobs is complex. Such as large data offline report processing business, from data collection, cleaning, summarization operation of reports of various levels, and final data export to an external business system, a complete business process may involve hundreds or thousands of inter-cross-dependent associated jobs. For job placement, management and analysis work on complex job dependencies is critical, so focused job dependency analysis is required to determine the traffic to migrate.

In a specific implementation, the service deployment global control layer of the embodiment of the invention can be loaded with a dependency relationship solving program, the service deployment global control layer is also configured with a corresponding initial machine room, the initial machine room can be used for storing initial data, the dependency relationship solving program can generate a service relationship chain dividing model considering cross-machine room bandwidth cost based on community discovery (Community Detection) according to the dependency relationship among the operations and the size of the data to be processed. The operation relation chain dividing model firstly builds a loop-free directed graph DAG (Directed Acyclic Graph) according to the dependency relationship among the operations managed by the scheduling system, then circles service subunits with relatively high cohesion (relatively closed loop) from the DAG graph, and finally combines the data quantity among the interdependent subunits to select the migratable subunits.

Referring to fig. 4, fig. 4 is a schematic representation diagram of a dependency relationship of job data provided in the embodiment of the present invention, it may be assumed that a square represents calculation in the diagram, a circle represents data, and a circle represents a size of the data, so that a dashed line may be used as a dividing boundary to divide a DAG into two subunits, and the subunits are respectively scheduled to two machine rooms, so as to meet a goal of low data transmission cost. The whole process not only considers the data access cost of the cross-machine room, but also considers whether the machine room calculation and storage resources can meet the requirements.

For example, generally, the periodic scheduling operations such as ETL (Extract-Transform-Load) in actual production are relatively stable, and will not change frequently, even part of the operations will not change, so the dependency analysis process for determining that the operations Job are placed in the machine room can be generated by offline computation periodically in units of days or weeks; in addition, from the management perspective, a company generally has a plurality of relatively independent business departments, each business department vertically divides a plurality of business subunits, and the inter-business association tightness in business is far greater than that between businesses; meanwhile, the service (unit) is also a resource management unit and a communication unit in the implementation process of the multi-machine room landing; in practice, therefore, it is often the case that the traffic units are bordered by a dependent division.

Of course, the above examples are merely examples, and those skilled in the art may also use any other division manner as the dependency division, which is not limited to the embodiments of the present invention.

In a production environment, a plurality of Job scheduling platforms generally exist, information of placing a Job in a machine room is maintained on any platform, all jobs cannot be covered, and referring to fig. 2, a data manager service is introduced as an access layer in the embodiment of the present invention, to manage IDC (Internet Data Center ) information of placing a Job and path information of needing to perform data replication, and the platform can access a multi-machine room system by docking the service.

In an alternative embodiment of the present invention, the data manager DataManager is provided with a corresponding data replication server, the service deployment global control layer includes a job management platform, the job management platform is provided with a corresponding cellular metadata storage Hive MetaStore, and the step of generating the offsite computer room replica data for the master job data includes:

Acquiring activity information Event generated by the Hive MetaStore;

In practical application, job placement can place closely related Job jobs in a machine room generally, so as to reduce cross-machine room access and further reduce cross-machine room network bandwidth consumption; for the irremovable dependence of the machine room, especially for the data with the use frequency of more than 1 in the different machine room, the data copy is needed to exist in the different machine room, so as to reduce the consumption of network bandwidth; thus, embodiments of the present invention provide a data replication service for replication of replicas.

The data replication service is realized based on a data replication tool DistCp, is enhanced in the aspects of correctness, atomicity, idempotency, transmission efficiency and the like, and simultaneously supports functions of flow control, multi-tenant transmission priority (high-priority operation can obtain more cross-machine room flow and computing resource quota), copy life cycle management and the like.

Specifically, the data replication is mainly performed for regular periodic scheduling jobs, the jobs are generally fixed, the input and output conditions of the jobs, including information such as data paths and used data ranges, can be deduced by analyzing the historical operation records of the jobs, and thus a large number of replication of the long-span back-flushing tasks can be prevented. So after determining the job to be migrated, the data path rules (rules) can be refined and persisted to the database manager's rule base, which is periodically updated as the job placement changes.

The method includes the steps of using a rule base to perform path extraction for different scenes, describing a data replication process by taking a data warehouse tool Hive table scene as an example, and as shown in fig. 5, fig. 5 is a schematic diagram of a data replication process provided in an embodiment of the present invention, firstly collecting Event information related to a mount table/partition of Hive MetaStore to Kafka service, then cleaning a path conforming to rules in the rule base through a real-time link task, when a new partition of a hot spot table is detected to generate, transmitting the path by a Data Replication Service (DRS) to generate a remote machine room copy, wherein the DRS is essentially a management service of DistCp operation, and after transmission is completed, persisting copy information (including paths, versions, TTLs and the like) by the data replication service to perform full life cycle management on copy data (deleting expired cross-machine room copies, and releasing storage space).

The copying flow adopts an automatic active copying discovery strategy, so that data copies can be quickly captured and prepared, and the service requirement of an offline scene can be effectively met; the above strategy of automatically discovering active replication can effectively solve the problem of incremental data replication, but for the job to be migrated, it may also depend on the stock data for a long period of time, for this problem, besides preparing the stock data by adopting a way of starting replication process in advance, the data migration strategy based on Snapshot snap shot can be introduced for initial replication for the scene needing fast migration.

In an optional embodiment of the present invention, the initial machine room is configured with an initial resource scheduling platform yann and an initial file storage system HDFS, where the initial file storage system HDFS has a corresponding initial Router HDFS Router, and further includes:

In practical application, after data copying, the double machine rooms can have data copies of a certain path, and how to position correct data after the data copies are placed in IDCs is a key problem to be solved by data routing service.

On the basis of MergeFs realized by the multi-mount point based on the HDFS Router, the mirror image mount point is realized to realize the data routing function. For convenience of description, the original data may be agreed to be the main operation data, the data transmitted to the remote machine room is the copy data of the remote machine room (also referred to as mirror image data, which is only allowed to be read and deleted), and the first mounting point in the agreed mirror image mounting points is the main operation data, the later mounting points are the copy data of the remote machine room (theoretically, a plurality of machine rooms can be expanded), in order to be transparent to the user at the routing level, an IP position sensing function of the request source may be added in the processing logic of the mirror image mounting points, and the function may determine the DC of the request source and route the request to the HDFS of the corresponding DC by acquiring the position information of the request source IP.

Referring to FIG. 6, FIG. 6 is a schematic diagram of a routing process provided in an embodiment of the present invention, wherein if the data request is from DC1, the Router redirects the data request to the HDFS cluster of DC1, and from DC2 to the HDFS cluster of DC 2. In order to reduce the consumption of bandwidth across the machine room, in principle, all read operations on data are allowed only in the local machine room (i.e. the machine room where the Client is located), otherwise they are copied to the local machine room first. However, in a special case, referring to fig. 7, fig. 7 is a schematic diagram of another routing process provided in the embodiment of the present invention, if the data replication service Data Replication Service is abnormal, but cannot be repaired in a short time or the server NS is abnormal for a long time, the degradation is allowed to be a cross-machine room current limiting read (the copy is not ready, and if the data is not read in the target machine room for a certain time, the degradation is allowed).

Optionally, if there is a special Temporary library at the time of production, a Temporary table (automatic table of seven days clean) for managing the created short life cycle in the user SQL (Structured Query Language) job, the Temporary table name is not fixed (for example, some ETL jobs may add a date suffix to the Temporary table name), thus causing a table-like path to be not fixed; aiming at the situation that the paths are not fixed, the mirror image mounting points cannot be used for management, so that a multi-mounting point named IDC_FOLLOW can be introduced for mounting temporary library paths in a plurality of machine rooms; when the temporary table is read and written, the data can be accessed by selecting a DC internal HDFS NS mounting path according to the DC where the user Client is located, so that the problem of the flow of the temporary table across a machine room is solved.

In an alternative embodiment of the present invention, further comprising:

In a specific implementation, in a distributed scenario, copies are generated by a data replication mode, which inevitably causes consistency problems, so when multiple machines exist with data copies, the data version consistency problems must be considered in addition to the routing problems.

Embodiments of the present invention can solve this problem by introducing Version service (Version); in order to simplify the version service design, aiming at the characteristic of writing less and more reading of a big data offline scene, a certain choice is made for realizing a mirror image mounting point according to the CAP theory, and all operations can be carried out on main data, and only read/delete operations are allowed for copy data; on this premise, the method introduces an edition service of an edit log Editlog based on HDFS, referring to fig. 8, fig. 8 is a schematic diagram of a version verification flow provided in an embodiment of the present invention, where the service monitors, with the identity of an observer, the behavior of subscribing to a path of HDFS JN (journal nodes), and identifies a data version with an operation ID (transaction id); if the data in the subscribed path changes, the data is conducted to the JN through editlog, and then Version update is carried out by the Version plug-in Version notified by the JN; because all the change operations on the data can record editlog, no matter the SQL scene and the non-SQL scene, only the change of the data can be captured by the version service, thereby effectively ensuring the consistency of the data.

When the job is submitted as described in the general flow of the first section, after the expected placement machine room of the job is obtained, checking whether the dependent data Ready also comprises a version check work; when the job needs copy data, checking whether the version of the transmitted data copy is consistent with the latest version subscribed in the version service through the data transmission service, and if so, allowing the job to submit and use the data copy; otherwise, the job is temporarily blocked, and the job is allowed to be submitted after the service is transmitted to update the copy data; if the data is not read in the target machine room for a certain period of time, the data is degraded to read the main data.

In an alternative embodiment of the present invention, further comprising:

In practical application, the bandwidth of the cross-machine room is limited in the current scene, and the bandwidth is shared with services which are more sensitive to delay, such as online service, real-time service and the like, so as to prevent the offline cross-machine room flow (especially the unplanned cross-machine flow) from being full of bandwidth and affecting the online service.

Referring to fig. 9, fig. 9 is a schematic diagram of an access token issuing process according to an embodiment of the present invention; the core idea of token bucket throttling is that when a certain operation needs tokens, the corresponding token number needs to be taken out of the token bucket, if the tokens are acquired, the operation is continued, otherwise, the operation is blocked, and the token bucket is not replaced after being used up. The method is based on the idea, a global center current limiting service is designed, the method realizes a throttling distributed file system with a read-write current limiting function on the basis of an HDFS distributed file system, when a user uses the type to read and write files of an HDFS, the throttling distributed file system judges whether the read-write flow can cross a machine room or not according to user IDC information and block IDCBlock IDC information in a Localized Block returned by RBF, if so, the request (Token) for requesting to cross the machine room bandwidth is firstly tried to be sent to a throttling valve service, after the request is made to Token, the subsequent HDFS read-write is carried out, and if the applied flow is used up, the new bandwidth Token is applied to the ThroeServer; besides the inherent characteristics of the Token bucket, the method realizes the queue priority and the weighted fairness characteristic on the basis of the Token bucket, and the queue priority of the current-limiting service and the job priority in the dispatching system are mapped one by one to ensure that important services can acquire Token preferentially under the condition of multiple tenants; in terms of stability, in order to reduce the pressure of the current limiting service, each Token is set to represent a relatively large flow unit, so that the performance influence caused by excessive acquisition times of the Token is reduced; in order to prevent job blocking caused by downtime of the current limiting service, the method increases a strategy of degrading to a local fixed bandwidth, and simultaneously, as a computing engine is continuously connected to the current limiting service, the stability of the service and the request water level become bottlenecks (single 100K+qps), and the performance of the current limiting service is enhanced by horizontally expanding the service.

In an optional embodiment of the invention, the target file storage system HDFS and the initial file storage system HDFS have corresponding engine end programs and data node end programs, respectively, and further include:

determining planned flow information for the remote machine room replica data;

acquiring out-of-plan flow information other than the in-plan flow information;

In practical application, with the gradual advancement of multi-machine room projects, the cross-machine room flow is also gradually increased, and the special line bandwidth is occasionally full at peak time. In order to effectively manage the cross-machine-room bandwidth traffic, it is necessary to know which operations contribute the most to the cross-machine-room traffic, thereby performing targeted management. From the perspective of offline operation, there are three main sources of network traffic: reading data from upstream, and writing data to downstream table between different Executor/Task.

In the off-line multi-machine room scene, because each machine room adopts a unitized architecture and has independent Yarn clusters, the operation does not run across the machine room and the situation of flushing data Shuffle across the machine room does not exist, therefore, only the flow across the machine room generated in the process of reading and writing an HDFS file needs to be considered, and the flow across the machine room generated in the process of reading and writing the HDFS file can be divided into two major categories of planned flow and unplanned flow: (1) the planned traffic, i.e. the traffic generated by the data replication service for data replication, is called the planned traffic, and this part of data is used for multiple times with high probability; (2) unintended traffic, i.e., data traffic generated by a non-data replication service, is used in a single (or multiple) pass.

The main sources of unintended traffic are the following: (1) historical back flushing of a scheduling task in a plan for a long time span occurs, and dependent data copies are destroyed due to expiration; (2) periodic scheduling tasks with unreasonable placement positions (missing/misplacement/new addition, etc.) can be eliminated by optimizing job placement; (3) the method is characterized in that the method is used for processing the flow in an unscheduled way, and the method is characterized in that the method is used for processing the flow in an Adhoc inquiry, the flow is burst, the flow is used for one time (or multiple times), the temporary production requirement cannot be predicted, and the required data cannot be processed in advance.

Referring to fig. 10, fig. 10 is a schematic diagram of a log collection flow provided in an embodiment of the present invention, which can be used for effectively controlling a flow across a machine room for processing an unplanned flow, and the embodiment of the present invention introduces a flow analysis tool across the machine room, and makes the following modifications for an engine end and a DN (data node) end: injecting Job ID into a ClientName of the HDFS Client when initializing the HDFS Client; aiming at a DataNode terminal, embedding points in a DataXreceiver, analyzing Job ID from ClientName, merging read-write traffic according to the Job ID and a client IP network segment, and outputting a statistical result to a traffic log every preset time period.

Finally, collecting the cross-machine room flow logs on each DN in real time, summarizing the logs on the ClickHouse through the Flink, obtaining the cross-machine room flow operation of each time period through aggregation analysis, sequencing the cross-machine room flow operation according to the flow from large to small, determining the operation which is more sequenced to the front according to the preset number, and facilitating the treatment (including relocation, emergency searching, operation optimization and the like) of the cross-machine room flow.

For the unplanned traffic of the Adhoc type, the data replication-job placement-data routing mode in the multi-machine-room system is not applicable due to the randomness of the unplanned traffic; therefore, the method adopts some other optimization means, namely, the SQL Scan scans out the dependent data size and position information through the runtime to save the bandwidth of multiple machine rooms as the main target, and determines which machine room is scheduled by the SQL in combination with the actual load condition of the cluster. When a single sheet of table is accessed, the job is dispatched to a machine room where the data are located; when a plurality of tables are accessed, dispatching the operation to the machine room where the data are located when the tables are in the same machine room; when the multiple tables are in different machine rooms, the operation is scheduled to the machine room where the table with larger data volume is located; smaller tables limit reads, or block notifications copy services copies.

Optionally, in the processing of the unplanned traffic, for prest, which is an engine with multi-source query capability, each machine room can be regarded as a Connector by utilizing the Connector multi-source query function, and sub-queries are pushed down and sent to a remote machine room for processing in a multi-table access scene, so that the traffic bandwidth of the collapsed machine room is reduced.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 11, a block diagram of a data transmission device provided in an embodiment of the present invention is shown, which may specifically include the following modules:

a master job data determination module 1101 for determining master job data;

the remote computer room copy data generating module 1102 is configured to initiate a copy service and generate remote computer room copy data for the main job data;

the remote machine room copy data sending module 1103 is configured to send the remote machine room copy data to the target resource scheduling platform YARN;

a user IP obtaining module 1104, configured to obtain a user IP, and determine, when the user IP corresponds to a direct network of the target machine room, a first target server closest to the user IP from the target machine room based on the user IP;

And the remote computer room copy data sending module 1105 is configured to send the remote computer room copy data to the first target server by using the target Router HDFS Router.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In addition, the embodiment of the invention also provides electronic equipment, which comprises: the processor, the memory, store the computer program on the memory and can run on the processor, this computer program realizes each process of the above-mentioned data transmission method embodiment when being carried out by the processor, and can reach the same technical result, in order to avoid repetition, will not be repeated here.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, realizes the processes of the above data transmission method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The embodiment of the present invention further provides a computer program product, which is stored in a storage medium, and the program product is executed by at least one processor to implement the respective processes of the above-mentioned embodiments of the data transmission method, and achieve the same technical effects, so that repetition is avoided, and a detailed description is omitted herein.

Fig. 12 is a schematic hardware structure of an electronic device implementing various embodiments of the present invention.

The electronic device 500 includes, but is not limited to: radio frequency unit 501, network module 502, audio output unit 503, input unit 504, sensor 505, display unit 506, user input unit 507, interface unit 508, memory 509, processor 510, and power source 511. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 5 is not limiting of the electronic device and that the electronic device may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. In the embodiment of the invention, the electronic equipment comprises, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer and the like.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 501 may be used to receive and send information or signals during a call, specifically, receive downlink data from a base station, and then process the downlink data with the processor 510; and, the uplink data is transmitted to the base station. Typically, the radio frequency unit 501 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 501 may also communicate with networks and other devices through a wireless communication system.

The electronic device provides wireless broadband internet access to the user through the network module 502, such as helping the user to send and receive e-mail, browse web pages, access streaming media, and the like.

The audio output unit 503 may convert audio data received by the radio frequency unit 501 or the network module 502 or stored in the memory 509 into an audio signal and output as sound. Also, the audio output unit 503 may also provide audio output (e.g., a call signal reception sound, a message reception sound, etc.) related to a specific function performed by the electronic device 500. The audio output unit 503 includes a speaker, a buzzer, a receiver, and the like.

The input unit 504 is used for receiving an audio or video signal. The input unit 504 may include a graphics processor (Graphics Processing Unit, GPU) 5041 and a microphone 5042, the graphics processor 5041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 506. The image frames processed by the graphics processor 5041 may be stored in the memory 509 (or other storage medium) or transmitted via the radio frequency unit 501 or the network module 502. Microphone 5042 may receive sound and may be capable of processing such sound into audio data. The processed audio data may be converted into a format output that can be transmitted to the mobile communication base station via the radio frequency unit 501 in case of a phone call mode.

The electronic device 500 also includes at least one sensor 505, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 5061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 5061 and/or the backlight when the electronic device 500 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for recognizing the gesture of the electronic equipment (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; the sensor 505 may further include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which are not described herein.

The display unit 506 is used to display information input by a user or information provided to the user. The display unit 506 may include a display panel 5061, and the display panel 5061 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 507 is operable to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 507 includes a touch panel 5071 and other input devices 5072. Touch panel 5071, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on touch panel 5071 or thereabout using any suitable object or accessory such as a finger, stylus, etc.). Touch panel 5071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 510, and receives and executes commands sent by the processor 510. In addition, the touch panel 5071 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 5071, the user input unit 507 may include other input devices 5072. In particular, other input devices 5072 may include, but are not limited to, physical keyboards, function keys (e.g., volume control keys, switch keys, etc.), trackballs, mice, joysticks, and so forth, which are not described in detail herein.

Further, the touch panel 5071 may be overlaid on the display panel 5061, and when the touch panel 5071 detects a touch operation thereon or thereabout, the touch operation is transmitted to the processor 510 to determine a type of touch event, and then the processor 510 provides a corresponding visual output on the display panel 5061 according to the type of touch event. Although in fig. 5, the touch panel 5071 and the display panel 5061 are two independent components for implementing the input and output functions of the electronic device, in some embodiments, the touch panel 5071 and the display panel 5061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.

The interface unit 508 is an interface for connecting an external device to the electronic apparatus 500. For example, the external devices may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 508 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 500 or may be used to transmit data between the electronic apparatus 500 and an external device.

The memory 509 may be used to store software programs as well as various data. The memory 509 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 510 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 509, and calling data stored in the memory 509, thereby performing overall monitoring of the electronic device. Processor 510 may include one or more processing units; preferably, the processor 510 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 510.

The electronic device 500 may also include a power supply 511 (e.g., a battery) for powering the various components, and preferably the power supply 511 may be logically connected to the processor 510 via a power management system that performs functions such as managing charging, discharging, and power consumption.

In addition, the electronic device 500 includes some functional modules, which are not shown, and will not be described herein.

In yet another embodiment provided by the present invention, as shown in fig. 13, there is further provided a computer-readable storage medium 1301 having instructions stored therein, which when run on a computer, cause the computer to perform the data transmission method described in the above embodiment.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The data transmission method is characterized in that the method is applied to a service deployment global control layer, the service deployment global control layer is configured with a corresponding target machine room, the target machine room is configured with a target resource scheduling platform YARN and a target file storage system HDFS, the target file storage system HDFS is provided with a corresponding target Router HDFS Router, and the method comprises the following steps:

determining main job data;

2. The method according to claim 1, wherein the service deployment global control layer is loaded with a dependency resolution program, the service deployment global control layer is further configured with a corresponding initial machine room, the initial machine room is used for storing initial data, and the step of determining main job data includes:

and determining main job data through the data dependency relationship.

3. The method of claim 1 or 2, wherein the service deployment global control layer comprises a data manager, dataManager, for storing the primary job data.

4. A method according to claim 3, wherein the data manager DataManager is provided with a corresponding data replication server, the service deployment global control layer comprising a job management platform provided with a corresponding cellular metadata store Hive MetaStore, the step of generating off-site machine room replica data for the master job data comprising:

Acquiring activity information Event generated by the Hive MetaStore;

5. A method according to claim 3, wherein the step of generating off-site machine room copy data for the primary job data comprises:

6. The method of claim 2, wherein the initial machine room is configured with an initial resource scheduling platform YARN and an initial file storage system HDFS having a corresponding initial Router HDFS Router, further comprising:

7. The method as recited in claim 1, further comprising:

8. The method as recited in claim 5, further comprising:

9. The method of claim 6, wherein the target file storage system HDFS and the initial file storage system HDFS have corresponding engine-side programs and data-node-side programs, respectively, further comprising:

determining planned flow information for the remote machine room replica data;

acquiring out-of-plan flow information other than the in-plan flow information;

10. The data transmission device is characterized in that the device is applied to a service deployment global control layer, the service deployment global control layer is configured with a corresponding target machine room, the target machine room is configured with a target resource scheduling platform YARN and a target file storage system HDFS, the target file storage system HDFS is provided with a corresponding target Router HDFS Router, and the device comprises:

the main job data determining module is used for determining main job data;

11. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor being configured to implement the method of any of claims 1-9 when executing a program stored on a memory.

12. A computer-readable storage medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the method of any of claims 1-9.