CN109408537A

CN109408537A - Data processing method and device, storage medium and calculating equipment based on Spark SQL

Info

Publication number: CN109408537A
Application number: CN201811214789.5A
Authority: CN
Inventors: 姚琴
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2018-10-18
Filing date: 2018-10-18
Publication date: 2019-03-01

Abstract

Embodiments of the present invention provide a kind of data processing method based on Spark SQL.This method comprises:, according to the user name for the proxy user for initiating session, being concentrated in preset relation in response to the initiation of session and searching the corresponding Spark context variable example of the user name；Corresponding Spark context variable is created if not finding and is instantiated, and the corresponding relationship for adding the user name at least between corresponding Spark context variable example is concentrated in preset relation；According to the corresponding Spark context variable example of user name for the proxy user for initiating session, corresponding runtime environment is created to execute corresponding data processing, this method can provide service by running single application example on a server for multiple tenants, realize multi-tenant function.In addition, embodiments of the present invention provide a kind of data processing equipment based on Spark SQL, storage medium and calculate equipment.

Description

Data processing method and device, storage medium and calculating equipment based on Spark SQL

Technical field

Embodiments of the present invention are related to data processing field, more specifically, embodiments of the present invention are related to a kind of base In the data processing method and device of Spark SQL, storage medium and calculate equipment.

Background technique

Big data technology is a more popular at present technology, refers to and is inquired huge data, analyzed The technology of processing.With the arriving of big data era, data warehouse relevant to big data, data safety, data analysis, data The application such as excavation has been increasingly becoming the research hotspot of IT industry.

For example, the Apache Spark for being born in University of California Berli gram branch school AMPLab is one and calculates based on memory Big data Computational frame.Wherein, Spark is the alternative solution of MapReduce (MR), and it is an object of the present invention to provide at more efficient data Reason ability, and it can be compatible with HDFS distributed storage layer, compatible Apache Hive metadata warehouse can incorporate Hadoop's The ecosystem, to make up the deficiency of missing MapReduce.In general, Spark program is principal and subordinate (master/slave) structure, drive Dynamic device (Driver) is responsible for its tune for calculating minimum unit task (task) as master (referring to the side for actively initiating to request) Degree, and the operation of actuator (Executor) loser task.But MapReduce is not able to satisfy under most of big data scene Extemporaneous inquiry.

For another example, one of which of the Spark SQL as SQL on Hadoop technology, effect is to pass through SQL query statement Its included query optimizer translates into Spark bottom calculating logic, to provide efficient SQL query ability.Based on Spark SQL realizes calculating logic to the target product of Apache Hive etc., can be improved processing for MapReduce Performance.

Summary of the invention

But above-mentioned big data Computational frame can not be multiple by running single application example on a server Tenant provides service, that is, not having multi-tenant (Multi Tenancy/Tenant) function.

Draw for example, HiveServer2 (hereinafter referred to as technology one) as shown in Figure 1A provides a kind of inquire based on Hive The SQL on Hadoop multi-tenant scheme held up, the multi-tenant scheme ask each client (Client) from the user It asks, HiveServer2 is that the request creates a session (Session), and distributes an execution context environmental, is corresponded to One wheel MR task.In the multi-tenant scheme, the performing environment of computation layer starting is corresponded with Client number, can not be reused Efficiency is influenced, is not able to achieve and runs single application example on a server to provide the purpose of service for multiple tenants, therefore Without real multi-tenant function.

For another example, SparkThriftServer as shown in Figure 1B (hereinafter referred to as technology two) provides a kind of based on Spark The SQL On Hadoop scheme of SQL query engine, since single SparkThriftServer does not have multi-tenant characteristic, in order to User can be allowed to access the data for being stored in HDFS corresponding to it, it is necessary to start individual server (server) for it, that is, use Family User2 cannot achieve through the server of User1 the purpose for accessing oneself resource.Therefore, the program does not have more rents yet Family characteristic, also, the program increases the complexity of system maintenance by way of a server preset for specific user, Reduce the concurrent capability and resource utilization of server resource.

Therefore in the prior art, the often mode of two mixed deployment of above-mentioned technology one and technology, but the two can not be real Existing seamless compatibility, this is very bothersome process.

Thus, it is also very desirable to which a kind of improved data processing method based on Spark SQL is taken with making it through at one Service can be provided for multiple tenants by running single application example on business device.

In the present context, embodiments of the present invention are intended to provide a kind of data processing method based on Spark SQL And device, storage medium and calculating equipment.

In the first aspect of embodiment of the present invention, a kind of data processing method based on Spark SQL is provided, is wrapped It includes: being concentrated described in lookup according to the user name for the proxy user for initiating the session in preset relation in response to the initiation of session The corresponding Spark context variable example of user name；If not finding the corresponding Spark context variable of the user name Example then creates Spark context variable corresponding with the user name, and carries out example to the Spark context variable Change, to form the corresponding Spark context variable example of the user name, and is concentrated in the preset relation and add the user Corresponding relationship of the name at least between corresponding Spark context variable example；And it is used according to the agency for initiating the session The corresponding Spark context variable example of the user name at family creates corresponding runtime environment to execute corresponding data processing.

In one embodiment of the invention, the preset relation collection includes: from the use by one or more proxy users The first set that name in an account book is constituted to reflecting between the second set being made of one or more Spark context variable examples one by one Penetrate relationship.

In another embodiment of the present invention, the preset relation collection includes: from by one or more proxy users The first set that user name is constituted is to the mapping relations one by one between third set；Wherein, the third set include one or Multiple elements, each element of the third set include a Spark context variable example and with the Spark context The corresponding connection number of variable instance.

In yet another embodiment of the present invention, the mapping relations one by one are the HashMap buildings based on thread-safe 's.

In yet another embodiment of the present invention, the described user name at least with corresponding Spark context variable example Between corresponding relationship include: corresponding relationship between the user name and corresponding Spark context variable example.

In yet another embodiment of the present invention, the described user name at least with corresponding Spark context variable example Between corresponding relationship include: the user name and corresponding Spark context variable example and the Spark context variable example Corresponding relationship between corresponding connection number.

In yet another embodiment of the present invention, this method further include: if finding the proxy user for initiating the session The corresponding Spark context variable example of user name, by the preset relation concentrate and the Spark context variable example phase The connection number answered is updated to current connection number and adds 1 resulting value.

In yet another embodiment of the present invention, this method further include: when the session is closed, the session will be initiated The corresponding corresponding connection number of Spark context variable example of proxy user be updated to the current connection number resulting value that subtracts 1.

In yet another embodiment of the present invention, this method further include: periodically or in response to the session pass It closes, the occupied resource of Spark context variable example is recycled according to LRU principle.

It is described occupied to Spark context variable example according to LRU principle in yet another embodiment of the present invention The step of resource is recycled includes: to judge that the preset relation is concentrated with the presence or absence of wherein Spark context variable example phase The corresponding relationship that the connection number answered is 0 is concentrated in the preset relation and is deleted when there are the corresponding relationship of the connection number 0 The corresponding relationship of the connection number 0, and the corresponding occupied money of Spark context variable of corresponding relationship for discharging the connection number 0 Source.

In yet another embodiment of the present invention, shared by same proxy user in the session that different clients are initiated same A Spark context variable example.

It is described to execute corresponding data processing including executing corresponding data query in yet another embodiment of the present invention Processing.

In yet another embodiment of the present invention, the step of creation corresponding runtime environment includes: creation Driver RPC communication environment.

In yet another embodiment of the present invention, the step of creation corresponding runtime environment includes: to resource pipe It manages device and submits resource request, it is corresponding to correspond to acquisition in queue in the proxy user for initiating the session by resource manager Computing resource, and start the actuator with computing resource binding.

It is corresponding in the user name for searching the proxy user for initiating the session in yet another embodiment of the present invention Before the step of Spark context variable example, further includes: if the authentication information for initiating the proxy user of the session is invalid, Terminate the processing of the session.

It is corresponding in the user name for searching the proxy user for initiating the session in yet another embodiment of the present invention Before the step of Spark context variable example, further includes: if the proxy user for initiating the session is not the starting service The providers of credit of the process user of device terminates the processing to the session.

In the second aspect of embodiment of the present invention, a kind of storage medium for being stored with program, described program are provided The above-mentioned data processing method based on Spark SQL is realized when being executed by processor.

In the third aspect of embodiment of the present invention, a kind of data processing equipment based on Spark SQL is provided, is wrapped Include: searching unit is adapted for the initiation of session, according to the user name for the proxy user for initiating the session, closes default The corresponding Spark context variable example of the user name is searched in assembly；Processing unit, if suitable for not finding the user The corresponding Spark context variable example of name, then create Spark context variable corresponding with the user name, and right The Spark context variable is instantiated, to form the corresponding Spark context variable example of the user name, and The preset relation concentrates the corresponding relationship for adding the user name at least between corresponding Spark context variable example； And execution unit, suitable for the corresponding Spark context variable example of user name according to the proxy user for initiating the session, Corresponding runtime environment is created to execute corresponding data processing.

In yet another embodiment of the present invention, the processing unit is further adapted for: if finding the generation for initiating the session The corresponding Spark context variable example of user name for managing user concentrates the preset relation and the Spark context variable The corresponding connection number of example is updated to current connection number and adds 1 resulting value.

In yet another embodiment of the present invention, the processing unit is further adapted for: when the session is closed, by initiating State session the corresponding corresponding connection number of Spark context variable example of proxy user be updated to current connection number subtract 1 gained Value.

In yet another embodiment of the present invention, the processing unit is further adapted for: periodically or in response to the session Closing, the occupied resource of Spark context variable example is recycled according to LRU principle.

In yet another embodiment of the present invention, the processing unit is suitable for: judging whether the preset relation concentration deposits The corresponding relationship for being 0 in the wherein corresponding connection number of Spark context variable example, when there are the corresponding of the connection number 0 to close When being, the corresponding relationship for deleting the connection number 0 is concentrated in the preset relation, and the corresponding relationship for discharging the connection number 0 is corresponding The occupied resource of Spark context variable.

In yet another embodiment of the present invention, the searching unit is adapted so that same proxy user in different clients The same Spark context variable example is shared in the session of initiation.

In yet another embodiment of the present invention, corresponding data processing performed by the execution unit includes corresponding Data query processing.

In yet another embodiment of the present invention, corresponding runtime environment that the execution unit is created: corresponding Driver RPC communication environment.

In yet another embodiment of the present invention, the execution unit is suitable for creating corresponding operation by handling as follows When environment: resource request is submitted to resource manager, with corresponding in the proxy user for initiating the session by resource manager Corresponding computing resource is obtained in queue, and starts the actuator with computing resource binding.

In yet another embodiment of the present invention, the searching unit, which is further adapted for searching in searching unit, initiates the session Proxy user the corresponding Spark context variable example of user name before, determine the proxy user for initiating the session Whether authentication information is effective, if the authentication information is invalid, terminates the processing of the session.

In yet another embodiment of the present invention, the searching unit is further adapted for searching the agency's use for initiating the session Before the step of user name at family corresponding Spark context variable example, determine to initiate the session proxy user whether It is the providers of credit for starting the process user of the server, if the proxy user for initiating the session is not the starting server Process user providers of credit, terminate processing to the session.

In the fourth aspect of embodiment of the present invention, a kind of calculating equipment, including above-mentioned storage medium are provided.

The data processing method and device, storage medium and calculating based on Spark SQL of embodiment according to the present invention Equipment can provide service by running single application example on a server for multiple tenants, realize multi-tenant Function.

Detailed description of the invention

The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:

Figure 1A is the illustrative diagram for showing existing HiveServer2 scheme；

Figure 1B is the illustrative diagram for showing existing SparkThriftServer scheme；

Fig. 1 C is the circuit theory schematic diagram for showing existing primary Spark program；

Fig. 1 D is the frame for showing the data processing method and device based on Spark SQL of embodiment according to the present invention Structural schematic diagram；

Fig. 2 is schematically show the data processing method based on Spark SQL of embodiment according to the present invention one The flow chart of a exemplary process；

Fig. 3 A is the UML timing diagram for schematically showing the working principle of existing Spark program；

Fig. 3 B is one for schematically showing the data processing method according to an embodiment of the present invention based on Spark SQL It is preferred that applying the UML timing diagram of exemplary working principle；

Fig. 4 is schematically show the data processing equipment based on Spark SQL of embodiment according to the present invention one A exemplary structural block diagram；

Fig. 5 is the structural schematic diagram for schematically showing computer according to an embodiment of the invention；

Fig. 6 is the schematic diagram for schematically showing computer readable storage medium according to an embodiment of the invention.

In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.

Specific embodiment

The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any Mode limits the scope of the invention.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and energy It is enough that the scope of the present disclosure is completely communicated to those skilled in the art.

One skilled in the art will appreciate that embodiments of the present invention can be implemented as a kind of system, device, equipment, method Or computer program product.Therefore, the present disclosure may be embodied in the following forms, it may be assumed that complete hardware, complete software The form that (including firmware, resident software, microcode etc.) or hardware and software combine.

Embodiment according to the present invention proposes a kind of data processing method based on Spark SQL and device, storage Medium and calculating equipment.

It is to be appreciated that any number of elements in attached drawing be used to example rather than limit and it is any name all only For distinguishing, without any restrictions meaning.

Below with reference to several representative embodiments of the invention, the principle and spirit of the present invention are explained in detail.

Summary of the invention

The inventors discovered that Apache Spark included Thrift Server module simply inherits Apache Hive HiveServer2 module, but other than being promoted in functionally similar and performance, due to Spark overall architecture Limitation, but having castrated much has the function of practical significance, such as multi-tenant characteristic and High Availabitity characteristic etc..However, looking forward to Being isolated for the shared and data of resource is generally required under industry grade scene, no multi-tenant characteristic is just unable to satisfy the actual demand. Meanwhile for a resident service, if the robustness of service will substantially reduce without High Availabitity characteristic.The present invention exists On the basis of transformation Spark kernel makes it execute more examples, the Thrift of the offer multi-tenant service an of High Availabitity is realized Server system realizes shared and user data the isolation of resource.

Fig. 1 C shows the framework of a primary Spark program.As shown in Figure 1, driver procedure (Driver It Program is) Master of Spark program, in original architecture design, Spark context variable (SparkContext) example It will create its corresponding runtime environment after change, including create Driver RPC communication environment and mentioned to resource manager (YARN) Resource request etc. is handed over, when instantiating a SparkContext again in Driver process, which can be subsequently supplied Once, exist in the form of globally unique variable due to the environment, because this latter can cover the former all environment, lead to the former Environment it is actually unavailable.

It can thus be appreciated that, on the one hand, since Spark core architecture limits, the Driver process of a Spark program can only be right A SparkContext example is answered, which can request into resource manager (Hadoop YARN) some particular queue Corresponding computing resource starts corresponding number Executor.These resources are owned by a certain user, cannot be shared by other users, The data of the user can only be accessed, other users are not available the resource and access its corresponding data；On the other hand, due to Thrift Server is substantially a Spark program, can be corresponded to when starting the program the corresponding SparkContext of starting and Corresponding resource, the problems such as due to permission, this program cannot provide service for different users, be only different users Such a service of deactivation, this framework are clearly unpractical.

The present invention provides a kind of data processing method based on Spark SQL in view of the above problems and device, storage are situated between Matter and calculating equipment realize the more characteristics of examples of SparkContext by modification Spark core architecture as shown in figure iD, So that multiple non-interfering SparkContext can be instantiated in Driver process, and the operation of these SparkContext Environment is mutually isolated by granularity of user, each SparkContext to resource manager go user correspond to obtained in queue it is corresponding Computing resource, and start the Executor computing resource bound with it.

In addition, as shown in figure iD, based on " the more examples of SparkContext realize Multi Spark Thrift Server " Method, Server and SparkContext can be decoupled first, starting service itself when server starts, without starting SparkContext and its corresponding computing cluster；Secondly, server can for example be checked when there is corresponding user to initiate session The user whether have it is corresponding initialize the SparkContext example that finishes, if there is being then multiplexed, if being created without if；When When user closes session, such as unified recycling can be carried out to SparkContext according to LRU principle by server, guarantee performance with Balance between resource occupation.

It follows that embodiment of the present invention provide technical solution due to can be based on Hadoop camouflage mechanism, i.e., with Process user pretends the instantiation process that (doAs) proxy user executes SparkContext, so, should Cluster just executes all subsequent manipulations with proxy user when SparkContext corresponding operation.In addition, realization <user, fortune Environment when row>(<user,env>), i.e. the mapping one by one of proxy user and runtime environment is corresponding by user The corresponding running environment variable storage of SparkContext example is into the mapping.

Wherein, process user for example refers to that the user of launching process or the login (Login) under Kerberos environment use Family.The user's name that proxy user for example can be the user information carried in client instance or be specified by configuration item； In the process of implementation, process user executes correlation function with the identity of user's (proxy user) in process.

After introduced the basic principles of the present invention, lower mask body introduces various non-limiting embodiment party of the invention Formula.

Illustrative methods

In the following, being described with reference to Figure 2 the data processing side based on Spark SQL of illustrative embodiments according to the present invention Method.

Fig. 2 schematically shows one kind according to the data processing method based on Spark SQL of the embodiment of the present disclosure Illustrative process flow 200.

As shown in Fig. 2, step S210 is first carried out after process flow 200 starts.

S210, the initiation in response to session are looked into according to the user name for the proxy user for initiating session in preset relation concentration Look for SparkContext example corresponding to the user name of the proxy user.

As an example, the user name of proxy user for example including but be not limited to register account number, the pet name, cell-phone number, mailbox number Or any one of other contact methods.

As an example, preset relation collection can be empty set at the beginning, also can wrap containing pre-stored mapping relations.

As an example, the preset relation collection in embodiment of the present invention for example may include from first set to second set Mapping relations one by one.Wherein, first set is for example made of the user name of one or more proxy users, and second set example The mapping relations one by one between second set being such as made of one or more SparkContext examples.In other words, at this In example, preset relation collection for example may include multiple corresponding relationships, the user name of an each corresponding relationship i.e. proxy user Corresponding relationship between corresponding SparkContext example.

As an example, the preset relation collection in embodiment of the present invention for example also may include: from by one or more generations First set that the user name of user is constituted is managed to the mapping relations one by one between third set；Wherein, third set includes one A or multiple elements, each element of third set are real including a SparkContext example and with the SparkContext The corresponding connection number of example.In other words, in this example, preset relation collection for example may include multiple corresponding relationships, each right It should be related to the corresponding SparkContext example of the user name of i.e. one proxy user, the company, SparkContext example institute The corresponding relationship between connection number (all proxy user quantity connected) connect.

Increase new corresponding relationship in preset relation concentration it should be understood that can according to need, that is, increases one newly Proxy user the corresponding SparkContext example of user name between corresponding relationship, also can according to need to it In included corresponding relationship deleted or changed.

Wherein, above-mentioned mapping relations one by one from first set to second set and/or above-mentioned from first set to The mapping relations one by one of three set are, for example, the HashMap building based on thread-safe.

As an example, the user name corresponding SparkContext example for searching the proxy user for initiating session the step of Before, if can also include: the proxy user for initiating session authentication information it is invalid, terminate to the processing of the session (following letter Claim authentication information determination processing).

For example, when there is proxy user to initiate session, first determining whether to initiate the session after the beginning of process flow 200 Whether the authentication information of proxy user is effective: if effectively, it can be according to the user name for the proxy user for initiating the session, pre- If searching the corresponding Spark context variable example of the user name in set of relations；If invalid, terminate the place to this session Reason waits the initiation of session next time.

As an example, the user name corresponding SparkContext example for searching the proxy user for initiating session the step of Before, if can also include: the proxy user of initiation session be the providers of credit for starting the process user of server, terminate to this The processing (hereinafter referred to as credit determination processing) of session.

For example, when there is proxy user to initiate session, first determining whether to initiate the session after the beginning of process flow 200 Whether proxy user is the providers of credit for starting the process user of server: if so, can be used according to the agency for initiating the session The user name at family is concentrated in preset relation and searches the corresponding Spark context variable example of the user name；Otherwise, then terminate pair The processing of this session waits the initiation of session next time.

In another example, the corresponding SparkContext example of user name for initiating the proxy user of session is being searched The step of before, can simultaneously include above-mentioned authentication information determination processing and credit determination processing, wherein authentication information determine Processing and credit determination processing successive execution sequence do not limit, can first carry out authentication information determination processing, after awarded Believe determination processing, vice versa.

For example, when there is proxy user to initiate session, first determining whether to initiate the session after the beginning of process flow 200 Whether the authentication information of proxy user is effective: if the authentication information is effective, continuing to determine that the proxy user for initiating the session is The providers of credit of the no process user for starting server, when the proxy user is to start the providers of credit of the process user of server When, it is concentrated according to the user name for the proxy user for initiating the session in preset relation and searches the corresponding Spark or more of the user name Literary variable instance terminates the place to this session when the proxy user is not to start the providers of credit of the process user of server Reason；If the authentication information is invalid, terminate the processing to this session.

If the user name for step S220, not finding the proxy user for initiating the session in step S210 is corresponding SparkContext example then creates SparkContext corresponding with the user name, and carries out example to SparkContext Change, to form the corresponding SparkContext example of the user name, and preset relation concentrate add the user name at least with it is right The corresponding relationship between SparkContext example answered.

As an example, corresponding relationship of the user name at least between corresponding SparkContext example for example, Corresponding relationship between the user name and corresponding SparkContext example.

As an example, corresponding relationship of the user name at least between corresponding SparkContext example for example can also be with It include: pair between user name connection number corresponding with corresponding SparkContext example and the SparkContext example It should be related to.

As an example, if the user name for finding the proxy user for initiating the session in step S210 is corresponding SparkContext example can then skip the processing that step S220 directly executes step S230.

S230, according to initiate session proxy user the corresponding SparkContext example of user name, create it is corresponding Runtime environment executes corresponding data processing.

As an example, executing corresponding data processing for example including corresponding data query processing of execution etc..

As an example, the step of creating corresponding runtime environment may include: creation Driver RPC communication environment.

As an example, the step of creating corresponding runtime environment also may include: to submit resource to ask to resource manager Ask, correspond in the proxy user for initiating session by resource manager and obtains corresponding computing resource in queue, and start and The actuator of computing resource binding.

As an example, if can also include the following steps:, finding the agency for initiating session is used in process flow 200 Preset relation, then is concentrated that SparkContext example is corresponding connects with this by the corresponding SparkContext example of the user name at family Number is connect to be updated to current connection number and add 1 resulting value.

For example, if finding the corresponding SparkContext example of user name for initiating the proxy user of session, it is assumed that with Current connection number is n to the SparkContext example accordingly₁, then updated connection number is n₁+1。

As an example, in process flow 200 can also include the following steps: that session will be initiated when the session is closed The corresponding corresponding connection number of SparkContext example of proxy user be updated to the current connection number resulting value that subtracts 1.

For example, when the session is closed, it is assumed that current connection number corresponding with the SparkContext example is n₂, then more Connection number after new is n₂-1。

As an example, in process flow 200, can also include the following steps: periodically or in response to session pass It closes, the occupied resource of SparkContext example is recycled according to LRU principle.

For example, can realize as follows " according to LRU principle to the occupied resource of SparkContext example Recycled " the step of: judge that preset relation is concentrated with the presence or absence of the wherein corresponding connection number of SparkContext example as 0 Corresponding relationship concentrates the corresponding relationship for deleting the connection number 0 in preset relation, and release when there are the corresponding relationship of connection number 0 Put the occupied resource of the corresponding SparkContext of corresponding relationship of the connection number 0.

As an example, sharing the same SparkContext reality in the session that different clients are initiated by same proxy user Example.

For example, proxy user U_AIn client P₁One session S of upper initiation₁, it is assumed that proxy user U_AIt is corresponding SparkContext example is SC_A；In session S₁During continuing, proxy user U_AAgain in client P₂Upper another session of initiation S₂, then session S₂With session S₁Shared SparkContext example SC_A, wherein client is not limited to cell phone client, computer Client etc..

It is preferred that applying example

Before describing this and preferably applying example, the application scenarios of the prior art are described referring initially to Fig. 3 A.In existing skill It, can not be normal when there is different user request as soon as a Spark Thrift Server can only serve a user in art Execute certain requests associated with the data of user.When user needs to access the data of oneself, it is necessary to will belong to the user's Server is preset.

As shown in Figure 3A, in the prior art, after user 1 starts server, just start SparkContext and example Change, after instantiating successfully, if user 1 initiates new session, the SparkContext and server of user 1 is bound, and user 1 Subsequent can carry out inquiring etc. operation, but another user 2 then can not by access the SparkContext of the server come into Row corresponding operating, such as inquiry.

Fig. 3 B shows one of the embodiment of the present invention preferably using example.As shown in Figure 3B, example is preferably applied at this In, for different user's connection requests, SparkContext can be instantiated by user, the connection request of different user is real The different SparkContext of exampleization realizes that Server concurrently responds the request from different user.Same subscriber is come from The connection request of different clients then shares a SparkContext example, these examples all can be in respective resource manager Effective queue in complete initialization, and the data resource that subsequent access HDFS accumulation layer has permission, to realize multi-tenant.

As shown in Figure 3B, user 1 and user 2 are used as proxy user.

The full instance of one proxy user is described by taking user 1 as an example, user 2 can use similar form, will no longer It repeats.

In figure 3b, SC caching be the HashMap building based on thread-safe <user name, (SparkContext is real Example, connection number)>mapping relations (or building<user name, SparkContext example>mapping relations), support key-value pair Increase, delete, changing, looking into operation.

As shown in Figure 3B, user 1 starts server by process user, and SC caches (SparkContext Cache) example After changing (Init) success, user 1 initiates new session (being set as session 1), in SC caching corresponding to the middle user name for searching user 1 SparkContext example: if finding, user 1 and the SparkContext example are bound, and current connection number adds 1, i.e., will < The user name of user 1, the user name of (the corresponding sparkcontext example of user 1, connection number)>be updated to<user 1, (user 1 corresponding sparkcontext example, connection number+1) >；Otherwise, SparkContext corresponding to the user name of user 1 is created And the SparkContext is instantiated, in the effective situation of token, SparkContext is instantiated successfully, by the user of <user 1 Name, (the corresponding sparkcontext example of user 1,1) > mapping relations are written SC Cache and complete registration, and then user's meeting Words creation is completed, and then binds user 1 and the SparkContext example.User 1 subsequent can carry out the operation such as inquiring.When At the end of session 1, session 1 itself is removed, current connection number subtracts 1, i.e., by the user name of <user 1, (user 1 is corresponding Sparkcontext example, connection number)>it is updated to<the user name of user 1, (the corresponding sparkcontext example of user 1, even Connect number -1) >.

In addition, as shown in Figure 3B, the preset relation collection when user 2 initiates new session (being set as session 2), in SC caching SparkContext example corresponding to the middle user name for searching user 2: if finding, user 2 and the SparkContext are real Example binding, current connection number add 1, i.e., by the user name of<user 2, (the corresponding sparkcontext example of user 2, connection number)> It is updated to<the user name of user 2, (the corresponding sparkcontext example of user 2, connection number+1)>；Otherwise, create user's 2 SparkContext corresponding to user name simultaneously instantiates the SparkContext, in the effective situation of token, SparkContext It instantiates successfully, by the user name of<user 2, SC is written in (the corresponding sparkcontext example of user 2,1)>mapping relations Cache completes registration, and then user conversation creation is completed, and then binds user 2 and the SparkContext example.User 2 subsequent can carry out the operation such as inquiring.At the end of session 2, session 2 itself is removed, current connection number subtracts 1, i.e. general <user 2 User name, the user name of (the corresponding sparkcontext example of user 2, connection number)>be updated to<user 2, (user 2 is corresponding Sparkcontext example, connection number -1) >.

Further, it is also possible to regularly be recycled etc. to the occupied resource of SparkContext example according to LRU principle. As shown in Figure 3B, such as above-mentioned function can be realized by SC cache cleaner thread (SC Cache cleaner thread), The thread is the thread that a cycle of server end starting executes, and is mainly used for judging in SC Cache with the presence or absence of <use Name in an account book, (SparkContext example, 0) > mapping relations (mapping relations indicate that not active user is being connected to this SparkContext example), this, then can be recorded complete deletion, and recycle by such mapping relations if it exists SparkContext。

As can be seen from the above description, the above-mentioned data processing method based on Spark SQL according to an embodiment of the present invention, energy Enough realize is completed in a manner of user isolation in the Java Virtual Machine (JVM) of the driver of single Spark program The multiple example type of SparkContext, major embodiment both ways: one is that can meet the other tune of SparkContext thread-level Degree, it is more efficient than being dispatched in general prior art with process-level；The other is can realize that SparkContext is same One user sharing improves oncurrent processing ability, and cannot achieve in prior art shared.Wherein, SparkContext line The scheduling of journey rank refers to, is as shown in Figure 3B a process instance, operates in single JVM, completes to belong in inside In the instantiation of the SparkContext of different user, another process/JVM is not restarted to realize (and traditional scheme, due to being limited by Spark framework itself, Yao Shixian similar functions are then for the instantiation of SparkContext Another process/JVM can only be started to realize).

In some embodiments, the above-mentioned data processing method based on Spark SQL according to an embodiment of the present invention passes through SparkContext and user bind, and just need to instantiate the example when there is user's request, multiple requests with user can be total to Enjoy the example；Different users has different examples to bind therewith；SparkContext and server itself realize decoupling, realize Dynamic scheduling, more efficient reasonable utilization backstage cluster resource.In addition, multi-tenant engine is realized using Spark, with HiveServer2 compares the query performance that can greatly improve SQL

Exemplary means

After describing the data processing method based on Spark SQL of exemplary embodiment of the invention, next, It is illustrated with reference to data processing equipment based on Spark SQL of the Fig. 4 to exemplary embodiment of the invention.

Referring to fig. 4, it is schematically shown that the data processing equipment according to an embodiment of the invention based on Spark SQL Structural schematic diagram, which can be set in terminal device, for example, the device can be set in desktop computer, notes In the intelligent electronic devices such as type computer, intelligent mobile phone and tablet computer；Certainly, the device of embodiment of the present invention It can be set in server.The device 400 of embodiment of the present invention may include following component units: searching unit 410, place Manage unit 420 and execution unit 430.

Searching unit 410 is adapted for the initiation of session, according to the user name for the proxy user for initiating session, pre- If searching the corresponding SparkContext example of user name in set of relations.

Processing unit 420, if suitable for not finding the corresponding SparkContext example of user name, newly-built and user name Corresponding SparkContext, and SparkContext is instantiated, to form the corresponding SparkContext of user name Example, and corresponding relationship of the addition user name at least between corresponding SparkContext example is concentrated in preset relation.

Execution unit 430, suitable for according to initiate session proxy user the corresponding SparkContext example of user name, Corresponding runtime environment is created to execute corresponding data processing.

As an example, the preset relation collection in embodiment of the present invention for example, from by one or more proxy users User name constitute first set between the second set being made of one or more SparkContext examples one by one Mapping relations.

As an example, the preset relation collection in embodiment of the present invention for example also may include: from by one or more generations First set that the user name of user is constituted is managed to the mapping relations one by one between third set；Wherein, third set includes one A or multiple elements, each element of third set are real including a SparkContext example and with the SparkContext The corresponding connection number of example.

As an example, the mapping relations one by one in embodiment of the present invention are, for example, the HashMap structure based on thread-safe It builds.

As an example, the user name in embodiment of the present invention is at least between corresponding SparkContext example Corresponding relationship for example, the corresponding relationship between the user name and corresponding SparkContext example.

As an example, the user name in embodiment of the present invention is at least between corresponding SparkContext example Corresponding relationship for example, the user name and corresponding SparkContext example and this SparkContext example is corresponding connects Connect the corresponding relationship between number.

As an example, the processing unit 420 in embodiment of the present invention is for example further adapted for: if finding the generation for initiating session The corresponding SparkContext example of user name for managing user is concentrated preset relation corresponding with the SparkContext example Connection number is updated to current connection number and adds 1 resulting value.

As an example, the processing unit 420 in embodiment of the present invention is for example further adapted for: when session is closed, will initiate The corresponding corresponding connection number of SparkContext example of the proxy user of session is updated to the current connection number resulting value that subtracts 1.

As an example, the processing unit 420 in embodiment of the present invention is for example further adapted for: periodically or in response to meeting The closing of words recycles the occupied resource of SparkContext example according to LRU principle.

As an example, the processing unit 420 in embodiment of the present invention is for example suitable for: judging whether preset relation concentration deposits The corresponding relationship for being 0 in the wherein corresponding connection number of SparkContext example, when there are the corresponding relationship of connection number 0, Preset relation concentrates the corresponding relationship for deleting the connection number 0, and the corresponding relationship for discharging the connection number 0 is corresponding The occupied resource of SparkContext.

As an example, the searching unit 410 in embodiment of the present invention is for example suitable for so that same proxy user is in difference The same SparkContext example is shared in the session that client is initiated.

As an example, corresponding data processing performed by execution unit 430 in embodiment of the present invention for example including Corresponding data query processing.

As an example, the corresponding runtime environment that the execution unit 430 in embodiment of the present invention is created for example is wrapped It includes: corresponding Driver RPC communication environment.

As an example, the execution unit 430 in embodiment of the present invention is for example suitable for creating correspondence by handling as follows Runtime environment: to resource manager submit resource request, with by resource manager initiate session proxy user pair It answers and obtains corresponding computing resource in queue, and start the actuator with computing resource binding.

As an example, the searching unit 410 in embodiment of the present invention is for example further adapted for searching hair in searching unit 410 Before the corresponding SparkContext example of user name for playing the proxy user of session, recognizing for the proxy user for initiating session is determined It whether effective demonstrate,proves information, if the authentication information is invalid, terminates the processing of session.

As an example, the searching unit 410 in embodiment of the present invention is for example further adapted for searching the agency for initiating session Before the step of user name of user corresponding SparkContext example, whether the proxy user for determining to initiate session is starting The providers of credit of the process user of server, if the proxy user for initiating session is not the credit for starting the process user of server Person terminates the processing to session.

It should be noted that each unit in the above-mentioned data processing equipment based on Spark SQL can execute respectively with The identical processing of each corresponding step in data processing method based on Spark SQL described above, and can reach similar As function and technical effect, which is not described herein again.

Fig. 5 shows the block diagram for being suitable for the exemplary computer system/server 50 for being used to realize embodiment of the present invention. The computer system/server 50 that Fig. 5 is shown is only an example, should not function and use scope to the embodiment of the present invention Bring any restrictions.

As shown in figure 5, computer system/server 50 is showed in the form of universal computing device.Computer system/service The component of device 50 can include but is not limited to: one or more processor 501, system storage 502, connect not homologous ray group The bus 503 of part (including system storage 502 and processor 501).

Computer system/server 50 typically comprises a variety of computer system readable media.These media, which can be, appoints What usable medium that can be accessed by computer system/server 50, including volatile and non-volatile media, it is moveable and Immovable medium.

System storage 502 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 5021 and/or cache memory 5022.Computer system/server 50 may further include it Its removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, ROM5023 can be with For reading and writing immovable, non-volatile magnetic media (not showing in Fig. 5, commonly referred to as " hard disk drive ").Although not existing It is shown in Fig. 5, disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") and right can be provided The CD drive of removable anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these feelings Under condition, each driver can be connected by one or more data media interfaces with bus 503.In system storage 502 It may include at least one program product, which has one group of (for example, at least one) program module, these program moulds Block is configured to perform the function of various embodiments of the present invention.

Program/utility 5025 with one group of (at least one) program module 5024, can store in such as system In memory 502, and such program module 5024 includes but is not limited to: operating system, one or more application program, its It may include the realization of network environment in its program module and program data, each of these examples or certain combination. Program module 5024 usually executes function and/or method in embodiment described in the invention.

Computer system/server 50 can also be with one or more external equipment 504 (such as keyboard, sensing equipment, displays Device etc.) communication.This communication can be carried out by input/output (I/O) interface 505.Also, computer system/server 50 Network adapter 506 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public affairs can also be passed through Common network network, such as internet) communication.As shown in figure 5, network adapter 506 passes through bus 503 and computer system/server 50 other modules (such as processor 501) communication.It should be understood that although being not shown in Fig. 5, can in conjunction with computer system/ Server 50 uses other hardware and/or software module.

Processor 501 by the program that is stored in system storage 502 of operation, thereby executing various function application and Data processing, for example, executing and realizing each step in the data processing method based on Spark SQL；For example, in response to session Initiation concentrate that search the user name corresponding in preset relation according to the user name for the proxy user for initiating the session Spark context variable example；If not finding the corresponding Spark context variable example of the user name, create Spark context variable corresponding with the user name, and the Spark context variable is instantiated, to be formed State the corresponding Spark context variable example of user name, and the preset relation concentrate add the user name at least with it is right The corresponding relationship between Spark context variable example answered；And the user name according to the proxy user for initiating the session Corresponding Spark context variable example creates corresponding runtime environment to execute corresponding data processing.

One specific example of computer readable storage medium of embodiment of the present invention is as shown in Figure 6.

The computer readable storage medium of Fig. 6 is CD 600, is stored thereon with computer program (i.e. program product), should When program is executed by processor, documented each step in above method embodiment can be realized, for example, in response to the hair of session It rises, according to the user name for the proxy user for initiating the session, is concentrated in preset relation and search the corresponding Spark of the user name Context variable example；If not finding the corresponding Spark context variable example of the user name, create with it is described The corresponding Spark context variable of user name, and the Spark context variable is instantiated, to form the user The corresponding Spark context variable example of name, and the preset relation concentrate add the user name at least with it is corresponding Corresponding relationship between Spark context variable example；And it is corresponding according to the user name for the proxy user for initiating the session Spark context variable example, create corresponding runtime environment to execute corresponding data processing；The specific reality of each step This will not be repeated here for existing mode.

It should be noted that although being referred to the several of the data processing equipment based on Spark SQL in the above detailed description Unit, module or submodule, but it is this division be only exemplary it is not enforceable.In fact, according to the present invention The feature and function of embodiment, two or more above-described modules can embody in a module.Conversely, above The feature and function of one module of description can be to be embodied by multiple modules with further division.

In addition, although describing the operation of the method for the present invention in the accompanying drawings with particular order, this do not require that or Hint must execute these operations in this particular order, or have to carry out shown in whole operation be just able to achieve it is desired As a result.Additionally or alternatively, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/or by one Step is decomposed into execution of multiple steps.

Although detailed description of the preferred embodimentsthe spirit and principles of the present invention are described by reference to several, it should be appreciated that, this It is not limited to the specific embodiments disclosed for invention, does not also mean that the feature in these aspects cannot to the division of various aspects Combination is benefited to carry out, this to divide the convenience merely to statement.The present invention is directed to cover appended claims spirit and Included various modifications and equivalent arrangements in range.

Claims

1. the data processing method based on Spark SQL, characterized by comprising:

It is concentrated in preset relation according to the user name for the proxy user for initiating the session in response to the initiation of session and searches institute State the corresponding Spark context variable example of user name；

If not finding the corresponding Spark context variable example of the user name, create corresponding with the user name Spark context variable, and the Spark context variable is instantiated, it is corresponding to form the user name Spark context variable example, and the preset relation concentrate add the user name at least with corresponding Spark context Corresponding relationship between variable instance；And

According to the corresponding Spark context variable example of user name for the proxy user for initiating the session, corresponding fortune is created Environment executes corresponding data processing when row.

2. data processing method according to claim 1, which is characterized in that the preset relation collection includes: from by one Or the first set that the user name of multiple proxy users is constituted is believed to by the correlation of one or more Spark context variable examples Cease the mapping relations one by one between the second set constituted.

3. data processing method according to claim 1, which is characterized in that the preset relation collection includes: from by one Or the first set that the user name of multiple proxy users is constituted is to the mapping relations one by one between third set；

Wherein, the third set includes one or more elements, and each element of the third set includes on a Spark The hereafter relevant information of variable instance and connection number corresponding with the Spark context variable example.

4. data processing method according to any one of claim 1-3, it is characterised in that further include:

Periodically or in response to the session closing, it is occupied to Spark context variable example according to LRU principle Resource is recycled.

5. data processing method according to any one of claim 1-3, which is characterized in that by same proxy user not The same Spark context variable example is shared in the session initiated with client.

6. data processing method according to any one of claim 1-3, which is characterized in that initiate the session searching Proxy user user name corresponding Spark context variable example the step of before, further includes: if initiating the session The authentication information of proxy user is invalid, terminates the processing of the session.

7. data processing method according to any one of claim 1-3, which is characterized in that initiate the session searching Proxy user user name corresponding Spark context variable example the step of before, further includes: if initiating the session Proxy user is not the providers of credit for starting the process user of the server, terminates the processing to the session.

8. the data processing equipment based on Spark SQL, characterized by comprising:

Searching unit is adapted for the initiation of session, according to the user name for the proxy user for initiating the session, closes default The corresponding Spark context variable example of the user name is searched in assembly；

Processing unit, if suitable for not finding the corresponding Spark context variable example of the user name, newly-built and institute The corresponding Spark context variable of user name is stated, and the Spark context variable is instantiated, to form the use The corresponding Spark context variable example of name in an account book, and the preset relation concentrate add the user name at least with it is corresponding Corresponding relationship between Spark context variable example；And

Execution unit, suitable for the corresponding Spark context variable example of user name according to the proxy user for initiating the session, Corresponding runtime environment is created to execute corresponding data processing.

9. a kind of storage medium for being stored with program was realized when described program is executed by processor such as appointing in claims 1 to 7 Data processing method based on Spark SQL described in one.

10. a kind of calculating equipment, including storage medium as claimed in claim 9.