CN106452899A

CN106452899A - Distributed data mining system and method

Info

Publication number: CN106452899A
Application number: CN201610957904.2A
Authority: CN
Inventors: 丁贤; 金焰; 王备
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2016-10-27
Filing date: 2016-10-27
Publication date: 2017-02-22
Anticipated expiration: 2036-10-27
Also published as: CN106452899B

Abstract

The invention provides a distributed data mining system and method, and relates to the technical field of data mining. The distributed data mining system is characterized in that a first control server is used as a work node; and a second control server is used for real-time monitoring, and sending a work node replacing request to a prepositive server to complete replacing of the work node when the second control servers determines that the first control server malfunctions. Therefore, after completing switching of the work node, the normal working of the whole distributed data mining system is recovered. The distributed data mining system and method can avoid the problem that once the current JobTracker server as the main control node malfunctions, the task scheduling system of the whole hadoop framework becomes paralysed so that the task scheduling and processing cannot be finished.

Description

A kind of distributed data digging system and method

Technical field

The present invention relates to data mining technology field, more particularly to a kind of distributed data digging system and method.

Background technology

In recent years, the rise of the generation information technology with big data, cloud computing, mobile Internet etc. as representative is in the whole world In the range of started " revolution of third time IT ".With the arrival in big data epoch, data analysis capabilities how are lifted further, deep The business for entering mining data is worth, and closely drives the change of products innovation, operation flow and management system, realizes real " with number It is said that words ", become a new problem.The essence of data mining be excavate from mass data implicit, decision-making is had potential The relation, pattern of value and trend.Due to being related to the calculating of mass data, the framework for realizing system is had higher requirements.

At present the big data digging system of main flow be using distributed mining structure, will a big data analysis task Decomposed, then by multiple servers parallel computation, finally result of itemizing is collected.For example based on hadoop framework Distributed system basic framework to be represented is widely used.In hadoop framework, JobTracker server undertakes total tune The big data analysis operation that the key player of degree, i.e. receive user terminal submit to, realizes task according to map/reduce algorithm Decompose, then according to the busy degree of calculation server (TaskTracker), task is distributed on the calculation server of free time. JobTracker server monitors the implementation status of the upper task of calculation server (TaskTracker) simultaneously, if tasks carrying has Abnormal, then task is redistributed.There is an apparent defect, the i.e. risk of Single Point of Faliure in this framework.As master The task scheduling system for causing whole hadoop framework is absorbed in if breaking down by the JobTracker server of control node Paralysis.Being embodied in user terminal the submission of task cannot be completed because of the main controlled node fault as service entrance；Computing The implementation procedure of server (TaskTracker) cannot be monitored and cause execution failed tasks redistribute；Simultaneously Calculation server (TaskTracker) makes resource in idle state because losing task distributor.

Content of the invention

Embodiments of the invention provide a kind of distributed data digging system and method, current as main controlled node to solve JobTracker server if breaking down, the task scheduling system for causing whole hadoop framework is paralysed, no Method completes task scheduling and the problem for processing.

For reaching above-mentioned purpose, the present invention is adopted the following technical scheme that：

A kind of distributed data digging system, including：User terminal group, front server group, the first control server, the Two control servers and calculation server group；The user terminal group includes multiple user terminals；The front server group Including multiple front servers；The calculation server group includes multiple calculation servers；The user terminal group with described before Put server group communication connection；The front server group respectively with described first control server, second control server and The calculation server group communication connection；

The user terminal, for sending data mining task request to the front server；

The front server, for parsing the domain-name information of data mining task request, digs according to the data Data mining task request is submitted to the domain-name information of pick task requests the first control server as working node；

The first control server, for asking corresponding data mining task to carry out point the data mining task Solution, forms multiple data mining subtasks；The plurality of data mining subtask is sent to the front server；

The front server, is additionally operable to be assigned at multiple calculation servers by the plurality of data mining subtask Row is processed, and receives the task feedback information of calculation server, and the task feedback information is sent to the described first control Server；

The first control server, is additionally operable to for the task feedback information real-time synchronization to be sent to the second control service At device；

The second control server, for the first control server described in monitor in real time, is confirming first control During server fail, working node is sent to the front server and substitute request；

The front server, is additionally operable to substitute request according to the working node, the second control clothes described in more new record The network address of business device, so that the second control server is used as working node；

The second control server, is additionally operable to ask to each calculation server broadcasting tasks information；

The calculation server, is additionally operable to when broadcasting tasks information request is listened to, to the preposition clothes Business device feedback task situation information；

The front server, is additionally operable to for the task situation information to be sent to the described second control server；

The second control server, it is right to be additionally operable to carry out the task situation information and the task feedback information Than, determine the different information of the task situation information and the task feedback information, and according to the process strategy for pre-setting, The different information is processed.

Further, the front server, is additionally operable to obtain the fortune of other front servers in front server group Row status information；When the running status of other front servers is malfunction, receive with other front servers even The connection request of the user terminal for connecing, and set up communication connection.

Further, the front server, is additionally operable to record the network of the first control server as working node Address or the network address of the second control server as working node.

Additionally, the front server, specifically for receiving the heartbeat message of calculation server；The calculation server Heartbeat message includes that calculation server processing data excavates the cpu resource letter of the task feedback information of subtask and calculation server Breath；The heartbeat message of the calculation server is sent to the described first control server.

Additionally, described first control server, specifically for multiple data mining subtasks are sent to described preposition During server, data synchronization information is sent to the second control server；The data synchronization information includes data mining subtask The corresponding calculation server of mission number and each data mining subtask IP address；

After the heartbeat message for receiving calculation server, the heartbeat message real-time synchronization of calculation server is sent to At two control servers.

Additionally, the second control server, specifically for being taken to the described first control with prefixed time interval timing Business device sends heartbeat request；If continuous n time sends after heartbeat request to the described first control server, the first control is not all received The heart beating response message of control server, it is determined that the first control server fail, sends to the front server Working node substitutes request；Wherein n is the frequency threshold value for pre-setting.

Additionally, the second control server, specifically for：

According to the task situation information and the task feedback information, two parts of task list lists are generated；The task Inventory list includes the cpu resource information of the IP address of calculation server and calculation server；

According to two parts of task list lists, different information is determined；

If the different information has been distributed to due to the first control server after calculation server for the first control server Fault, is not synchronized to the task of the second control server, controls the number of server according to task situation information updating second According to synchronizing information；

If the different information is the first control server has distributed to calculation server, and calculation server process task After failure, due to the first control server failure, the task of the second control server is not synchronized to, from the task situation information Middle acquisition mission failure information, and corresponding for mission failure information data mining subtask is redistributed；

If the different information is the first control server still unappropriated data mining subtask, will still unappropriated number Distributed at calculation server by front server according to excavation subtask and processed.

Additionally, the first control server, specifically for the cpu resource information according to calculation server, by data Excavate the calculation server that cpu resource maximum in each calculation server is distributed in subtask.

A kind of distributed data digging method, is applied to above-mentioned distributed data digging system, and the system includes：User Set of terminal, front server group, the first control server, the second control server and calculation server group；User's end End group includes multiple user terminals；The front server group includes multiple front servers；The calculation server group includes Multiple calculation servers；The user terminal group is communicated to connect with the front server group；The front server group is respectively With the described first control server, the second control server and calculation server group communication connection；

Methods described includes：

User terminal sends data mining task request to the front server；

Front server parses the domain-name information of the data mining task request, is asked according to the data mining task Domain-name information using the data mining task request be submitted to as working node first control server；

The data mining task is asked corresponding data mining task to be decomposed by the first control server, is formed many Individual data mining subtask, and the plurality of data mining subtask is sent to the front server；

The front server is assigned to the plurality of data mining subtask at multiple calculation servers and is processed, And the task feedback information of calculation server is received, and the task feedback information is sent to the described first control server；

The first control server is sent to the task feedback information real-time synchronization at the second control server；

First control server described in the second control server real-time monitoring, is confirming the first control server When breaking down, working node is sent to the front server and substitute request；

The front server substitutes request according to the working node, the net of the second control server described in more new record Network address, so that the second control server is used as working node；

The second control server is asked to each calculation server broadcasting tasks information；

The calculation server is fed back to the front server when broadcasting tasks information request is listened to Task situation information；

The task situation information is sent to the described second control server by the front server；

The second control server is contrasted to the task situation information and the task feedback information, determines institute The different information of task situation information and the task feedback information is stated, and according to the process strategy for pre-setting, to the difference Different information is processed.

Additionally, described distributed data digging method, also includes：

The front server obtains the running state information of other front servers in front server group；Described When the running status of other front servers is malfunction, the connection of the user terminal being connected with other front servers is received Request, and set up communication connection.

Additionally, described distributed data digging method, also includes：

The front server record is as the network address of the first control server of working node or as work section The network address of the second control server of point.

Specifically, the plurality of data mining subtask is assigned at multiple calculation servers by the front server Row is processed, and receives the task feedback information of calculation server, and the task feedback information is sent to the described first control Server, including：

The front server receives the heartbeat message of calculation server；The heartbeat message of the calculation server includes fortune Calculate the task feedback information of server process data mining subtask and the cpu resource information of calculation server；

The heartbeat message of the calculation server is sent to the described first control server.

In addition described distributed data digging method, also includes：

Described first controls server when multiple data mining subtasks are sent to the front server, to second Control server sends data synchronization information；The data synchronization information includes the mission number of data mining subtask and each number According to the IP address for excavating the corresponding calculation server in subtask；

The first control server is sent to the task feedback information real-time synchronization at the second control server, bag Include：

The heart beating of calculation server is believed after the heartbeat message for receiving calculation server by the first control server Breath real-time synchronization is sent at the second control server.

Specifically, the first control server described in the second control server real-time monitoring, is confirming first control When control server breaks down, working node is sent to the front server and substitute request, including：

The second control server sends heart beating with prefixed time interval timing to the described first control server please Ask；

If continuous n time sends after heartbeat request to the described first control server, the first control server is not all received Heart beating response message, it is determined that described first control server fail, to the front server send working node Substitute request；Wherein n is the frequency threshold value for pre-setting.

Specifically, the second control server carries out right to the task situation information and the task feedback information Than, determine the different information of the task situation information and the task feedback information, and according to the process strategy for pre-setting, The different information is processed, including：

The second control server generates two parts of tasks according to the task situation information and the task feedback information Inventory list；The task list list includes the IP address of calculation server and the cpu resource information of calculation server；

The second control server determines different information according to two parts of task list lists；

Further, described distributed data digging method, also includes：

One data mining subtask is distributed by the first control server according to the cpu resource information of calculation server Calculation server to cpu resource maximum in each calculation server.

A kind of distributed data digging system and method provided in an embodiment of the present invention, is controlling as the first of working node When control server breaks down, the second control server can substitute the described first control server becomes new working node, So as to, after working node switching is completed, recover the normal work of whole distributed data digging system.Certainly, when the second control During server fail, the first control server can also substitute the described second control server, and two control servers can The general layout hot standby to realize principal and subordinate, it is to avoid the current JobTracker server as main controlled node will if breaking down The task scheduling system of whole hadoop framework is caused to paralyse, it is impossible to complete task scheduling and the problem for processing.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for technology description is had to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other accompanying drawings according to these accompanying drawings.

Fig. 1 is a kind of structural representation one of distributed data digging system provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart one of distributed data digging method provided in an embodiment of the present invention；

Fig. 3 is a kind of flowchart 2 of distributed data digging method provided in an embodiment of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

As shown in figure 1, the embodiment of the present invention provides a kind of distributed data digging system 10, including：User terminal group 11, Front server group 12, first controls server 13, second to control server 14 and calculation server group 15；User's end End group 11 includes multiple user terminals 111；The front server group 12 includes multiple front servers 121；The computing clothes Business device group 15 includes multiple calculation servers 151；The user terminal group 11 is communicated to connect with the front server group 12；Institute State front server group 12 and control server 14 and the calculation server group with the described first control server 13, second respectively 15 communication connections.

Wherein, the user terminal 111, for sending data mining task request to the front server 121.

Herein, the user terminal 111 can run procedure script of hadoop client or data mining task etc..

The front server 121, for parsing the domain-name information of data mining task request, according to the data Data mining task request is submitted to the domain-name information of mining task request the first control service as working node Device 13.

The first control server 13, for asking corresponding data mining task to carry out the data mining task Decompose, form multiple data mining subtasks.The plurality of data mining subtask is sent to the front server 121.

The front server 121, is additionally operable to for the plurality of data mining subtask to be assigned to multiple calculation servers Processed at 151, and the task feedback information of calculation server 151 is received, and the task feedback information is sent to institute State the first control server 13.

The first control server 13, is additionally operable to for the task feedback information real-time synchronization to be sent to the second control clothes At business device 14.

The second control server 14, for the first control server 13 described in monitor in real time, is confirming described first When control server 13 breaks down, working node is sent to the front server 121 and substitute request.

The front server 121, is additionally operable to substitute request according to the working node, the second control described in more new record The network address of server 14, so that the second control server 14 is used as working node.

The second control server 14, is additionally operable to ask to 151 broadcasting tasks information of each calculation server.

The calculation server 151, is additionally operable to when broadcasting tasks information request is listened to, to described preposition Server 121 feeds back task situation information.

The front server 121, is additionally operable to for the task situation information to be sent to the described second control server 14.

The second control server 14, it is right to be additionally operable to carry out the task situation information and the task feedback information Than, determine the different information of the task situation information and the task feedback information, and according to the process strategy for pre-setting, The different information is processed.

Further, the front server 121, is additionally operable to obtain other the preposition services in 121 groups of front server The running state information of device 121.When the running status of other front servers 121 is malfunction, receive and other The connection request of the user terminal that front server 121 connects, and set up communication connection.

Further, the front server 121, is additionally operable to record as the first control server 13 of working node The network address or the network address of the second control server 14 as working node.

Additionally, the front server 121, specifically for receiving the heartbeat message of calculation server 151.The computing clothes The heartbeat message of business device 151 includes that 151 processing data of calculation server excavates the task feedback information of subtask and computational service The cpu resource information of device 151.The heartbeat message of the calculation server 151 is sent to the described first control server 13.

Additionally, described first control server 13, specifically for multiple data mining subtasks are sent to described before When server 121 is put, data synchronization information is sent to the second control server 14.The data synchronization information includes data mining The IP address of the corresponding calculation server 151 of the mission number of subtask and each data mining subtask.

After the heartbeat message for receiving calculation server 151, the heartbeat message real-time synchronization of calculation server 151 is sent out It is sent at the second control server 14.

Additionally, the second control server 14, specifically for prefixed time interval timing to the described first control Server 13 sends heartbeat request.If continuous n time sends after heartbeat request to the described first control server 13, all do not receive The heart beating response message of the first control server 13, it is determined that the first control server 13 breaks down, to described preposition Server 121 sends working node and substitutes request.Wherein n is the frequency threshold value for pre-setting.

Additionally, the second control server 14, specifically for：

According to the task situation information and the task feedback information, two parts of task list lists are generated.The task Inventory list includes the cpu resource information of the IP address of calculation server 151 and calculation server 151.

According to two parts of task list lists, different information is determined.

If the different information has been distributed to as the first control takes after calculation server 151 for the first control server 13 Business 13 fault of device, is not synchronized to the task of the second control server 14, according to the control clothes of task situation information updating second The data synchronization information of business device 14.

If the different information is the first control server 13 has distributed to calculation server 151, and calculation server 151 After process task failure, due to the first control 13 fault of server, the task of the second control server 14 is not synchronized to, from described Mission failure information is obtained in task situation information, and corresponding for mission failure information data mining subtask is divided again Join.

If the different information is the first control server 13 still unappropriated data mining subtask, will be still unappropriated Data mining subtask is distributed to by front server 121 and is processed at calculation server 151.

Additionally, the first control server 13, specifically for the cpu resource information according to calculation server 151, by one The calculation server 151 of cpu resource maximum in each calculation server 151 is distributed in data mining subtask.

A kind of distributed data digging system provided in an embodiment of the present invention, in the first control service as working node When device breaks down, second control server can substitute described first control server become new working node, so as to After completing working node switching, recover the normal work of whole distributed data digging system.Certainly, when second controls server When breaking down, the first control server can also substitute the described second control server, and two control servers can be realized The hot standby general layout of principal and subordinate, it is to avoid the current JobTracker server as main controlled node will cause whole if breaking down The task scheduling system of individual hadoop framework paralyses, it is impossible to complete task scheduling and the problem for processing.

As shown in Fig. 2 the embodiment of the present invention provides a kind of distributed data digging method, it is applied to shown in above-mentioned Fig. 1 Distributed data digging system, methods described includes：

Step 201, user terminal send data mining task request to the front server.

Step 202, front server parse the domain-name information of the data mining task request, according to the data mining Data mining task request is submitted to the domain-name information of task requests the first control server as working node.

The data mining task is asked corresponding data mining task to carry out point by step 203, the first control server Solution, forms multiple data mining subtasks, and the plurality of data mining subtask is sent to the front server.

Step 204, the front server are assigned to the plurality of data mining subtask at multiple calculation servers Processed, and the task feedback information of calculation server is received, and the task feedback information is sent to the described first control Control server.

The task feedback information real-time synchronization is sent to the second control clothes by step 205, the first control server At business device.

First control server described in step 206, the second control server real-time monitoring, is confirming first control When control server breaks down, working node is sent to the front server and substitute request.

Step 207, the front server substitute request according to the working node, the second control clothes described in more new record The network address of business device, so that the second control server is used as working node.

Step 208, the second control server are asked to each calculation server broadcasting tasks information.

Step 209, the calculation server listen to the broadcasting tasks information request when, to the preposition clothes Business device feedback task situation information.

The task situation information is sent to the described second control server by step 210, the front server.

Step 211, described second control server the task situation information and the task feedback information are carried out right Than, determine the different information of the task situation information and the task feedback information, and according to the process strategy for pre-setting, The different information is processed.

A kind of distributed data digging method provided in an embodiment of the present invention, in the first control service as working node When device breaks down, second control server can substitute described first control server become new working node, so as to After completing working node switching, recover the normal work of whole distributed data digging system.Certainly, when second controls server When breaking down, the first control server can also substitute the described second control server, and two control servers can be realized The hot standby general layout of principal and subordinate, it is to avoid the current JobTracker server as main controlled node will cause whole if breaking down The task scheduling system of individual hadoop framework paralyses, it is impossible to complete task scheduling and the problem for processing.

In order that those skilled in the art is better understood by the present invention, a more detailed embodiment is set forth below, As shown in figure 3, the embodiment of the present invention provides a kind of distributed data digging method, including：

Step 301, user terminal and a front server set up communication connection.

Step 302, each front server obtain the running state information of other front servers in front server group； When the running status of other front servers is malfunction, the user terminal being connected with other front servers is received Connection request, and set up communication connection.

Herein, it is the load balancing of realizing front server group, the present invention is using ZooKeeper as Distributed Application journey Sequence coordination service, Zookeeper is generally made up of multiple nodes (being each front server) herein, and each node leads to each other Cross heart beating and obtain others' running state information, and in the internal memory of each node in store full dose data, node single-point therefore Barrier can't affect the service ability of whole cluster.In the present invention, each front server is used as a section of Zookeeper Point, after client is connected on certain front server, client will remain this connection, and by this connection come Send request, event notice is obtained, and heart beating is sent, if be connected breaks down, then client can be connected to automatically In addition, on available node, finally the service cluster of one High Availabitity of composition, realizes the association between multiple distributed variable-frequencypump Biconditional operation.

Step 303, user terminal send data mining task request to the front server.

Step 304, front server parse the domain-name information of the data mining task request, according to the data mining Data mining task request is submitted to the domain-name information of task requests the first control server as working node.

Wherein, the front server record has the network address of the first control server as working node.Now, Second control server is used as secondary node.

The data mining task is asked corresponding data mining task to carry out point by step 305, the first control server Solution, forms multiple data mining subtasks, and the plurality of data mining subtask is sent to the front server.

Step 306, first control server when multiple data mining subtasks are sent to the front server, to Second control server sends data synchronization information；The data synchronization information include data mining subtask mission number and The IP address of the corresponding calculation server in each data mining subtask.

Step 307, the front server are assigned to the plurality of data mining subtask at multiple calculation servers Processed.

Step 308, front server receive the heartbeat message of calculation server, and the heart beating of the calculation server is believed Breath is sent to the described first control server.

Wherein, the heartbeat message of the calculation server includes that the task of calculation server processing data excavation subtask is anti- The cpu resource information of feedforward information and calculation server.

Step 309, the first control server can continue a data mining according to the cpu resource information of calculation server The calculation server of cpu resource maximum in each calculation server is distributed in subtask.

Step 310, first control server after the heartbeat message for receiving calculation server, by the heart of calculation server Hop-information real-time synchronization is sent at the second control server.

Step 311, the second control server send the heart with prefixed time interval timing to the described first control server Request is jumped, if continuous n time sends after heartbeat request to the described first control server, does not all receive the first control server Heart beating response message, it is determined that the first control server fail, sends working node more to the front server For request.

Wherein n is the frequency threshold value for pre-setting.

Step 312, the front server substitute request according to the working node, the second control clothes described in more new record The network address of business device, so that the second control server is used as working node.

Now, the front server needs record as the network address of the second control server of working node.

Step 313, the second control server are asked to each calculation server broadcasting tasks information.

Step 314, the calculation server listen to the broadcasting tasks information request when, to the preposition clothes Business device feedback task situation information.

The task situation information is sent to the described second control server by step 315, the front server.

Step 316, the second control server are generated according to the task situation information and the task feedback information Two parts of task list lists.

The task list list includes the IP address of calculation server and the cpu resource information of calculation server.

Step 317, the second control server determine different information according to two parts of task list lists.

After step 317, can be with execution step 318,319 or step 320.

If step 318 different information has been distributed to due to the first control after calculation server for the first control server Control server fault, is not synchronized to the task of the second control server, according to the control clothes of task situation information updating second The data synchronization information of business device.

If step 319 different information is the first control server has distributed to calculation server, and calculation server After process task failure, due to the first control server failure, the task of the second control server is not synchronized to, from the task Mission failure information is obtained in situation information, and corresponding for mission failure information data mining subtask is redistributed.

If step 320 different information is the first control server still unappropriated data mining subtask, will not yet The data mining subtask of distribution is distributed at calculation server by front server and is processed.

After step 318, step 319 or step 320 are completed, the second control server has been disposed as working node Finish, by the second control server as working node, return to step 304 continues whole flow process, and now second controls server Work process is identical with the first control server, and follow-up monitoring second controls the work of server by the first control server Carry out, two such controls server cycle alternation, constitutes the hot standby general layout of principal and subordinate.What deserves to be explained is, the first control clothes " first ", " second " in business device and the second control server is just for the sake of two control servers of difference, and two controls The 26S Proteasome Structure and Function of server is essentially identical, and when one of them is as working node, another is used as secondary node.

Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can adopt complete hardware embodiment, complete software embodiment or with reference to software and hardware in terms of reality Apply the form of example.And, the present invention can be adopted in one or more computers for wherein including computer usable program code The upper computer program that implements of usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) is produced The form of product.

The present invention is the flow process with reference to method according to embodiments of the present invention, equipment (system) and computer program Figure and/or block diagram are describing.It should be understood that can be by computer program instructions flowchart and/or each stream in block diagram Journey and/or the combination of square frame and flow chart and/or the flow process in block diagram and/or square frame.These computer programs can be provided The processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing device The device of the function of specifying in present one flow process of flow chart or multiple flow processs and/or one square frame of block diagram or multiple square frames.

These computer program instructions may be alternatively stored in and can guide computer or other programmable data processing device with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory is produced to be included to refer to Make the manufacture of device, the command device realize in one flow process of flow chart or multiple flow processs and/or one square frame of block diagram or The function of specifying in multiple square frames.

These computer program instructions can be also loaded in computer or other programmable data processing device so that in meter Series of operation steps is executed on calculation machine or other programmable devices to produce computer implemented process, so as in computer or The instruction for executing on other programmable devices is provided for realizing in one flow process of flow chart or multiple flow processs and/or block diagram one The step of function of specifying in individual square frame or multiple square frames.

Apply specific embodiment in the present invention to be set forth the principle of the present invention and embodiment, above example Explanation be only intended to help and understand the method for the present invention and its core concept；Simultaneously for one of ordinary skill in the art, According to the thought of the present invention, all will change in specific embodiments and applications, in sum, in this specification Appearance should not be construed as limiting the invention.

Claims

1. a kind of distributed data digging system, it is characterised in that include：User terminal group, front server group, the first control Server, the second control server and calculation server group；The user terminal group includes multiple user terminals；Described preposition Server group includes multiple front servers；The calculation server group includes multiple calculation servers；The user terminal group Communicate to connect with the front server group；The front server group is controlled with the described first control server, second respectively Server and calculation server group communication connection；

The user terminal, for sending data mining task request to the front server；

The front server, for parsing the domain-name information of data mining task request, appoints according to the data mining Data mining task request is submitted to the domain-name information of business request the first control server as working node；

The first control server, for the data mining task is asked corresponding data mining task to be decomposed, Form multiple data mining subtasks；The plurality of data mining subtask is sent to the front server；

The front server, is additionally operable to the plurality of data mining subtask is assigned at multiple calculation servers and is located Reason, and the task feedback information of calculation server is received, and the task feedback information is sent to the described first control service Device；

The first control server, is additionally operable to for the task feedback information real-time synchronization to be sent to the second control server Place；

The second control server, for the first control server described in monitor in real time, is confirming the first control service When device breaks down, working node is sent to the front server and substitute request；

The front server, is additionally operable to substitute request according to the working node, the second control server described in more new record The network address so that described second control server as working node；

The calculation server, is additionally operable to when broadcasting tasks information request is listened to, to the front server Feedback task situation information；

The second control server, is additionally operable to contrast the task situation information and the task feedback information, really The fixed task situation information and the different information of the task feedback information, and according to the process strategy for pre-setting, to institute State different information to be processed.

2. distributed data digging system according to claim 1, it is characterised in that the front server, is additionally operable to Obtain the running state information of other front servers in front server group；Operation shape in other front servers The connection request of the user terminal being connected with other front servers when state is malfunction, is received, and sets up communication connection.

3. distributed data digging system according to claim 2, it is characterised in that the front server, is additionally operable to Record the net of the second control server as the first of the working node network address for controlling server or as working node Network address.

4. distributed data digging system according to claim 3, it is characterised in that the front server, concrete uses In the heartbeat message for receiving calculation server；The heartbeat message of the calculation server includes that calculation server processing data is excavated The task feedback information of subtask and the cpu resource information of calculation server；The heartbeat message of the calculation server is sent To the described first control server.

5. distributed data digging system according to claim 4, it is characterised in that the first control server, tool Body is used for, when multiple data mining subtasks are sent to the front server, sending data to the second control server same Step information；The data synchronization information includes the corresponding fortune of the mission number of data mining subtask and each data mining subtask Calculate the IP address of server；

After the heartbeat message for receiving calculation server, the heartbeat message real-time synchronization of calculation server is sent to the second control At control server.

6. distributed data digging system according to claim 5, it is characterised in that the second control server, tool Body is used for sending heartbeat request with prefixed time interval timing to the described first control server；If continuous n time to described the After one control server sends heartbeat request, the heart beating response message of the first control server is not all received, it is determined that described First control server fail, sends working node to the front server and substitutes request；Wherein n pre-sets Frequency threshold value.

7. distributed data digging system according to claim 6, it is characterised in that the second control server, tool Body is used for：

According to the task situation information and the task feedback information, two parts of task list lists are generated；The task list List includes the cpu resource information of the IP address of calculation server and calculation server；

If the different information has been distributed to due to the first control server failure after calculation server for the first control server, The task of the second control server is not synchronized to, controls the data syn-chronization of server according to task situation information updating second Information；

If the different information is the first control server has distributed to calculation server, and the failure of calculation server process task Afterwards, due to the first control server failure, the task of the second control server is not synchronized to, is obtained from the task situation information Mission failure information is taken, and corresponding for mission failure information data mining subtask is redistributed；

If the different information is the first control server still unappropriated data mining subtask, still unappropriated data will dig Pick subtask is distributed at calculation server by front server and is processed.

8. distributed data digging system according to claim 7, it is characterised in that the first control server, tool Body is used for the cpu resource information according to calculation server, and CPU money in each calculation server is distributed in a data mining subtask The calculation server of source maximum.

9. a kind of distributed data digging method, it is characterised in that be applied to distributed described in any one of claim 1 to 8 Data digging system, the system includes：User terminal group, front server group, the first control server, the second control server And calculation server group；The user terminal group includes multiple user terminals；The front server group includes multiple preposition Server；The calculation server group includes multiple calculation servers；The user terminal group is logical with the front server group Letter connection；The front server group controls server and the computational service with the described first control server, second respectively Device group is communicated to connect；

Methods described includes：

User terminal sends data mining task request to the front server；

Front server parses the domain-name information of the data mining task request, according to the domain of data mining task request Data mining task request is submitted to name information the first control server as working node；

The data mining task is asked corresponding data mining task to be decomposed by the first control server, forms many numbers According to excavation subtask, and the plurality of data mining subtask is sent to the front server；

The front server is assigned to the plurality of data mining subtask at multiple calculation servers and is processed, and connects The task feedback information of calculation server is received, and the task feedback information is sent to the described first control server；

First control server described in the second control server real-time monitoring, is confirming the first control server generation During fault, working node is sent to the front server and substitute request；

The front server substitutes request according to the working node, the network ground of the second control server described in more new record Location, so that the second control server is used as working node；

The calculation server feeds back task when broadcasting tasks information request is listened to the front server Situation information；

The second control server is contrasted to the task situation information and the task feedback information, determines described appointing Business situation information and the different information of the task feedback information, and according to the process strategy for pre-setting, the difference is believed Breath is processed.

10. distributed data digging method according to claim 9, it is characterised in that also include：

The front server obtains the running state information of other front servers in front server group；Described other When the running status of front server is malfunction, the connection for receiving the user terminal being connected with other front servers please Ask, and set up communication connection.

11. distributed data digging methods according to claim 10, it is characterised in that also include：

The front server record is as the network address of the first control server of working node or as working node The network address of the second control server.

12. distributed data digging methods according to claim 11, it is characterised in that the front server will be described Multiple data mining subtasks are assigned at multiple calculation servers and are processed, and receive the task feedback letter of calculation server Breath, and the task feedback information is sent to the described first control server, including：

The front server receives the heartbeat message of calculation server；The heartbeat message of the calculation server includes that computing takes Business device processing data excavates the task feedback information of subtask and the cpu resource information of calculation server；

13. distributed data digging methods according to claim 12, it is characterised in that also include：

The first control server controls to second when multiple data mining subtasks are sent to the front server Server sends data synchronization information；The data synchronization information includes that the mission number of data mining subtask and each data are dug The IP address of the corresponding calculation server in pick subtask；

The first control server is sent to the task feedback information real-time synchronization at the second control server, including：

Described first controls server after the heartbeat message for receiving calculation server, by the heartbeat message reality of calculation server When synchronized transmission to second control server at.

14. distributed data digging methods according to claim 13, it is characterised in that the second control server reality When monitor described first control server, confirm described first control server fail when, to the front server Send working node and substitute request, including：

The second control server sends heartbeat request with prefixed time interval timing to the described first control server；

If continuous n time sends after heartbeat request to the described first control server, the heart of the first control server is not all received Jumping response message, it is determined that the first control server fail, working node replacement is sent to the front server Request；Wherein n is the frequency threshold value for pre-setting.

15. distributed data digging methods according to claim 14, it is characterised in that the second control server pair The task situation information and the task feedback information are contrasted, and determine the task situation information and task feedback The different information of information, and according to the process strategy for pre-setting, the different information is processed, including：

The second control server generates two parts of task lists according to the task situation information and the task feedback information List；The task list list includes the IP address of calculation server and the cpu resource information of calculation server；

16. distributed data digging methods according to claim 15, it is characterised in that also include：

One data mining subtask is distributed to respectively by the first control server according to the cpu resource information of calculation server The calculation server of cpu resource maximum in calculation server.