CN115934328B

CN115934328B - Train fault automatic identification cluster system and method based on fine-grained task disassembly

Info

Publication number: CN115934328B
Application number: CN202211558948.XA
Authority: CN
Inventors: 朱润光; 苏培旺; 贾凡; 杨宝雨
Original assignee: Harbin Kejia General Mechanical and Electrical Co Ltd
Current assignee: Harbin Kejia General Mechanical and Electrical Co Ltd
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-08-08
Anticipated expiration: 2042-12-06
Also published as: CN115934328A

Abstract

A train fault automatic identification cluster system and method based on fine-grained task disassembly belong to the technical field of detection. In order to solve the problems of long detection time and high hardware requirement of the existing single-machine type automatic train fault identification, the system adopts a plurality of identification servers for clustered deployment, each identification server is provided with an identification service unit, and only a specific few identification servers are provided with a task distribution service unit and a result collection service unit; the task distribution service unit disassembles the task to be detected and performs task delivery, the type of the fault to be identified is pointed out when the task is delivered, the identification server receives the identification task, the identification service unit selects a corresponding identification module to identify the train picture according to the type of the fault to be identified, and the identification service unit sends the identification result to the result recovery service unit after the completion.

Description

Train fault automatic identification cluster system and method based on fine-grained task disassembly

Technical Field

The invention belongs to the technical field of detection, and particularly relates to a train fault automatic identification method, a cluster system and a method.

Background

The train fault detection system is characterized in that a linear array camera is arranged at the side of a train track, photos of a plurality of parts when a train passes through are collected to a server, and an advanced deep learning algorithm is used for detecting and identifying original pictures, so that possible faults are identified and marked, and train accidents are avoided.

When the train is detected, the deep learning model for fault detection is required to support various different vehicle types and can detect various different faults because the train has various models, so that the requirements on software and hardware resources are higher, and the train basically passes through check points at high speed, so that the train passing detection is required to be completed quickly, and the requirements on detection time are also higher.

Aiming at the intelligent train detection system for fault identification by deep learning, the requirements on the software and hardware resources of the system are very high, when a train passes through a detection station, more than 1G of pictures to be identified can be generated, and the fault identification of the train must be completed within a plurality of minutes according to the actual service requirements, so that very high software and hardware equipment is required to be configured when the train is detected at present, and the cost is very high; in practice, the detection requirements are not well met. Especially for fine granularity detection, the requirements on software and hardware are higher.

Therefore, the detection task can be completed by distributing all the fault recognition modules on a plurality of servers, the parallel recognition can improve the overall recognition speed, and the fault detection of multiple trains can be simultaneously carried out by multiple detection stations. But in this case all distributed nodes are required for multi-copy cluster deployment. The software and hardware faults of any node in the cluster can influence the train fault recognition conclusion, so that the recognition time is longer, or the fault recognition of all vehicles cannot be completed, in order to complete the train detection within the limit of the specified time length, the task needs to be retried immediately after any error occurs, the granularity of the retried task needs to be controlled, the retried can be performed immediately under the error scene, the failed retried can be completed within the required time limit, and the fault recognition detection of the train is completed within a plurality of minutes after the train passes. But the deployment of current railroad train detection systems is not well supervised.

Disclosure of Invention

The invention aims to solve the problems of long detection time and high hardware requirement in the existing single-machine type train fault automatic identification and the problem of low processing efficiency caused by the fact that the processing mode of the identification task cannot be effectively adjusted in the existing distributed train fault automatic identification.

The train fault automatic identification cluster system based on fine-grained task disassembly adopts clustered deployment of a plurality of identification servers, an identification service unit is deployed on each identification server, and a task distribution service unit and a result collection service unit are deployed on only a specific few identification servers; the identification service unit of each identification server comprises a plurality of identification modules, and each identification module correspondingly identifies one fault type;

the task distribution service unit disassembles the task to be detected and performs task delivery, the type of the fault to be identified is pointed out when the task is delivered, the identification server receives the identification task, the identification service unit selects a corresponding identification module to identify the train picture according to the type of the fault to be identified, and the identification service unit sends the identification result to the result recovery service unit after the completion;

the task distribution service unit performs task delivery including a process of dividing a resource pool and a process of routing to a specific identification server; wherein,,

the resource pool division process is as follows:

logically dividing the identification servers with the same identification modules into a resource pool, wherein one resource pool comprises n identification servers, namely n tasks are executed simultaneously; k recognition servers are needed for accommodating the full recognition module, and then the whole cluster is divided into K resource pools;

the process of routing to a particular recognition server is as follows:

firstly, searching a specific resource pool through a bitmap mapping algorithm, and then searching a specific identification server in the resource pool to execute an identification task;

the result recovery service unit recovers the recognition results of all tasks, combines the recognition results, and combines the final detection results locally after all the fault detection tasks of the whole train are completed; if part of task identification fails in the train identification process, or a result is not returned when the identification duration of the task to be detected of the whole train reaches a time threshold value, the recovery service unit automatically informs the task distribution service unit that fine-grained tasks which do not return the identification result need to be re-thrown, and all re-thrown tasks can be re-processed by the cluster system.

Further, when the task distribution service unit disassembles the task to be detected, the task to be detected is disassembled according to the number of sections of the train and the number of fault identification modules.

Further, the bitmap construction process in the map mapping algorithm is as follows:

the method comprises the steps of carrying out bitmap encoding on identification modules, arranging T types of identification modules in sequence, and arranging the T types of identification modules into a matrix with I being J, namely a bitmap, wherein I is the number of identification modules contained in a single identification server, J is the number of identification servers needed for containing the whole number of identification modules, and each identification module corresponds to a coordinate (x, y) point of the matrix;

according to physical deployment, the same identification module is deployed by an identification server in one resource pool, coordinate points in the bitmap and the resource pool are in a many-to-one relationship, and the values of the coordinate points are set as ID values of the resource pool.

Further, the structure of the bitmap is realized through a Map data structure.

Further, the process of finding a specific identification server in the resource pool includes the following steps:

step 2.1, calculating an identification server value in the resource pool:

identification server value = identification server physical address (network card physical address)% identification server lookup radix;

n identification servers are arranged in the resource pool, N identification server values in the resource pool are calculated, and the values are stored in an identification server value list and are arranged in an ascending order;

step 2.2, calculating task values:

task value = identification server selection ID% identification server lookup radix;

step 2.3, selecting according to a selection rule:

if a certain task is executed, traversing in ascending order in the recognition server value list according to the task value calculated in the step 2.2, finding a first recognition server value which is greater than or equal to the task value, and using the recognition server corresponding to the recognition server value to execute the task.

Further, the physical address of the identification server for calculating the identification server value adopts the physical address of the network card.

Further, the recognition server finds cardinality=65535.

Further, when the task distribution service unit performs task delivery, the state of the identification server is monitored; when a certain identification server is busy, the corresponding identification server value is queried through the identification server value list, and the identification server selection ID of the task which can be selected to the corresponding identification server is weighted in the task distribution service unit, so that the calculated task value is not adjacent to the identification server value.

Further, in the process of monitoring the status of the identification server, when the server is idle, a reverse process of the process when the identification server is busy is adopted.

The train fault automatic identification method based on fine-grained task disassembly and failure retry comprises the following steps:

s1, based on the train fault automatic identification cluster system based on fine-granularity task disassembly, firstly grouping according to the number of train sections, and setting 10 vehicles as an identification group;

s2, creating an identification task, and forming an identification task by each identification group and a fault type;

s3, traversing the identification task list, and delivering each identification task to an identification service node according to the deployment condition of an identification module in a cluster of the automatic identification cluster system based on the train fault of fine-granularity task disassembly;

s4, waiting for receiving a returned identification conclusion;

s5, if all the recognition results are returned, merging the recognition results, and ending the detection task;

s6, traversing the identification task list when the maximum identification duration limit is half, and re-delivering the unfinished task to the task queue if the identification result is not correctly returned; if all the recognition tasks are completed, merging recognition conclusions, and ending the task;

s7, reaching the limit of the identification duration, still not identifying to complete all tasks, and returning a failure message;

if all the recognition tasks are completed, merging the recognition conclusions, and ending the task.

The beneficial effects are that:

1. the recognition speed of train faults is detected by the lifting system; the detection tasks are split into a large number of tiny tasks, the tiny tasks are evenly put into a cluster system, a plurality of identification nodes process detection services simultaneously, and the fault detection time length of single passing can be greatly shortened through the deployment scale of the transverse expansion system. The fine granularity splitting is beneficial to load balancing of the service and improving the parallel processing capacity of recognition, so that the cluster computing power can be exerted to the maximum extent, and the recognition speed can be greatly improved under the scene of low overall load of a cluster system.

2. The reliability of the system is improved; the invention separates tasks through fine granularity, and can retry failed fine granularity tasks rapidly, thus ensuring that the total recognition time consumption can be controlled within a limited time interval.

Drawings

Fig. 1 is a schematic deployment diagram of an automatic train fault identification cluster system based on fine-grained task disassembly.

Fig. 2 is a schematic diagram of an entire cluster to be divided into K resource pools.

FIG. 3 is a schematic diagram of matrix coordinates for each identification module.

Fig. 4 is a mapping relationship diagram of coordinate points and resource pools in a bitmap.

Fig. 5 and 6 are schematic diagrams of logically forming a circle by identifying values in a list of server values.

FIG. 7 is a schematic diagram of finding an identification server value that is equal to or greater than a task value.

FIG. 8 is a schematic diagram of an identification server being removed from selecting the next identification server clockwise adjacent to a task.

FIG. 9 is a schematic diagram of a recognition server being removed from selecting a next recognition server that is not adjacent to the task clockwise.

FIG. 10 is a schematic diagram of adding one recognition server adjacent to a task clockwise to select the next recognition server.

FIG. 11 is a schematic illustration of adding a recognition server that is not adjacent to the task clockwise to select the next recognition server.

Fig. 12 is a schematic diagram of selecting a next identification server by weighting and associating identification server selection IDs.

Fig. 13 is a flow chart of a method for automatically identifying a train fault based on fine-grained task disassembly and failed retries.

Detailed Description

It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.

The first embodiment is as follows:

the embodiment is a train fault automatic identification cluster system based on fine-grained task disassembly, the train fault automatic identification system adopts clustered deployment, aiming at server multiplexing, task distribution service units, result collection service units and identification service units are deployed on identification servers, the number of the task distribution service units and the result collection service units is smaller than that of the identification service units, namely, each identification server is provided with the identification service unit, and only a specific number of identification servers are provided with the task distribution service units and the result collection service units. It is particularly pointed out that the identification service unit of each identification server comprises a plurality of identification modules, and each identification module correspondingly identifies one fault type. A deployment schematic is shown in fig. 1.

The task disassembly process is as follows:

task splitting is completed by a task distribution service unit deployed in a specific recognition server; carrying out fine-granularity splitting on a task to be detected in two dimensions, wherein the first dimension is as follows: dividing the whole train into one identification group according to each Q (Q=10 in the embodiment), wherein one train can be divided into a plurality of identification groups; the second dimension is: one task only runs one identification module, one fault type is detected, the fault type to be identified is pointed out when the task is delivered, after the identification server receives the identification task, the identification service unit selects the corresponding identification module to identify the train picture according to the fault type to be identified, and after the identification is completed, the identification result is sent to the result recovery service unit, and other types of faults are not detected any more.

Assuming that a train needs to be split into M identification groups and has N fault identification modules, then it needs to be split into m×n identification tasks.

The task delivery process is as follows:

dividing a resource pool: the task delivery is also completed by a task distribution service unit in a specific identification server, and because the identification cluster is physically deployed, the identification units among the identification servers are deployed differently, and the execution service is also different, the identification servers with the same identification module are logically divided into a resource pool, and one resource pool contains n identification servers, namely n tasks can be executed simultaneously. The complete cluster is divided into K resource pools (resource pool ID is resource pool: ID e 0, K)) as shown in fig. 2, when K identification servers are needed to accommodate the full identification module.

Task routing algorithm: initiated by the task distribution service unit, routed to a specific recognition server for execution by the routing algorithm.

The routing algorithm is divided into two processes, wherein in the process 1, a specific resource pool is searched by a BitMap (BitMap) mapping algorithm; process 2 on the basis of process 1, a specific recognition server is found in the resource pool by a recognition server selection algorithm to perform a recognition task.

The ID for scheduling a task at the time of task distribution service includes two parts: a bitmap ID and an identification server selection ID, respectively. The bitmap ID is coordinate values (x, y) of the recognition module after bitmap encoding and is used for selecting a specific resource pool; the identification server selection ID is a discrete random value (generated by the task distribution service unit) for selecting a particular identification server within the resource pool.

(1) Bitmap mapping algorithm:

bitmap encoding is carried out on the identification modules, T (T E [0, ++)) identification modules are arranged sequentially, the T identification modules are arranged into a matrix (i.e. bitmap) of I.J, I is the number of identification modules contained in a single identification server, J is the number of identification servers needed for containing the whole number of identification modules, and each identification module corresponds to the coordinate (x, y) point of a matrix; as shown in fig. 3.

According to physical deployment, the same identification module is deployed by an identification server in one resource pool, so that coordinate points in the bitmap and the resource pool are in a many-to-one relationship, and the values of the coordinate points are set as ID values of the resource pool. It is particularly pointed out that for efficiency reasons, the structure can be completed by Map data structure, and the mapping relation is shown in fig. 4.

(2) Identification server selection algorithm:

on the basis of the bitmap mapping completion, namely on the basis of determining a specific resource pool, ID calculation is selected by the identification server, and a specific identification server is selected to execute the identification task.

Identification server lookup radix=65535 is first defined for the following selection algorithm calculation. The specific identification server selection algorithm comprises the following steps:

step 2.1, calculating an identification server value in the resource pool:

identification server value = identification server physical address (network card physical address)% identification server search base number, there are N identification servers in the resource pool, N (N e (0, +#) in the resource pool is calculated according to the formula, identification server value is determined according to the traffic scale), since the values in the list are modulo 65535, the values in the list would be between (0, 65535) and logically form a circle, the schematic diagrams of which are shown in fig. 5 and 6.

Step 2.2, calculating task values:

task value = identification server selection ID% identification server lookup radix.

Step 2.3, determining a selection rule:

assuming that a certain task is executed, traversing the task value calculated in step 2.2 in ascending order in the identification server value list (i.e. the logic ring searches clockwise), and finding the first identification server value greater than or equal to the task value, and using the identification server corresponding to the identification server value to execute the task, as shown in fig. 7.

The above treatment enables the invention to have the following characteristics:

(A) The addition of the identification server in the resource pool does not affect the execution of the task: according to the rule of step 2.3 in the selection algorithm of the identification server, the number of the identification servers in the resource pool is increased or decreased, and the execution of the task is not affected. The method has the advantages that the identification servers in the resource pool are down or the number of the identification servers in the extended resource pool cannot influence the service.

(A1) Identifying servers in a resource pool to be removed

The recognition server adjacent to the task clockwise is removed, the algorithm selects the next recognition server to execute the task, the recognition server with the value of 500 in fig. 8 is removed, and the algorithm selects the recognition server with the value of 32768 in fig. 8 to execute the task; if a node not adjacent to the task value is removed, there is no effect on the task, e.g., the identification server of 32768 in FIG. 9 is removed, and the algorithm will also select the identification server of 500 in FIG. 9 to perform the task.

(A2) Identifying server additions in resource pools

Adding an identification server adjacent to the task value, the algorithm will directly select the newly added identification server to perform the task, for example, if the identification server with the value 460 is added in fig. 10, the algorithm will select the identification server to perform the task; when an identification server not adjacent to the task value is added, the algorithm also selects the original identification server to execute the task, and no influence is received, for example, when an identification server with a value of 1000 is added to the resource pool in fig. 11, no influence is generated on the original task, and the algorithm also selects the identification server with a value of 500 to execute the task.

(B) The identification server selection algorithm in the resource pool has the load balancing characteristic:

(B1) Since the identification server selection ID is a random value, the calculated task value is also random relative discrete.

(B2) The physical address of the identification server is also relatively random, so the calculated identification server value is also relatively random.

In conclusion, the probability that different task values select the same identification server is low, namely, different tasks can select different identification servers to execute tasks, so that the aim of load balancing is achieved.

(C) The accurate flow control is realized through an algorithm, when the monitoring service finds that some servers are in a transitional busy or idle state, task ID can be weighted by a task distribution service unit, the accurate control of the task is routed to an idle recognition server, and the idle recognition server is far away from the busy recognition server; meanwhile, the task can be accurately put on an idle recognition server.

(C1) When a certain recognition server is too busy: the value of the identification server may be queried through the identification server value list, the identification server selection ID of the task that can be selected to the identification server may be weighted in the task distribution service unit, so that the calculated task value is not adjacent to the identification server value, for example, the identification server with the value of 500 is busy, and the identification server with the value of 32768 is relatively idle, and the identification server selection ID may be weighted so that the calculated task value is greater than 500, which is illustrated in fig. 12.

(C2) When the server is too idle: the principle is the reverse process of (C1), and is not described in detail herein.

In conclusion, the flow of the specific server in the whole resource pool can be controlled through simple weighting, and the algorithm is simple and effective. Plays a vital role in current limiting and service management in the whole cluster.

The failed retry procedure is as follows:

after the result recovery service unit recovers the recognition results of all the tasks, merging the recognition results, finally merging the final detection results locally after all the fault detection tasks of the whole train are completed, sending the final detection results to a human-computer interaction interface for manual checking, if partial task recognition failure occurs in the train recognition process or a recognition conclusion is not returned for a long time, the final recognition results are failed, when the recognition duration of the tasks to be detected of the whole train reaches half of the maximum duration limit, the recognition task which does not return the result can be assumed to have abnormality by the system, or recognition is overtime, the recovery service unit automatically informs the task distribution service unit that the fine-grained tasks which do not return the recognition results need to be re-thrown, all the re-thrown tasks can be re-processed by the cluster system, and the erroneous tasks can be correctly processed under the retry strategy, so that the reliability of the system is improved. The strategy exerts the clustering capability as much as possible, the identification of the whole train is finished to the maximum extent, and the fine-grained task can consume less cluster computing resources.

In summary, the invention provides a fine-grained recognition task splitting method, which splits an original once-through train passing detection task into a large number of fine-grained detection tasks, firstly, manually defining a train into a plurality of recognition groups, each recognition group consists of 10 trains, each recognition task bears recognition of a specific fault type of one recognition group, and as a result, the once-through train passing detection task can be split into M x N tasks, M is the number of faults to be detected, and N is the number of recognition groups. The task initiating terminal splits the vehicle passing detection task, then respectively transmits the fine-grained tasks to fault detection nodes in the cluster according to the deployment condition of the fault identification modules in the distributed cluster, and waits for receiving the detection result. If any unexpected situation occurs in the recognition process, which results in recognition failure, the system automatically initiates a retry task, only retries the failed fine-grained task, and does not need to re-detect all the passing pictures, thereby reducing the retry workload, improving the retry speed, completing error retry within the limit of the duration allowed by the passing detection, and improving the reliability of the system.

The second embodiment is as follows: the present embodiment will be described with reference to fig. 13,

the embodiment is a train fault automatic identification method based on fine-grained task disassembly and failure retry, which is realized by a train fault automatic identification cluster system based on fine-grained task disassembly, and specifically comprises the following steps:

1. firstly, grouping according to the number of train sections, and setting 10 vehicles as an identification group;

2. creating an identification task, and forming an identification task by each identification group and a fault type;

3. traversing the identification task list, and delivering each identification task to an identification service node according to the deployment condition of an identification module in a cluster of the automatic identification cluster system based on the train fault of fine-granularity task disassembly;

4. waiting for receiving a returned identification conclusion;

5. if all the recognition results are returned, merging the recognition results, and ending the detection task;

6. traversing the identification task list when the maximum identification duration limit is half, and re-delivering the unfinished task to the task queue if the identification result is not correctly returned; if all the recognition tasks are completed, merging recognition conclusions, and ending the task;

7. reaching the limit of the identification duration, still not identifying to complete all tasks, and returning a failure message;

The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims

1. The train fault automatic identification cluster system based on fine-grained task disassembly is characterized in that the system adopts clustered deployment of a plurality of identification servers, an identification service unit is deployed on each identification server, only a task distribution service unit and a result collection service unit are deployed on a plurality of identification servers, and a plurality of identification servers deploying the task distribution service unit and the result collection service unit are marked as specific identification servers; the identification service unit of each identification server comprises a plurality of identification modules, and each identification module correspondingly identifies one fault type;

the resource pool division process is as follows:

the process of routing to a particular identification server is as follows:

firstly, searching a corresponding resource pool through a bitmap mapping algorithm, and then searching a specific identification server in the resource pool to execute an identification task;

2. The automatic train fault recognition cluster system based on fine-grained task disassembly according to claim 1, wherein the task distribution service unit disassembles the task to be detected according to the number of sections of the train and the number of fault recognition modules when disassembles the task to be detected.

3. The train fault automatic identification cluster system based on fine-grained task disassembly according to claim 1 or 2, wherein the bitmap construction process in the map mapping algorithm is as follows:

4. The automatic train failure recognition cluster system based on fine-grained task disassembly according to claim 3, wherein the structure of the bitmap is implemented through a Map data structure.

5. The automatic train failure identification cluster system based on fine-grained task orchestration according to claim 4, wherein the process of finding a specific identification server within the resource pool comprises the steps of:

s2.1, calculating and identifying server values in the resource pool:

identification server value = identification server physical address% identification server lookup radix;

s2.2, calculating task values:

s2.3, selecting according to a selection rule:

if a certain task is executed, traversing in ascending order in the recognition server value list according to the task value calculated in the step S2.2, finding a first recognition server value which is greater than or equal to the task value, and using the recognition server corresponding to the recognition server value for executing the task.

6. The system for automatically identifying a train fault cluster based on fine-grained task orchestration according to claim 5, wherein the identification server physical address that calculates the identification server value uses a network card physical address.

7. The system for automatically identifying clusters of train faults based on fine-grained task disassembly of claim 6 in which the identification server looks for cardinality = 65535.

8. The automatic train fault recognition cluster system based on fine-grained task disassembly according to claim 6, wherein the task distribution service unit monitors the state of the recognition server when performing task delivery; when a certain identification server is busy, the corresponding identification server value is queried through the identification server value list, and the identification server selection ID of the task which can be selected to the corresponding identification server is weighted in the task distribution service unit, so that the calculated task value is not adjacent to the identification server value.

9. The automatic train failure recognition cluster system based on fine-grained task disassembly according to claim 8, wherein the process of monitoring the status of the recognition server uses a reverse process of the process when the recognition server is busy when the server is idle.

10. The train fault automatic identification method based on fine-granularity task disassembly and failure retry is characterized by comprising the following steps of:

s1, a train fault automatic identification cluster system based on fine-grained task disassembly according to any one of claims 1 to 9, wherein 10 vehicles are firstly grouped according to the number of train sections and are set as an identification group;

s4, waiting for receiving a returned identification conclusion;