CN113361326B

CN113361326B - Wisdom power plant management and control system based on computer vision target detection

Info

Publication number: CN113361326B
Application number: CN202110480594.0A
Authority: CN
Inventors: 曾晨; 沈迎迎; 陈石明; 李学钧; 戴相龙; 蒋勇; 刘浩; 瞿丹波; 万人韬; 曹龙辉; 范炜; 武申; 王天烁; 赵思婷
Original assignee: Jiangsu Haohan Information Technology Co ltd; Zhejiang Ninghai Power Generation Co ltd
Current assignee: Jiangsu Haohan Information Technology Co ltd; Zhejiang Ninghai Power Generation Co ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2022-08-05
Anticipated expiration: 2041-04-30
Also published as: CN113361326A

Abstract

The invention provides an intelligent power plant management and control system based on computer vision target detection, which comprises: the image acquisition module is used for shooting images aiming at various scenes in the intelligent power plant and acquiring image pictures and video information corresponding to the scenes; the Retinex image enhancement module is used for enhancing the image information required in the image picture and weakening or eliminating the image information not required; the deep learning target detection module is used for carrying out target detection on the image after the enhancement processing in a mode of combining a single stage and a double stage; the deep learning face recognition module is used for carrying out human body posture estimation on the enhanced image through a face recognition algorithm of a deep convolutional neural network; judging whether the person falls down or not according to the detected position corresponding to the final positioning of the joint of the person; the image retrieval module is used for obtaining the image retrieval result through a hash function.

Description

Wisdom power plant management and control system based on computer vision target detection

Technical Field

The invention provides an intelligent power plant management and control system based on computer vision target detection, and belongs to the technical field of intelligent power plants.

Background

With the development of the power industry, it is more and more important to improve the intellectualization of the power plant. In relevant technique, the management and control system of power plant comprises equipment terminal layer data acquisition module, control module, by data acquisition module to equipment data collection, the manual work carries out the analysis to equipment data collection, according to the artificial control instruction that triggers control module of analysis result, controls the equipment in the wisdom power plant. In the process of carrying out intelligent power plant management and control at current power plant management and control system, still there is the management and control that needs artifical control video picture to carry out the operation conditions of power plant, owing to need the operation of artifical control power plant equipment, has reduced the intelligent degree of system, simultaneously, appears easily because of the management and control personnel omit the management and control leak that the observation caused and control the problem emergence of neglecting.

Disclosure of Invention

The invention provides a computer vision target detection-based intelligent power plant management and control system, which is used for solving the problems of management and control loopholes and monitoring careless loopholes caused by omitted observation of management and control personnel in the conventional intelligent power plant management and control system, and adopts the following technical scheme:

a wisdom power plant management and control system based on computer vision target detects, the system includes:

the image acquisition module is used for shooting images aiming at various scenes in the intelligent power plant and acquiring image pictures and video information corresponding to the scenes;

the Retinex image enhancement module is used for enhancing the image information required in the image picture by utilizing an SSR algorithm, weakening or eliminating the image information not required and obtaining an enhanced image;

the deep learning target detection module is used for carrying out target detection on the image subjected to the enhancement processing by utilizing an RPN (resilient packet network), an FPN (field programmable gate array) algorithm and an SSD (solid State disk) algorithm in a mode of combining a single stage and a double stage;

the deep learning face recognition module is used for extracting face features to be recognized aiming at the enhanced images by adopting a Resnet mode through a face recognition algorithm of a deep convolutional neural network, comparing the face features with the face features in a sample library by adopting a cosine distance as a measurement function, and finding out the most similar face as a recognition result;

the personnel falling detection module is used for learning the prior distribution of the joints relative to the upper half of the human body by utilizing the training image set and carrying out human posture estimation on the enhanced image; then, detecting the upper half of the body by using a human body upper half detector, determining a distribution area of joints according to the prior distribution of the joints, calculating the final positioning probability of the joints of the human body, and determining the final positioning of the joints; judging whether the person falls down or not according to the detected position corresponding to the final positioning of the joint of the person;

and the image retrieval module is used for mapping each enhanced image into a binary code through a hash function, comparing the Hamming distance with the binary codes of all pictures in the database, and obtaining the result of the image retrieval according to the sequence of the Hamming distance from small to large.

Further, the Retinex image enhancement module comprises:

the conversion module is used for reading the original image S (x, y) and judging whether the original image S (x, y) is a gray scale image or not; if the original image is a gray image, converting the gray value of each pixel of the image from an integer (int) to a floating point (float) and converting the gray value to a logarithmic domain; if the original image is a color image, the color is divided into channels for processing, and each component pixel value is converted into a floating point number (float) from an integer type (int) and is converted into a logarithmic domain;

the scale acquisition module is used for inputting a Gaussian surrounding scale, converting the discretization of integral operation into summation operation, and acquiring a scale lambda by using the input Gaussian surrounding scale and the scale to meet the condition;

the final formed image acquisition module is used for acquiring a final formed image through an image model formed by the reflected light entering human eyes, and only one final formed image is generated if the original image is a gray scale image; if the original image is a color image, each channel forms a corresponding final formed image;

the output image acquisition module is used for converting the finally formed image from a logarithmic domain to a real number domain to obtain an output image;

and the display module is used for linearly stretching the output image and converting the output image into a corresponding format for output display.

Further, the image processing process of the Retinex image enhancement module includes:

step 1, reading an original image S (x, y), and judging whether the original image S (x, y) is a gray scale image; if the original image is a gray image, converting the gray value of each pixel of the image from an integer (int) to a floating point (float) and converting the gray value to a logarithmic domain; if the original image is a color image, the color is divided into channels for processing, and each component pixel value is converted into a floating point number (float) from an integer type (int) and is converted into a logarithmic domain;

step 2, inputting a Gaussian surrounding scale C, converting discretization of integral operation into summation operation, and determining a lambda value through the following formula;

∫∫F(x,y)dxdy＝1

wherein F (x, y) represents a center surround function; x and y represent coordinate values of each point in the image on an x axis and a y axis; c represents a Gaussian surrounding scale, lambda represents a scale, and the scale corresponding to the lambda value meets the condition that: : (x, y) dxdy ═ 1;

step 3, obtaining a final formed image by an image model formed by the reflected light entering human eyes, and only generating one final formed image if the original image is a gray scale image; if the original image is a color image, each channel forms a corresponding final formed image, wherein the image model formed by the reflected light entering human eyes is as follows:

wherein r (x, y) represents a final formed image, and S (x, y) represents an original image; k is the number of the Gaussian center surrounding functions, and the value of K is 1, 2 and 3; and has the following components:

step 4, converting R (x, y) from a logarithmic domain to a real domain to obtain an output image R (x, y);

and 5, linearly stretching the output image R (x, y), converting the output image R (x, y) into a corresponding format, and outputting and displaying the format.

Further, the deep learning target detection module comprises

A candidate region module for generating a candidate region for the enhanced image, the candidate region including position information of the target;

the frame information acquisition module is used for acquiring primary frame information through an RPN (resilient packet network) and further regressing the primary frame information in a regression branch mode to acquire secondary frame information corresponding to the primary frame information; the first-level frame information is first-level frame information, and the second-level frame information is more accurate frame information corresponding to the first-level frame information;

and the detection module is used for carrying out target detection on the enhanced image by combining the detection network of the SSD frame with the secondary frame information.

Further, the detection network of the SSD frame comprises an anchor frame refining module, a target detection module and a transmission connection structure;

the anchor frame refining module is used for carrying out parameter setting on the scale and the length-width ratio of the anchor frame, then matching the anchor frame with a real target frame, and marking the anchor frame with the highest matching degree with the real target frame as a positive sample; matching the unmatched anchor frame with the real target frame value, and if the numerical value of the IOU is greater than a preset threshold th, taking the anchor frame as a positive sample, wherein the threshold th is 0.5;

the transmission connection structure is used for converting the low-layer output characteristics of the anchor frame refining module into output characteristics by fusing the high-layer characteristics with the low-layer output characteristics of the convolution layer corresponding to the elements of the convolution layer, and taking the output characteristics as the input characteristics of the target detection module;

and the target detection module is used for returning to a more accurate object position by using the output characteristic of the transmission connection structure with the mosaic anchor frame generated by the fixed frame mosaic module as an input.

Further, the process of performing parameter setting on the dimension and the aspect ratio of the anchor frame comprises:

firstly, obtaining the size of an anchor frame through the following formula:

wherein s is _k Represents the dimension of the kth anchor frame; s _min The size of the anchor frame at the bottommost layer is represented and generally takes a value of 0.2; s _max The anchor frame size of the highest layer is represented, and the value is generally 0.9; m represents the total number of anchor frames

Secondly, the width and the length of the anchor frame are obtained through the following formulas:

wherein, w _k ^a Representing the width of the anchor frame; h is _k ^a Indicating the length of the anchor frame; a is _r Representing the aspect ratio of the anchor frame.

Further, the loss function of the detection network of the SSD framework includes losses of the anchor frame refinement module and the target detection module, the loss function being of the form:

wherein N is _ARM Representing the foreground object in the anchor frame refinement module, i.e. the number of positive samples, N _ODM Representing the number of positive samples, p, in the target detection module _i Indicates confidence, x _i Coordinates representing the predicted foreground object after refinement by the anchor frame optimization module, c _i Representing the class of objects predicted by the object detection module, t _i Coordinates representing the object predicted by the object detection module,/ _i ^* True class label, g, representing object _i ^* Representing the true position and size of the target; l is _b Representing a two-classification loss function in the anchor frame optimization module; l is _r (xi，g _i ^* ) Representing a regression loss function in the anchor frame optimization module; l is _m Representing multi-class classification loss in the target detection module; l is _r (t _i ，g _i ^* ) Representing the regression loss function in the target detection module.

Further, the model training process of the detection network of the SSD framework includes:

step A1, arranging 5000 sealed sample intrusion sample images as a data set, wherein 4000 sealed sample intrusion sample images are used as training samples, and 1000 sealed sample intrusion sample images are used as test samples;

a2, marking each sealed sample intrusion sample image in the data set, and marking a boundary box where an object in the sealed sample intrusion sample image is located, wherein the boundary box comprises [ y [) _min ,x _min ,y _max ,x _max ]Four floating point numbers;

step A3, introducing a first-order momentum to update algorithm parameters, wherein the algorithm parameters are updated as follows:

v _dW ＝βv _dW +(1-β)dW

v _db ＝βv _db +(1-β)db

W＝W-αv _dW ,b＝b-αv _db

wherein v is _dW 、v _db Respectively representing exponentially weighted average parameters; w represents a gaussian center-surround function; α represents a learning rate; β represents an exponentially weighted average, and β is 0.9;

and step A4, iterating for 50 times at most each time of training until the optimal model is found, and outputting a training log.

Further, the deep learning face recognition module comprises a video reading module, a face detection module, a face calibration module and a face recognition module:

the video reading module is used for reading a video stream in video information and extracting a plurality of key frames corresponding to the video aiming at the video stream;

the face detection module is used for carrying out face identification on the video image corresponding to the key frame and judging whether a face exists in the video image corresponding to the key frame; if the video image corresponding to the key frame has a face, starting a face calibration module;

the face calibration module is used for mapping a face into a front face by a 68-point face calibration method, judging whether face correction is successful or not, starting the face recognition module if the face correction is successful, and extracting key frames corresponding to the video image again if the face correction is unsuccessful;

and the face recognition module is used for extracting the face features to be recognized, comparing the face features with the face features in the sample library by taking the cosine distance as a measurement function, finding out the most similar face as a recognition result and storing the recognition result.

Further, the personal fall detection module comprises:

the prior distribution module is used for learning the prior distribution of the joints relative to the upper half of the human body by utilizing the training image set;

the estimation module is used for detecting the upper half of the body by using the upper half detector of the human body and determining the distribution area of the joint according to the prior distribution of the joint when the human body posture estimation is carried out on the image to be processed, so that the search space of the joint is reduced from the whole image to a smaller image area, and the purpose of reducing the search space of the joint is achieved;

the positioning probability acquisition module is used for carrying out convolution calculation through the distribution area of the joint appearance model degree joint to acquire the positioning probability of the joint in the joint distribution area;

the joint positioning module is used for calculating the final positioning probability of the joint by using the human body model and determining the final positioning of the joint, wherein the positioning probability is as follows:

wherein p is _i (x) The positioning probability of the joint at the pixel point x is represented,

the conditional distribution probability of the joint i is assumed when the joint u is positioned in the center of the image; the distribution probability may be learned from the relative positions of the two joints in the training image. p is a radical of _u (x) Representing the positioning probability of the joint u at the pixel point x; elbow jointThe joint and wrist joints can be solved for location probabilities in the same manner.

And the judging module is used for judging whether the personnel falls down according to the position corresponding to the detected final positioning of the joint of the personnel.

The invention has the beneficial effects that:

the intelligent power plant management and control system based on the computer vision target detection provided by the invention can be used for automatically extracting the image characteristics of the target equipment, personnel, scenes and other elements in the monitored area and automatically detecting abnormal conditions, so that the working efficiency of the workers is effectively improved, the monitoring quality is effectively improved, the monitoring personnel do not need to stare at a video picture for a long time to carry out manual monitoring, and the problem of monitoring negligence caused by reduced attention due to longer manual monitoring time can be effectively reduced by carrying out automatic monitoring through the intelligent power plant management and control system based on the computer vision target detection provided by the invention. The monitoring degree and the monitoring accuracy of the operation of the power plant are effectively improved, and the monitoring error rate and the monitoring negligence rate are greatly reduced.

Drawings

FIG. 1 is a system block diagram of the system of the present invention;

FIG. 2 is a schematic diagram of a network structure of a deep learning target detection model according to the present invention;

FIG. 3 is a schematic structural diagram of the resnet module according to the present invention;

FIG. 4 is a schematic structural diagram of a transmission connection structure according to the present invention;

FIG. 5 is a schematic flow chart of the face recognition of the present invention;

fig. 6 is a schematic diagram of a network structure corresponding to the personal fall detection module according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

The embodiment of the invention provides an intelligent power plant management and control system based on computer vision target detection, as shown in fig. 1, the system comprises:

the deep learning target detection module is used for carrying out target detection on the image subjected to the enhancement processing by utilizing an RPN (resilient packet network), an FPN (field programmable gate array) algorithm and an SSD (solid State disk) algorithm in a mode of combining a single stage and a double stage; the targets in the image can be equipment, personnel and designated scenes, and real-time monitoring for the equipment, the personnel and the designated scenes (such as running environments of fire, water leakage and the like) can be effectively realized by target detection of the equipment, the personnel and the designated scenes and matching with subsequent target identification;

The working principle of the technical scheme is as follows: the intelligent power plant management and control system based on computer vision target detection comprises the following steps that firstly, an image acquisition module is used for shooting images of various scenes in an intelligent power plant to acquire image pictures and video information corresponding to the scenes; then, enhancing the required image information in the image picture by using an SSR algorithm through a Retinex image enhancement module, and weakening or eliminating the unnecessary image information to obtain an enhanced image; the deep learning target detection module is used for carrying out target detection on the image subjected to enhancement processing by using an RPN (resilient packet network), an FPN (field programmable gate array) algorithm and an SSD (solid State disk) algorithm in a mode of combining a single stage and a double stage; then, a face recognition module adopting deep learning extracts the face features to be recognized aiming at the enhanced image by adopting a Resnet mode through a face recognition algorithm of a deep convolutional neural network, and a cosine distance is used as a measurement function to be compared with the face features in a sample library to find out the most similar face as a recognition result; then, a person falling detection module learns the prior distribution of the joints relative to the upper half of the human body by utilizing a training image set, and the human body posture estimation is carried out on the enhanced image; then, detecting the upper half of the body by using a human body upper half detector, determining a distribution area of joints according to the prior distribution of the joints, calculating the final positioning probability of the joints of the human body, and determining the final positioning of the joints; judging whether the person falls down or not according to the detected position corresponding to the final positioning of the joint of the person; and finally, mapping each enhanced image into a binary code by using an image retrieval module through a hash function, comparing the Hamming distance with the binary codes of all the pictures in the database, and obtaining the result of the image retrieval according to the sequence of the Hamming distance from small to large.

The effect of the above technical scheme is as follows: the intelligent power plant management and control system based on the computer vision target detection can be used for automatically monitoring, so that the problem of monitoring negligence caused by reduced attention due to long time of manual monitoring can be effectively reduced. The monitoring degree and the monitoring accuracy of the operation of the power plant are effectively improved, and the monitoring error rate and the monitoring negligence rate are greatly reduced.

In an embodiment of the present invention, the Retinex image enhancement module includes:

the output image acquisition module is used for converting the finally formed image from a logarithmic domain to a real domain to obtain an output image;

Wherein, the image processing process of the Retinex image enhancement module comprises:

∫∫F(x,y)dxdy＝1

The working principle of the technical scheme is as follows: the realization process of the SSR (simple sequence repeat) of the single-scale Retinex algorithm can be summarized as follows:

(1) reading original S (x, y): if the original image is a gray scale image: converting the gray value of each pixel of the image from integer (int) to floating point (float) and converting to logarithmic domain; if the original image is a color image: color is divided into channels to be processed, each component pixel value is converted into a floating point number (float) from an integer (int) and is converted into a logarithm domain;

(2) inputting a Gaussian surrounding scale C, discretizing integral operation, converting the discretized integral operation into summation operation, and determining the value of lambda through the above formulas (4) and (5);

(3) by

Obtaining r (x, y); if the original image is a gray scale image, only one r (x, y); if the original image is a color image, each channel has a corresponding r (x, y);

(4) converting R (x, y) from a logarithmic domain to a real domain to obtain an output image R (x, y);

(5) in this case, the range of R (x, y) values is not 0 to 255, and therefore, linear stretching and conversion into a corresponding format are also required to output a display.

The center surround function F (x, y) is a low pass function, and the low frequency part of the incident image corresponding to the original image can be estimated in the algorithm through the low pass function. Removing the low frequency illumination from the original image leaves the high frequency components corresponding to the original image. The high-frequency component is valuable, and in a human visual system, human eyes are quite sensitive to high-frequency information of an edge part, so that the SSR algorithm can better enhance the edge information in an image.

Meanwhile, due to the characteristics of the Gaussian function selected in the SSR algorithm, two indexes of large-amplitude compression and contrast enhancement of a dynamic range cannot be guaranteed at the same time. But to balance the two enhancement effects, a more appropriate gaussian scale constant C must be chosen. The value of C is generally between 80 and 100.

Further consider that the image high fidelity is kept and the dynamic range of the image is compressed at the same time, so as to realize color enhancement, color constancy, local dynamic range compression and global dynamic range compression, and improve the image enhancement formula as follows:

where K is the number of gaussian center-surround functions. When K is 1, the image enhancement formula is changed and degenerated into an SSR formula. Generally speaking, in order to ensure the advantages of the SSR in three dimensions, i.e. high, medium and low, K is usually 3, and has:

the effect of the above technical scheme is as follows: by means of the method, the edge information in the image can be effectively enhanced, and then the accuracy and the precision of subsequent target extraction are improved. Meanwhile, through the improvement of the image enhancement formula, the high fidelity of the image and the compression efficiency of compressing the dynamic range of the image can be effectively improved, and color enhancement, color constancy, local dynamic range compression and global dynamic range compression are realized. Moreover, the balance degree between the two indexes can be effectively compressed and contrast-enhanced within a too high dynamic range, so that the two indexes have higher enhancement effects.

In one embodiment of the present invention, the deep learning target detection module includes:

the frame information acquisition module is used for acquiring primary frame information through an RPN (resilient packet network) and further regressing the primary frame information in a regression branch mode to acquire secondary frame information corresponding to the primary frame information;

The working principle of the technical scheme is as follows: firstly, generating a candidate region for the enhanced image through a candidate region module, wherein the candidate region comprises position information of a target; then, acquiring primary frame information through an RPN by using a frame information acquisition module, and further regressing the primary frame information in a regression branch mode to acquire secondary frame information corresponding to the primary frame information; the first-level frame information is first-level frame information, and the second-level frame information is more accurate frame information corresponding to the first-level frame information; and finally, the detection module is adopted to carry out target detection on the enhanced image by combining the detection network of the SSD frame and the information of the secondary frame. Wherein the model of the detection network of the SSD framework is shown in fig. 2.

The effect of the above technical scheme is as follows: the double-stage target detection algorithm can effectively improve the detection speed. The single-stage target detection algorithm can continuously improve the detection precision of the model by utilizing the speed advantage, and meets the requirements on the detection speed and precision. The RPN, the FPN algorithm and the SSD algorithm are combined, and the first-stage and second-stage target detection algorithms are combined, so that the detection precision is greatly improved. In addition, in the second stage target detection algorithm, more accurate frame information can be obtained through a coarse-to-fine regression idea of the frame, namely, the first stage frame information is obtained through an RPN network, and then further regression is carried out through a general regression branch. The FPN algorithm and other functions are introduced to realize fusion operation, so that the detection effect of small objects can be effectively improved, network detection is facilitated, and the occurrence rate of detection omission conditions is avoided. Meanwhile, the detection network takes the SSD as a frame, so that the detection speed of the model can be effectively improved.

According to one embodiment of the invention, the detection network of the SSD frame comprises an anchor frame refining module, a target detection module and a transmission connection structure;

The working principle of the technical scheme is as follows: the network structure mainly comprises three modules, namely an anchor frame refining module, a target detection module and a transmission connection structure. In addition, the network structure further includes a network common feature extraction module, specifically, a resnet34 module is adopted, and a final pooling layer and a full connection layer of a conventional resnet34 module are removed, as shown in fig. 3, the resnet module is an improved resnet module, and the problem of gradient dispersion or gradient explosion caused by the increase of the number of layers of convolution is effectively solved by adopting identity mapping and residual mapping. The transmission connection structure mainly carries out feature conversion, finely adjusts the anchor frame to the low-layer output feature of the module through the elements of the convolution layer corresponding to the convolution layer, fuses the high-layer feature and then converts the anchor frame into the input feature of the target detection module. The target detection module returns to a more accurate object location using the tessellation anchor frame generated by the fixed-frame tessellation module as an input based on the output characteristics of the transport connection structure. Wherein, concretely:

and aiming at an anchor frame refining module: object detection algorithms typically sample many regions in an input image and then check whether the object to be searched is contained in the region and adjust the region edges to more accurately predict the actual bounding box of the object. The area sampling method used may be different, and the size and aspect ratio of the bounding box may be different, depending on the model. These bounding boxes are commonly referred to as "anchor boxes". The processing object of the anchor frame refinement module in the algorithm is the initial anchor frame.

In the anchor box refinement module, the network infrastructure module extracts the network for the Resnet34 feature, and the size and aspect ratio settings of the anchor box depend on the modification range of the "actual box" (ground route). The higher the matching degree of the anchor box and the actual box is, the better the matching degree is, so that the influence of repeated background noise on the precision can be avoided. In addition, the smaller the difference between the anchor frame and the actual box, the easier the position regression. This is because the anchor frame and the actual box are linear regression models when they are close to each other, and if the difference is large, a complicated nonlinear regression model needs to be provided to solve the problem.

The process of matching the anchor frame with the actual frame is a process of adding a label to the anchor frame. Whether the anchor frame is a positive or negative sample is measured by the intersection of the two boxes (IOU) and, if evaluated with a positive sample, is used to learn the location regression relationship.

The specific coping rules are as follows:

a) marking the anchor frame with the highest matching degree with the real target frame as a positive sample;

b) and matching the unmatched anchor frame with the real target frame value in sequence, and if the IOU is greater than th (th is usually set to be 0.5), taking the anchor frame as a positive sample. If the threshold th is set too small, background noise is easily caused, thereby affecting the detection accuracy (precision); setting the threshold th too large may result in a too low recall (recall).

Aiming at a target detection module: the role of the object detection module is to further generate object classes and precise object positions. In the embodiment, convolution is adopted to extract detection results from different feature maps. The target detection module is used for convolving the outputs of three convolution layers, namely ConvP4, ConvP5 and ConvP6, in a network model (shown in FIG. 2) by respectively adopting two convolution kernels of 3 x 3, wherein one output is used for classifying predicted values, and each anchor frame comprises 2 predicted values; the other output is used to regress the positioning information, each anchor box yielding 4 coordinate values (x, y, w, h) for the target.

And combining the prediction frames generated by conv4_3, conv5_3 and fc7 in the anchor frame refinement module, sorting according to the confidence coefficient threshold value, filtering out lower prediction frames, reserving the first 500 prediction frames, and decoding the rest prediction frames. And finally, eliminating overlapped or incorrect prediction frames by a Non-Maximum Suppression (NMS) method so as to obtain a final detection result.

The non-maximum suppression method assumes a prediction bounding box B and the model works by calculating the prediction probability for each class. If the maximum prediction probability is set as P, the class corresponding to the probability is the prediction class of B, and P is called the confidence of the prediction bounding box B. In the same image, the prediction bounding boxes (instead of the prediction category backgrounds) are sorted from high to low confidence, resulting in a list L. Then, the prediction bounding box B1 with the highest confidence level is selected from L as a reference, and all non-reference prediction bounding boxes with the intersection ratio of B1 larger than a certain threshold value are removed from L. The threshold value here is a predetermined hyper-parameter. At this point, L retains the predicted bounding box with the highest confidence and removes other predicted bounding boxes that are similar to it. Next, the prediction bounding box B2 with the second highest confidence level is selected as a reference from L, and all non-reference prediction bounding boxes whose intersection ratio with B2 is greater than a certain threshold are removed from L. This process is repeated until all the predicted bounding boxes in L have been referenced. At this time, the intersection ratio of any pair of the prediction bounding boxes in the L is smaller than the threshold value, and all the prediction bounding boxes in the list L are output. And finally, outputting the detection result with the highest confidence value.

The specific structure of the transmission connection structure is shown in fig. 4, and the transmission connection structure converts the characteristics of each layer of the anchor frame refining module into the form required by the target detection module by establishing connection between the anchor frame refining module and the target detection module, so as to achieve the target that the target detection module shares the characteristics of the anchor frame refining module. To match the dimension between them, this module uses a deconvolution module Deconv to enlarge the high-dimensional feature map. The transmission connection structure uses an Eltwise module to directly relate semantic information of a context, correspondingly adds elements of a convolution layer and an anti-convolution layer, increases characteristic parameters of a sample, enhances the learning capacity of a network to the sample, and improves the training precision. The three basic operations supported by the Eltwise layer include dot multiplication, summation, and maximum value taking. And summing the operation results into default operation, fusing the context information, increasing the characteristic parameters and enhancing the learning capability of the network.

The effect of the above technical scheme is as follows: by the method, the accuracy and the efficiency of target detection in the image can be effectively improved. And then the target to be identified in the enhanced image can be effectively extracted.

In an embodiment of the present invention, the process of performing parameter setting on the dimension and the aspect ratio of the anchor frame includes:

firstly, obtaining the size of an anchor frame through the following formula:

The loss function of the detection network of the SSD framework comprises the loss of an anchor frame refining module and a target detection module, and the loss function is in the form of:

wherein, N _ARM Representing the foreground object in the anchor frame refinement module, i.e. the number of positive samples, N _ODM Representing the number of positive samples, p, in the target detection module _i Indicates confidence, x _i Coordinates representing the predicted foreground object after refinement by the anchor frame optimization module, c _i Representing the passing object inspectionClass of object predicted in test module, t _i Coordinates representing the object predicted by the object detection module,/ _i ^* True class label, g, representing an object _i ^* Representing the true position and size of the target; l is _b Representing a two-classification loss function in the anchor frame optimization module; l is _r (xi，g _i ^* ) Representing a regression loss function in the anchor frame optimization module; l is _m Representing multi-class classification loss in the target detection module; l is _r (t _i ，g _i ^* ) Representing the regression loss function in the target detection module. The thinning precision and accuracy of the anchor frame can be effectively improved through the loss function.

The working principle of the technical scheme is as follows:

the anchor frame needs to set two parameters of the dimension and the aspect ratio, the dimension setting should follow a linear increasing rule, the feature size is reduced, the dimension of the anchor frame should increase linearly, and the rule is defined as follows:

wherein s is _min The scale representing the bottom layer is that the value is generally set to 0.2, s _max The scale of the highest layer is represented, and the value is generally set to be 0.9. With a _r To represent the length-width ratio of the anchor frame, 0.5, 1.0 and 2 are set in this embodiment, and the width w and the length h of the anchor frame can be calculated by the following formula with 3 different length-width ratios.

There is usually one a per feature map _r 1 and dimension s _k The anchor frame is further provided with a dimension of

Such that each feature map is provided with two square anchor frames having an aspect ratio of 1, but different sizes. Thus, there are a total of 4 anchor boxes per feature map, the location of each anchor box is fixed and the size of each anchor box is different.

Before training, three layers of feature maps, conv4_3, conv5_3 and fc7, in fig. 2 are extracted, 4 anchor frames with different scale aspect ratios are generated on the feature maps by taking each pixel as a center, and each anchor frame is regarded as a training sample. And then performing predictive analysis on the four bias values of the anchor frame, and obtaining a classification score of the candidate frame. And deleting the candidate frames with larger background confidence values, reserving the candidate frames with high foreground confidence values and the candidate frames which are uncertain whether to be the background, and then transmitting the candidate frames to a target detection module for further screening.

The effect of the above technical scheme is as follows: by means of the method, the acquisition accuracy and the acquisition speed of the anchor frame can be effectively improved, the acquisition accuracy and the acquisition accuracy of the candidate area are effectively improved, the existence rate of interference images in the candidate frame is reduced, and the accuracy and the target detection efficiency of subsequent target detection are further improved.

In an embodiment of the present invention, the model training process of the detection network of the SSD framework includes:

v _dW ＝βv _dW +(1-β)dW

v _db ＝βv _db +(1-β)db

W＝W-αv _dW ,b＝b-αv _db

The working principle of the technical scheme is as follows: in this embodiment, in order to train the model, the intrusion sample images of the seal sample are firstly sorted out as the data set, and 5000 images are collected in total in this embodiment, of which 4000 images are used for training and 1000 images are used for testing. Each picture in the data set is manually marked, and a boundary box where an object in the picture is located is marked, wherein the boundary box consists of four floating point numbers [ ymin, xmin, ymax, xmax ]. In order to reduce the problem that the training cannot be converged due to few samples, model training and parameter updating are performed through the content of the step a 3. Each training is iterated a maximum of 50 times until an optimal model is found. And meanwhile, a training sunrise is output, the log outputs a Loss value in training, and the lower the Loss value is, the closer the model is to convergence is.

The effect of the above technical scheme is as follows: in this embodiment, a sample data set of the targets to be detected is first constructed, and then a target detection model is trained by using the target detection neural network. During deployment, the trained model is used for detecting the input image, and the average detection accuracy rate reaches more than 98%. Meanwhile, the training model can be rapidly converged by the method, and the model training efficiency and the training speed are improved.

In an embodiment of the present invention, the deep learning face recognition module includes a video reading module, a face detection module, a face calibration module, and a face recognition module:

The video reading module can extract the key frames through the video monitoring time points aiming at the extraction of the key frames, and specifically comprises the following steps:

setting a time interval corresponding to the time point for extracting two adjacent key frames, wherein the time interval is obtained by the following formula:

wherein T represents the time interval of key frame collection within 24 hours (from 0 am to 24 night) of the current natural day of the power plant operation, and T ₀ Is a preset initial time interval, which is usually set to 1 min; c represents the frequency of monitoring abnormal pictures in the previous natural day of the power plant operation; s represents the number of days of abnormal monitoring pictures in the monitoring power plant operation process of the intelligent power plant management and control system; s _z Representing the total days for monitoring the operation of the power plant by the intelligent power plant management and control system; lambda [ alpha ] ₁ 、λ ₂ And λ ₃ For adjusting the coefficient in time, λ ₁ The value range of (a) is 0.82-0.97, preferably, 0.89; lambda [ alpha ] ₂ The value range of (A) is 0.41-0.53,preferably, 0.43; lambda [ alpha ] ₃ The value range of (a) is 0.36-0.43, preferably, 0.39; c _i And the times of monitoring abnormal pictures in the ith day when the power plant operates on the ith day are shown.

The time interval setting can adjust the acquisition time interval of the current monitored video image key frame through the abnormal condition of the monitoring of the power plant in the previous day. Through the time interval adjustment, the monitoring process of the current image acquisition can be adaptively adjusted in combination with the abnormal condition of the whole operation of the power plant, the acquisition frequency of the key frame is made to accord with the actual condition of the operation of the power plant, the condition that the acquisition of the key frame with uniform frequency in the monitoring process of the power plant is inaccurate due to the fact that the operation of power plant equipment is live and the actual working state of personnel is not matched with the image acquisition frequency is prevented, and the occurrence probability of monitoring omission is effectively avoided. Meanwhile, the reasonability of the acquisition frequency of the key frames of the power plant on the same day can be effectively improved through the time interval acquired by the formula, the capture efficiency of the key frames is improved, the condition that key video images cannot be captured due to unreasonable acquisition frequency of the key frames is effectively prevented, and the power plant operation safety monitoring strength is further improved. Meanwhile, the time adjustment is carried out through the formula, so that the reasonable collection frequency of the key frames can be ensured, the collection quantity of the key frames can be effectively controlled, and the problem of reduced system data processing response caused by excessive video image processing is prevented. The acquisition frequency of the key frames meets the rationality of equipment and personnel operation monitoring on the same day, and the data volume of image processing can be effectively reduced, so that the double improvement of the monitoring strength and the monitoring efficiency of the system is ensured.

The working principle of the technical scheme is as follows: the detection principle of the face recognition module is shown in fig. 5, and first, a video stream in video information is read by the video reading module, and a plurality of key frames corresponding to the video are extracted for the video stream; then, the face detection module is used for carrying out face recognition on the video image corresponding to the key frame, and whether a face exists in the video image corresponding to the key frame is judged; if the video image corresponding to the key frame has a face, starting a face calibration module; then, the face calibration module is adopted to map the face into a front face through a 68-point face calibration method, whether face correction is successful or not is judged, if the face correction is successful, the face recognition module is started, and if the face correction is unsuccessful, key frame extraction is carried out on the key frame corresponding to the video image again; and finally, extracting the face features to be recognized by adopting a face recognition module, comparing the face features with the face features in the sample library by adopting the cosine distance as a measurement function, finding out the most similar face as a recognition result, and storing the recognition result.

The effect of the above technical scheme is as follows: by the method, the speed and the efficiency of face recognition can be effectively improved. Meanwhile, the accuracy and the accuracy of face recognition can be effectively improved by combining a 68-point face calibration method with a key frame re-extraction strategy, and the problem of face recognition errors is avoided to a great extent.

In one embodiment of the invention, the personal fall detection module comprises:

the conditional distribution probability of the joint i is assumed when the joint u is positioned in the center of the image; the distribution probability may be learned from the relative positions of the two joints in the training image. p is a radical of _u (x) Representing the positioning probability of the joint u at the pixel point x; the elbow and wrist joints can solve for the location probability in the same manner.

And the judging module is used for judging whether the person falls down according to the position corresponding to the final positioning of the detected joint of the person.

The working principle of the technical scheme is as follows: the network structure corresponding to the personal fall detection module is shown in fig. 6, a human body is connected together by four limbs, a trunk, a head and the like through joints, and the joints are distributed in a relatively fixed range around the trunk regardless of posture change. Based on this, the prior distribution of the joints relative to the upper half of the human body is learned by using the training image set, when the human body posture estimation is carried out on the image to be processed, the upper half of the human body is firstly detected by using the human body upper half detector, and then the distribution area of the joints is determined according to the prior distribution of the joints, so that the search space of the joints is reduced from the whole image to a smaller image area, and the aim of reducing the joint search space is fulfilled.

The effect of the above technical scheme is as follows: by means of the method, the joint search space can be effectively reduced, the joint search efficiency and the accuracy of joint search can be effectively improved through reduction of the joint search space, the recognition efficiency and the recognition accuracy of human body postures are further improved, the detection speed and the detection accuracy of human body falling are improved, and the problem that falling recognition omission is caused when the falling posture recognition of a worker is not timely due to the fact that the worker falls down and gets up due to too fast action of the worker is avoided.

In an embodiment of the present invention, the image retrieval module operates as follows:

first, the output of the hidden layer H is extracted as a picture label, which is denoted by out (H). The hidden layer is activated through a threshold, and binary coding is carried out to obtain binary codes. For each bit j ═ 1 ″' H (H is the number of hidden layer nodes), the binary coding of the output hidden layer H is as follows:

suppose there are n pictures to be selected { I ₁ ,I ₂ ,...,I _n H, associated binary code H ₁ ,H ₂ ,...,H _n }，H _i E {0,1 }. Given a query image I _q And its binary code H _q It can identify its m candidate pictures { I _1c ,I _2c ,...,I _mc If H _q And { H } ₁ ,H ₂ ,...,H _n The hamming distance in the is less than a certain threshold.

Giving a picture I to be retrieved _q And a candidate set P, finding out the first k most similar pictures V through the feature extraction of the F7 layer _q An F7 level feature representing the picture to be retrieved,

the F7 level features representing the candidate set are similarity obtained by comparing euclidean distances between the picture to be retrieved and pictures in the candidate set. The smaller the distance the more similar.

The smaller the euclidean distance, the higher the similarity of the two images. Arranging the candidate pictures according to the similarity ascending order; thus, the k-top ranked image is identified.

The effect of the above technical scheme is as follows: by means of the image retrieval method, the calculation amount in the image retrieval process can be effectively reduced, and the image retrieval efficiency is effectively improved.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. The utility model provides a management and control system of wisdom power plant based on computer vision target detects which characterized in that, the system includes:

the Retinex image enhancement module is used for enhancing the image information required in the image picture by utilizing an SSR algorithm, weakening or eliminating the image information which is not required and obtaining an enhanced image;

the image retrieval module is used for mapping each enhanced image into a binary code through a hash function, comparing the Hamming distance with the binary codes of all pictures in the database, and obtaining the result of the image retrieval according to the sequence of the Hamming distance from small to large;

the deep learning target detection module comprises

the detection module is used for carrying out target detection on the enhanced image by combining the detection network of the SSD frame with the information of the secondary frame;

the detection network of the SSD frame comprises an anchor frame refining module, a target detection module and a transmission connection structure;

the target detection module is used for returning to a more accurate object position by using the output characteristic of the transmission connection structure with the mosaic anchor frame generated by the fixed frame mosaic module as input;

the process of setting the parameters of the dimension and the aspect ratio of the anchor frame comprises the following steps:

firstly, obtaining the size of an anchor frame through the following formula:

wherein s is _k Represents the dimension of the kth anchor frame; s _min Representing the anchor frame dimension of the bottommost layer; s _max The anchor frame dimension of the highest layer is represented; m represents the total number of anchor frames;

wherein, w _k ^a Representing the width of the anchor frame; h is _k ^a Indicating the length of the anchor frame; a is _r Representing the aspect ratio of the anchor frame;

wherein N is _ARM Representing the foreground object in the anchor frame refinement module, i.e. the number of positive samples, N _ODM Representing the number of positive samples, p, in the target detection module _i Indicates confidence, x _i Coordinates representing the predicted foreground object after refinement by the anchor frame optimization module, c _i Representing the class of objects predicted by the object detection module, t _i Coordinates representing the object predicted by the object detection module,/ _i ^* True class label, g, representing an object _i ^* Representing the true position and size of the target; l is _b Representing a two-classification loss function in the anchor frame optimization module; l is _r (xi，g _i ^* ) Representing a regression loss function in the anchor frame optimization module; l is _m Representing multi-class classification loss in the target detection module;L _r (t _i ，g _i ^* ) Representing a regression loss function in the target detection module;

the model training process of the detection network of the SSD framework comprises the following steps:

step A2, marking each sealed sample intrusion sample image in the data set, and marking a boundary box where an object in the sealed sample intrusion sample image is located, wherein the boundary box comprises [ y ] _min ,x _min ,y _max ,x _max ]Four floating point numbers;

v _dW ＝βv _dW +(1-β)dW

v _db ＝βv _db +(1-β)db

W＝W-αv _dW ,b＝b-αv _db

2. The system of claim 1, wherein the Retinex image enhancement module comprises:

3. The system according to claim 1, wherein the image processing procedure of the Retinex image enhancement module comprises:

step 1, reading an original image S (x, y), and judging whether the original image S (x, y) is a gray scale image; if the original image is a gray image, converting the gray value of each pixel of the image from an integer type to a floating point and converting the gray value to a logarithmic domain; if the original image is a color image, the color is subjected to channel division processing, and each component pixel value is converted into a floating point number from an integer and is converted into a logarithmic domain;

∫∫F(x,y)dxdy＝1

4. The system of claim 1, wherein the deep learning face recognition module comprises a video reading module, a face detection module, a face calibration module, and a face recognition module:

the face calibration module is used for mapping the face into a front face by a 68-point face calibration method, judging whether the face correction is successful or not, starting the face recognition module if the face correction is successful, and extracting key frames corresponding to the video image again if the face correction is unsuccessful;

5. The system of claim 1, wherein the personal fall detection module comprises:

the estimation module is used for detecting the upper half of the body by using a human body upper half detector when the human body posture estimation is carried out on the image to be processed, and determining the distribution area of the joints according to the prior distribution of the joints;

the conditional distribution probability of the joint i is assumed when the joint u is positioned in the center of the image; p is a radical of _u (x) Representing the positioning probability of the joint u at the pixel point x;