CN111931550A

CN111931550A - Infant monitoring method based on intelligent distance perception technology

Info

Publication number: CN111931550A
Application number: CN202010443078.6A
Authority: CN
Inventors: 吕森; 陈仁海; 王晋超; 冯志勇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-11-13

Abstract

The invention discloses a baby monitoring method based on an intelligent distance perception technology, which is used for framing the environment around a baby; segmenting the baby in the acquired image; the distance of the divided baby relative to the bed is determined. The method for improving the upsampling part in the image segmentation algorithm by using the method with the pooling index is provided, so that more characteristic information can be obtained in the upsampling process, and the accuracy of semantic segmentation on the boundary outline of the image is improved. Experiments show that the method with the pooling indexes is superior to the current popular semantic segmentation algorithm in the accuracy of the edge contour; meanwhile, the actual distance of the baby in the real scene is intelligently sensed by using a stereoscopic vision algorithm, and the monitoring efficiency of the baby is improved.

Description

Infant monitoring method based on intelligent distance perception technology

Technical Field

The invention relates to the technical field of computer vision, in particular to the fields of stereoscopic vision and image segmentation, and particularly relates to an infant monitoring method based on an intelligent distance perception technology.

Background

With the opening of the "second-birth" policy, there is an increasing need for more intelligent techniques to alleviate the parents' strain while reducing the risks to the baby. Because the body organs of the infant are not completely developed, the self-ability of the infant is poor, the emotion cannot be expressed by words, and meanwhile, because parents cannot accompany the infant at any time, the infant is exposed to higher risk of accidents, so that the infant intelligent monitoring scheme based on the stereoscopic vision by utilizing the camera is provided. There is a need to enhance the monitoring of infants to reduce the risk of danger.

Infant monitoring in the current market is simply connected with devices such as a mobile phone and the like in a remote mode through a sensor or a camera. Parents observe the condition of the baby through the terminal equipment, and the temperature, the humidity and the air quality are monitored and detected through the sensor. Some monitoring through the non-contact of camera has the quilt that prevents that children kick off: the state of the quilt covered on the body of the child is judged by utilizing the infrared perspective mirror through the infrared energy difference generated when the child covers the quilt on the body and does not cover the quilt (or partially covers the quilt). The known infant monitoring software BabbyCam is used for detecting whether the face of an infant appears on an infant bed by using an image recognition technology. Parents can also observe the real-time state of the baby in real time through the mobile phone APP.

Prevent in the reality that the baby from turning over out the bed edge, at present mainly have the use camera, guardianship baby whether in bed, the thinking of main part is: images captured by the camera are periodically saved, and the difference between two adjacent images is compared. When the number of the noise points reaches a certain degree, alarm information is sent out. In another proposal, an infrared sensing device is arranged on the bed body. Early warning is given to the baby trying to turn out of the crib. However, the schemes for preventing the baby from turning out of the bed edge cannot play a role in prevention, have high cost and are not suitable for popularization. It is also proposed to segment the infant and the background by using the conventional image segmentation method, and then monitor the infant according to the spatial distance relationship between the segmented infant image and the crib.

In order to separate the target from the image region, the main methods of traditional image segmentation are edge-based segmentation, graph theory-based segmentation and clustering-based segmentation.

1) Edge-based segmentation. Based on the principle that two sides of the gray value of the contour pixel points of different object boundaries in the image have obvious changes, the contour of the object in the image is judged through whether the second derivative of the gray image crosses the zero point or not. Common edge detection differential operators include a Sobel operator and a Laplace operator, but the method cannot ensure the continuity of object edge segmentation or form a closed loop, and has a poor effect on small region segmentation in an image.

2) Segmentation based on graph theory. By using the graph theory method, the image is regarded as a non-directional graph with weight, the image is divided into different subsets, and the weight is continuously iterated by calculating the weight graph. The GraphCut algorithm is characterized in that the vertexes of the two called terminal vertexes are added on the basis, and the segmentation problem is converted into the problem of solving the minimum segmentation. There are also algorithms like GrabCut, etc., which use the grey level of the image and the region boundary information.

3) Clustering-based segmentation. Firstly, the pixels in the image are divided into superpixel blocks, and then region fusion is carried out. A Meanshift algorithm based on kernel density gradient estimation forms a superpixel block on an image by establishing a high-dimensional sphere and continuously iterating to a place with the maximum image density through a Meanshift vector. An SLIC (simple linear iterative clustering) algorithm is an application of a K-means algorithm in image segmentation, and is used for changing pixel points in an image into five-dimensional vectors and constructing a distance measurement standard through the distance of pixels in the image and the similar size of pixel colors. However, the clustering algorithm is sensitive to the selection of the initial point, the segmentation effect is affected, and the anti-interference performance is poor.

Disclosure of Invention

The invention provides an infant monitoring method based on an intelligent distance perception technology, and aims to optimize an image semantic segmentation method and improve the efficiency of an infant monitoring system by using a stereoscopic vision method.

In order to achieve the purpose of the invention, the infant monitoring method based on the intelligent distance perception technology provided by the invention comprises the following steps: the method comprises the following steps: framing an environment around the infant; step two: segmenting the baby in the acquired image; step three: judging the distance between the separated baby and the bed so as to judge whether the baby is in danger of falling; when the judgment distance is smaller than the preset threshold value, alarming; wherein the acquired image is segmented by a semantic segmentation method.

Furthermore, the acquired image is segmented by a semantic segmentation method based on a full convolution neural network encoder-decoder structure, a deep convolution neural network is improved, a convolution pooling part of a VGG-19 network is used as a feature extractor of the image, a pooling index structure is improved, and the infant image is segmented semantically.

Furthermore, the full convolution neural network is divided into two stages, namely an encoding stage and a decoding stage, the full connection layer in the network is removed by utilizing the framework of a deep convolution neural network VGG-19 model in the encoding stage, and the information stored for the maximum pooling position in the pooling process is reserved; the Encoder of each layer corresponds to a layer of Deconder; in the decoding stage, the maximum value information of the corresponding position is restored by utilizing the maximum pooling information stored in the encoding stage, the image is up-sampled in such a way, the resolution of the feature map is amplified to be the same as that of the original image, a softmax classifier is connected to the rear end of the decoder network to generate class probability for each pixel, a channel number probability map equal to the class number is formed, and the class corresponding to the maximum network output probability value is predicted.

Furthermore, in the up-sampling process, the weight of the corresponding index position in the feature map is set to be equal to the value to be up-sampled in a pooling index mode, and the weight is set to be half of the maximum feature value at other positions corresponding to the pooling core, so that the features around the maximum feature position are restored while the image is filtered out of the maximum feature position information through pooling.

Further, a binocular camera mode is used for ranging; wherein, the binocular camera is in the projection geometry of X-O-Y plane under the world space coordinate system, O_R、O_TIs the optical centers of two cameras, and the projection points of any point P in space on two image planes are P, P', X_R、X_TThe distance between the two cameras is the distance from the origin of the image plane, i.e. the length of the base line is b, Z is the distance from the point P to the base line, F is the distance from the optical center to the image plane, i.e. the focal length, and then the distances of delta Ppp' to delta PO are known_RO_TTherefore, the following relationship exists:

let D be X_R-X_TThe following can be obtained:

similarly, other coordinate information of the point P can be obtained:

wherein (x)_R，y_R) And (x)_T，y_T) The coordinates of the projection points of the point P in space on the phase plane of the binocular camera, and D the parallax (disparity) of the two images. The distance of the object from the camera can be calculated through the parallax; then calibrating the camera, wherein the calibration is mainly used for acquiring internal parameters, external parameters and distortion coefficients of the camera; the spatial three-dimensional information of the object is calculated by establishing a camera imaging geometric model through calibration, and the three-dimensional calibration is to calculate the geometric relationship between two cameras with different positions in space, obtain the corresponding relationship between rotation and translation of the cameras and finally obtain the three-dimensional spatial information of the object.

Furthermore, an MATLAB tool box is used for calibration, an internal reference matrix, radial distortion, tangential distortion, a rotation matrix and a translation vector of the camera are obtained through calculation, and in the calibration process, the selected calibration plates are uniformly distributed at all corners of the image, so that the radial distortion and the tangential distortion of the lens can be effectively calculated.

Further, the calculation of the distances of different objects in the image is performed by the following method:

firstly, obtaining specific coordinate values of a space object through a binocular camera:

wherein u is_L、u_RRepresenting pixel distances of the left camera and the right camera from the left edge of each image plane; f. of_xIs an internal reference value through calibration; d ═ u_L-u_R. b is the optical center distance of the two cameras; taking the optical center of the left camera as the origin of a world coordinate system, and u and v represent coordinates of points under a pixel coordinate system; u. of_o、v_oThe coordinate value of the central origin of the left camera image plane under the pixel coordinate system; converting the coordinate into homogeneous coordinates through a simultaneous equation set, and obtaining the coordinate in a matrix form:

wherein X is W X_w,Y＝W*Y_w,Z＝W*Z_wSince the numerical value of the rotation matrix T between the binocular cameras in the x direction represents the amount of translation of the camera 2 in the horizontal direction of the camera 1, b-T_x. Let u₀、u₀' as coordinate values of the horizontal component of the origin of the left and right camera image planes in the pixel coordinate system, since the difference between the two values is small, and therefore, is approximately 0, the following equation is obtained:

order to

Then it can be deduced that:

usually Q is called a re-projection matrix, and by means of a 4X 4 re-projection matrix Q, two-dimensional plane coordinates can be converted into spatial three-dimensional coordinates, the coordinates of the spatial point being (X)_w＝X/W,Y_w＝Y/W,Z_wZ/W), the spatial position distance between different objects in the image can be determined by combining the stereoscopic vision with the image segmentation.

Furthermore, the expression characteristics of the baby are detected through the convolutional neural network model, so that the state of the baby is judged, and when crying occurs, the alarm unit is triggered to give an early warning in time.

Compared with the prior art, the method has the beneficial effects that the method with the pooling index is used for improving the upsampling part in the image segmentation algorithm, so that more characteristic information can be obtained in the upsampling process, and the accuracy of semantic segmentation on the boundary outline of the image is improved. Experiments show that the method with the pooling indexes is superior to the current popular semantic segmentation algorithm in the accuracy of the edge contour; meanwhile, the actual distance of the baby in the real scene is intelligently sensed by using a stereoscopic vision algorithm, and the monitoring efficiency of the baby is improved.

Drawings

FIG. 1 is a schematic diagram of an infant monitoring protocol arrangement according to the present application;

FIG. 2 is a general flow chart of the operation of the infant monitoring system set forth herein;

FIG. 3 is a network architecture employed by the present application;

FIG. 4 is an improved pooling process proposed herein;

FIG. 5 is a flow chart of the binocular camera stereo ranging of the present application;

FIG. 6 is a two-dimensional block diagram of the binocular range finding of the present application;

FIG. 7 is a schematic view of the epipolar constraint of the present application;

FIG. 8 is an original image captured by a binocular camera according to the present application;

FIG. 9 is a corrected image of the present application;

FIG. 10 is a comparison of an original image and a semantically segmented image according to the present application;

FIG. 11 is a line inspection of the baby contour convex hull and bed contour of the present application;

FIG. 12 is a graph of experimental data and error calculations for the present application.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1 to 11, the present application mounts a camera above the body of an infant lying in a bed, and uses the camera to view the environment around the infant. And judging whether the infant is in danger of falling or not by the distance between the infant and the bed in the image. The semantic segmentation method is adopted to distinguish the target object in the image pixel by pixel, and the specific position is not marked through a detection Box (Bounding Box) in the image detection, so that the segmentation of the pixel level can be more precise for the position relation between the target main body and other objects in the image. The distance between the baby pixel point and the bed is more accurate than the judgment distance of the detection frame in image detection. By using the method, the misjudgment of the actual distance can be reduced.

In order to design a critical condition for alarming, the actual height is measured by a camera, and the actual minimum distance is obtained by utilizing a similar triangle principle, so that whether the state of the baby is dangerous or not is judged. Meanwhile, the expression characteristics of the baby are detected through the convolutional neural network model, so that the state of the baby is judged. When crying occurs, early warning is timely carried out to prevent accidents.

The method is based on a full convolutional neural network (FCN) encoder-decoder structure, a deep convolutional neural network is improved, a convolutional pooling part of a VGG-19 network is used as a feature extractor of an image, a pooling index structure is improved, and semantic segmentation is carried out on the infant image. The minimum pixel distance of the image is obtained through image operation, the distance between the baby and the camera is further obtained through the stereo vision technology of the binocular camera, and finally the actual size of the minimum boundary distance is indirectly obtained.

The design of the monitoring system will be described in more detail below. First, a method of image segmentation is described. In the application, the images in a specific scene only have three categories, namely baby, bed and background. The size of the target object in the image is moderate relative to the whole image, the multi-scale problem of the target in the image does not need to be processed, and the aim is to enable the speed of the model processing segmentation task to be as fast as possible under the condition that certain fine granularity is kept. Therefore, the design utilizes a pre-training model of the VGG network as a framework, and the feature map of the encoding stage and the feature map of the up-sampling are subjected to feature fusion in the FCN model, so that the roughness of image segmentation is improved through a jump connection structure. However, in the part of the pooling layer in the convolution operation, the down-sampling causes the semantic information of the image to be continuously lost, and the obtained image segmentation result is rough.

The network adopted by the application is divided into two stages, namely an encoding (Encoder) stage and a decoding (Decode) stage. And in the coding stage, a full connection layer in the network is removed by utilizing the architecture of a deep convolutional neural network VGG-19 model, and information stored for the maximum pooling position in the pooling process is reserved. Encoder of each layer corresponds to a layer of Deconder. In the decoding stage, the maximum value information of the corresponding position is restored by utilizing the maximum pooling information stored in the encoding stage. In this way, the image is up-sampled, and the resolution of the feature map is enlarged to the same size as the original image. And a softmax classifier is connected to the rear end of the decoder network to generate class probability for each pixel, a channel number probability graph equal to the class number is formed, and the class corresponding to the maximum network output probability value is predicted. In the decoding stage, the image needs to be upsampled (upsampling) so that the image can be restored to the same size as the original image. There are generally three ways to upsample, one is to use, for example, bilinear (bilinear) interpolation on the image, and directly scale the image by using a difference algorithm. The other method is to perform up-sampling on the image by using a Transposed Convolution (Transposed Convolution) and restore the feature map after Convolution to the space with the size of the original image. The last method is called pooling (unpooling) operation, which keeps the information of the maximum value in the original feature map and realizes the magnification of the resolution by performing zero-filling operation on other positions. When the transposed convolution is used for up-sampling, the accurate position relation between pixels needs to be learned through calculation between matrixes, so that the parameters of the network become more, the performance of the network is greatly influenced, the space overhead of a system for storing low characteristic diagram information is increased, and the performance of a model is influenced. Another method is to restore the information of the pixel values in the corresponding upsampling process by storing the filtered characteristic position information in the pooling process, and setting the corresponding positions of the neighboring neighbors (Nearest Neighbor) to equal values. The method simply restores the position information of the original image, but because the adjacent positions are set to be the same value, the restored image cannot be effectively described and distinguished for the edge details of the image, and the restored image is not good in overall detail.

The method with the pooling index is adopted, upsampling is improved, and the image segmentation effect is improved. In the up-sampling process, the weight value of the corresponding index position in the feature map is set to be equal to the value to be up-sampled in a pooling index mode.

And setting the weight value to be half of the maximum characteristic value at other positions corresponding to the pooling cores. Therefore, the maximum characteristic position information of the image is filtered through pooling, and the characteristics around the maximum characteristic position are restored, so that more characteristic information can be obtained in the up-sampling process of the image, and the semantic segmentation is more accurate on the boundary outline of the image.

The ranging method employed in the present design is described next. Firstly, the type of a distance measuring camera is selected, and finally, the distance measuring is carried out by using a binocular camera-based mode according to various factors such as a distance measuring principle, precision and hardware cost.

Geometrical configuration of the projection of the binocular camera on the X-O-Y plane in the world space coordinate System, OR_、O_TIs the optical center of both cameras. The projection points of any point P in the space on the two image planes are P, P', X_R、X_TIs its distance from the origin of the image plane. The distance between the two cameras, i.e. the base length, is b. Z is the distance from point P to the baseline. F is the distance from the optical center to the image plane, i.e. the focal length. It can be seen that Δ Ppp' to Δ PO_RO_TTherefore, the following relationship exists:

let D be X_R-X_TThe following can be obtained:

similarly, other coordinate information of the point P can be obtained:

wherein (x)_R，y_R) And (x)_T，y_T) The coordinates of the projection points of the point P in space on the phase plane of the binocular camera, and D the parallax (disparity) of the two images. The distance of the object from the camera can be calculated by the parallax.

And then calibrating the camera, wherein the calibration is mainly used for acquiring internal parameters, external parameters and distortion coefficients of the camera. And calculating the spatial three-dimensional information of the object by calibrating and establishing a geometric model imaged by the camera. The three-dimensional calibration is to calculate the geometric relationship between two cameras with different positions in space to obtain the corresponding relationship between rotation and translation of the cameras, and finally obtain the three-dimensional space information of the object. The calibration method comprises the methods of manual calibration, OpenCV function based, MATLAB toolbox based and the like. MATLAB has higher accuracy and stronger robustness than other methods, and the design adopts the MATLAB toolbox to calibrate the camera. Parameters such as an internal reference Matrix (Intrinsic Matrix), Radial Distortion (Radial Distortion), Tangential Distortion (Tangential Distortion), Rotation Matrix (Rotation Matrices), Translation Vectors (Translation Vectors) and the like of the camera are calculated by using an MATLAB toolbox. In the calibration process, the selected calibration plates are uniformly distributed at each corner of the image, so that the radial distortion and the tangential distortion of the lens can be effectively calculated. And correcting the image by using the obtained parameters. Avoiding optical distortion of the image. In practice, images captured by the binocular camera are required to be aligned strictly in a horizontal position. However, because of mechanical errors, a perfectly aligned binocular camera does not exist in practice, and therefore, the two captured images are subjected to stereo correction. By studying the relationship of the two camera images, the three-dimensional image has a geometrical relationship on a two-dimensional image plane, which is called epipolar geometry.

wherein u is_L、u_RRepresenting the pixel distance of the left and right cameras from the left edge of each image plane. f. of_xIs an internal reference value through calibration. d ═ u_L-u_R. b is the optical center distance of the two cameras. And taking the optical center of the left camera as the origin of a world coordinate system, and u and v represent coordinates of points under a pixel coordinate system. u. of_o、v_oAnd the coordinate values of the central origin of the left camera image plane under the pixel coordinate system. Converting the coordinate into homogeneous coordinates through a simultaneous equation set, and obtaining the coordinate in a matrix form:

wherein X is W X_w,Y＝W*Y_w,Z＝W*Z_w. Since the value of the rotation matrix T between the binocular cameras in the x direction represents the amount of translation of the camera 2 in the horizontal direction of the camera 1, b-T_x. Let u₀、u₀' is a coordinate value of the origin horizontal component of the left and right camera image planes in the pixel coordinate system, and is approximately 0 since the difference between the two values is small. The following equation is obtained:

order to

Then it can be deduced that:

usually Q is called a reprojection matrix (reprojection matrix). By means of a 4 x 4 reprojection matrix Q, we can convert two-dimensional planar coordinates into spatial three-dimensional coordinates. The coordinates of the space point are (X)_w＝X/W,Y_w＝Y/W,Z_wZ/W). By combining stereo vision with image segmentation, the spatial position distance between different objects in the image can be determined.

In the invention, an intelligent distance perception infant monitoring system is designed based on the stereoscopic vision principle. The method for improving the upsampling part in the image segmentation algorithm by using the method with the pooling index is provided, so that more characteristic information can be obtained in the upsampling process, and the accuracy of semantic segmentation on the boundary outline of the image is improved. Experiments show that the method with the pooling indexes is superior to the current popular semantic segmentation algorithm in the accuracy of the edge contour; meanwhile, the actual distance of the baby in the real scene is intelligently sensed by using a stereoscopic vision algorithm, and the monitoring efficiency of the baby is improved.

The experimental environment in fig. 12 is briefly described below. The camera that the experiment was used is 1080p binocular camera module, and the focus of leaving the factory is 2.6MM, and the scope is 100. The sensor size is 1/3 "CMOS. The pixel size is 2.0um (H) x 2.0um (V). The adjustable range of the distance between the central points of the lenses is as follows: 25m-70 mm. The experimental equipment is provided with an Ubuntu18.04.3LTS operating system, and is provided with an Intel I5-9600K processor with the main frequency of 3.70GHz and a 16G RAM. The GPU model is GeForce RTX2070, 8G video memory. CUDA version 9.0.176, cuDNN version 7.3.1, tensoflow-gpu version 1.12.3. By combining the experimental data, the average error of the real minimum distance and the calculated value is 2.52%. The error in the experimental process is basically within the range of three millimeters, and the standard for judging the position information of the baby and the bed can be achieved. When the shortest distance is smaller than a certain size, early warning is sent out, and the monitoring efficiency of the baby is improved.

The technical means not described in detail in the present application are known techniques.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An infant monitoring method based on an intelligent distance perception technology is characterized by comprising the following steps:

the method comprises the following steps: framing an environment around the infant; step two: segmenting the baby in the acquired image; step three: judging the distance between the separated baby and the bed so as to judge whether the baby is in danger of falling; when the judgment distance is smaller than the preset threshold value, alarming; wherein the acquired image is segmented by a semantic segmentation method.

2. The infant monitoring method based on intelligent distance perception technology as claimed in claim 1, wherein the segmentation of the acquired image by the semantic segmentation method is based on a full convolution neural network encoder-decoder structure, and the semantic segmentation of the infant image is performed by improving a deep convolution neural network, using a convolution pooling part of a VGG-19 network as a feature extractor of the image, and improving a pooling index structure.

3. The infant monitoring method based on the intelligent distance perception technology according to claim 2, characterized in that the full convolutional neural network is divided into two stages, namely an encoding stage and a decoding stage, the encoding stage utilizes a deep convolutional neural network VGG-19 model architecture to remove a full connection layer in the network, and information stored for a maximum pooling position in a pooling process is retained; the Encoder of each layer corresponds to a layer of Deconder; in the decoding stage, the maximum value information of the corresponding position is restored by utilizing the maximum pooling information stored in the encoding stage, the image is up-sampled in such a way, the resolution of the feature map is amplified to be the same as that of the original image, a softmax classifier is connected to the rear end of the decoder network to generate class probability for each pixel, a channel number probability map equal to the class number is formed, and the class corresponding to the maximum network output probability value is predicted.

4. The infant monitoring method based on the intelligent distance perception technology according to claim 3, wherein in the upsampling process, the weight of the corresponding index position in the feature map is set to a value equal to the value to be upsampled in a pooling index mode, and the weight is set to half of the maximum feature value at other positions corresponding to the pooling kernel, so that the features around the maximum feature position are restored while the image is filtered out of the maximum feature position information through pooling.

5. The infant monitoring method based on the intelligent distance perception technology according to claim 3, characterized in that distance measurement is performed by means of a binocular camera; wherein, the binocular camera is in the projection geometry of X-O-Y plane under the world space coordinate system, O_R、O_TIs the optical centers of two cameras, and the projection points of any point P in space on two image planes are P, P', X_R、X_TThe distance between two cameras is the distance from the two cameras to the origin of the image plane, namely the length of the base line is b, Z is the distance from the point P to the base line, F is the distance from the optical center to the image plane, namely the focal length, and the delta P is known_pp′～ΔPO_RO_TTherefore, the following relationship exists:

let D be X_R-X_TThe following can be obtained:

similarly, other coordinate information of the point P can be obtained:

6. The infant monitoring method based on the intelligent distance perception technology according to claim 5, characterized in that an MATLAB toolbox is used for calibration, an internal reference matrix, radial distortion, tangential distortion, a rotation matrix and a translation vector of a camera are obtained through calculation, and in the calibration process, the selected calibration plates are uniformly distributed at each corner of an image, so that the radial distortion and the tangential distortion of a lens can be effectively calculated.

7. The infant monitoring method based on intelligent distance perception technology as claimed in claim 5, wherein the calculation of the distance between different objects in the image is performed by the following method:

wherein X is W X_w，Y＝W*Y_w，Z＝W*Z_wSince the numerical value of the rotation matrix T between the binocular cameras in the x direction represents the amount of translation of the camera 2 in the horizontal direction of the camera 1, b-T_x. Let u₀、u₀Is the horizontal component of the origin of the left and right camera image planesThe coordinate values in the pixel coordinate system are approximately 0 because the difference between the two values is small, so that the following equation is obtained:

order to

Then it can be deduced that:

usually Q is called a re-projection matrix, and by means of a 4X 4 re-projection matrix Q, two-dimensional plane coordinates can be converted into spatial three-dimensional coordinates, the coordinates of the spatial point being (X)_w＝X/W，Y_w＝Y/W，Z_wZ/W), the spatial position distance between different objects in the image can be determined by combining the stereoscopic vision with the image segmentation.

8. The infant monitoring method based on the intelligent distance perception technology as claimed in claim 1, wherein the expression characteristics of the infant are detected through a convolutional neural network model to judge the state of the infant, and when crying and screaming occur, an alarm unit is triggered to give an early warning in time.