CN109766856A

CN109766856A - A kind of method of double fluid RGB-D Faster R-CNN identification milking sow posture

Info

Publication number: CN109766856A
Application number: CN201910040870.4A
Authority: CN
Inventors: 薛月菊; 朱勋沐; 郑婵; 杨晓帆; 陈畅新; 王卫星; 甘海明
Original assignee: South China Agricultural University
Current assignee: South China Agricultural University
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2019-05-17
Anticipated expiration: 2039-01-16
Also published as: CN109766856B

Abstract

The invention discloses the identification milking sow posture methods of double fluid RGB-D FasterR-CNN a kind of, propose it is a kind of feature extraction phases fusion RGB-D characteristics of image end-to-end double-current RGB-D Faster R-CNN algorithm, for identification the milking sow standing under free column sow scene, sitting, prostrate, abdomen crouch and 5 class postures of lying on one's side.Based on Faster R-CNN, RGB image feature and depth image feature are extracted respectively using two CNN networks first；Then the mapping relations for utilizing RGB-D image, the area-of-interest of RGB image characteristic pattern and depth image characteristic pattern is generated only with a RPN network；After area-of-interest feature pool, realize that the splicing of RGB-D feature is merged using an independent network layer；Finally in the Fast R-CNN stage, introduces NOC structure and continue the fused feature of convolution extraction, be re-fed into classifier and return device.The end-to-end high-precision of RGB-D data information, mini Mod and real-time sow gesture recognition method have been merged the present invention provides a kind of, has been laid a good foundation for further analysis sow behavior.

Description

Method for recognizing posture of lactating sow by double-current RGB-D Faster R-CNN

Technical Field

The invention relates to the technical field of multi-modal target detection and identification in computer vision, in particular to an end-to-end lactating sow posture identification method based on an Faster R-CNN target detection algorithm, which is fused in a characteristic extraction stage after RGB-D characteristics are extracted by using RGB-D data and double-flow CNN.

Background

The behavior of pigs in a pig farm is an important manifestation of their welfare and health status, directly affecting the economic benefits of the pig farm. In animal behavior monitoring technology, compared with traditional manual monitoring and sensor technology, automatic identification by using computer vision is a low-cost, efficient and non-contact mode, and valuable behavior information can be continuously provided.

In recent years, behavior recognition of pigs using computer vision has been extensively studied. For example: patent publication No. CN108830144A to schroemeri, university of south China agricultural, 2018, utilizes depth image data, introduces a residual structure and the centrloss improved fasterrr-CNN algorithm to automatically identify five poses of lactating sows in the freeboard. In 2017, the patent with publication number CN201710890676 of the team utilizes a depth image, a DPM algorithm is firstly adopted to detect the sow, a CNN network is used for recognizing the posture of the sow in a detection frame, and the patent with publication number CN107527351A utilizes an RGB image and an FCN algorithm to automatically segment the sow in a scene. A patent with publication number CN108717523A of Shaodeqin et al of Ministry of agriculture university in China in 2018 discloses a sow oestrus behavior detection method based on machine vision. A patent with publication number CN104881636A of Laobandan et al, China university of agriculture in 2016 discloses a method and a device for identifying lying behaviors of pigs. In addition, the patent with the group publication number of CN107679463A discloses an analysis method for identifying attack behaviors of the group-raised pigs by adopting a machine vision technology, and the patent with the publication number of CN107437069A discloses a contour-based pig drinking water identification method and the like.

In the current pig behavior recognition research based on computer vision, most researches only use RGB images or depth image data for research, which causes that robust feature representation is difficult to obtain in a real scene, and the bottleneck of recognition accuracy is easy to achieve. The camera mapping the 3-dimensional world into an RGB image of the 2-dimensional space inevitably results in information loss, and it is feasible to use the depth image as compensation for this part of the information. In addition, the depth image lacks the detailed information characteristics of the target object due to lack of information such as texture and color of the RGB image, and it is difficult to accurately recognize the target object when the shapes of the target objects are highly similar. In particular, in the posture recognition task of the sow in the top view shooting, on one hand, the height of the target is important information for judging different postures, which cannot be reflected in the RGB image, and on the other hand, the heights and shapes of the partial postures of the sow are similar (for example, prone posture and lying abdomen), and are difficult to be accurately distinguished only by using the depth information.

Therefore, providing a high-precision method for recognizing the posture of the lactating sow by using the double-current RGB-D Faster R-CNN is a technical problem to be solved by the technical personnel in the field.

Disclosure of Invention

In view of the above, the present invention provides a method for recognizing postures of lactating sows by using a dual-stream RGB-D Faster R-CNN, which realizes automatic, higher-precision and real-time posture recognition of lactating sows in a free column. The specific scheme for achieving the purpose is as follows:

the invention discloses a method for recognizing the posture of a lactating sow by using double-current RGB-D FasterR-CNN, which comprises the following steps:

s1, collecting RGB-D video images of the lactating sows, including RGB images and depth images, and establishing a sow posture recognition RGB-D video image library;

s2, calculating and obtaining a mapping relation between the RGB image and the depth image by a camera calibration method;

s3, based on the Faster R-CNN algorithm, respectively convolving the RGB image and the depth image by using two CNN networks to obtain an RGB image feature map and a depth image feature map;

s4, only using one RPN network, generating interested region D-ROI based on the depth image feature map, and generating interested region RGB-ROI of the RGB image feature map for each D-ROI in one-to-one correspondence through the mapping relation between the RGB image and the depth image;

s5, Pooling each D-ROI and each RGB-ROI to be fixed size by using an ROI Pooling layer, and fusing the features of each group of pooled D-ROI and RGB-ROI feature maps by using a splicing fusion method;

s6, further extracting fusion characteristics from the fused characteristic diagram by using Fast R-CNN of an NOC structure, processing the fusion characteristics by a classifier and a regressor after passing through a global average pooling layer to obtain a double-current RGB-D Fast R-CNN sow posture identification model, and outputting an identification result;

s7, acquiring a training set and a test set from the sow posture recognition RGB-D video image library, training a double-current RGB-D Faster R-CNN sow posture recognition model by using the training set, testing the performance of the model by using the test set, and finally screening an optimal performance model.

Preferably, the specific process of step S1 is as follows:

s11, fixing the RGB-D sensor to overlook to shoot and collect RGB-D video images of the pigsty; the RGB-D sensor can capture not only the color information of an objective world but also the depth information of a target, wherein a shot RGB image comprises information such as color, shape, texture and the like, and the depth image comprises clear edge information and depth information robust to light;

s12, sampling and acquiring a training set and a test set from the acquired RGB-D video image data, wherein the training data set accounts for 70%, and the test set accounts for 30% to test the model performance;

s13, preprocessing the depth image in a training set and a testing set, wherein the preprocessing comprises filtering, denoising and image enhancement, and then marking the target of the preprocessed depth image, namely marking a surrounding frame and a posture category outside the target, wherein the RGB image does not need to be processed; and then, carrying out rotation and mirror image amplification on the processed training set data for training the model.

Preferably, the specific process of step S2 is as follows:

s21, obtaining the internal reference matrix K of the RGB image by using the camera calibration method_rgbInternal reference matrix K of depth image_dObtaining an external reference matrix R of the RGB image aiming at the same checkerboard image used for camera calibration_rgbAnd T_rgbAnd the outer reference matrices Rd and Td of the depth image; let the non-homogeneous pixel coordinate of RGB image be P_rgb＝[U_rgb,V_rgb,1]^TThe non-homogeneous pixel coordinate of the depth image is P_d＝[U_d,V_d,1]^T(ii) a Then the depth image coordinates are mapped to a rotation matrix R of RGB image coordinates, and the translation matrices T are respectively:

s22, the mapping relation between the pixel coordinates of the depth image and the pixel coordinates of the RGB image is as follows:

P_rgb＝(R*Z_d*P_d+T)/Z_rgb

from the above equation, it is possible to obtain the coordinate value P of the depth image_dAnd its pixel value Z_dAnd a shooting distance Z_rgbObtaining the coordinate value P of the RGB image corresponding to the point mapping_rgb。

Preferably, the specific process of step S3 is as follows:

in the shared convolution layer part of the FasterR-CNN, two identical CNN networks are used, wherein the CNN network taking the depth image as input is Conv-D, and the CNN network taking the RGB image as input is Conv-RGB.

Preferably, the specific process of step S4 is as follows:

s41, in the RPN stage of the FasterR-CNN algorithm, only using a depth image feature map output by a Conv-D network as an input to generate a region of interest D-ROI of a depth image;

s42, generating RGB image interesting regions RGB-ROI of the RGB image feature map output by Conv-RGB for each D-ROI in a one-to-one correspondence mode by utilizing the mapping relation between the RGB image and the depth image.

Preferably, the specific process of step S5 is as follows:

s51, pooling each set of D-ROI and RGB-ROI to a fixed size using ROIpooling layer;

s52, performing serial stacking fusion on the feature maps of the pooled D-ROI and the RGB-ROI, namely stacking D (D is not less than 1 and not more than D) on channels at the same spatial position i, j (i is not less than 1 and not more than H, j is not less than 1 and not more than W), and outputting the feature map with the channel number of 2D after stacking for the feature map with the channel number of two channels of D:

wherein,RGB-ROI and D-ROI before fusion and feature map after fusion are respectively shown.

Preferably, the specific process of step S6 is as follows:

s61, continuing to use the NOC structure convolution formed by combining a plurality of convolution layers to further extract the fusion characteristics from the fused characteristic diagram;

and S62, pooling by using the global average pooling layer, and inputting the pooled result to a classifier and a regressor of Fast R-CNN.

Preferably, the RGB-D sensor of step S11 is a kinect2.0 sensor.

Preferably, Conv-D and Conv-RGB of the step S31 are the same convolution structure and both are ZF structures.

Preferably, in step S51, the feature map is divided into 6 × 6 grids by using a spatial pyramid pooling method, and a feature map with a fixed size of 6 × 6 and 256 channels is generated for each grid by using a maximum pooling method.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

(1) the invention provides a double-current RGB-D Faster R-CNN algorithm, which integrates respective superiorities of an RGB image and a depth image, greatly improves the identification precision and does not increase too much time cost.

(2) The invention greatly compresses the size of the model through the structural design of full convolution and simultaneously ensures the real-time property.

(3) The method establishes the RGB-D video image database of the lactating sow, and provides a data source for the subsequent algorithm design and model training based on the RGB-D video image.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method of identifying the posture of a lactating sow by a dual stream RGB-D FasterR-CNN of the present invention;

FIG. 2 is a view of a sow posture recognition model structure of the double-current RGB-D Faster R-CNN of the present invention;

FIG. 3 is a schematic diagram of the recognition result of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The depth learning can obtain better image feature representation by fusing different image features, and in the target recognition task, the complementary property between the RGB image and the depth image feature can be extracted by fusing the RGB-D feature, so that the robustness of feature learning can be improved, and the feature with target recognition capability can be obtained. The invention provides an end-to-end fusion strategy of an RGB-D characteristic extraction stage, namely, firstly, two CNN networks are respectively used for extracting the difference characteristics of two data, and then the CNN networks are continuously used for extracting the inherent complementary characteristic information between the two data characteristics after the two characteristics are fused. Finally, based on the Faster R-CNN algorithm, a double-current RGB-D Faster R-CNN algorithm capable of fully utilizing RGB-D data information is provided for identifying the posture of the lactating sow with high precision.

In the first stage shown in the figure of fig. 1, to generate the ROI of the RGB-D video image, the CNN network Conv-D of the depth image stream is first used to extract the depth image features by inputting the depth image, and the CNN network Conv-RGB of the RGB image stream is then used to extract the RGB image features by inputting the RGB image. And then generating a region of interest D-ROI by using the RPN network and taking the depth image feature map as input, generating RGB-ROIs in one-to-one correspondence by using a mapping relation between the RGB image and the depth image, namely the RGB-D mapping relation, Pooling the feature maps of the D-ROI and the RGB-ROI into a fixed size by using an ROI-Pooling layer respectively, and then merging the feature maps of the D-ROI and the RGB-ROI by using splicing. And the second stage is a stage of classifying and identifying the ROI by Fast R-CNN, the fused features are continuously convolved by using an NOC structure so as to further fuse the RGB image and the depth image features, extract the robust RGB-D features, and finally, the robust RGB-D features are processed by a classifier and a regressor to output an identification result.

The depth learning can obtain better image feature representation by fusing different image features, and in the target recognition task, the complementary property between the RGB image and the depth image feature can be extracted by fusing the RGB-D feature, so that the robustness of feature learning can be improved, and the feature with target recognition capability can be obtained. The invention provides an end-to-end fusion strategy of an RGB-D characteristic extraction stage, namely, firstly, two CNN networks are respectively used for extracting the difference characteristics of two data, and then the CNN networks are continuously used for extracting the inherent complementary characteristic information between the two data characteristics after the two characteristics are fused. Finally, based on the Faster R-CNN algorithm, a double-current RGB-D Faster R-CNN algorithm capable of fully utilizing RGB-D data information is provided for identifying the posture of the lactating sow with high precision. The method is based on an Nvidia GTX 980Ti model GPU hardware platform, a Caffe deep learning framework is built on an Ubuntu14.04 operating system, and a python is used as a programming language to train and test a sow posture recognition model.

The concrete implementation is as follows:

collecting RGB-D video images of a lactating sow, wherein the RGB-D video images comprise RGB images and depth images, and establishing a sow posture recognition RGB-D video image library;

step two, obtaining a mapping relation between the RGB image and the depth image through camera calibration calculation;

thirdly, based on a Faster R-CNN algorithm, respectively convolving the RGB image and the depth image by using two CNN networks;

generating region-of-interest D-ROIs on the basis of the depth image feature map by using only one RPN, and generating the region-of-interest RGB-ROIs of the RGB image feature map in one-to-one correspondence to each D-ROI through the mapping relation between the RGB image and the depth image;

step five, Pooling each D-ROI and each RGB-ROI by using an ROI Pooling layer to be a fixed size, and fusing the features of each group of the D-ROI and the RGB-ROI after Pooling by using a splicing fusion method;

step six, continuously using the NOC structure convolution formed by combining a plurality of convolution layers to further extract the fusion characteristics, namely RGB-D characteristics, passing through a global average pooling layer and then processing by a classifier and a regressor to obtain a double-current RGB-D FasterR-CNN sow posture identification model, and outputting an identification result;

and seventhly, training a double-current RGB-DFasterR-CNN sow posture recognition model by using a training set in a sow posture recognition RGB-D video image library, testing the model performance by using a test set, and finally screening an optimal performance model.

The database establishing method of the first step specifically comprises the following steps:

1) data acquisition is carried out on 28-column pigs, the size of the acquired pigsty is about 3.8m multiplied by 2.0m, and each pigsty comprises one suckling sow and 8-10 piglets. Using a Kinect v2.0 sensor of Microsoft, looking down the shooting from top to bottom at the height of 190-270 cm from the ground of the pigsty, and acquiring RGB-D data at the speed of 5 frames per second. The pixels of the acquired RGB image are 1080 × 1920, and the RGB image is scaled to 540 × 960 resolution in order to save a video memory (GPU memory) and increase the calculation speed of processing the RGB image in the subsequent algorithm use process. The depth image is acquired with a resolution of 424 x 512 pixels, and the pixel values of the depth image reflect the distance of the object from the sensor.

2) Selecting one group of continuous video images at intervals of 10-40 frames randomly from 21 columns of data shot in the first three times, randomly sampling RGB-D image groups with 5 types of postures, standing 2522 groups, sitting 2568 groups, lying 2505 groups, lying abdomen 2497 groups and lying side 2508 groups, and taking 12600 groups of RGB-D images as an original training set. And randomly sampling 1127 standing groups, 1033 sitting groups, 1151 prone groups, 1076 abdominal lying groups and 1146 lateral lying groups from the 7 columns of data of the fourth shooting, and taking 5533 groups of RGB-D images as a test set for testing the performance of the model. Wherein each set of RGB-D images comprises an RGB image and its corresponding depth image. In the total data set, the training set accounts for about 70% and the testing set accounts for about 30%.

3) Firstly, median filtering and adaptive histogram equalization processing are carried out on a depth image acquired after sampling so as to improve the contrast, and the RGB image is not preprocessed. In the manual labeling stage, for data of each group of RGB-D video images, manual labeling is carried out on the depth images, namely, an outer boundary frame of a sow is labeled on each depth image in a data set, and coordinate position information of the sow in the image is obtained. In order to enhance the generalization capability and robustness of subsequent model training, data amplification processing is carried out on original training set data in an experiment, namely clockwise rotation of 90 degrees, 180 degrees and 270 degrees, and horizontal mirror image processing and vertical mirror image processing are carried out. The processed RGB-D data reached 75600 sets as training data sets for training the model.

TABLE 1 introduction of 5-class postures of lactating sows

The method for obtaining the RGB-D mapping relation by using the camera calibration method in the second step specifically comprises the following steps:

obtaining an internal reference matrix K of the RGB image by using a camera calibration method_rgbInternal reference matrix K of depth image_dObtaining RGB external reference matrix R aiming at the same checkerboard image_rgbAnd T_rgbAnd an appearance matrix R of the depth image_dAnd T_dHere, the checkerboard image is a checkerboard image printed in an experiment used for camera calibration. Let the non-homogeneous pixel coordinate of RGB image be P_rgb＝[U_rgb,V_rgb,1]^TThe non-homogeneous pixel coordinate of the depth image is P_d＝[U_d,V_d,1]^T. Then the depth image coordinates are mapped to a rotation matrix R of RGB image coordinates, and the translation matrices T are respectively:

therefore, the mapping relationship between the pixel coordinates of the depth image and the pixel coordinates of the RGB image is:

P_rgb＝(R*Z_d*P_d+T)/Z_rgb

from the above equation, we can obtain the coordinate value P of the depth image_dAnd its pixel value Z_dAnd a shooting distance Z_rgbObtaining the coordinate value P of the RGB image corresponding to the point mapping_rgb。

The method for respectively convolving the RGB image and the depth image by using two CNN networks based on the FasterR-CNN algorithm specifically comprises the following steps:

1) based on the FasterR-CNN algorithm, taking the ZF network as an example, the network structure firstly uses a series of convolution layers and maxporoling layers of the ZF network structure to independently process the two data, and extracts the characteristics of the two image data. Conv 1-Conv 5, Pool1 and Pool2 form Conv-D for extracting depth image features, and Conv1_ 1-Conv 5_1, Pool1_1 and Pool2_1 form Conv-RGB for extracting RGB image features. Conv-D input is a depth image of 512 × 424 × 1, output is a feature map of 33 × 28 size and channel 256, Conv-RGB input is an RGB image of 960 × 540 × 3, output is a feature map of 61 × 35 size and channel 256, and output feature maps are shown in FIG. 2.

The method for using only one ROI of the RGB-D data features of the RPN network specifically comprises the following steps:

1) in the RPN stage, the feature maps output by double streams share one RPN network, namely, a D-ROI is generated on the basis of the depth map feature map, and an RGB-ROI is generated for the RGB image feature map through an RGB-D mapping relation. Wherein for the RPN network, at each sliding window position, 9 anchor points of 3 area scales {96,192,384} and 3 length-width ratios {1:1,1:3,3:1} are respectively taken.

Step five, the method for performing feature fusion after pooling each group of D-ROI and RGB-ROI into a fixed size specifically comprises the following steps:

1) and respectively generating D-ROI and RGB-ROI with different sizes by using two ROI-Pooling layers (spatial pyramid Pooling layers) through a grid of H W (H, W is set to be 6), and generating the feature maps with fixed sizes by adopting a maximum value Pooling mode, namely Pooling each ROI feature map into a feature map with the size of 6W and the number of channels of 256.

2) In the feature fusion phase. Define the fusion function F of the ROI fusion layer: is a feature map of an RGB-ROI,then the signature of the D-ROI, t the t-th set of ROI signatures (t is 128 in the experiments herein), H and W the height and width of the signature size, respectively, and D the number of channels, the RGB-ROI and D-ROI sizes in the methods herein are identical after ROI-Pooling (set to 6 x 6 in the experiments). Outputting feature map after fusion The input feature size is the same as that of the input feature, and is also H, W, and the number of channels is D'. For ease of discussion, the subscript t is omitted for analysis, regardless of the number of features in the set (each set using the same feature fusion).

The splicing fusion formula is Y^cat＝F^cat(X^rgb，X^d). I.e. stacking two signatures in series. Stacking D (D is more than or equal to 1 and less than or equal to D) on the channels for the same spatial positions i, j (i is more than or equal to 1 and less than or equal to H, j is more than or equal to 1 and less than or equal to W), and outputting the number of the channels of the feature map after stacking to be 2D for the feature map with the number of the two channels being D:

wherein,the serial stacking does not directly circulate information between the two feature maps, but the information circulation and information fusion of the two data can be realized after the convolution of the subsequent convolution layer. As shown in fig. 2.

The method for using Fast R-CNN with NOC structure for the fused features in the sixth step specifically comprises the following steps:

and continuously convolving the fused feature graph by using a NOC structure consisting of four convolutional layers Conv6, Conv7, Conv8 and Conv9 at the Fast R-CNN stage of the fused features to promote the circulation of information among fused feature channels, thereby further abstracting the features of RGB-D data, and finally connecting a classifier and a regressor of the Fast R-CNN after passing through a global average pooling layer. As shown in fig. 2.

Step seven, training a double-current RGB-D FasterR-CNN model by using a training set, testing the performance of the model by using a test set, and finally screening the optimal performance model specifically comprises the following steps:

model training is carried out by using a training set in a prepared RGB-D database, the mini batch of image input is set to be 1, impulse is set to be 0.9, and weight attenuation coefficient is set to be 5^-4Maximum number of iterations 14 x 10⁵The basic learning rate is 10^-4Attenuation step of 6 x 10⁵The attenuation coefficient gamma was 0.1. At 8 x 10⁵After each iteration, every 1 × 10⁵And (4) storing a model in the secondary iteration, and selecting the model with the highest precision of the test set as comparison. And taking the optimal model as a final model.

The experimental results of the present invention are explained in detail below:

the invention adopts 3 evaluation indexes accepted in the industry to count the sow posture recognition results of a test set, and compares the method provided by the invention with a method only using a Depth image (Depth image), a method only using an RGB image (RGBIM), a front fusion method (RGB-D early fusion) for simply splicing RGB-D as four-channel image input and a post fusion method (RGB-Dlater fusion) for fusing output results by adopting two CNNs, two RPNs and two Fast R-CNNs, wherein in the front fusion and the post fusion, the resolution of the Depth image is scaled to 540 x 960 and then is registered with the RGB image as input, and the results are as follows:

the method adopts AP (Average Precision), MAP (Mean Average Precision), identification speed and model size for evaluation. As shown in table 2 below:

TABLE 2 identification Performance comparison of models

After the RGB data and the depth image data are fused, the Average accuracy of APs (Average Precision) of five postures of standing, sitting, lying prone, lying abdomen and lying on side respectively reaches 99.74%, 96.49%, 90.77%, 90.91% and 99.45%, the MAP (Mean Average Precision) of the five postures reaches 95.47%, which is more than 7.11% of a method only using RGB images, which is more than 5.36% of a method only using depth images, which is more than 1.55% of a pre-fusion method, and which is more than 0.15% of a post-fusion method. When the recognition speed reaches 12.3FPS, the real-time recognition requirement can be met. The model size is only 70.1MB, which is far smaller than other methods, and great superiority is shown. In conclusion, the method of the invention has excellent performance of identification precision and model size, and simultaneously maintains real-time identification performance.

The method for recognizing the posture of the lactating sow by the double-current RGB-D FasterR-CNN provided by the invention is described in detail, a specific example is applied in the method for explaining the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of identifying the posture of a lactating sow as claimed in claim 1 comprising the steps of:

s3, based on a FasterR-CNN algorithm, respectively convolving the RGB image and the depth image by using two CNN networks to obtain an RGB image feature map and a depth image feature map;

s6, further extracting fusion characteristics from the fused characteristic diagram by using FastR-CNN of an NOC structure, handing the fusion characteristics to a classifier and a regressor after passing through a global average pooling layer to obtain a double-current RGB-D Faster R-CNN sow posture identification model, and outputting an identification result;

2. The method for recognizing the posture of nursing sows according to claim 1, wherein the specific process of step S1 is as follows:

s11, fixing the RGB-D sensor to overlook to shoot and collect RGB-D video images of the pigsty;

3. The method for recognizing the posture of nursing sows according to claim 1, wherein the specific process of step S2 is as follows:

s21, obtaining the internal reference matrix K of the RGB image by using the camera calibration method_rgbInternal reference matrix K of depth image_dObtaining an external reference matrix R of the RGB image aiming at the same checkerboard image used for camera calibration_rgbAnd T_rgbAnd an appearance matrix R of the depth image_dAnd T_d(ii) a Let the non-homogeneous pixel coordinate of RGB image be P_rgb＝[U_rgb,V_rgb,1]^TThe non-homogeneous pixel coordinate of the depth image is P_d＝[U_d,V_d,1]^T(ii) a Then the depth image coordinates are mapped to a rotation matrix R of RGB image coordinates, and the translation matrices T are respectively:

P_rgb＝(R*Z_d*P_d+T)/Z_rgb

4. The method for recognizing the posture of nursing sows according to claim 1, wherein the specific process of step S3 is as follows:

in the shared convolution layer part of the Faster R-CNN, two identical CNN networks are used, with the depth image and the RGB image as input, respectively, the CNN network with the depth image as input being Conv-D, and the CNN network with the RGB image as input being Conv-RGB.

5. The method for recognizing the posture of nursing sows according to claim 4, wherein the specific process of step S4 is as follows:

s41, in the RPN stage of the Faster R-CNN algorithm, only one RPN network is used to generate a region of interest D-ROI of the depth image by taking a depth image feature map output by Conv-D as input;

6. The method for recognizing the posture of nursing sows according to claim 1, wherein the specific process of step S5 is as follows:

s51, pooling to a fixed size using an ROI posing layer for each set of D-ROI and RGB-ROI;

s52, performing series stacking fusion on the feature maps of the pooled D-ROI and the RGB-ROI, namely stacking D on channels at the same spatial position i, j, wherein i is more than or equal to 1 and less than or equal to H, j is more than or equal to 1 and less than or equal to W, D is more than or equal to 1 and less than or equal to D, and for the feature maps with two channels of D, the number of the channels of the stacked output feature maps is 2D:

7. The method for recognizing the posture of nursing sows according to claim 1, wherein the specific process of step S6 is as follows:

8. The method for recognizing the posture of nursing sows as claimed in claim 2, wherein said RGB-D sensor of step S11 is Kinect2.0 sensor.

9. The method for recognizing the posture of nursing sows as claimed in claim 4, wherein Conv-D and Conv-RGB of said step S31 are the same convolution structure and both ZF structures.

10. The method of claim 6, wherein the step S51 is performed in a spatial pyramid pooling manner, wherein the feature map is divided into 6 × 6 grids, and a maximum pooling manner is performed on each grid to generate a feature map with a fixed size of 6 × 6 and a channel number of 256.