CN114419489A

CN114419489A - Training method and device for feature extraction network, terminal equipment and medium

Info

Publication number: CN114419489A
Application number: CN202111614706.3A
Authority: CN
Inventors: 李百双; 胡文泽; 黄坤
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2021-12-25
Filing date: 2021-12-25
Publication date: 2022-04-29

Abstract

The embodiment of the application is applicable to the technical field of image processing, and provides a training method, a device, terminal equipment and a medium for a feature extraction network, wherein the method comprises the following steps: determining a background interval of each pixel point in a video frame sequence; determining foreground pixel points in a target video frame according to the background interval, wherein the target video frame is any video frame in the video frame sequence; obtaining a foreground image in each target video frame according to the foreground pixel points; determining positive and negative samples of a foreground image of the target video frame to construct a positive sample pair and a negative sample pair; and performing self-supervision training on a preset feature extraction network by adopting the positive sample pair and the negative sample pair. The feature extraction network obtained by training by adopting the method has high matching degree in downstream tasks for target detection and segmentation.

Description

Training method and device for feature extraction network, terminal equipment and medium

Technical Field

The present application belongs to the field of image processing technologies, and in particular, to a training method and apparatus for a feature extraction network, a terminal device, and a medium.

Background

Self-supervised learning (self-supervised learning) mainly utilizes an auxiliary task to mine own supervision information from large-scale unsupervised data, and the network is trained through the constructed supervision information, so that valuable characteristics of downstream tasks can be learned.

The self-supervision learning method is widely applied to the field of target detection, but the area intercepted by the existing self-supervision learning method when the original image is sampled cannot embody the complete information of an object or has a large amount of irrelevant information. Thus, currently learned features are incomplete or redundant, making responsive object detection a low match to segmented downstream tasks.

Disclosure of Invention

In view of this, embodiments of the present application provide a training method, an apparatus, a terminal device, and a medium for a feature extraction network, so as to solve the problems in the prior art that features in an image extracted in a self-supervision training process are incomplete, and a matching degree of a downstream task for target detection and segmentation is low.

A first aspect of an embodiment of the present application provides a method for training a feature extraction network, including:

determining a background interval of each pixel point in a video frame sequence;

determining foreground pixel points in a target video frame according to the background interval, wherein the target video frame is any video frame in the video frame sequence;

obtaining a foreground image in each target video frame according to the foreground pixel points;

determining positive and negative samples of a foreground image of the target video frame to construct a positive sample pair and a negative sample pair;

and performing self-supervision training on a preset feature extraction network by adopting the positive sample pair and the negative sample pair.

A second aspect of the embodiments of the present application provides a training apparatus for a feature extraction network, including:

the background interval determining module is used for determining the background interval of each pixel point in the video frame sequence;

a foreground pixel point determining module, configured to determine a foreground pixel point in a target video frame according to the background interval, where the target video frame is any video frame in the video frame sequence;

the foreground image acquisition module is used for acquiring a foreground image in each target video frame according to the foreground pixel points;

a sample pair determining module, configured to determine a positive sample and a negative sample of a foreground image of the target video frame to construct a positive sample pair and a negative sample pair;

and the training module is used for carrying out self-supervision training on the preset feature extraction network by adopting the positive sample pair and the negative sample pair. A third aspect of embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for training a feature extraction network according to the first aspect.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, which, when executed by a processor, implements the method for training a feature extraction network according to the first aspect.

A fifth aspect of embodiments of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the training method for a feature extraction network according to the first aspect.

Compared with the prior art, the embodiment of the application has the following advantages:

in the embodiment of the application, in the training process of extracting the characteristics of the network by using a self-supervision method, firstly, a background interval of pixel points in a video frame sequence is determined, and then foreground pixel points of the video frame are determined according to the background interval; based on the foreground pixel points, the foreground image in the video frame can be determined, so that the video frame can be purposefully sampled according to the foreground image, the sampling result comprises an interested target, the characteristics learned by the trained characteristic extraction network are more complete, the matching degree of the characteristic extraction network and the target detection and the segmented downstream tasks is improved, and meanwhile, when the characteristic extraction network trained by the method is used for characteristic extraction, the extracted characteristics have better consistency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic flow chart illustrating steps of a method for training a feature extraction network according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating steps of another method for training a feature extraction network according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a method for determining a foreground object frame according to an embodiment of the present application;

FIG. 4 is a schematic diagram of image feature extraction according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus of a feature extraction network according to an embodiment of the present application;

fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Generally, in the process of target segmentation and detection, a feature extraction model is required to be used for feature extraction; the feature extraction model can be obtained through self-supervision training. In the self-supervision training process, the feature extraction network can be trained through an auxiliary task, so that the features extracted from the images by the feature extraction network meet expectations.

Generally, for the same target in different images, features obtained by using a feature extraction network should have certain similarity; the extracted features should have a certain degree of discrimination for different targets. If one feature network can meet the requirements, the feature extraction network can be considered to be capable of deeply learning the information in the image, and the purpose of training is achieved.

Based on the above, in the embodiment of the present application, the auxiliary task may be set as: features of different images of the same object have expected similarity; the features of the images of different objects have a certain degree of distinction.

In the embodiment of the application, random sampling is not performed, but the foreground and the background of an image are distinguished, and the foreground in the image is generally an interested area when tasks such as target detection are performed; the background in the image, typically a fixed view of the picture taking area.

In the embodiment, the foreground picture is separated from the picture and sampled, so that the sample can contain the complete information of the foreground, information loss and redundancy are avoided, and the feature consistency of feature extraction is facilitated.

The technical solution of the present application will be described below by way of specific examples.

Referring to fig. 1, a schematic flow chart illustrating steps of a training method for a feature extraction network according to an embodiment of the present application is shown, which may specifically include the following steps:

s101, determining a background interval of each pixel point in a video frame sequence.

The execution subject of the embodiment is a terminal device, such as a monitoring device. The method in the embodiment can obtain an image feature extraction algorithm, and is applied to the target segmentation and detection process.

In particular, the video frame sequence may be obtained by processing the monitoring data. The monitoring data generally includes a plurality of consecutive video frames arranged in time sequence, but there may be some special cases during the video frame shooting process, resulting in the image information in the video frames being missing. For example, in a dark environment, a picture taken by the monitoring device is not clear, or when a lens of the monitoring device is blocked, the taken picture does not include target content. Therefore, when the monitoring data is adopted to continue training, the unavailable pictures need to be removed from the monitoring data, the available picture information is collected and arranged according to the time sequence to form the video frame sequence.

The background interval refers to a variation range of a pixel value of each pixel point in a background picture of the video frame picture. The sizes of the video frames in the video frame sequence are the same, and the number of the pixel points in each video frame is the same and the arrangement is the same; for the pixel point at the same position in each video frame in the video frame sequence, the variation range can be calculated, so that the background interval of the pixel point is determined. In the present embodiment, a background section is used to describe a portion that is not generally focused when processing an image. For example, when a target entry in a surveillance video that a certain period of time passes is determined, the background area, such as a wall, a lawn, etc., shot by the surveillance video is generally not concerned, and these parts are generally not changed, so that if no one passes through, the images in the shot video frame images are the same, but are subject to changes such as temperature and weather, and some errors are generated, so that the value of the pixel point at each corresponding position of the images is within a certain fluctuation range. In this embodiment, the background interval is used to determine whether a foreground passes through a shooting area of the monitored video.

S102, determining foreground pixel points in a target video frame according to the background interval, wherein the target video frame is any video frame in the video frame sequence.

Specifically, the target video frame refers to a video frame in a sequence of video frames. For a target video frame, each pixel point in the target video frame can be compared with a background interval of a corresponding pixel point, and if the pixel value of the pixel point is in the background interval, the pixel point is used as a background pixel point; and if the pixel value of the pixel point is not in the background interval, taking the pixel point as a foreground pixel point. For example, the background interval of the pixels arranged in the tenth row and the tenth column is 345 to 567, and the pixel value of the pixel arranged in the tenth row and the tenth column in the target video is 789, which indicates that the pixel arranged in the tenth row and the tenth column in the target video frame is a foreground pixel; the background interval of the pixel points arranged in the third row and the fifth column is 123-367, and the pixel value of the pixel points arranged in the third row and the fifth column in the target video is 234, which indicates that the pixel points arranged in the third row and the fifth column in the target video frame are non-foreground pixel points and are background pixel points.

S103, obtaining foreground images in the target video frames according to the foreground pixel points.

Specifically, cameras for monitoring videos are generally aligned to the same shooting area, the scene of the shooting area is fixed, and the fixed scene in the shooting area can be considered as a background; when a target passes through the shooting area, the shot target in the video frame is the foreground in the video frame. In the last step, a foreground pixel point in the target video frame is determined according to the background interval, and the foreground pixel point can be a pixel point in the foreground, so that a foreground image in the video frame can be determined according to the foreground pixel point.

Specifically, for each video frame, the foreground image in the target video frame may be determined according to an image region formed by the foreground pixel points.

Specifically, the image binarization processing may be performed on the target video frame. For example, if a pixel point is a foreground pixel point, the pixel value of the pixel point is set to 1; if the pixel is not a foreground pixel, the pixel value of the pixel can be set to 0. Therefore, the pixel point with the pixel value of 1 can form a connected image in the target video frame, and the connected image is the foreground image.

Specifically, a video frame may include a plurality of foreground images, and each foreground image may correspond to a connected region. For example, pixel points with a pixel value of 1 in a video frame are combined into 3 connected regions, and then it can be considered that the 3 connected regions respectively correspond to one foreground image.

And S104, determining a positive sample and a negative sample of the foreground image of the target video frame to construct a positive sample pair and a negative sample pair.

Specifically, the positive sample refers to an image region having many features same as those of the foreground image; negative examples refer to regions of the image that are completely different from the foreground image.

Specifically, the foreground image corresponds to a target, and other images of the target in other video frames can be used as positive samples of the foreground image in the target video frame; and taking the images of different targets in other video frames as negative samples of the foreground images in the target video frame. For example, adjacent video frames of the video frame may be selected, the same foreground in the target video frame and the adjacent video frame may be used as a positive sample pair, and the different foreground in the target video frame and the adjacent video frame may be used as a negative sample pair.

Specifically, one video frame separated from the target video frame by a certain number may be selected, and then the same foreground may be selected as a positive sample, and different foreground may be selected as a negative sample. The same foreground corresponds to the same target and the different foreground corresponds to different targets.

Illustratively, one foreground image Z1 is included in the target video frame, which corresponds to a target Z; in a sample video frame that is 5 video frames ahead of the target video frame, two foreground images Z2 and Y1 are included, corresponding to Z and Y, respectively. Then Z2 may be taken as a positive sample of Z1 and Y1 as a negative sample of Z1; that is, Z1 and Z2 constitute a positive sample pair, and Z1 and Y1 constitute a negative sample pair.

And S105, performing self-supervision training on a preset feature extraction network by adopting the positive sample pair and the negative sample pair.

In particular, in the self-supervision learning training, an auxiliary task is generally required to be constructed for training. And continuously training the feature extraction network based on the auxiliary task by adopting the positive sample pair and the negative sample pair, so that the feature extraction network obtained by training has feature consistency.

In one possible implementation mode, a positive sample pair and a negative sample pair extracted from a video frame are respectively used as training samples to be input into a feature extraction network, and the feature extraction network is trained; after a sample image is adopted to train the feature network, the trained feature extraction network can be adopted to respectively extract the features of a foreground image, a positive sample corresponding to the foreground image and a negative sample corresponding to the foreground image of the target video frame, so as to obtain the corresponding foreground image feature, positive sample feature and negative sample feature. Then calculating the positive sample characteristic difference between the foreground image characteristic and the positive sample characteristic; calculating a negative sample characteristic difference between the foreground image characteristic and the negative sample characteristic; when the feature difference of the negative sample is far greater than that of the positive sample, it can be shown that the feature extraction network can extract the same features for the same graph, and can extract different features for different images, and at this time, the feature extraction network training is completed. When the difference between the negative sample characteristic difference and the positive sample characteristic difference does not reach a preset value, the characteristic extraction network cannot identify the same image and different images, and the image in the positive sample pair and the image in the negative sample pair need to be continuously adopted to train the characteristic extraction network. In the application, whether the features in the positive sample pair are the same or not and whether the features in the negative sample pair are different or not can be detected, so that whether the model is trained or not is determined.

In addition, the auxiliary tasks may also include various tasks, such as disordering the sequence in the video frame sequence, and if the video frame sequence can be reordered based on an algorithm, it may be stated that the content in the image can be identified and the training is completed.

In this embodiment, during sampling, an area in an image is not randomly selected, but an area with a foreground is selected, so that a trained network can extract similar features for different images of the same target, extract different features for images of different targets, and further adapt to downstream tasks better in subsequent target detection and object segmentation.

Referring to fig. 2, a schematic flow chart illustrating steps of another training method for a feature extraction network according to an embodiment of the present application is shown, which may specifically include the following steps:

s201, performing Gaussian fitting on pixel points at each corresponding position of each video frame in the video frame sequence to obtain the fluctuation range of the pixel values of the pixel points.

The execution subject of the embodiment is a terminal device. By adopting the method in the embodiment, a feature extraction algorithm can be obtained; the feature extraction algorithm can be applied to the fields of image analysis, video analysis and the like and is used for target segmentation and detection.

Specifically, the fluctuation range of the pixel value of the pixel point can be obtained by performing gaussian fitting on the pixel value of the pixel point at each corresponding position in a plurality of video frames in the video frame sequence.

In another possible implementation manner, a video frame image only containing a background image in a surveillance video can be directly selected, and a maximum pixel value and a minimum pixel value corresponding to each pixel point are determined; and determining the fluctuation range of the pixel value of the pixel point according to the maximum pixel value and the minimum pixel value. The background interval is, equivalently, a change interval of pixel values of the pixels in the background image.

S202, taking the fluctuation range of the pixel value of each pixel point as the background interval of each pixel point. Specifically, the calculated fluctuation range may be used as a background interval corresponding to the pixel point.

S203, judging whether the pixel value of the pixel point in the target video frame is located in the background interval corresponding to the pixel point.

Specifically, the background interval refers to a possible range of pixel values of each pixel point of the ground background image shot by the monitoring video under normal conditions in consideration of changes such as illumination and weather. And when the pixel value of the pixel point is not in the background interval, the pixel point is corresponding to the foreground pixel point.

Specifically, for a currently processed target video frame, whether the pixel value of a pixel point is in a background interval can be determined, so that whether a certain pixel point in the target video frame is a foreground pixel point is determined.

S204, if the pixel value of the pixel point is not in the background interval corresponding to the pixel point, determining the pixel point as a foreground pixel point.

Specifically, if the pixel value of the pixel point is not in the background interval, it is indicated that the pixel point does not belong to the background part, belongs to the foreground part, and is a foreground pixel point.

And S205, acquiring foreground images in the target video frames according to the foreground pixel points.

Specifically, the regional image obtained by communicating the foreground pixel points can be framed as a frame of the foreground object, and the position of the frame is recorded, so that a complete foreground image is selected according to the frame.

Fig. 3 is a schematic diagram of a method for determining a foreground object frame according to an embodiment of the present application, and as shown in fig. 3, when the foreground object frame is obtained, gaussian fitting is performed on each pixel point in a video frame sequence to obtain a background interval threshold T of each pixel point. And if the pixel point corresponding to the processed video frame is within the threshold T of the background interval, judging the pixel point as a foreground pixel point, and judging the pixel point which is not within the threshold T of the background interval as a background pixel point. And selecting the connected domain frames obtained by the connected foreground pixel points to obtain the foreground object frame. And recording the coordinate position of the foreground object frame, so that the foreground image in the video frame can be acquired by utilizing the coordinate position.

S206, extracting a preset number of sample video frames from the video frame sequence according to a preset interval.

Specifically, the preset interval may be 1, such as taking a video frame previous to the target video frame as the sample video frame. A plurality of foreground images can be extracted from the sample video frame, and the foreground images are somewhat the same as or different from the foreground images in the target video frame image.

Of course, only one foreground image may be included in one sample video frame, and in this case, the plurality of sample video frames may be extracted at different intervals, so that the plurality of video frames may include a plurality of foreground images, thereby facilitating determination of the positive and negative samples.

And S207, judging whether the foreground image of each sample video frame is the same as the foreground image of the target video frame.

Specifically, the foreground image of the sample video frame is the same as the foreground image of the target video frame, and the two foreground images may correspond to the same target, for example, the two foreground images may be images captured by the same person in different video frames; the size and position of the two images may be different, but they correspond to the same person, and they can be considered to be the same.

In particular, the sample video frames may each include a corresponding foreground image. For a foreground image in a current target video frame, determining a foreground image of the same target corresponding to the foreground image from the foreground images in the sample video frame; a foreground image of a different target corresponding to the foreground image is determined from the foreground images in the sample video frame.

In one possible implementation, the foreground image in the sample video frame and the foreground image in the target video frame may be transformed so that the sizes of the two images become the same; then comparing the shapes of the two objects, if the shapes are the same, the two objects can be considered to correspond to the same object.

In another possible implementation manner, a first image frame of a foreground image in the sample video frame may be determined, and a second image frame of the foreground image in the target video frame in the sample video frame may be determined; a time difference between the sample video frame and the target video frame is determined. If the time difference is smaller than the preset time length and the overlapping area of the first image frame and the second image frame is larger than the preset area, it can be determined that the foreground image of the sample video frame is the same as the foreground image of the target video frame. For example, the sample video frame may be a next video frame of the target video frame, and a time difference between the two video frames is small, so that a moving range of the foreground image is not large, and if it can be determined that the first image frame and the second image frame are largely overlapped, it is indicated that the foreground image of the sample video frame and the foreground image of the target video frame correspond to the same target.

In the embodiment of the application, positive and negative samples for training the feature extraction network can be obtained by judging whether the foreground image of the sample video frame is the same as the foreground image of the target video frame. Specifically, if the foreground image in the sample video frame and the foreground image in the target video frame correspond to the same target, S208 may be executed to use the foreground image in the sample video frame as a positive sample. If the foreground image in the sample video frame and the foreground image in the target video frame correspond to different targets, S209 may be executed to use the foreground images of the different targets in the sample video frame as negative samples.

And S208, taking the foreground image of the sample video frame as a positive sample.

And S209, taking the foreground image of the sample video frame as a negative sample.

And S210, forming a positive sample pair according to the foreground image of the target video frame and the positive sample.

Specifically, the foreground image and its corresponding positive sample form a positive sample pair, and if the corresponding objects of the two images in the positive sample pair are the same, they should have the same characteristics or similar characteristics.

In another possible implementation manner, a background area in the target video frame and the sample video frame, which does not contain the foreground image, may also be selected as a positive sample pair. Specifically, a background region may be selected from the target video frame, the position coordinates of the background region may be determined, and based on the position coordinates, a corresponding region may be selected from the sample video frame, and then the two may be used as a positive sample pair. Both are the same background section and should also have the same characteristics. For example, in a monitoring picture, the camera can shoot flowers, roads and mountains all the time, and then the flowers, the roads and the mountains are background images and appear in the video frames all the time. Two image areas corresponding to flowers in two different video frames can be selected as a positive sample pair. Since two image regions correspond to the same flower, both should correspond to the same feature.

In particular, for one video frame, multiple positive sample pairs may be selected when processing.

And S211, forming a negative sample pair according to the foreground image of the target video frame and the negative sample.

Specifically, the foreground image and its corresponding negative example form a negative example pair, and the objects corresponding to the two images in the negative example pair are different, so that the two images should have completely different features, or the feature difference between the two images should be relatively large.

And S212, respectively extracting the features in the positive sample pair and the negative sample pair by adopting the feature extraction network, wherein the features comprise foreground image features, positive sample features and negative sample features.

Specifically, a feature extraction network needing training is adopted to respectively extract image features from the positive sample pair and the negative sample pair, and the extracted image features comprise foreground image features, positive sample features and negative sample features.

In addition, the extracted features can be transformed to be in the same dimension, so that analysis and comparison are facilitated.

S213, respectively calculating a positive sample feature difference of the positive sample pair and a negative sample feature difference of the negative sample pair, where the positive sample feature difference is a difference between the foreground image feature and the positive sample feature, and the negative sample feature difference is a difference between the foreground image feature and the negative sample feature.

Specifically, if the feature extraction network can accurately identify the image features, the foreground image features and the positive sample features should have a relatively large similarity; the difference between the foreground image features and the negative sample features is large.

An auxiliary task may be constructed when determining whether the feature extraction network is trained. In this embodiment, the auxiliary task may be whether image features of different images of the same object are the same; whether the image characteristics of different objects are different.

In another possible implementation, other auxiliary tasks may be used to determine whether the network is trained completely.

S214, when the distance between the positive sample feature difference and the negative sample feature difference reaches a preset distance value, determining that the feature extraction network training is finished; otherwise, adjusting the parameters of the feature extraction network, and continuing to train the feature extraction network.

Specifically, when the negative sample feature difference is much larger than the positive sample feature difference, the features of the positive sample pair may be considered similar, and the features of the negative sample pair may be different, and at this time, the network training may be considered to be completed.

And when the characteristic difference of the negative sample and the characteristic difference of the positive sample do not reach the preset distance value, indicating that the training of the characteristic network is not finished. Parameters of the feature network can be adjusted, for example, the feature extraction network is a convolutional network, and the weight of each convolutional layer in the feature extraction network can be adjusted in a preset mode to adjust the hyper-parameters of the neurons.

For example, for two adjacent frames, the corresponding same foreground may be extracted as a positive sample pair and the different foreground as a negative sample pair. The features extracted in this way can improve robustness to the angle and pose of the object. In addition, background regions at the same positions, which do not include foreground regions, in two adjacent video frames can be respectively selected as a positive sample pair. For two frames of pictures with a certain time interval, the background block is extracted by the same method to be used as a positive sample pair, so that the extracted characteristics can improve the robustness on the illumination condition and the color of an object due to the change of the illumination condition in one day. And converting the characteristic pairs into the same dimensionality after ROI alignment processing, and further enabling the characteristics of the positive sample pairs to be consistent as much as possible and the characteristic difference of the negative sample pairs to be increased on the basis of the characteristic consistency, so as to train the network. The auxiliary task self-supervision training can enable the network to extract the same characteristics for the same object region, and different object region characteristics are different, so that the network can better adapt to the downstream tasks in subsequent target detection and object segmentation work.

FIG. 4 is a schematic diagram of image feature extraction according to an embodiment of the present application; referring to fig. 4, two adjacent video frames are determined from a sequence of video frames, and then background subtraction is performed to obtain a foreground image in each video frame. Then through the backbone network f_θObtaining a corresponding characteristic block; and cutting the features of the corresponding positions of the foreground and the background in the feature block. Features of different sizes are transformed to the same scale using ROIAlign, reducing their gap for positive sample pairs and increasing their gap for negative sample pairs.

As shown in fig. 4, three people framed out of the two adjacent video frames are the foreground in the video frames; the three persons exist in two video frames, the feature blocks extracted through the backbone network are respectively the extracted Vn and Vn +1, and as shown in the figure, three boxes A1, A2 and A3 in the Vn correspond to three prospects; three boxes B1, B2 and B3 in Vn +1 correspond to three prospects; a1 and B1 correspond to the same target, A2 and B2 correspond to the same target, and A3 and B3 correspond to the same target; thus, a positive sample pair can be obtained: a1 and B1, a2 and B2, and A3 and B3; two feature blocks of different objects can be considered as negative sample pairs, such as: a1 and B2, a2 and B3, and A3 and B1, and the like.

In addition, the borders a4, a5 in Vn are background parts that do not contain foreground images; the borders B4, B5 in Vn +1 are in the same position as a4, a5, respectively, with the same background, so that a positive sample pair can be obtained: a4 and B4, a5 and B5.

And respectively aligning the regions of interest of the intercepted feature blocks, and then performing self-supervision training to ensure that the features of the positive sample pairs are the same and the features of the negative sample pairs are different, thereby training the network.

In particular, the self-supervised trained network can be applied to the downstream tasks of target detection and segmentation, and the effect of the network on COCO and Pascal VOC data sets is verified.

In the embodiment, the method fully utilizes the inter-frame information of the monitoring video, collects the same foreground and background to train the feature consistency, and achieves a certain effect on the detected and segmented data set. The effect of performing downstream tasks on each data set by using the method in the embodiment is as follows:

the results of target detection on the COCO were:

AP	AP50	AP75
			37.443	56.068	40.088

the result of the object segmentation on the COCO is:

the results of the target detection on Pascal VOCs were:

AP	AP50	AP75
			54.8224	81.9173	59.2730

after the effect of the method in the embodiment is verified on the data set, downstream tasks such as target segmentation and detection can be completed by the method in the embodiment.

In the embodiment, the foreground and the background in the video frame are distinguished, so that the finished foreground image is selected as a sample image, the image is complete and does not contain redundancy, training is facilitated, and the algorithm obtained by training has remarkable improvement effect on target segmentation and detection on each data set

It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Referring to fig. 5, a schematic diagram of a training apparatus for a feature extraction network according to an embodiment of the present application is shown, and specifically may include a background interval determining module 51, a foreground pixel point determining module 52, a foreground image obtaining module 53, a sample pair determining module 54, and a network training module 55, where:

a background interval determining module 51, configured to determine a background interval of each pixel in the video frame sequence;

a foreground pixel point determining module 52, configured to determine a foreground pixel point in a target video frame according to the background interval, where the target video frame is any video frame in the video frame sequence;

a foreground image obtaining module 53, configured to obtain a foreground image in each target video frame according to the foreground pixel;

a sample pair determining module 54, configured to determine a positive sample and a negative sample of a foreground image of the target video frame to construct a positive sample pair and a negative sample pair;

and the network training module 55 is configured to perform self-supervision training on the preset feature extraction network by using the positive sample pair and the negative sample pair.

The background interval determination module 51 includes:

the fluctuation range determining submodule is used for carrying out Gaussian fitting on pixel points at each corresponding position of each video frame in the video frame sequence to obtain the fluctuation range of the pixel values of the pixel points;

and the background interval determining submodule is used for taking the fluctuation range of the pixel value of each pixel point as the background interval of each pixel point.

The foreground pixel point determining module 52 includes:

the first judgment submodule is used for judging whether the pixel value of a pixel point in the target video frame is positioned in a background interval corresponding to the pixel point;

and the foreground pixel point determining submodule is used for determining the pixel point as a foreground pixel point if the pixel value of the pixel point is not in the background interval corresponding to the pixel point.

The foreground image obtaining module 53 includes:

the setting submodule is used for setting the pixel value of a foreground pixel point in the target video frame as a preset value;

the coordinate determination submodule is used for determining the coordinates of the frame of the closed area communicated by a plurality of pixel points with the pixel values being the preset values;

and the foreground image acquisition sub-module is used for acquiring the foreground image from the target video frame according to the coordinates.

The above-mentioned sample pair determination module 54 includes:

the sample video extraction sub-module is used for extracting a preset number of sample video frames from the video frame sequence according to a preset interval;

the second judgment submodule is used for judging whether the foreground image of each sample video frame is the same as the foreground image of the target video frame;

a sample determining submodule, configured to, if a foreground image of the sample video frame is the same as a foreground image of the target video frame, take the foreground image of the sample video frame as a positive sample, and otherwise, take the foreground image of the sample video frame as a negative sample;

a positive sample pair construction submodule, configured to construct a positive sample pair according to the foreground image of the target video frame and the positive sample;

and the negative sample pair construction submodule is used for constructing a negative sample pair according to the foreground image of the target video frame and the negative sample.

The above-mentioned device still includes:

a target area determination module, configured to select a target area that does not include the foreground image from the target video frame;

and the positive sample determining module is used for taking an image area in the sample video frame, which has the same position as the target area, as a positive sample of the target area, and the target area and the positive sample of the target area form the positive sample pair.

The network training module 55 includes:

a feature extraction submodule, configured to respectively extract features in the positive sample pair and the negative sample pair by using the feature extraction network, where the features include a foreground image feature, a positive sample feature, and a negative sample feature;

a calculating sub-module, configured to calculate a positive sample feature difference of the positive sample pair and a negative sample feature difference of the negative sample pair, respectively, where the positive sample feature difference is a difference between the foreground image feature and the positive sample feature, and the negative sample feature difference is a difference between the foreground image feature and the negative sample feature;

a third judging submodule, configured to determine that the feature extraction network training is completed when a distance between the positive sample feature difference and the negative sample feature difference reaches a preset distance value; otherwise, adjusting the parameters of the feature extraction network, and continuing to train the feature extraction network.

For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to the description of the method embodiment section for relevant points.

Fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 6, the terminal device 6 of this embodiment includes: at least one processor 60 (only one shown in fig. 6), a memory 61, and a computer program 62 stored in the memory 61 and executable on the at least one processor 60, the processor 60 implementing the steps in any of the various method embodiments described above when executing the computer program 62.

The terminal device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is only an example of the terminal device 6, and does not constitute a limitation to the terminal device 6, and may include more or less components than those shown, or combine some components, or different components, such as an input/output device, a network access device, and the like.

The Processor 60 may be a Central Processing Unit (CPU), and the Processor 60 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may in some embodiments be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are equipped on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 61 may also be used to temporarily store data that has been output or is to be output.

The embodiment of the application also discloses a terminal device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the real-time property communication method in the foregoing embodiments.

The embodiment of the application also discloses a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the real-time property communication method according to the foregoing embodiments.

The embodiment of the application also discloses a computer program product, and when the computer program product runs on the terminal equipment, the terminal equipment is enabled to execute the real-time property communication method in each embodiment.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for training a feature extraction network, comprising:

2. The method of claim 1, wherein determining a background interval for each pixel point in the sequence of video frames comprises:

performing Gaussian fitting on pixel points at each corresponding position of each video frame in the video frame sequence to obtain the fluctuation range of the pixel values of the pixel points;

and taking the fluctuation range of the pixel value of each pixel point as the background interval of each pixel point.

3. The method of claim 1, wherein determining foreground pixel points in a target video frame based on the background interval comprises:

judging whether the pixel value of a pixel point in the target video frame is located in a background interval corresponding to the pixel point;

and if the pixel value of the pixel point is not in the background interval corresponding to the pixel point, determining the pixel point as a foreground pixel point.

4. The method according to any one of claims 1 to 3, wherein the obtaining a foreground image in each of the target video frames according to the foreground pixel points comprises:

setting the pixel value of a foreground pixel point in the target video frame as a preset value;

determining the coordinates of the frame of the closed area communicated by a plurality of pixel points with the pixel values as the preset values;

and acquiring the foreground image from the target video frame according to the coordinates.

5. The method of any one of claims 1-3, wherein the determining positive and negative samples of the foreground image of the target video frame to construct pairs of positive and negative samples comprises:

extracting a preset number of sample video frames from the video frame sequence according to a preset interval;

judging whether the foreground image of each sample video frame is the same as the foreground image of the target video frame;

if the foreground image of the sample video frame is the same as the foreground image of the target video frame, taking the foreground image of the sample video frame as a positive sample, otherwise, taking the foreground image of the sample video frame as a negative sample;

forming a positive sample pair according to the foreground image of the target video frame and the positive sample;

and forming a negative sample pair according to the foreground image of the target video frame and the negative sample.

6. The method of claim 5, wherein after said determining positive and negative samples of a foreground image of the target video frame, further comprising:

selecting a target area which does not contain the foreground image from the target video frame;

and taking an image area in the sample video frame, which is at the same position as the target area, as a positive sample of the target area, wherein the target area and the positive sample of the target area form the positive sample pair.

7. The method of claim 5, wherein the self-supervised training of the preset feature extraction network with the positive and negative sample pairs comprises:

respectively extracting features in the positive sample pair and the negative sample pair by using the feature extraction network, wherein the features comprise foreground image features, positive sample features and negative sample features;

calculating a positive sample feature difference of the positive sample pair and a negative sample feature difference of the negative sample pair respectively, the positive sample feature difference being a difference between the foreground image feature and the positive sample feature, the negative sample feature difference being a difference between the foreground image feature and the negative sample feature;

when the distance between the positive sample characteristic difference and the negative sample characteristic difference reaches a preset distance value, determining that the training of the characteristic extraction network is finished; otherwise, adjusting the parameters of the feature extraction network, and continuing to train the feature extraction network.

8. An apparatus for training a feature extraction network, comprising:

and the network training module is used for performing self-supervision training on the preset feature extraction network by adopting the positive sample pair and the negative sample pair.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the training method of the feature extraction network according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of training a feature extraction network according to any one of claims 1 to 7.