WO2020135554A1 - 图片处理方法、装置、设备及存储介质 - Google Patents

图片处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020135554A1
WO2020135554A1 PCT/CN2019/128573 CN2019128573W WO2020135554A1 WO 2020135554 A1 WO2020135554 A1 WO 2020135554A1 CN 2019128573 W CN2019128573 W CN 2019128573W WO 2020135554 A1 WO2020135554 A1 WO 2020135554A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
picture
visual task
task processing
processing model
Prior art date
Application number
PCT/CN2019/128573
Other languages
English (en)
French (fr)
Inventor
张壮辉
梁柱锦
王俊东
梁德澎
张树业
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Priority to US17/418,692 priority Critical patent/US20220083808A1/en
Priority to SG11202107121VA priority patent/SG11202107121VA/en
Priority to RU2021120968A priority patent/RU2770748C1/ru
Publication of WO2020135554A1 publication Critical patent/WO2020135554A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the embodiments of the present application relate to the field of computer vision technology, for example, to a picture processing method, device, device, and storage medium.
  • Computer vision is a science that studies how to use machines to simulate human and biological vision processing functions.
  • Computer vision is to use the camera instead of the human eye to collect visual information, and use the computer instead of the brain to process and analyze the information, so as to complete tasks such as image classification, image segmentation, object detection, key point positioning, pose estimation and face recognition.
  • Deep learning With the improvement of computer hardware performance and the emergence of large-scale image data, deep learning has been widely used in the field of computer vision. Deep learning stems from the study of artificial neural networks and is an important branch of machine learning. It has formed an end-to-end new model. The motivation for deep learning is to simulate the learning method of the human brain to build a deep convolutional neural network. Understand. Deep learning refers to a deep convolutional neural network.
  • the computer vision recognition method is to extract manual features from the perception of different colors, textures and edge modules in the picture.
  • the deep convolutional neural network is composed of many different linear layers and nonlinear
  • the deep network structure composed of layers can extract features from shallow to deep, from concrete to abstract. These high-level features automatically extracted through the network have strong expressive power, and can extract many abstract concepts and semantic information in the picture. For example, the target object in the picture and the location of the target object.
  • the related art has at least the following problems: Although deep learning is widely used in image classification, image segmentation, object detection, key point positioning, pose estimation and face recognition, etc., due to the existence of complex scenes and/or objects Difficult to recognize and so on, the visual task processing model generated based on deep learning training has low prediction accuracy when processing visual tasks.
  • Embodiments of the present application provide a picture processing method, device, equipment, and storage medium to improve the prediction accuracy of a visual task processing model.
  • an embodiment of the present application provides a picture processing method.
  • the method includes:
  • the object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain a response map of the original picture.
  • an embodiment of the present application further provides an image processing apparatus, including:
  • the original picture and auxiliary information acquisition module is configured to obtain the original picture and the auxiliary information of the original picture
  • the feature map acquisition module is configured to input the original picture into the main path of the first visual task processing model to obtain an object feature map, and input the auxiliary information into the branch of the first visual task processing model to obtain an auxiliary feature map;
  • the response image acquisition module of the original picture is configured to fuse the object feature map and the auxiliary feature map into the main path of the first visual task processing model to obtain the response image of the original picture.
  • an embodiment of the present application further provides a device, which includes:
  • One or more processors are One or more processors;
  • Memory configured to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the method described in the embodiments of the present application.
  • an embodiment of the present application further provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method described in the embodiment of the present application is implemented. .
  • FIG. 1 is a flowchart of a picture processing method in an embodiment of the present application
  • FIG. 2 is a flowchart of another image processing method in an embodiment of the present application.
  • FIG. 3 is an application schematic diagram of a picture processing method in an embodiment of the present application.
  • FIG. 5 is an application schematic diagram of another picture processing method in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a picture processing device in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a device in an embodiment of the present application.
  • prior knowledge can be understood as auxiliary information related to the original picture, the following will The above content will be described in conjunction with the embodiment.
  • FIG. 1 is a flowchart of a picture processing method provided by an embodiment of the present application. This embodiment may be applicable to a case of processing a visual task.
  • the method may be executed by a picture processing device, which may use software and/or hardware.
  • the device can be configured in a device, such as a computer or a mobile terminal. As shown in FIG. 1, the method includes steps 110 to 130.
  • Step 110 Acquire original pictures and auxiliary information of original pictures.
  • the original picture is also collected, and auxiliary information related to the original picture is also collected, wherein the auxiliary information related to the original picture can be used as a priori knowledge.
  • the original picture can be understood as a picture that needs to perform a visual task on the picture, and the visual task may include image classification, image segmentation, object detection, key point positioning, and pose estimation.
  • the original picture may be a single picture or a video frame in the video.
  • the auxiliary information of the original picture may include the background picture corresponding to the original picture, wherein the background picture corresponding to the original picture may be understood as follows: the original picture includes the target object, while the background picture does not include the target Picture of the object.
  • the background picture is a picture obtained by removing the target object in the original picture. Exemplarily, for example, a picture obtained by a camera shooting a sleeping kitten in a corner of the room is an original picture, and a picture obtained by the camera shooting a corner of the room is a background picture, where the target object is a sleeping kitten.
  • the auxiliary information of the original picture may include the previous video of the current video frame The response graph of the frame and the previous video frame.
  • Step 120 Input the original picture into the main path of the first visual task processing model to obtain the object feature map, and input the auxiliary information of the original picture into the branch of the first visual task processing model to obtain the auxiliary feature map.
  • Step 130 The object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain a response map of the original picture.
  • the response picture of the original picture can be understood as the result obtained after performing the corresponding type of visual task on the original picture.
  • the representation form of the response picture of the original picture is determined according to the type of visual task. Exemplary, if The visual task is image segmentation (image segmentation is to classify each pixel in the picture according to its category), then the response picture of the original picture can be the probability map of the category of each pixel in the original picture, or it can be set by the probability Threshold, the semantic segmentation map of the image converted from the probability map; if the visual task is object detection, the response map of the original picture is a map containing a preselected box, and the target object falls into the preselected box; if the visual task is to locate key points, then The response graph of the original picture is a heat map generated based on the position of key points.
  • the first visual task processing model may be generated based on convolutional neural network training, and the first visual task processing model may include a main path and a branch.
  • Convolutional neural network is a multi-layer neural network, which can include convolutional layer, pooling layer, nonlinear activation layer and fully connected layer. Each layer of it is composed of multiple feature maps, and the pixels in each feature map represent a neuron.
  • the feature map can be represented by W ⁇ H ⁇ K, where W indicates the width of the feature map, H indicates the length of the feature map, K indicates the number of channels, and W ⁇ H indicates the size of the feature map.
  • the number of channels refers to the number of convolution kernels in each convolutional layer.
  • the above convolutional layer, pooling layer, nonlinear activation layer and fully connected layer are the network structure of the convolutional neural network.
  • the structure of the above network structure is relatively complicated and the number of parameters is large.
  • you can use Lightweight convolutional neural network such as a fully convolutional neural network.
  • the so-called fully convolutional neural network is a convolutional neural network that does not contain fully connected layers.
  • the following is processed with the first visual task generated based on the training of the fully convolutional neural network.
  • Model which describes the structure of the first visual task processing model.
  • the main path of the first visual task processing model includes a first down-sampling module and an up-sampling module.
  • the input end of the sampling module is connected, and the branch of the first visual task processing model includes a second downsampling module, and the first downsampling module is connected in parallel with the second downsampling module.
  • Each downsampling module can include M convolutional layers, and each upsampling module can include M transposed convolutional layers. After each convolutional layer, a batch normalization layer and a non-linear activation layer can also be connected. After the downsampling module and the second downsampling module, a downsampling feature map is obtained.
  • the downsampling feature map contains the feature information of the picture, and because the size of the downsampling feature map is reduced compared to the input picture size, the downsampling feature map Feature maps have larger receptive fields and can provide more contextual information.
  • the size of the up-sampling feature map is the same as the size of the input picture.
  • the specific form of the structure of the first visual task processing model may be designed according to actual conditions.
  • the object feature map described here may be the down-sampled feature map obtained after passing through the first down-sampling module as described above. Contains the feature information in the original picture, input the auxiliary information of the original picture into the branch of the first visual task processing model to obtain the auxiliary feature map, where the auxiliary feature map may be the second downsampling module described above.
  • the obtained down-sampling feature map and auxiliary feature map contain feature information of auxiliary information of the original picture.
  • the object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain the response image of the original image.
  • the response image of the original image described here may be obtained after passing through the upsampling module described above. Upsampling feature map.
  • the auxiliary information of the original picture since the auxiliary information of the original picture also participates in the process of generating the response graph of the original picture, that is, the a priori knowledge also participates in the process of generating the response graph of the original picture, in other words, as the original of the prior knowledge
  • the auxiliary information of the picture plays a role in improving the prediction accuracy of the model in the process of generating the response picture of the original picture. Therefore, the response picture of the original picture generated by the auxiliary information of the original picture mentioned above is compared with the original picture only, but not The auxiliary information of the original picture participates in the generated response picture of the original picture, which is more accurate.
  • the size of the object feature map and the auxiliary feature map are the same, and the number of channels of the object feature map is the same as the number of channels of the auxiliary feature map.
  • the foregoing may be set
  • the first downsampling module and the second downsampling module have the same structure and the same number of convolution kernels, that is, the first downsampling module and the second downsampling module include the same number of convolution layers and the same number of convolutions nuclear.
  • the object feature map and the auxiliary feature map can be fused in the following two ways, including: method one, the object feature map and the auxiliary feature map are fused in a bitwise manner; method two, the object feature map and the auxiliary feature map Fusion through channel interaction. Which method is used to merge the two can be set according to the actual situation.
  • the auxiliary information of the original picture includes the previous video frame and the response map of the previous video frame
  • the response map of the previous video frame may be the previous video frame Input the graph obtained by the first visual task processing model as an input variable.
  • the auxiliary information of the original picture is required as a priori knowledge to improve the prediction accuracy of the model, it is necessary to ensure the accuracy and accuracy of the response image of the previous video frame in the auxiliary information of the original picture as the a priori knowledge The higher the better.
  • a visual processing model with higher prediction accuracy may be considered, that is, the previous video frame that meets the preset conditions will be used as the input variable and not input to the first A visual task processing model, but selects a visual processing model that has a higher prediction accuracy than the first visual task processing model.
  • the higher the prediction accuracy of the model the more complex the model structure and the larger the parameter amount.
  • the prediction efficiency of the model is lower.
  • the visual model with prediction accuracy is selected, which not only improves the accuracy of the response graph of the previous video frame, but also reduces the prediction efficiency of the model. Based on the above, it can be determined according to the actual situation whether to input the last video frame as the input variable into the first visual task processing model to obtain the response graph of the previous video frame, or whether to input the previous video frame as the input variable than the first visual task.
  • the model with higher prediction accuracy can obtain the response graph of the previous video frame, including the following two methods:
  • Method 1 When the previous video frame belongs to one of the first N video frames of the video, input the previous video frame as an input variable to a model with higher prediction accuracy than the first visual task processing model to obtain the previous Response diagram of the video frame; when the previous video frame does not belong to one of the first N video frames of the video, input the previous video frame as the input variable to the first visual task processing model to get the response of the previous video frame
  • the N is a positive integer.
  • the higher precision model can guarantee the accuracy of the response image of the previous video frame in the auxiliary information of the original picture as a priori knowledge.
  • the above-mentioned method 1 determines the method of acquiring the response image of the previous video frame in units of video.
  • Method 2 If the duration of the video is greater than or equal to the duration threshold, the accuracy of the last video frame obtained in method 1 may not meet the actual requirements.
  • multiple video frames in the video can be divided into two in chronological order Or more than two video frame sequences, multiple video frame sequences do not overlap, and the number of video frames included in each video frame sequence may be the same or different, and can be determined according to actual conditions.
  • in each sequence of video frames it can be divided into a first video frame, a second video frame, ..., a Pth video frame in chronological order.
  • the previous video frame will belong to a sequence of video frames in the sequence of multiple video frames.
  • the method of acquiring the response map of the previous video frame will be considered in the video unit in mode 1, and will be converted to the video frame sequence as the unit.
  • the previous video frame in a case where the previous video frame belongs to one of the previous T video frames of the video frame sequence corresponding to the previous video frame, the previous video frame is input as an input variable to be processed than the first visual task A model with a higher prediction accuracy of the model to obtain a response graph of the previous video frame; if the previous video frame does not belong to one of the previous T video frames of the video frame sequence corresponding to the previous video frame, then Input the last video frame as an input variable into the first visual task processing model to obtain a response graph of the previous video frame, where T is a positive integer.
  • the reason why the above processing can be performed is that, because there are usually correlations between multiple video frames in the video sequence, the response graph of the first T video frame in each video sequence is that multiple video frames are input as input variables than the first Obtained by the model with higher prediction accuracy of the visual task processing model, the accuracy of the response graph of the previous video frame in the auxiliary information of the original picture as a priori knowledge can be guaranteed.
  • the acquisition method of determining the response image of the previous video frame in the unit of video frame sequence rather than the unit of video is used to improve the accuracy of the response image of the previous video frame in the auxiliary information of the original picture.
  • the auxiliary information of the original picture includes the previous video frame of the current video frame and the response of the previous video frame Figure, according to the type of visual task, determine whether to adjust the representation of the response graph of the previous video frame.
  • the response graph of the previous video frame is the probability map of the category to which each pixel in the previous video frame belongs, or the response graph of the previous video frame is to set the probability threshold The semantic segmentation image of the image converted into the image.
  • the representation of the response image of the previous video frame of the image segmentation can be directly input as an input variable to the branch of the first visual task processing model without further adjustment; if the visual task is object detection , The response picture of the previous video frame is a picture containing a preselection box, then adjust the picture containing the preselection box, you can choose to set the pixel value of the pixels in the preselection box to 1, and set the pixel value of the pixels outside the preselection box Set to 0, and input the adjusted response graph of the last video frame as an input variable to the branch of the first visual task processing model.
  • the pixel values of the pixels inside and outside the pre-selection box can be set according to the actual situation; if the visual task is to locate key points, the response map of the previous video frame is a heat map generated based on the position of the key points, you can The response graph of the previous video frame is directly input to the branch of the first visual task processing model as an input variable, and there is no need to adjust the representation form of the response graph of the previous video frame.
  • the original picture is input into the main path of the first visual task processing model to obtain an object feature map
  • the auxiliary information is input into the support of the first visual task processing model Way, get the auxiliary feature map, merge the object feature map and the auxiliary feature map and input it into the main path of the first visual task processing model to get the response picture of the original picture.
  • the auxiliary information of the original picture can provide strong prior knowledge, and the prior knowledge helps to solve the problems of complex and changeable scenes and/or objects that are difficult to identify, which affect the prediction accuracy of the visual task processing model, Therefore, the prediction accuracy of the visual task processing model is improved.
  • the auxiliary information of the original picture includes a background picture corresponding to the original picture.
  • the auxiliary information of the original picture may include the background picture corresponding to the original picture.
  • the background picture of the original picture may be understood as follows: the original picture includes the target object, while the background picture does not include The target's picture. From another perspective, the background picture is the picture obtained by removing the target object from the original picture. The following describes the role of the background picture in the above understanding.
  • the following situation may occur: when the visual task is image segmentation, the foreground and background may be confused or the edge of the response image generating the original picture may be rough.
  • the response graph of the original picture can be the semantic segmentation of the image; when the visual task is object detection, the generated pre-selection box may be shaken more seriously; when the visual task is positioned for a key point, the key point may not be recognized Or jitter of key points.
  • the above situation indicates that the prediction accuracy of the model is not high, and the reason for the low prediction accuracy of the model is not because the target object itself is difficult to be identified, but because the scene is complex and variable, compared to the target For the object, complex and changeable scenes can be understood as background interference information.
  • the background picture is a picture to remove the target object, compared to the original picture, the background picture contains only background interference information, and the background picture is used as an input variable to enter the branch of the first visual task processing model to obtain auxiliary features
  • the auxiliary feature map extracts the features of the background interference information.
  • the auxiliary feature map participates in the process of generating the response image of the original image, so that the generated response image of the original image is a response image that suppresses background interference.
  • the background picture is a picture obtained by removing the target object in the original picture
  • the background picture serves as a priori knowledge to suppress background interference, thereby improving the prediction accuracy of the model.
  • the auxiliary information of the original picture includes the previous video of the current video frame The response graph of the frame and the previous video frame.
  • the auxiliary information of the original picture includes the previous video frame of the current video frame and the response picture of the previous video frame, when processing visual tasks During the process, the following situations may occur: when the visual task is image segmentation, there may be cases where the segmentation mask flickers between different video frames; when the visual task is object detection, several consecutive videos may appear The jitter of the pre-selection frame generated in the frame is relatively serious; when the visual task is to locate the key point, the jitter of the key point in the adjacent video frame may occur.
  • the above situation indicates that the prediction accuracy of the model is not high, and the reason why the prediction accuracy of the model is not high is that objects and/or scenes are difficult to recognize.
  • the response graph of the previous video frame is The response map of the current video frame has a high reference, that is, the response map of the previous video frame can be used as a priori knowledge to participate in the process of generating the response map of the current video frame.
  • the above process is Use the response map of the previous video frame as an input variable to enter the branch of the first visual task processing model to obtain the auxiliary feature map.
  • the auxiliary feature map will extract the features of the previous video frame.
  • the auxiliary feature map participates in generating the current video frame Process of the response graph.
  • the response graph of the previous video frame serves as a priori knowledge to enhance the continuity between frames, thereby improving the prediction accuracy of the model.
  • the structure of the first visual task model generated based on the training of the convolutional neural network can be simplified as much as possible , In order to improve the prediction efficiency of the model.
  • the input variable inputs a model with a higher prediction accuracy than the first visual task processing model to obtain a response graph of the previous video frame.
  • the response graph of the previous video frame can be obtained as follows: In the case where the previous video frame belongs to one of the first N video frames of the video, the response graph of the previous video frame The response graph obtained by inputting the previous video frame into the second video task processing model. In the case where the previous video frame does not belong to one of the first N video frames of the video, the response graph of the previous video frame is the response graph obtained by inputting the previous video frame into the first visual task processing model.
  • the second visual task processing model has higher prediction accuracy than the first visual task processing model, and the N is a positive integer.
  • a visual processing model with higher prediction accuracy may be considered, that is, the previous video frame is used as an input variable and is not selected for input to the first visual task processing model Instead, choose to input to a visual processing model with a higher prediction accuracy than the first visual task processing model.
  • the higher the prediction accuracy of the model the more complex the model structure and the larger the parameter amount.
  • the visual model with prediction accuracy is selected, which not only improves the accuracy of the response graph of the previous video frame, but also reduces the prediction efficiency of the model. Based on the above, it can be determined according to the actual situation whether to input the last video frame as the input variable into the first visual task processing model to obtain the response graph of the previous video frame, or whether to input the previous video frame as the input variable than the first visual task.
  • the model with higher prediction accuracy can get the response graph of the previous video frame.
  • the last video frame is input as the input variable to the second visual task processing model to obtain the response graph of the previous video frame; if the previous video frame If it is not one of the first N video frames of the video, the last video frame is input as the input variable to the first visual task processing model to obtain the response graph of the previous video frame.
  • the second visual task processing model is better than the first
  • the prediction accuracy of the visual task processing model is high, and N is a positive integer.
  • the above method determines the method of acquiring the response image of the previous video frame in units of video.
  • the reason why the above processing can be performed is that because the two adjacent video frames in the video are usually related, the response graph of the first N video frames is obtained by inputting the first N video frames as input variables into the second visual task processing model , You can ensure the accuracy of the response map of the video frame as a priori knowledge, that is, the prediction accuracy of the model.
  • the prediction accuracy of the second visual task processing model is higher than that of the first visual task processing model, the structure of the second visual task processing model will be more complicated than that of the first visual task processing model.
  • the second visual task processing model The parameter amount of the task processing model will also be larger than that of the first visual task processing model. The computational cost will increase with the increase of the complexity of the model structure and the increase of the parameter amount.
  • the increase of the computational cost means the decrease of the model prediction efficiency.
  • the above method is used to ensure the accuracy of the response map of the previous video frame as a priori knowledge, and also to ensure that the computational efficiency of the model is maintained at a high level. Model prediction accuracy and model prediction efficiency.
  • the above-mentioned method when the visual task object is a video, after the above-mentioned method is processed, in terms of visual effects, the above-mentioned method will enhance the inter-frame consistency. In other words, after the above-mentioned method is processed, due to the prediction accuracy of the model It has been improved, therefore, inter-frame consistency is also achieved to a certain extent.
  • the response graph of the previous video frame may be obtained as follows: the previous video frame belongs to one of the previous T video frames of the video frame sequence corresponding to the previous video frame In the case of, the response graph of the previous video frame is a response graph obtained by inputting the previous video frame into the second visual task processing model.
  • the response diagram of the previous video frame is to input the previous video frame into the first visual task processing model
  • the resulting response graph is one of multiple video frame sequences obtained by dividing multiple video frames in the video; the second visual task processing model has higher prediction accuracy than the first visual task processing model, and the T is a positive integer .
  • the method of obtaining the response map of the previous video frame in units of video may not meet the actual requirements.
  • the video The video frames are divided into two or more video frame sequences in time sequence.
  • the multiple video frame sequences do not overlap.
  • the number of video frames included in each video frame sequence may be the same or different. Determine according to the actual situation.
  • in each sequence of video frames it can be divided into a first video frame, a second video frame, ..., a Pth video frame in chronological order.
  • the previous video frame will belong to a sequence of video frames in the sequence of multiple video frames.
  • the way to acquire the response map of the previous video frame will be considered from the video unit to the video frame sequence unit.
  • the previous video frame belongs to one of the previous T video frames of the video frame sequence corresponding to the previous video frame, the previous video frame is input as the input variable to the second visual task processing model to Get the response graph of the previous video frame; if the previous video frame does not belong to one of the previous T video frames of the video frame sequence corresponding to the previous video frame, then input the previous video frame as an input variable to the first visual task Process the model to obtain a response graph of the previous video frame, where the second visual task processing model has higher prediction accuracy than the first visual task processing model, and T is a positive integer.
  • the task processing model can guarantee the accuracy of the response image of the previous video frame in the auxiliary information of the original picture as a priori knowledge.
  • the acquisition method of determining the response image of the previous video frame in the unit of video frame sequence rather than the unit of video is adopted to improve the accuracy of the response image of the previous video frame in the auxiliary information of the original picture degree.
  • the prediction accuracy of the second visual task processing model is higher than that of the first visual task processing model, the structure of the second visual task processing model will be more complicated than that of the first visual task processing model.
  • the computational cost will increase with the increase of the complexity of the model structure and the increase of the parameter amount.
  • the increase of the computational cost means the decrease of the model prediction efficiency.
  • the first visual task processing model may be trained by acquiring the original training picture, the labeling information of the original training picture, and the auxiliary training information of the original training picture.
  • the original training picture is input to the main path of the convolutional neural network to obtain the object training feature map
  • the auxiliary training information is input to the branch of the convolutional neural network to obtain the auxiliary training feature map.
  • the object training feature map and the auxiliary training feature map are fused and input into the main path of the convolutional neural network to obtain the response map of the original training image.
  • the loss function of the convolutional neural network is obtained. Adjust the network parameters of the convolutional neural network according to the loss function until the output value of the loss function is less than or equal to the preset threshold, and use the convolutional neural network as the first visual task processing model.
  • auxiliary training information that can be used as a priori knowledge as an input variable of the first visual task processing model to jointly participate in the first visual task processing
  • the training process of the model and is an input variable that is a branch of the first visual task processing model.
  • the branch where the original training picture is input as the input variable is called the main path of the first visual task processing model
  • the branch where the auxiliary training information is input as the input variable is called the branch of the first visual task processing model.
  • the branch in which the original training picture is input as an input variable during the training process is the main path of the convolutional neural network, which assists the training information
  • the branch input as the input variable is the branch of the convolutional neural network.
  • the labeling information of the original picture will be different according to the type of visual task.
  • the labeling information of the original picture is the real label of each pixel in the original picture.
  • the real label indicates that the pixel belongs Classification;
  • the labeling information of the original picture is the target frame, which includes the target object;
  • the labeling information of the original picture is the coordinate information of the key point.
  • the original training picture is input to the main path of the convolutional neural network to obtain the object training feature map
  • the auxiliary training information is input to the branch of the convolutional neural network to obtain the auxiliary training feature map.
  • the auxiliary training information of the original training picture may include the previous training video frame and the response graph of the last training video frame; if the original training picture is a single picture, Then, the auxiliary training information of the original training picture may include a background training picture.
  • the response graph of the previous training video frame can be used as an input variable to input the second vision Task processing model is obtained.
  • the object training feature map and the auxiliary training feature map are fused and input into the main path of the convolutional neural network to obtain the response map of the original training picture.
  • the volume is obtained (for example, calculated) based on the labeling information of the original training picture and the response picture of the original training picture.
  • the loss function of the product neural network can be a cross-entropy loss function, a 0-1 loss function, a square loss function, an absolute loss function, a log loss function, etc., which can be set according to the actual situation.
  • the training process of the convolutional neural network is to calculate the loss function of the convolutional neural network through forward propagation, and calculate the partial derivative of the loss function on the network parameters.
  • the reverse gradient propagation method is used to adjust the network parameters of the convolutional neural network. Until the output value of the loss function of the convolutional neural network is less than or equal to the preset threshold. When the output value of the loss function of the convolutional neural network model is less than or equal to the preset threshold, it means that the convolutional neural network has been trained. At this time, the network parameters of the convolutional neural network are also determined. On this basis, the trained convolutional neural network can be used as the first visual task processing model.
  • the convolutional neural network described in the embodiment of the present application may be a fully convolutional neural network, that is, the fully convolutional neural network described above, and the structural form of the fully convolutional neural network may be designed according to actual conditions .
  • the content of the auxiliary training information of the original training picture will be different according to the form of the original training picture.
  • the first visual task processing model obtained by training in the above manner will also be different.
  • the difference mentioned here may refer to the difference in network parameters of the first visual task processing model.
  • the auxiliary training information model of the original training picture as a priori knowledge plays a role in processing the first visual task obtained during training
  • the prediction accuracy of the model has a higher role. Therefore, the first visual task processing model with the auxiliary training information of the original training picture participating in the generation is compared with the original training picture without the auxiliary training information of the original training picture.
  • the first visual task processing model has higher prediction accuracy.
  • the second visual task processing model described in the embodiments of the present application is a model that has been trained by itself.
  • the second visual task processing model may be configured to generate a response graph of the last training video frame and a response graph of the previous video frame.
  • the auxiliary training information is auxiliary training information obtained through data enhancement processing.
  • the visual task processing model is generated based on the training of the convolutional neural network.
  • One of the advantages of the convolutional neural network is its ability to absorb data and convert it into continuous learning and updating of parameters to obtain a A model with good predictive performance and generalization ability.
  • the convolutional neural network requires the number and quality of training samples. In other words, the number and quality of training samples have the prediction performance and generalization ability of the model. Significant influence. Based on the above, we can consider using data enhancement methods to process the training samples to increase the number of training samples and improve the quality of the training samples, so as to improve the prediction performance and generalization ability of the model.
  • the training samples mentioned here refer to the auxiliary training information. That is, the embodiment of the present application uses the data enhancement method to process the auxiliary training information.
  • the auxiliary training information is the auxiliary training information obtained through the data enhancement processing.
  • Using the data enhancement method to process the auxiliary training information can improve the quality of the auxiliary training information, which can be understood as follows:
  • the original training pictures and auxiliary training information The background training pictures are not taken at the same time, but are taken separately. Therefore, the shooting angle, brightness, deformation and hue of the background training picture in the original training picture and the auxiliary training information cannot be kept consistent, and in different situations The degree of inconsistency below may be different. In order to reflect this difference and make it as close as possible to the actual situation, the above difference is reflected on the background picture in the auxiliary training information.
  • the data enhancement method can realize the above-mentioned different ways.
  • the background training picture in the auxiliary training information after data enhancement processing can reflect the inconsistency of the shooting angle, brightness, deformation and hue of the original training picture in different situations, so that the degree of inconsistency between the two is as close as possible to the actual situation.
  • the original training picture is the current training video frame
  • the auxiliary training information of the original training picture includes the previous training video frame and the response map of the last training video frame
  • the response map of the previous training video frame is also subjected to data enhancement processing To keep the response graph of the last training video frame consistent with the previous training video frame.
  • the trained visual task processing model is compared to using the original training picture and the auxiliary training information without data enhancement processing as input variables,
  • the trained visual task processing model the former has better prediction performance and generalization ability than the latter, so that when the former is used to process visual tasks, the restrictions on the original pictures and the auxiliary information of the original pictures are small. There is no need to keep the brightness, deformation and hue of the two the same. At the same time, even if the two are inconsistent in the above aspects, the prediction results with higher accuracy can be obtained.
  • the data enhancement processing includes at least one of translation, rotation, cropping, non-rigid transformation, noise disturbance, and color transformation.
  • the rigid transformation may refer to a transformation in which only the position and orientation of the picture change, and the shape does not change.
  • the non-rigid transformation is a more complicated transformation than the rigid transformation.
  • the non-rigid transformation may include chamfering, Distortion and perspective.
  • the noise disturbance may include Gaussian noise, and the color transformation may include saturation enhancement, brightness enhancement, contrast enhancement, and so on.
  • the data enhancement processing method can be selected according to the actual situation.
  • FIG. 2 is a flowchart of a picture processing method provided by an embodiment of the present application. This embodiment can be applied to a case of processing a visual task.
  • the method can be executed by a picture processing device, which can use software and/or hardware.
  • the device can be configured in a device, such as a computer or a mobile terminal. As shown in FIG. 2, the method includes steps 210 to 230.
  • Step 210 Acquire the original picture and the background picture of the original picture.
  • Step 220 Input the original picture into the main path of the first visual task processing model to obtain the object feature map, and input the background picture into the branch of the first visual task processing model to obtain the auxiliary feature map.
  • Step 230 The object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain a response map of the original picture.
  • FIG. 3 an application schematic diagram of another picture processing method is given.
  • the original picture is input to the main path of the first visual task processing model to obtain an object feature map
  • the background picture is input to the first visual task processing A branch of the model to obtain the auxiliary feature map
  • the object feature map and the auxiliary feature map are fused
  • the fused feature map is input into the main path of the first visual task processing model to obtain the response map of the original picture, that is, the semantic segmentation of the image Figure.
  • the original picture and the background picture are obtained, the original picture is input into the main path of the first visual task processing model to obtain an object feature map, and the background picture is input into the branch of the first visual task processing model to obtain Auxiliary feature map, the object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain the response picture of the original picture.
  • the above process involves the background picture in the process of generating the response picture of the original picture.
  • the background image can provide strong prior knowledge, and the prior knowledge helps to solve the problems of complex and changeable scenes and/or objects that are difficult to identify, which affect the prediction accuracy of the visual task processing model, thereby improving the prediction of the visual task processing model Precision.
  • FIG. 4 is a flowchart of yet another image processing method provided by an embodiment of the present application. This embodiment may be applicable to processing a visual task.
  • the method may be executed by an image processing device, which may use software and/or hardware.
  • the device can be configured in a device, such as a computer or a mobile terminal. As shown in FIG. 4, the method includes steps 310 to 330.
  • Step 310 Obtain a response diagram of the current video frame, the previous video frame, and the previous video frame.
  • Step 320 Input the current video frame into the main path of the first visual task processing model to obtain an object feature map, and input the previous video frame and the response map of the previous video frame into the branch of the first visual task processing model to obtain assistance Feature map.
  • Step 330 The object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain a response map of the original picture.
  • the response graph of the previous video frame can be obtained in the following two ways.
  • the response graph of the previous video frame is a response graph obtained by inputting the previous video frame into the second video task processing model.
  • the response graph of the previous video frame is a response graph obtained by inputting the previous video frame into the first visual task processing model.
  • the second visual task processing model has higher prediction accuracy than the first visual task processing model, and the N is a positive integer.
  • Method 2 When the previous video frame belongs to one of the previous T video frames of the video frame sequence corresponding to the previous video frame, the response diagram of the previous video frame is to input the previous video frame into the second visual task Response graph obtained by processing the model.
  • the response diagram of the previous video frame is to input the previous video frame into the first visual task processing model The resulting response graph.
  • the video frame sequence is one of multiple video frame sequences obtained by dividing multiple video frames in the video; the second visual task processing model has higher prediction accuracy than the first visual task processing model, and the T is Positive integer.
  • the method of acquiring the response map of the previous video frame may be selected according to the actual situation.
  • FIG. 5 an application schematic diagram of another picture processing method is given.
  • the current video frame is input to the main path of the first visual task processing model to obtain the object feature map
  • the previous video frame and the response map of the previous video frame are input to the branch of the first visual task processing model to obtain auxiliary features Figure, where the response map of the previous video frame is obtained by inputting the previous video frame into the second visual task processing model, and the object feature map and the auxiliary feature map are fused to obtain the fused feature map, and the fused feature
  • the graph is input into the main path of the first visual task processing model to obtain the response graph of the original picture, that is, the semantic segmentation graph of the image.
  • the current video frame, the previous video frame, and the response map of the previous video frame are obtained, and the current video frame is input into the main path of the first visual task processing model to obtain the object feature map, and the previous The response frame of the video frame and the previous video frame is input to the branch of the first visual task processing model to obtain the auxiliary feature map.
  • the object feature map and the auxiliary feature map are fused to the main path of the first visual task processing model to obtain the original picture
  • the response graph of the previous frame and the previous video frame is involved in the process of generating the response map of the current video frame.
  • prior knowledge helps to solve the problems of complex and changeable scenes and/or objects that are difficult to identify, which affect the prediction accuracy of the visual task processing model, thereby improving the prediction accuracy of the visual task processing model.
  • FIG. 6 is a schematic structural diagram of an image processing apparatus provided by an embodiment of the present application. This embodiment may be configured to process a visual task.
  • the apparatus may be implemented in software and/or hardware.
  • the apparatus may be configured in a device For example, a computer or a mobile terminal is typical. As shown in FIG. 6, the device includes: an original picture and auxiliary information acquisition module 410, a feature picture acquisition module 420, and a response picture acquisition module 430 of the original picture.
  • the original picture and auxiliary information obtaining module 410 is configured to obtain the original picture and the auxiliary information of the original picture.
  • the feature map acquisition module 420 is configured to input an original picture into the main path of the first visual task processing model to obtain an object feature map, and input auxiliary information into a branch of the first visual task processing model to obtain an auxiliary feature map.
  • the response image acquisition module 430 of the original picture is configured to fuse the object feature map and the auxiliary feature map into the main path of the first visual task processing model to obtain the response map of the original picture.
  • the original picture is input into the main path of the first visual task processing model to obtain an object feature map
  • the auxiliary information is input into the support of the first visual task processing model Way, get the auxiliary feature map, merge the object feature map and the auxiliary feature map and input it into the main path of the first visual task processing model to get the response picture of the original picture.
  • the auxiliary information of the original picture can provide strong prior knowledge, and the prior knowledge helps to solve the problems of complex and changeable scenes and/or objects that are difficult to identify, which affect the prediction accuracy of the visual task processing model, Therefore, the prediction accuracy of the visual task processing model is improved.
  • the image processing apparatus configured in the device provided by the embodiments of the present application can execute the method provided by any embodiment of the present application, and has the function modules and effects corresponding to the execution method.
  • FIG. 7 is a schematic structural diagram of a device provided by an embodiment of the present application. 7 shows a block diagram of an exemplary device 512 suitable for implementing embodiments of the present application. The device 512 shown in FIG. 7 is only an example.
  • the device 512 is represented in the form of a general-purpose computing device.
  • the components of the device 512 may include one or more processors 516, a system memory 528, and a bus 518 connected to different system components (including the system memory 528 and the processor 516).
  • the system memory 528 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory, RAM) 530 and/or cache 532.
  • the storage system 534 may be configured to read and write non-removable, non-volatile magnetic media.
  • the system memory 528 may include at least one program product having a set of (eg, at least one) program modules configured to perform the functions of various embodiments of the present application.
  • a program/utility tool 540 having a set of (at least one) program modules 542 may be stored in, for example, the memory 528.
  • the program module 542 generally performs the functions and/or methods in the embodiments described in this application.
  • the device 512 may also communicate with one or more external devices 514 (eg, keyboard, pointing device, display 524, etc.). Such communication may be performed through an input/output (I/O) interface 522.
  • the device 512 may also communicate with one or more networks through a network adapter 520.
  • the processor 516 runs a program stored in the system memory 528 to execute various functional applications and data processing, for example, to implement the method provided in the embodiments of the present application, and the method includes:
  • the object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain a response map of the original picture.
  • the processor may also implement the solution applied to the image processing method of the device provided by any embodiment of the present application.
  • the hardware structure and function of the device can be explained with reference to the content of the embodiment.
  • An embodiment of the present application further provides a computer-readable storage medium.
  • a computer program is stored on the computer-readable storage medium.
  • the program is executed by a processor, the method as provided in the embodiment of the present application is implemented.
  • the method includes:
  • the object feature map and the auxiliary feature map are fused and input into the main path of the first visual task processing model to obtain a response map of the original picture.
  • a computer-readable storage medium provided by an embodiment of the present application includes the method operations described above, and may also perform related operations of the method provided by any embodiment of the present application.

Abstract

一种图片处理方法、装置、设备及存储介质。该方法包括:获取原始图片和原始图片的辅助信息(110)。将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图(120)。将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图(130)。

Description

图片处理方法、装置、设备及存储介质
本申请要求在2018年12月29日提交中国专利局、申请号为201811648151.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机视觉技术领域,例如涉及一种图片处理方法、装置、设备及存储介质。
背景技术
计算机视觉是一门研究如何利用机器来模拟人和生物视觉处理功能的科学。计算机视觉是用摄像机代替人眼收集视觉信息,利用计算机代替大脑对信息进行处理和分析,从而完成图像分类、图像分割、物体检测、关键点定位、姿态估计和人脸识别等任务。
随着计算机硬件性能的提升和大规模图像数据的出现,深度学习在计算机视觉领域得到广泛应用。深度学习源于人工神经网络的研究,是机器学习的一个重要分支,形成了一种端到端的新模式,深度学习的动机在于模拟人脑的学习方式建立深层次的卷积神经网络,对数据进行理解。深度学习指的是深度卷积神经网络,计算机视觉识别方法是对图片中的不同颜色、纹理和边缘模块的感知提取手工特征,而深度卷积神经网络是由多种不同的线性层和非线性层组合成的深度网络结构,能够由浅入深,由具体到抽象地对特征进行提取,这些通过网络自动提取出的高层特征具有很强的表达能力,能够提炼图片中很多抽象概念和语义信息,如图片中目标对象以及目标对象所在的位置。
相关技术中至少存在如下问题:虽然深度学习在图像分类、图像分割、物体检测、关键点定位、姿态估计和人脸识别等方面得到广泛应用,但是由于存在着场景复杂多变和/或物体较难识别等情况,使得基于深度学习训练生成的视觉任务处理模型,在处理视觉任务时预测精度不高。
发明内容
本申请实施例提供一种图片处理方法、装置、设备及存储介质,以提升视觉任务处理模型的预测精度。
在一实施中,本申请实施例提供了一种图片处理方法,该方法包括:
获取原始图片和原始图片的辅助信息;
将所述原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将所述辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图;
将所述对象特征图和所述辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
在一实施中,本申请实施例还提供了一种图片处理装置,该装置包括:
原始图片和辅助信息获取模块,配置为获取原始图片和原始图片的辅助信息;
特征图获取模块,配置为将所述原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将所述辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图;
原始图片的响应图获取模块,配置为将所述对象特征图和所述辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
在一实施例中,本申请实施例还提供了一种设备,该设备包括:
一个或多个处理器;
存储器,配置为存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请实施例所述的方法。
在一实施例中,本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现如本申请实施例所述的方法。
附图说明
图1是本申请实施例中的一种图片处理方法的流程图;
图2是本申请实施例中的另一种图片处理方法的流程图;
图3是本申请实施例中的一种图片处理方法的应用示意图;
图4是本申请实施例中的又一种图片处理方法的流程图;
图5是本申请实施例中的另一种图片处理方法的应用示意图;
图6是本申请实施例中的一种图片处理装置的结构示意图;
图7是本申请实施例中的一种设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。此处所描述的实施例仅仅用于解释本申请,而非对本申请的限定。附图中仅示出了与本申请相关的部分而非全部结构。
实施例
为了解决上述基于深度学习训练生成的视觉处理模型,在处理视觉任务时预测精度不高的问题,可考虑增加先验知识,所谓先验知识可以理解为是与原始图片相关的辅助信息,下面将结合实施例对上述内容进行说明。
图1为本申请实施例提供的一种图片处理方法的流程图,本实施例可适用于处理视觉任务的情况,该方法可以由图片处理装置来执行,该装置可以采用软件和/或硬件的方式实现,该装置可以配置于设备中,例如计算机或移动终端等。如图1所示,该方法包括步骤110至步骤130。
步骤110、获取原始图片和原始图片的辅助信息。
在本申请的实施例中,为了提升视觉任务处理模型的预测精度,采集原始图片,也采集与原始图片相关的辅助信息,其中,与原始图片相关的辅助信息可以作为先验知识。
原始图片可以理解为需要对该图片执行视觉任务的图片,视觉任务可以包括图像分类、图像分割、物体检测、关键点定位和姿态估计等。在一实施例中,原始图片可以为单张图片,也可以为视频中的视频帧。
如果原始图片为单张图片,则原始图片的辅助信息可以包括原始图片对应的背景图片,其中,原始图片对应的背景图片可作如下理解:原始图片中包括目标对象,而背景图片为不包括目标对象的图片。在一实施例中,背景图片为移除原始图片中目标对象所得的图片。示例性的,如摄像头拍摄室内一个角落中正在睡觉的小猫所得的图片为原始图片,而摄像头拍摄该室内该角落所得的图片为背景图片,其中,目标对象为正在睡觉的小猫。
如果原始图片为视频中的视频帧,并将该视频帧作为当前视频帧且所述当前视频帧不为视频的首帧的情况下,则原始图片的辅助信息可以包括当前视频帧的上一视频帧和上一视频帧的响应图。
步骤120、将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将原始图片的辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图。
步骤130、将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
在本申请的实施例中,原始图片的响应图可以理解为对原始图片执行相应类型的视觉任务后得到的结果,原始图片的响应图的表现形式根据视觉任务的类型确定,示例性的,如果视觉任务为图像分割(图像分割就是将图片中的每个像素按照其所属类别进行分类),则原始图片的响应图可以为原始图片中每个像素所属类别的概率图,也可以为通过设置概率阈值,将概率图转化成的图像语义分割图;如果视觉任务为物体检测,则原始图片的响应图为包含预选框的图,目标物体落入该预选框;如果视觉任务为关键点定位,则原始图片的响应图为基于关键点的位置生成的热力图。
第一视觉任务处理模型可以基于卷积神经网络训练生成,第一视觉任务处理模型可以包括主路和支路。卷积神经网络是一个多层的神经网络,可以包括卷积层、池化层、非线性激活层和全连接层。它的每一层都是由多个特征图组成,而每个特征图中的像素点代表一个神经元。特征图可以用W×H×K表示,其中,W表示特征图的宽度,H表示特征图的长度,K表示通道数,W×H即表示特征图的尺寸。在卷积神经网络中,通道数即指每个卷积层中卷积核的个数。上述卷积层、池化层、非线性激活层和全连接层为卷积神经网络的网络结构,上述网络结构的结构比较复杂且参数量较大,为了简化网络结构以及减少参数量,可采用轻量级的卷积神经网络,如全卷积神经网络,所谓全卷积神经网络为不包含全连接层的卷积神经网络,下面以基于全卷积神经网络训练生成的第一视觉任务处理模型,对第一视觉任务处理模型的结构进行说明,在一实施例中,第一视觉任务处理模型的主路包括第一下采样模块和上采样模块,第一下采样模块的输出端与上采样模块的输入端连接,第一视觉任务处理模型的支路包括第二下采样模块,第一下采样模块与第二下采样模块并联。每个下采样模块可以包括M个卷积层,每个上采样模块可以包括M个转置卷积层,每个卷积层后还可以连接批规范化层和非线性激活层,图片经过第一下采样模块和第二下采样模块后,得到下采样特征图,下采样特征图包含了图片的特征信息,并且由于下采样特征图尺寸相比于输入图片尺寸进行了尺寸缩小,因此,下采样特征图具有更大的感受野,可以提供更多的上下文信息。将下采样特征图输入上采样模块,得到上采样特征图,上采样特征图尺寸与输入图片尺寸相同。在一实施例中,第一视觉任务处理模型的结构的具体形式可以根据实际情况进行设计。
将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,这里所述的对象特征图可以为前文所述的经过第一下采样模块后得到的下采样特征图,对象特征图中包含原始图片中的特征信息,将原始图片的辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图,这里所述的辅助特征图可以为前文所述的经过第二下采样模块后得到的下采样特征图,辅助特征图中包含原 始图片的辅助信息的特征信息。
将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图,这里所述的原始图片的响应图可以为前文所述的经过上采样模块后得到的上采样特征图。在一实施例中,由于原始图片的辅助信息也参与了生成原始图片的响应图的过程,即先验知识也参与了生成原始图片的响应图的过程,换句话说,作为先验知识的原始图片的辅助信息在生成原始图片的响应图的过程中起到了提升模型预测精度的作用,因此,上述有原始图片的辅助信息参与生成的原始图片的响应图相比于仅原始图片参与,而没有原始图片的辅助信息参与生成的原始图片的响应图,更加精确。
在一实施例中,对象特征图尺寸和辅助特征图尺寸相同以及对象特征图的通道数与辅助特征图的通道数相同,为了实现对象特征图尺寸和辅助特征图尺寸相同,可以设置前文所述的第一下采样模块和第二下采样模块的结构以及卷积核个数相同,即设置第一下采样模块和第二下采样模块包括同样个数的卷积层以及同样个数的卷积核。可以通过如下两种方式将对象特征图和辅助特征图进行融合,包括:方式一、将对象特征图和辅助特征图以按位加的方式进行融合;方式二、将对象特征图和辅助特征图通过通道交互的方式进行融合。采用哪种方式将两者进行融合,可根据实际情况进行设定。
在一实施例中,在原始图片为视频的视频帧的情况下,原始图片的辅助信息包括上一视频帧和上一视频帧的响应图,上一视频帧的响应图可为上一视频帧作为输入变量输入第一视觉任务处理模型得到的图。此外,考虑到需要原始图片的辅助信息作为先验知识来提升模型的预测精度,因此,需要保证作为先验知识的原始图片的辅助信息中上一视频帧的响应图的精确度,且精确度越高越好。为了提升原始图片的辅助信息中上一视频帧的响应图的精确度,可考虑选用预测精度更高的视觉处理模型,即将符合预设条件的上一视频帧作为输入变量并不选择输入到第一视觉任务处理模型,而是选择输入到比第一视觉任务处理模型的预测精度更高的视觉处理模型。通常模型的预测精度越高,模型的结构越复杂且参数量越大,当模型的结构越复杂且参数量越大时,计算开销也就越大,相应的,模型的预测效率也就越低。上述为了获得更高精确度的上一视频帧的响应图,而选用预测精度的视觉模型,在提升了上一视频帧的响应图的精确度的同时,也降低了模型的预测效率。基于上述,可以根据实际情况确定是将上一视频帧作为输入变量输入第一视觉任务处理模型以得到上一视频帧的响应图,还是将上一视频帧作为输入变量输入比第一视觉任务处理模型的预测精度更高的模型以得到上一视频帧的响应图,包括如下两种方式:
方式一、在上一视频帧属于该视频的前N视频帧之一的情况下,将上一视 频帧作为输入变量输入比第一视觉任务处理模型的预测精度更高的模型,以得到上一视频帧的响应图;在上一视频帧不属于该视频的前N视频帧之一的情况下,将上一视频帧作为输入变量输入第一视觉任务处理模型,以得到上一视频帧的响应图,所述N为正整数。可进行上述处理的原因在于:由于视频中多个视频帧之间通常具有关联性,因此,前N视频帧的响应图是将多个视频帧作为输入变量输入比第一视觉任务处理模型的预测精度更高的模型得到的,便可以保证作为先验知识的原始图片的辅助信息中上一视频帧的响应图的精确度。在一实施例中,上述方式一是以视频为单位来确定上一视频帧的响应图的获取方式的。
方式二、如果视频的时长大于或等于时长阈值,则采用方式一得到的上一视频帧的精确度可能无法满足实际要求,基于上述,可将视频中多个视频帧按时间顺序划分为两个或两个以上视频帧序列,多个视频帧序列之间不重叠,每个视频帧序列中所包括的视频帧的个数可以相同,也可以不同,可根据实际情况进行确定。在一实施例中,在每个视频帧序列中,按时间顺序可分为第一视频帧、第二视频帧、……、第P视频帧。在一实施例中,上一视频帧将属于多个视频帧序列中的一个视频帧序列。视频经过上述处理得到多个视频帧序列后,对上一视频帧的响应图的获取方式,将由方式一中以视频为单位进行考虑,转变为以视频帧序列为单位进行考虑。在一实施例中:在上一视频帧属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,将上一视频帧作为输入变量输入比第一视觉任务处理模型的预测精度更高的模型,以得到上一视频帧的响应图;在上一视频帧不属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,便将上一视频帧作为输入变量输入第一视觉任务处理模型,以得到上一视频帧的响应图,所述T为正整数。可进行上述处理的原因在于:由于视频序列中多个视频帧之间通常具有关联性,因此,每个视频序列中前T视频帧的响应图是将多个视频帧作为输入变量输入比第一视觉任务处理模型的预测精度更高的模型得到的,便可以保证作为先验知识的原始图片的辅助信息中上一视频帧的响应图的精确度。同时,采用以视频帧序列为单位,而不是以视频为单位来确定上一视频帧的响应图的获取方式,提高了原始图片的辅助信息中上一视频帧的响应图的精确度。
在一实施例中,如果原始图片为当前视频帧且所述当前视频帧不为视频的首帧的情况下,原始图片的辅助信息包括当前视频帧的上一视频帧和上一视频帧的响应图,则根据视觉任务的类型确定是否对上一视频帧的响应图的表现形式进行调整。示例性的,如果视觉任务为图像分割,上一视频帧的响应图为上一视频帧中每个像素所属类别的概率图,或者,上一视频帧的响应图为通过设置概率阈值,将概率图转化成的图像语义分割图,图像分割的上一视频帧的响 应图的表现形式可以直接作为输入变量输入到第一视觉任务处理模型的支路,而无需再调整;如果视觉任务为物体检测,上一视频帧的响应图为包含预选框的图,则对包含预选框的图进行调整,可选择将预选框内的像素的像素值设置为1,并将预选框外的像素的像素值设置为0,将调整后的上一视频帧的响应图作为输入变量输入到第一视觉任务处理模型的支路。在一实施例中,预选框内外的像素的像素值可根据实际情况进行设定;如果视觉任务为关键点定位,上一视频帧的响应图为基于关键点的位置生成的热力图,则可以直接将上一视频帧的响应图作为输入变量输入到第一视觉任务处理模型的支路,而无需再对上一视频帧的响应图的表现形式进行调整。
本实施例的技术方案,通过获取原始图片和原始图片的辅助信息,将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图,将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图,上述通过将原始图片的辅助信息参与到生成原始图片的响应图的过程中,由于原始图片的辅助信息可以提供较强的先验知识,而先验知识有助于解决影响视觉任务处理模型预测精度的场景复杂多变和/或物体较难识别等问题,从而提升了视觉任务处理模型的预测精度。
可选的,在上述技术方案的基础上,原始图片的辅助信息包括原始图片对应的背景图片。
在本申请的实施例中,原始图片的辅助信息可以包括原始图片对应的背景图片,根据前文所述,原始图片的背景图片可以作如下理解:原始图片中包括目标对象,而背景图片为不包括目标对象的图片。换个角度理解,背景图片为移除原始图片中目标对象所得的图片,下面针对上述理解对背景图片所起到的作用进行说明。
在一实施例中,在处理视觉任务过程中,可能会出现如下情况:视觉任务为图像分割时,可能会出现前景和背景相混淆或生成原始图片的响应图的边缘粗糙的情况,此时,原始图片的响应图可以为图像语义分割图;当视觉任务为物体检测时,可能会出现生成的预选框抖动比较严重的情况;当视觉任务为关键点定位时,可能会出现无法识别到关键点或关键点抖动的情况。在一实施例中,上述情况表明模型的预测精度不高,而导致模型的预测精度不高的原因并不是由于目标对象本身很难被识别,而是由于场景复杂且多变,相比于目标对象来说,可以将复杂且多变的场景理解为背景干扰信息。基于上述,由于背景图片为移除目标对象的图片,相比于原始图片来说,背景图片仅包含背景干扰信息,将背景图片作为输入变量输入第一视觉任务处理模型的支路,得到辅助 特征图,辅助特征图将提取到背景干扰信息的特征,该辅助特征图参与到生成原始图片的响应图的过程中,使得生成的原始图片的响应图为抑制背景干扰的响应图。在一实施例中,当背景图片为移除原始图片中目标对象所得的图片时,背景图片作为先验知识所起到的作用为抑制背景干扰,进而提升模型的预测精度。
可选的,在上述技术方案的基础上,在原始图片为当前视频帧且所述当前视频帧不为视频的首帧的情况下,原始图片的辅助信息包括所述当前视频帧的上一视频帧和上一视频帧的响应图。
在本申请的实施例中,针对原始图片为视频中当前视频帧,原始图片的辅助信息包括所述当前视频帧的上一视频帧和上一视频帧的响应图的情况,在处理视觉任务的过程中,可能会出现如下情况:当当视觉任务为图像分割时,可能会出现不同视频帧之间,分割掩码闪烁比较严重的情况;当视觉任务为物体检测时,可能会出现连续几个视频帧中所生成的预选框抖动比较严重的情况;当视觉任务为关键点定位时,可能会出现相邻视频帧中关键点抖动的情况。在一实施例中,上述情况表明模型的预测精度不高,而导致模型的预测精度不高的原因为物体和/或场景较难识别。基于上述,由于相邻两个视频帧之间具有一定的关联性,因此,相邻两个视频帧的响应图之间也具有一定的关联性,换句话说,上一视频帧的响应图对生成当前视频帧的响应图具有较高的参考性,即上一视频帧的响应图可以作为先验知识,参与到生成当前视频帧的响应图的过程中,在一实施例中,上述过程为将上一视频帧的响应图作为输入变量输入第一视觉任务处理模型的支路,得到辅助特征图,辅助特征图将提取到上一视频帧的特征,该辅助特征图参与到生成当前视频帧的响应图的过程中。上一视频帧的响应图作为先验知识所起到的作用为增强帧间连续性,进而提升模型的预测精度。在一实施例中,由于上一视频帧和上一视频帧的响应图为模型提供了较强的先验知识,因此,基于卷积神经网络训练生成的第一视觉任务模型的结构可以尽量简化,以便于提升模型的预测效率。
在一实施例中,根据前文所述可知,可以根据实际情况确定是将上一视频帧作为输入变量输入第一视觉任务处理模型以得到上一视频帧的响应图,还是将上一视频帧作为输入变量输入比第一视觉任务处理模型的预测精度更高的模型以得到上一视频帧的响应图。
可选的,在上述技术方案的基础上,可以通过如下方式获取上一视频帧的响应图:在上一视频帧属于视频的前N视频帧之一的情况下,上一视频帧的响应图为将上一视频帧输入第二视频任务处理模型得到的响应图。在上一视频帧不属于视频的前N视频帧之一的情况下,上一视频帧的响应图为将上一视频帧 输入第一视觉任务处理模型得到的响应图。其中,第二视觉任务处理模型比第一视觉任务处理模型的预测精度高,所述N为正整数。
在本申请的实施例中,考虑到由于原始图片的辅助信息作为先验知识来提升模型的预测精度,因此,需要保证作为先验知识的原始图片的辅助信息中上一视频帧的响应图的精确度,且精确度越高越好。为了提升原始图片的辅助信息中上一视频帧的响应图的精确度,可考虑选用预测精度更高的视觉处理模型,即上一视频帧作为输入变量并不选择输入到第一视觉任务处理模型,而是选择输入到比第一视觉任务处理模型的预测精度更高的视觉处理模型。通常模型的预测精度越高,模型的结构越复杂且参数量越大,当模型的结构越复杂且参数量越大时,计算开销也就越大,相应的,模型的预测效率也就越低。上述为了获得更高精确度的上一视频帧的响应图,而选用预测精度的视觉模型,在提升了上一视频帧的响应图的精确度的同时,也降低了模型的预测效率。基于上述,可以根据实际情况确定是将上一视频帧作为输入变量输入第一视觉任务处理模型以得到上一视频帧的响应图,还是将上一视频帧作为输入变量输入比第一视觉任务处理模型的预测精度更高的模型以得到上一视频帧的响应图。
可考虑如果上一视频帧属于该视频的前N视频帧之一,便将上一视频帧作为输入变量输入第二视觉任务处理模型,以得到上一视频帧的响应图;如果上一视频帧不属于该视频的前N视频帧之一,便将上一视频帧作为输入变量输入第一视觉任务处理模型,以得到上一视频帧的响应图,其中,第二视觉任务处理模型比第一视觉任务处理模型的预测精度高,所述N为正整数。在一实施例中,上述方式是以视频为单位来确定上一视频帧的响应图的获取方式的。
可进行上述处理的原因在于:由于视频中相邻两个视频帧之间通常具有关联性,因此,前N视频帧的响应图是将前N视频帧作为输入变量输入第二视觉任务处理模型得到的,便可以保证作为先验知识的视频帧的响应图的精确度,即保证模型的预测精度。此外,由于第二视觉任务处理模型的预测精度高于第一视觉任务处理模型,因此,第二视觉任务处理模型的结构将比第一视觉任务处理模型复杂,在一实施例中,第二视觉任务处理模型的参数量也将比第一视觉任务处理模型大。而计算开销将随着模型结构的复杂度的提升以及参数量的增大而增大,计算开销的增大意味着模型预测效率的降低。基于上述,采用上述方式,在保证了作为先验知识的上一视频帧的响应图的精确度的同时,也保证了模型的计算效率维持在一个较高的水平,即采用上述方式,兼顾了模型的预测精确度以及模型的预测效率。
在一实施例中,当视觉任务对象为视频时,采用上述方式处理后,在视觉效果上看,上述方式将增强帧间一致性,换句话说,采用上述方式处理后,由 于模型的预测精度得到提升,因此,一定程度上也实现了帧间一致性。
可选的,在上述技术方案的基础上,可以通过如下方式获取上一视频帧的响应图:在上一视频帧属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,上一视频帧的响应图为将上一视频帧输入第二视觉任务处理模型得到的响应图。
在上一视频帧不属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,上一视频帧的响应图为将上一视频帧输入第一视觉任务处理模型得到的响应图。其中,视频帧序列为将视频中的多个视频帧划分后得到的多个视频帧序列之一;第二视觉任务处理模型比第一视觉任务处理模型的预测精度高,所述T为正整数。
在本申请的实施例中,如果视频的时长大于或等于时长阈值,则采用以视频为单位来确定上一视频帧的响应图的获取方式可能无法满足实际要求,基于上述,可将视频中多个视频帧按时间顺序划分为两个或两个以上视频帧序列,多个视频帧序列之间不重叠,每个视频帧序列中所包括的视频帧的个数可以相同,也可以不同,可根据实际情况进行确定。在一实施例中,在每个视频帧序列中,按时间顺序可分为第一视频帧、第二视频帧、……、第P视频帧。在一实施例中,上一视频帧将属于多个视频帧序列中的一个视频帧序列。视频经过上述处理得到多个视频帧序列后,对上一视频帧的响应图的获取方式,将由以视频为单位进行考虑,转变为以视频帧序列为单位进行考虑。在一实施例中:如果上一视频帧属于与所述上一视频帧对应的视频帧序列的前T视频帧之一,便将上一视频帧作为输入变量输入第二视觉任务处理模型,以得到上一视频帧的响应图;如果上一视频帧不属于与所述上一视频帧对应的视频帧序列的前T视频帧之一,便将上一视频帧作为输入变量输入第一视觉任务处理模型,以得到上一视频帧的响应图,其中,第二视觉任务处理模型比第一视觉任务处理模型的预测精度高,所述T为正整数。
可进行上述处理的原因在于:由于视频序列中多个视频帧之间通常具有关联性,因此,每个视频序列中前T视频帧的响应图是将多个视频帧作为输入变量输入第二视觉任务处理模型得到的,便可以保证作为先验知识的原始图片的辅助信息中上一视频帧的响应图的精确度。在一实施例中,采用以视频帧序列为单位,而不是以视频为单位来确定上一视频帧的响应图的获取方式,提高了原始图片的辅助信息中上一视频帧的响应图的精确度。此外,由于第二视觉任务处理模型的预测精度高于第一视觉任务处理模型,因此,第二视觉任务处理模型的结构将比第一视觉任务处理模型复杂,第二视觉任务处理模型的参数量也将比第一视觉任务处理模型大。而计算开销将随着模型结构的复杂度的提升以及参数量的增大而增大,计算开销的增大意味着模型预测效率的降低。基于 上述,采用上述方式,在保证了作为先验知识的上一视频帧的响应图的精确度的同时,也保证了模型的计算效率维持在一个较高的水平,即采用上述方式,兼顾了模型的预测精确度以及模型的预测效率。
可选的,在上述技术方案的基础上,可以通过如下方式训练第一视觉任务处理模型:获取原始训练图片、原始训练图片的标注信息和原始训练图片的辅助训练信息。将原始训练图片输入卷积神经网络的主路,得到对象训练特征图,并且将辅助训练信息输入卷积神经网络的支路,得到辅助训练特征图。将对象训练特征图和辅助训练特征图融合后输入卷积神经网络的主路,得到原始训练图片的响应图。根据原始训练图片的标注信息和原始训练图片的响应图,得到卷积神经网络的损失函数。根据损失函数调整卷积神经网络的网络参数,直至损失函数的输出值小于或等于预设阈值,将卷积神经网络作为第一视觉任务处理模型。
在本申请的实施例中,为了提升第一视觉任务处理模型的预测精度,考虑将可作为先验知识的辅助训练信息作为第一视觉任务处理模型的输入变量,共同参与到第一视觉任务处理模型的训练过程,并且是作为第一视觉任务处理模型的一个分支的输入变量。在一实施例中,将原始训练图片作为输入变量输入的分支称为第一视觉任务处理模型的主路,将辅助训练信息作为输入变量输入的分支称为第一视觉任务处理模型的支路。在一实施例中,由于第一视觉任务处理模型是基于卷积神经网络训练生成的,因此,在训练过程中原始训练图片作为输入变量输入的分支是卷积神经网络的主路,辅助训练信息作为输入变量输入的分支是卷积神经网络的支路。
原始图片的标注信息将根据视觉任务的类型的不同而不同,示例性的,当视觉任务为图像分割时,原始图片的标注信息为原始图片中每个像素的真实标签,该真实标签表明像素所属分类;当视觉任务为物体检测时,原始图片的标注信息为目标框,该目标框包括目标对象;当视觉任务为关键点定位时,原始图片的标注信息为关键点的坐标信息。
将原始训练图片输入卷积神经网络的主路,得到对象训练特征图,并且将辅助训练信息输入卷积神经网络的支路,得到辅助训练特征图。在一实施例中,如果原始训练图片为当前训练视频帧,则原始训练图片的辅助训练信息可以包括上一训练视频帧和上一训练视频帧的响应图;如果原始训练图片为单张图片,则原始训练图片的辅助训练信息可以包括背景训练图片。当原始训练图片为当前训练视频帧,原始训练图片的辅助训练信息包括上一训练视频帧和上一训练视频帧的响应图时,上一训练视频帧的响应图可以作为输入变量输入第二视觉任务处理模型得到。
将对象训练特征图和辅助训练特征图融合后输入卷积神经网络的主路,得到原始训练图片的响应图,根据原始训练图片的标注信息和原始训练图片的响应图得到(例如计算得到)卷积神经网络的损失函数,损失函数可以为交叉熵损失函数、0-1损失函数、平方损失函数、绝对损失函数和对数损失函数等,可根据实际情况进行设定。
卷积神经网络的训练过程是经过前向传播计算卷积神经网络的损失函数,并计算损失函数对网络参数的偏导数,采用反向梯度传播方法,对卷积神经网络的网络参数进行调整,直至卷积神经网络的损失函数的输出值小于或等于预设阈值。当卷积神经网络模型的损失函数的输出值小于或等于预设阈值时,表示卷积神经网络已训练完成,此时,卷积神经网络的网络参数也得以确定。在此基础上,可将训练完成的卷积神经网络作为第一视觉任务处理模型。
在一实施例中,本申请实施例所述的卷积神经网络可以为全卷积神经网络,即前文所述的全卷积神经网络,全卷积神经网络的结构形式可以根据实际情况进行设计。
在一实施例中,针对原始训练图片的形式的不同,原始训练图片的辅助训练信息所包含的内容也将不同,在此基础上,通过上述方式训练得到的第一视觉任务处理模型也将不同,这里所述的不同可以指第一视觉任务处理模型的网络参数的不同。
在一实施例中,由于原始训练图片的辅助训练信息也参与到了模型训练过程中,作为先验知识的原始训练图片的辅助训练信息模型在训练过程中起到了使训练得到的第一视觉任务处理模型的预测精度更高的作用,因此,上述有原始训练图片的辅助训练信息参与生成的第一视觉任务处理模型相比于仅原始训练图片参与,而没有原始训练图片的辅助训练信息参与生成的第一视觉任务处理模型,模型的预测精度更高。
此外,本申请实施例所述的第二视觉任务处理模型为本身已经训练完成的模型,第二视觉任务处理模型可配置为生成上一训练视频帧的响应图和上一视频帧的响应图。
可选的,在上述技术方案的基础上,辅助训练信息为通过数据增强处理后得到的辅助训练信息。
在本申请的实施例中,视觉任务处理模型是基于卷积神经网络训练生成的,卷积神经网络的一大优势就是在于对数据的吸收能力,并转化为对参数的不断学习更新,得到一个预测性能和泛化能力都很好的模型。为了得到预测性能和泛化能力都很好的模型,卷积神经网络对训练样本的数量以及质量都提出了要 求,换句话说,训练样本的数量以及质量对模型的预测性能和泛化能力有着重要影响。基于上述,可考虑采用数据增强方法对训练样本进行处理,以增加训练样本的数量以及提高训练样本的质量,以此提升模型的预测性能和泛化能力。
在一实施例中,针对本申请实施例来说,由于将辅助训练信息作为先验知识,提升模型的预测性能,因此,这里所述的训练样本指的是辅助训练信息。即本申请实施例采用数据增强方法对辅助训练信息进行处理,换句话说,辅助训练信息为通过数据增强处理后得到的辅助训练信息。
采用数据增强方法对辅助训练信息进行处理,可以提高辅助训练信息的质量,可作如下理解:在实际应用中,由于多数情况下摄像头不是固定不动的,而原始训练图片和辅助训练信息中的背景训练图片并不是同时拍摄得到的,而是分别拍摄得到的,因此,使得原始训练图片和辅助训练信息中的背景训练图片的拍摄角度、亮度、形变和色调等无法保持一致,并且在不同情况下这种不一致性的程度可能并不相同,为了体现这种不同,使其尽量与实际情况相符,便在辅助训练信息中的背景图片上体现上述不同。数据增强方法便是可以实现体现上述不同的方式。即辅助训练信息中的背景训练图片经过数据增强处理后可以体现不同情况下与原始训练图片的拍摄角度、亮度、形变和色调的不一致性,使两者不一致性的程度尽量与实际情况相符。此外,如果原始训练图片为当前训练视频帧,原始训练图片的辅助训练信息包括上一训练视频帧和上一训练视频帧的响应图时,也对上一训练视频帧的响应图进行数据增强处理,使上一训练视频帧的响应图与上一训练视频帧保持一致。
基于上述,通过采用原始训练图片和数据增强处理后的辅助训练信息作为输入变量,训练得到的视觉任务处理模型相比于采用原始训练图片和未经数据增强处理后的辅助训练信息作为输入变量,训练得到的视觉任务处理模型,前者的预测性能和泛化能力要优于后者,使得后续在采用前者处理视觉任务时,对原始图片和原始图片的辅助信息的限制小,所谓限制小可以指无需保持两者的亮度、形变和色调等方面一致。同时,即使两者在上述方面不一致也可以得到精度较高的预测结果。
可选的,在上述技术方案的基础上,数据增强处理包括平移、旋转、裁剪、非刚性变换、噪声扰动和颜色变换中的至少一种。
在本申请的实施例中,刚性变换可以指只有图片的位置和朝向发生改变,而形状不变的变换,非刚性变换是相比于刚性变换更复杂的变换,非刚性变换可以包括斜切、扭曲和透视等。噪声扰动可以包括高斯噪声,颜色变换可以包括饱和度增强、亮度增强和对比度增强等。在一实施例中,可根据实际情况选择数据增强处理方式。
图2为本申请实施例提供的一种图片处理方法的流程图,本实施例可适用于处理视觉任务的情况,该方法可以由图片处理装置来执行,该装置可以采用软件和/或硬件的方式实现,该装置可以配置于设备中,例如计算机或移动终端等。如图2所示,该方法包括步骤210至步骤230。
步骤210、获取原始图片和原始图片的背景图片。
步骤220、将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将背景图片输入第一视觉任务处理模型的支路,得到辅助特征图。
步骤230、将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
在本申请实施例中,为了理解本申请实施例所提供的技术方案,下面将以视觉任务为图像分割为例进行说明。
如图3所示,给出了另一种图片处理方法的应用示意图,图3中将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,将背景图片输入第一视觉任务处理模型的支路,得到辅助特征图,将对象特征图和辅助特征图进行融合,将融合后的特征图输入第一视觉任务处理模型的主路,得到原始图片的响应图,即得到图像语义分割图。
本实施例的技术方案,通过获取原始图片和背景图片,将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将背景图片输入第一视觉任务处理模型的支路,得到辅助特征图,将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图,上述通过将背景图片参与到生成原始图片的响应图的过程中,由于背景图片可以提供较强的先验知识,而先验知识有助于解决影响视觉任务处理模型预测精度的场景复杂多变和/或物体较难识别等问题,从而提升了视觉任务处理模型的预测精度。
图4为本申请实施例提供的又一种图片处理方法的流程图,本实施例可适用于处理视觉任务的情况,该方法可以由图片处理装置来执行,该装置可以采用软件和/或硬件的方式实现,该装置可以配置于设备中,例如计算机或移动终端等。如图4所示,该方法包括步骤310至步骤330。
步骤310、获取当前视频帧、上一视频帧和上一视频帧的响应图。
步骤320、将当前视频帧输入第一视觉任务处理模型的主路,得到对象特征图,并且将上一视频帧和上一视频帧的响应图输入第一视觉任务处理模型的支路,得到辅助特征图。
步骤330、将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
在本申请的实施例中,在一实施例中,可以通过如下两种方式获取上一视频帧的响应图。
方式一、在上一视频帧属于视频的前N视频帧之一的情况下,上一视频帧的响应图为将上一视频帧输入第二视频任务处理模型得到的响应图。在上一视频帧不属于视频的前N视频帧之一的情况下,上一视频帧的响应图为将上一视频帧输入第一视觉任务处理模型得到的响应图。其中,第二视觉任务处理模型比第一视觉任务处理模型的预测精度高,所述N为正整数。
方式二、在上一视频帧属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,上一视频帧的响应图为将上一视频帧输入第二视觉任务处理模型得到的响应图。在上一视频帧不属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,上一视频帧的响应图为将上一视频帧输入第一视觉任务处理模型得到的响应图。其中,视频帧序列为将所述视频中的多个视频帧划分后得到的多个视频帧序列之一;第二视觉任务处理模型比第一视觉任务处理模型的预测精度高,所述T为正整数。
在一实施例中,可以根据实际情况选择获取上一视频帧的响应图的方式。
下面将以视觉任务为图像分割为例进行说明。
如图5所示,给出了另一种图片处理方法的应用示意图。图5中将当前视频帧输入第一视觉任务处理模型的主路,得到对象特征图,将上一视频帧和上一视频帧的响应图输入第一视觉任务处理模型的支路,得到辅助特征图,其中,上一视频帧的响应图为将上一视频帧输入第二视觉任务处理模型得到的,将对象特征图和辅助特征图进行融合,得到融合后的特征图,将融合后的特征图输入第一视觉任务处理模型的主路,得到原始图片的响应图,即得到图像语义分割图。
本实施例的技术方案,通过获取当前视频帧、上一视频帧和上一视频帧的响应图,将当前视频帧输入第一视觉任务处理模型的主路,得到对象特征图,并且将上一视频帧和上一视频帧的响应图输入第一视觉任务处理模型的支路,得到辅助特征图,将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图,上述通过将上一帧和上一视频帧的响应图参与到生成当前视频帧的响应图的过程中,由于上一视频帧和上一视频帧的响应图可以提供较强的先验知识,而先验知识有助于解决影响视觉任务处理模型预测精度的场景复杂多变和/或物体较难识别等问题,从而提升了视觉任务处理模型的预测精度。
图6为本申请实施例提供的一种图片处理装置的结构示意图,本实施例可 配置为处理视觉任务的情况,该装置可以采用软件和/或硬件的方式实现,该装置可以配置于设备中,例如典型的是计算机或移动终端等。如图6所示,该装置包括:原始图片和辅助信息获取模块410、特征图获取模块420以及原始图片的响应图获取模块430。
原始图片和辅助信息获取模块410,配置为获取原始图片和所述原始图片的辅助信息。
特征图获取模块420,配置为将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图。
原始图片的响应图获取模块430,配置为将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
本实施例的技术方案,通过获取原始图片和原始图片的辅助信息,将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图,将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图,上述通过将原始图片的辅助信息参与到生成原始图片的响应图的过程中,由于原始图片的辅助信息可以提供较强的先验知识,而先验知识有助于解决影响视觉任务处理模型预测精度的场景复杂多变和/或物体较难识别等问题,从而提升了视觉任务处理模型的预测精度。
本申请实施例所提供的配置于设备的图片处理装置可执行本申请任意实施例所提供的方法,具备执行方法相应的功能模块和效果。
图7为本申请实施例提供的一种设备的结构示意图。图7示出了适于用来实现本申请实施方式的示例性设备512的框图。图7显示的设备512仅仅是一个示例。
如图7所示,设备512以通用计算设备的形式表现。设备512的组件可以包括:一个或者多个处理器516,系统存储器528,连接于不同系统组件(包括系统存储器528和处理器516)的总线518。
系统存储器528可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)530和/或高速缓存532。存储系统534可以配置为读写不可移动的、非易失性磁介质。系统存储器528可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请多个实施例的功能。
具有一组(至少一个)程序模块542的程序/实用工具540,可以存储在例 如存储器528中程序模块542通常执行本申请所描述的实施例中的功能和/或方法。
设备512也可以与一个或多个外部设备514(例如键盘、指向设备、显示器524等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口522进行。设备512还可以通过网络适配器520与一个或者多个网络通信。
处理器516通过运行存储在系统存储器528中的程序,从而执行多种功能应用以及数据处理,例如实现本申请实施例所提供的方法,该方法包括:
获取原始图片和原始图片的辅助信息;
将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图;
将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
处理器还可以实现本申请任意实施例所提供应用于设备的图片处理方法的方案。该设备的硬件结构以及功能可参见实施例的内容解释。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现如本申请实施例所提供的方法,该方法包括:
获取原始图片和原始图片的辅助信息;
将原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将辅助信息输入第一视觉任务处理模型的支路,得到辅助特征图;
将对象特征图和辅助特征图融合后输入第一视觉任务处理模型的主路,得到原始图片的响应图。
本申请实施例所提供的一种计算机可读存储介质,其计算机可执行指令包括如上所述的方法操作,还可以执行本申请任意实施例所提供的方法的相关操作。

Claims (11)

  1. 一种图片处理方法,包括:
    获取原始图片和所述原始图片的辅助信息;
    将所述原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将所述辅助信息输入所述第一视觉任务处理模型的支路,得到辅助特征图;
    将所述对象特征图和所述辅助特征图融合后输入所述第一视觉任务处理模型的主路,得到所述原始图片的响应图。
  2. 根据权利要求1所述的方法,其中,所述原始图片的辅助信息包括所述原始图片对应的背景图片。
  3. 根据权利要求1所述的方法,其中,在所述原始图片为当前视频帧且所述当前视频帧不为视频首帧的情况下,所述原始图片的辅助信息包括所述当前视频帧的上一视频帧和所述上一视频帧的响应图。
  4. 根据权利要求3所述的方法,其中,通过如下方式获取所述上一视频帧的响应图:
    在所述上一视频帧属于所述视频的前N视频帧之一的情况下,所述上一视频帧的响应图为将所述上一视频帧输入第二视频任务处理模型得到的响应图;
    在所述上一视频帧不属于所述视频的前N视频帧之一的情况下,所述上一视频帧的响应图为将所述上一视频帧输入所述第一视觉任务处理模型得到的响应图;
    其中,所述第二视觉任务处理模型比所述第一视觉任务处理模型的预测精度高,所述N为正整数。
  5. 根据权利要求3所述的方法,其中,通过如下方式获取所述上一视频帧的响应图:
    在所述上一视频帧属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,所述上一视频帧的响应图为将所述上一视频帧输入第二视觉任务处理模型得到的响应图;
    在所述上一视频帧不属于与所述上一视频帧对应的视频帧序列的前T视频帧之一的情况下,所述上一视频帧的响应图为将所述上一视频帧输入所述第一视觉任务处理模型得到的响应图;
    其中,所述视频帧序列为将所述视频中的多个视频帧划分后得到的多个视频帧序列之一;所述第二视觉任务处理模型比所述第一视觉任务处理模型的预测精度高,所述T为正整数。
  6. 根据权利要求1所述的方法,其中,通过如下方式训练所述第一视觉任务处理模型:
    获取原始训练图片、所述原始训练图片的标注信息和所述原始训练图片的辅助训练信息;
    将所述原始训练图片输入卷积神经网络的主路,得到对象训练特征图,并且将所述辅助训练信息输入所述卷积神经网络的支路,得到辅助训练特征图;
    将所述对象训练特征图和所述辅助训练特征图融合后输入所述卷积神经网络的主路,得到所述原始训练图片的响应图;
    根据所述原始训练图片的标注信息和所述原始训练图片的响应图,得到卷积神经网络的损失函数;
    根据所述损失函数调整所述卷积神经网络的网络参数,直至所述损失函数的输出值小于或等于预设阈值,将所述卷积神经网络作为所述第一视觉任务处理模型。
  7. 根据权利要求6所述的方法,其中,所述辅助训练信息为通过数据增强处理后得到的辅助训练信息。
  8. 根据权利要求7所述的方法,其中,所述数据增强处理包括平移、旋转、裁剪、非刚性变换、噪声扰动和颜色变换中的至少一种。
  9. 一种图片处理装置,包括:
    原始图片和辅助信息获取模块,配置为获取原始图片和所述原始图片的辅助信息;
    特征图获取模块,配置为将所述原始图片输入第一视觉任务处理模型的主路,得到对象特征图,并且将所述辅助信息输入所述第一视觉任务处理模型的支路,得到辅助特征图;
    原始图片的响应图获取模块,配置为将所述对象特征图和所述辅助特征图融合后输入所述第一视觉任务处理模型的主路,得到所述原始图片的响应图。
  10. 一种设备,包括:
    一个或多个处理器;
    存储器,配置为存储一个或多个程序;
    所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-8中任一项所述的方法。
  11. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机 程序,所述计算机程序被处理器执行时实现如权利要求1-8中任一项所述的方法。
PCT/CN2019/128573 2018-12-29 2019-12-26 图片处理方法、装置、设备及存储介质 WO2020135554A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US17/418,692 US20220083808A1 (en) 2018-12-29 2019-12-26 Method and apparatus for processing images, device and storage medium
SG11202107121VA SG11202107121VA (en) 2018-12-29 2019-12-26 Method and apparatus for processing images, device and storage medium
RU2021120968A RU2770748C1 (ru) 2018-12-29 2019-12-26 Способ и аппарат для обработки изображений, устройство и носитель данных

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811648151.2 2018-12-29
CN201811648151.2A CN111382647B (zh) 2018-12-29 2018-12-29 一种图片处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020135554A1 true WO2020135554A1 (zh) 2020-07-02

Family

ID=71128768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/128573 WO2020135554A1 (zh) 2018-12-29 2019-12-26 图片处理方法、装置、设备及存储介质

Country Status (5)

Country Link
US (1) US20220083808A1 (zh)
CN (1) CN111382647B (zh)
RU (1) RU2770748C1 (zh)
SG (1) SG11202107121VA (zh)
WO (1) WO2020135554A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112991494A (zh) * 2021-01-28 2021-06-18 腾讯科技(深圳)有限公司 图像生成方法、装置、计算机设备及计算机可读存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255761A (zh) * 2021-05-21 2021-08-13 深圳共形咨询企业(有限合伙) 反馈神经网络系统及其训练方法、装置及计算机设备
CN115963917B (zh) * 2022-12-22 2024-04-16 北京百度网讯科技有限公司 视觉数据处理设备及视觉数据处理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154518A (zh) * 2017-12-11 2018-06-12 广州华多网络科技有限公司 一种图像处理的方法、装置、存储介质及电子设备
US20180225828A1 (en) * 2016-05-09 2018-08-09 Tencent Technology (Shenzhen) Company Limited Image processing method and processing system
CN108447078A (zh) * 2018-02-28 2018-08-24 长沙师范学院 基于视觉显著性的干扰感知跟踪算法
CN108492319A (zh) * 2018-03-09 2018-09-04 西安电子科技大学 基于深度全卷积神经网络的运动目标检测方法
CN108961220A (zh) * 2018-06-14 2018-12-07 上海大学 一种基于多层卷积特征融合的图像协同显著性检测方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5793975B2 (ja) * 2010-08-03 2015-10-14 株式会社リコー 画像処理装置、画像処理方法、プログラム、記録媒体
CN108229455B (zh) * 2017-02-23 2020-10-16 北京市商汤科技开发有限公司 物体检测方法、神经网络的训练方法、装置和电子设备
KR101983684B1 (ko) * 2017-08-25 2019-05-30 광운대학교 산학협력단 컨벌루션 신경망을 이용한 임베디드 플랫폼 상의 피플 카운팅 방법
CN108876813B (zh) * 2017-11-01 2021-01-26 北京旷视科技有限公司 用于视频中物体检测的图像处理方法、装置及设备
CN107886093B (zh) * 2017-11-07 2021-07-06 广东工业大学 一种字符检测方法、系统、设备及计算机存储介质
CN108288035A (zh) * 2018-01-11 2018-07-17 华南理工大学 基于深度学习的多通道图像特征融合的人体动作识别方法
US10747811B2 (en) * 2018-05-22 2020-08-18 Adobe Inc. Compositing aware digital image search
CN108846332B (zh) * 2018-05-30 2022-04-29 西南交通大学 一种基于clsta的铁路司机行为识别方法
CN108875654B (zh) * 2018-06-25 2021-03-05 深圳云天励飞技术有限公司 一种人脸特征采集方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180225828A1 (en) * 2016-05-09 2018-08-09 Tencent Technology (Shenzhen) Company Limited Image processing method and processing system
CN108154518A (zh) * 2017-12-11 2018-06-12 广州华多网络科技有限公司 一种图像处理的方法、装置、存储介质及电子设备
CN108447078A (zh) * 2018-02-28 2018-08-24 长沙师范学院 基于视觉显著性的干扰感知跟踪算法
CN108492319A (zh) * 2018-03-09 2018-09-04 西安电子科技大学 基于深度全卷积神经网络的运动目标检测方法
CN108961220A (zh) * 2018-06-14 2018-12-07 上海大学 一种基于多层卷积特征融合的图像协同显著性检测方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112991494A (zh) * 2021-01-28 2021-06-18 腾讯科技(深圳)有限公司 图像生成方法、装置、计算机设备及计算机可读存储介质
CN112991494B (zh) * 2021-01-28 2023-09-15 腾讯科技(深圳)有限公司 图像生成方法、装置、计算机设备及计算机可读存储介质

Also Published As

Publication number Publication date
RU2770748C1 (ru) 2022-04-21
SG11202107121VA (en) 2021-07-29
US20220083808A1 (en) 2022-03-17
CN111382647A (zh) 2020-07-07
CN111382647B (zh) 2021-07-30

Similar Documents

Publication Publication Date Title
WO2020238560A1 (zh) 视频目标跟踪方法、装置、计算机设备及存储介质
CN109583340B (zh) 一种基于深度学习的视频目标检测方法
WO2020135554A1 (zh) 图片处理方法、装置、设备及存储介质
CN111696110B (zh) 场景分割方法及系统
CN113052755A (zh) 一种基于深度学习的高分辨率图像智能化抠图方法
CN110852199A (zh) 一种基于双帧编码解码模型的前景提取方法
CN112802197A (zh) 动态场景下基于全卷积神经网络的视觉slam方法及系统
CN117197624A (zh) 一种基于注意力机制的红外-可见光图像融合方法
Zhang et al. Video extrapolation in space and time
CN116597144A (zh) 一种基于事件相机的图像语义分割方法
CN110889858A (zh) 一种基于点回归的汽车部件分割方法及装置
CN116342377A (zh) 一种降质场景下伪装目标图像自适应生成方法与系统
Wang et al. Research on gesture recognition and classification based on attention mechanism
CN114372931A (zh) 一种目标对象虚化方法、装置、存储介质及电子设备
Yan et al. Small Objects Detection Method for UAVs Aerial Image Based on YOLOv5s
Xue et al. An end-to-end multi-resolution feature fusion defogging network
CN113744141B (zh) 图像的增强方法、装置和自动驾驶的控制方法、装置
Chen et al. Automatic 2d-to-3d video conversion using 3d densely connected convolutional networks
Horita et al. SSA-GAN: End-to-end time-lapse video generation with spatial self-attention
CN114581448B (zh) 图像检测方法、装置、终端设备以及存储介质
Wang et al. CNN-based Super-resolution Reconstruction for Traffic Sign Detection
Ren et al. Research on Anomaly Suppression Correlation Filtering Algorithm
Li et al. Underwater Image Clearing Algorithm Based on the Laplacian Edge Detection Operator
Fu et al. A Three-Stage Low-Illumination Image Enhancement Method Based on Feature Refining and Its Application in Inspection Robot for High-Voltage Substation Room
CN114743002A (zh) 基于弱监督学习的视频目标分割方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19904063

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19904063

Country of ref document: EP

Kind code of ref document: A1