CN116758365A

CN116758365A - Video processing method, machine learning model training method, related device and equipment

Info

Publication number: CN116758365A
Application number: CN202210214986.7A
Authority: CN
Inventors: 余世杰; 李石华; 乔宇
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2023-09-15

Abstract

The embodiment of the application discloses a video processing method, a machine learning model training method, a related device and equipment, wherein the method relates to artificial intelligence and comprises the following steps: inputting the plurality of detection images into a first model to obtain clothes attributes of the target objects in the plurality of detection images, wherein the first model is used for respectively extracting clothes features of the target objects in the plurality of detection images to obtain a plurality of clothes feature images and identifying the clothes attributes of the target objects according to the plurality of clothes feature images; the detection image is an image of a target object in the video to be processed; extracting masks from the plurality of clothes feature maps, respectively, wherein the masks are used for indicating the positions of clothes; processing the detection image corresponding to each clothes feature map through the mask corresponding to each clothes feature map to obtain a non-clothes feature map; and inputting the non-clothes feature graphs corresponding to the detection images into a second model to obtain the non-clothes attribute of the target object.

Description

Video processing method, machine learning model training method, related device and equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a video processing method, a machine learning model training method, and related devices and apparatuses.

Background

In recent years, research on pedestrian re-recognition has been greatly advanced, and currently, changing clothes for pedestrians is an urgent problem to be solved in pedestrian re-recognition. In the prior art, the data sets of re-identification of the clothing change pedestrians are mainly picture data, namely, the re-identification of the clothing change pedestrians is carried out based on a single picture. However, the characteristics of the independent clothes in a single picture are weak and do not have enough discrimination, so that the retrieval precision of the current pedestrian re-identification method is low.

Disclosure of Invention

The embodiment of the application provides a video processing method, a machine learning model training method, a related device and equipment, wherein non-clothing properties are obtained by cross erasing clothing properties of target objects in a plurality of detection images of a video to be processed. On the one hand, the video contains richer non-clothing attributes, has enough authentication, and is beneficial to improving the accuracy of re-identifying the target object. On the other hand, by erasing the clothing attribute of the target object in the plurality of detection images, the accuracy of the target object re-recognition can be improved.

The first aspect of the embodiment of the application discloses a video processing method, which comprises the following steps: inputting a plurality of detection images into a first model to obtain clothes attributes of target objects in the plurality of detection images, wherein the first model is used for respectively extracting clothes features of the target objects in the plurality of detection images to obtain a plurality of clothes feature images and identifying the clothes attributes of the target objects according to the plurality of clothes feature images; the detection image is an image of the target object in the video to be processed; extracting masks from the plurality of clothes feature maps respectively, wherein the masks are used for indicating the positions of clothes; processing the detection image corresponding to each clothes feature map through the mask corresponding to each clothes feature map to obtain a non-clothes feature map; and inputting the non-clothing feature graphs corresponding to the detection images into a second model to obtain the non-clothing attribute of the target object. According to the method, on one hand, the video contains richer non-clothing attributes, has enough authentication, and is beneficial to improving the accuracy of re-identifying the target object. On the other hand, by erasing the clothing attribute of the target object in the plurality of detection images, the accuracy of the target object re-recognition can be improved. In addition, the overall attribute and the non-clothes attribute are combined, so that the accuracy of pedestrian re-identification in the aspect of clothes replacement can be improved, and the accuracy of pedestrian re-identification in the aspect of clothes replacement can also be improved.

With reference to the first aspect, in one possible implementation manner, the extracting masks from the multiple garment feature maps respectively specifically includes:

calculating a corresponding thermodynamic diagram of each garment feature map; binarizing a pixel value of each pixel point in each thermodynamic diagram into 0 or 1 to obtain a mask corresponding to each clothing feature diagram; where 1 represents a non-clothing position and 0 represents a clothing position.

With reference to the first aspect, in one possible implementation, binarizing the pixel value of each pixel point in each thermodynamic diagram to 0 or 1 includes:

binarizing pixel points with pixel values larger than a threshold value in each thermodynamic diagram to 0;

and binarizing the pixel points with the pixel values smaller than or equal to the threshold value in each thermodynamic diagram into 1.

With reference to the first aspect, in one possible implementation manner, the processing, by using a mask corresponding to each clothing feature map, an image corresponding to each clothing feature map to obtain a non-clothing feature map specifically includes:

respectively extracting features of the plurality of detection images to obtain a plurality of first feature images; the first feature map, the garment feature map and the mask are all the same in size;

And carrying out dot multiplication on a mask corresponding to each clothes characteristic diagram and a first characteristic diagram corresponding to each clothes characteristic diagram to obtain a non-clothes characteristic diagram, wherein the mask is a binary diagram, 1 represents a non-clothes position, and 0 represents a clothes position.

With reference to the first aspect, in one possible implementation, the method further includes:

and inputting the plurality of detection images into a third model to obtain the global attribute of the target object in the plurality of detection images. The global attribute is used for splicing with the non-clothing attribute to obtain identification attributes of target objects in the plurality of detection images, and the identification attributes are used for re-identifying the target objects. In the process of erasing the clothing attribute of the target object, the important non-clothing attribute may be deleted by mistake, so that the global attribute of the target object in the multiple detection images is extracted, the non-clothing attribute of the target object is favorably supplemented, and the integrity is ensured. The second aspect of the embodiment of the application discloses a machine learning model training method, wherein the model comprises the following steps: inputting a plurality of sample images into a first model to obtain predicted clothes properties of a target object in the plurality of sample images; the first model is used for respectively extracting clothes features of a target object in the plurality of sample images to obtain a plurality of clothes feature graphs, and identifying predicted clothes properties of the target object according to the plurality of clothes feature graphs; the sample image is an image of the target object in a sample video;

Extracting masks from the plurality of clothes feature maps respectively, wherein the masks are used for indicating the positions of clothes;

processing a sample image corresponding to each clothes feature map through a mask corresponding to each clothes feature map to obtain a non-clothes feature map;

inputting the non-clothing feature graphs corresponding to the sample images respectively into a second model to obtain the predicted non-clothing attribute of the target object;

model parameters of the first model and the second model are adjusted based on errors between predicted clothing properties, predicted non-clothing properties, and tag properties of the target object in the plurality of sample images.

With reference to the second aspect, in one possible implementation, the inputting the plurality of sample images into the first model, to obtain predicted clothing properties of the target object in the plurality of sample images includes:

and inputting the plurality of sample images into the first model, and performing supervised learning by taking the person ID and the clothes ID of the target object as labels to obtain the predicted clothes attribute of the target object in the plurality of sample images.

With reference to the second aspect, in one possible implementation manner, the inputting the non-clothing feature maps corresponding to the plurality of sample images into a second model to obtain the predicted non-clothing attribute of the target object includes:

And inputting the non-clothing feature graphs corresponding to the sample images into the second model, and performing supervised learning by taking the person ID of the target object as a tag to obtain the predicted non-clothing attribute of the target object.

With reference to the second aspect, in one possible implementation, extracting masks from the multiple garment feature maps respectively includes:

inputting the multiple clothes feature images into a clothes erasing model respectively to obtain masks respectively corresponding to the multiple clothes feature images;

the method further comprises the steps of: model parameters of the garment erasure model are adjusted based on errors between predicted garment properties, predicted non-garment properties, and tag properties of a target object in the plurality of sample images.

With reference to the second aspect, in one possible implementation, the method further includes:

respectively inputting the plurality of sample images into a feature extraction model to obtain first feature images respectively corresponding to the plurality of sample images; the first feature map, the garment feature map and the mask are all the same in size;

the method further comprises the steps of: model parameters of the feature extraction model are adjusted based on errors between predicted clothing properties, predicted non-clothing properties, and tag properties of the target object in the plurality of sample images.

With reference to the second aspect, in one possible implementation manner, the inputting the plurality of sample images into a feature extraction model respectively, to obtain first feature graphs corresponding to the plurality of sample images respectively includes:

and respectively inputting the plurality of sample images into the feature extraction model, and performing supervised learning by taking the character ID of the target object as a tag to obtain first feature images respectively corresponding to the plurality of sample images.

inputting the plurality of sample images into a third model to obtain a predicted global attribute of the target object;

model parameters of the third model are adjusted based on errors between predicted clothing properties, predicted non-clothing properties, predicted global properties, and tag properties of the target object in the plurality of sample images.

With reference to the second aspect, in a possible implementation manner, the inputting the plurality of sample images into a third model, to obtain a predicted global attribute of the target object includes:

and inputting the plurality of sample images into the third model, and performing supervised learning by taking the person ID of the target object as a tag to obtain the predicted global attribute of the target object.

A third aspect of an embodiment of the present application discloses a video processing apparatus, including:

the first information extraction module is used for extracting clothes features from the plurality of detection images to obtain a plurality of clothes feature images; identifying clothing properties of a target object in the plurality of detection images according to the plurality of clothing feature maps; the detection image is an image of the target object in the video to be processed;

the first feature extraction module is used for carrying out feature extraction on the plurality of detection images to obtain a plurality of first feature images;

a first clothes erasing module for respectively extracting masks from the plurality of clothes feature diagrams, wherein the masks are used for indicating the positions of clothes;

the first clothes erasing module is further used for processing the detection image corresponding to each clothes feature map through the mask corresponding to each clothes feature map to obtain a non-clothes feature map;

and the second information extraction module is used for extracting the non-clothing attribute of the target object from the non-clothing feature graphs corresponding to the detection images respectively.

In a fourth aspect, an embodiment of the present application discloses a machine learning model training apparatus, including:

The first acquisition module is used for extracting clothes features from the plurality of sample images to obtain a plurality of clothes feature images; identifying predicted garment attributes of a target object in the plurality of sample images according to the plurality of garment feature maps; the sample image is an image of the target object in a sample video;

the second feature extraction module is used for carrying out feature extraction on the plurality of sample images to obtain a plurality of first feature images;

a second clothes erasing module for respectively extracting masks from the plurality of clothes feature diagrams, wherein the masks are used for indicating the positions of clothes;

the second clothes erasing module is further used for processing the detection image corresponding to each clothes feature map through the mask corresponding to each clothes feature map to obtain a non-clothes feature map;

the second acquisition module is used for extracting predicted non-clothing attributes of the target object from the non-clothing feature graphs corresponding to the sample images respectively;

and the information processing module is used for adjusting model parameters of the first model and the second model based on the predicted clothes attribute, the predicted non-clothes attribute and the label attribute of the target object in the plurality of sample images.

A fifth aspect of an embodiment of the present application discloses a computer device, including: a processor and a memory;

the processor is connected to a memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-7.

A sixth aspect of the embodiments of the present application discloses a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the first aspect described above.

A seventh aspect of the embodiments of the present application discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a server, which executes the computer instructions, causing the server to perform the method of the first aspect described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic architecture diagram of a video processing system 100 according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a video processing apparatus or an electronic device according to an embodiment of the present application;

fig. 3A and fig. 3B are schematic flow diagrams of a machine learning model training method according to an embodiment of the present application;

FIG. 3C is a flow chart of training a machine learning model based on a sample video set and a sample data set provided by an embodiment of the present application;

FIG. 3D is a flow chart of a method for erasing a clothing feature of a target object by a clothing erasing module according to an embodiment of the application;

fig. 4 is a schematic flow chart of a video processing method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training device for machine learning model according to an embodiment of the present application;

table 1 is a performance comparison table of the pedestrian re-recognition method based on video and other traditional pedestrian re-recognition methods based on video on same clothing retrieval according to the embodiment of the application;

table 2 is a performance comparison table of the pedestrian re-recognition method based on the video and other traditional pedestrian re-recognition methods based on the video in the clothing changing search according to the embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Technical terms related to the embodiment of the present application are first described:

(1) Artificial intelligence

It should be appreciated that artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that simulates, extends and extends human intelligence, senses the environment, obtains knowledge and uses knowledge to obtain optimal results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The scheme provided by the embodiment of the application mainly relates to artificial intelligence natural language processing (nature language processing, NLP) technology and Machine Learning (ML) technology.

Among them, natural language processing (Nature Language Processing) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Among them, machine Learning (Machine Learning) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

(2) Neural network

Neural Networks (NN): an artificial neural network (artificial neural network, ANN), abbreviated as neural network or neural-like network, is a mathematical or computational model that mimics the structure and function of biological neural networks (the central nervous system of animals, particularly the brain) for estimating or approximating functions in the field of machine learning and cognitive sciences.

(1) Language model training method

Language model training method (bidirectional encoder representations from transformers, BERT): a training method using massive texts is widely used for various natural language processing tasks, such as text classification, text matching, machine reading understanding and the like.

(2) Model parameters

Model parameters: is a quantity that uses common variables to establish relationships between functions and variables. In artificial neural networks, the model parameters are typically real matrices.

(3) Model training

Model training: multi-class learning is performed on the image dataset. The model can be constructed by adopting deep learning frameworks such as Tensor Flow and torch, and a multi-classification model is formed by using multi-layer combination of neural network layers such as CNN. The input of the model is a three-channel or original channel matrix formed by reading an image through tools such as openCV, the model is output as multi-classification probability, and the webpage category is finally output through algorithms such as softmax. During training, the model approaches to the correct trend through an objective function such as cross entropy and the like.

(4) Convolutional neural network

Convolutional neural networks (convolutional neural networks, CNN) are a type of feedforward neural network (feed forward neural networks) that contains convolutional computations and has a deep structure, and are one of the representative algorithms of deep learning (deep learning). Convolutional neural networks have the capability of token learning (representation learning) to enable a shift-invariant classification (shift-invariant classification) of input information in their hierarchical structure.

(5) Circulating neural network

A recurrent neural network (recurrent neural network, RNN) is a class of neural networks used to process sequence data. It differs from other neural networks in that RNNs can better process sequential information, i.e., recognize that there is a relationship between the inputs before and after. In NLP, it is obviously not enough to understand a whole sentence, and to understand the words that make up the sentence in isolation, we need to process the whole sequence linked by these words in its entirety.

(6) Long-short term memory network

Long and short term memory neural networks (long short term networks, LSTMs) are a special recurrent neural network capable of capturing long term dependencies. Is specifically designed to avoid long-term dependency problems. Remembering information for a long time has become its default behavior and does not require specialized learning.

(7) Bidirectional attention neural network

The model architecture of the bi-directional attention neural network (bidirectional encoder representation from transformers, BERT) is based on multi-layer bi-directional transform decoding, because the decoder is unable to obtain the information to be predicted, the main innovation of the model is on the pre-training method, namely, the two methods of Masked LM and Next Sentence Prediction are used to capture the characteristics of words and sentences respectively.

Where "bi-directional" means that the model can use both the preceding and following words of information when processing a word, the source of this "bi-directional" is that the BERT differs from the traditional language model in that it does not predict the most likely current word under all the preceding words, but randomly masks some words and uses all the unmasked words for prediction.

(8) Depth residual network

The depth residual error network (deep residual network, resNet) adopts a jump structure as a basic structure of the network, so that the problem that the learning efficiency is low and the accuracy cannot be effectively improved due to deepening of the network (also called network degradation) can be solved. By superimposing layers of y=x (called identity mappings, identity mapping) on a shallow network basis, the network can be made to increase with depth without degradation.

(9) Encoder-decoder

The encoder-decoder architecture is a network architecture commonly used in machine translation technology. The method comprises two parts of an encoder and a decoder, wherein the encoder converts input text into a series of context vectors capable of expressing input text characteristics, and the decoder receives the output result of the encoder as own input and outputs a corresponding text sequence in another language.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a video processing system 100 according to an embodiment of the present application, where the system may include: server 200, network 300, and terminal 400 (terminal 400-1 and terminal 400-2 are illustrated), wherein terminal 400 (terminal 400-1 and terminal 400-2 are illustrated) is connected to server 200 via network 300, and network 300 may be a wide area network or a local area network, or a combination of both.

The terminal 400-1 belongs to a photographer of video, and is used to upload video to be processed to a background server of the video processing system 100, that is, the server 200 and retrieve video, so as to transmit the video to other terminals on the network 300, such as the terminal 400-2, through the server 200.

The server 200 is a background server of the video processing system 100, belongs to an administrator of the video processing system 100, and is configured to receive the video to be processed uploaded by the terminal 400-1, and store the video in the database 500. Carrying out global feature extraction processing on the video to be processed to obtain a global feature vector corresponding to the video to be processed; carrying out first clothes irrelevant feature extraction processing on the video to be processed to obtain a first clothes irrelevant feature vector corresponding to the video to be processed; performing clothes feature extraction processing on the video to be processed to obtain clothes feature vectors corresponding to the video to be processed; inputting the first clothes-independent feature vector and the clothes feature vector into a clothes erasing module to obtain a target clothes-independent feature vector; and splicing the target clothes irrelevant feature vector and the global feature vector to obtain the target feature vector. Further, the server 200 performs retrieval in other monitoring videos based on the target characteristics of pedestrians in the video to be processed.

The terminal 400-2 may receive the video transmitted by the terminal 400-1.

In some embodiments, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms.

The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a camera, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

In the embodiment of the application, the server trained by the machine learning model and the server for video processing can be the same or different, and the application is not limited to this.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a video processing apparatus or an electronic device according to an embodiment of the present application, where the video processing apparatus may be the server 200 or the terminal 400 in fig. 1, and includes: at least one processor 210, a memory 220, at least one network interface 230, and a user interface 240. The various components in server 200 are coupled together by bus system 250. It is understood that the bus system 250 is used to enable connected communications between these components. The bus system 250 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 250 in fig. 2.

It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the video processing apparatus or the electronic device. In other embodiments of the application, the video processing apparatus or electronic device may include more or fewer components than shown, or may combine certain components, or may split certain components, or may have a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 210 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, a digital signal processor (digital signal processor, DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The memory 220 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 220 optionally includes one or more storage devices physically located remote from processor 210.

Memory 220 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a random access memory (random access memory, RAM). The memory 220 described in embodiments of the present application is intended to comprise any suitable type of memory.

The user interface 240 includes one or more output devices 241 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 240 also includes one or more input devices 242, including user interface components that facilitate input of objects, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The video processing method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the server provided by the embodiment of the present application.

Example 1

The following describes a training method of a machine learning model according to an embodiment of the present application.

An embodiment first introduces a training method of a machine learning model according to an embodiment of the present application. The machine learning method may be implemented by the server 200 in fig. 1 and 2.

Specifically, as exemplarily shown in fig. 3A and 3B, a flow chart of a machine learning model training method is shown, and the method may include, but is not limited to, some or all of the following steps:

s301: based on the sample video set, a sample data set is acquired.

The sample video set may include one or more sample videos.

The sample video may be from the photographer terminal 400-1 of the video in fig. 1, or from other approaches, as the application is not limited in this regard.

A sample data set, which may also be referred to as a training data set, includes one or more sample data, each of which may include, but is not limited to, one or a combination of the following: sample video, target objects in sample video, global properties of target objects in sample video, clothing properties of target objects in sample video, and non-clothing properties of target objects in sample video.

S302: a machine learning model is trained based on the sample video set and the sample data set.

In some embodiments, in training a machine learning model, for a first model, supervised learning is performed with a person ID of a target object (e.g., face shape, hairstyle, body shape, etc. of the target object) and a clothing ID (e.g., of clothing) together as labels; and performing supervised learning on the feature extraction model, the second model and the third model by taking the character ID of the target object as a tag.

In other embodiments, other tags may be used for supervised learning, as the application is not limited in this regard.

Specifically, as exemplarily shown in fig. 3C, S302 may include some or all of the steps in S3021 to S3027:

s3021: and inputting the plurality of sample images into a first module to obtain predicted clothes properties of the target object in the plurality of sample images.

The sample video may be from the video capturing terminal 400-1 in fig. 1, or may be from other paths, which the present application is not limited to.

A method for acquiring a plurality of sample images from a sample video is described below.

The sample image is an image of a target object in the sample video.

In one possible implementation, multiple intermediate images may be acquired from a sample video; and then detecting the image of the target object from the plurality of intermediate images to obtain a plurality of sample images.

In another possible implementation, the target object may be detected from the sample video, resulting in an intermediate video; next, a plurality of sample images are selected in the intermediate video.

The embodiment of the application does not limit the method for acquiring a plurality of images from a video, and one of the methods is taken as an example for description.

The video may be divided into one or more video segments, at least one image being selected from each of the one or more video segments according to a time interval; at least one image constitutes an image set; one image set corresponds to one video clip; the one or more image sets include a plurality of images.

In some embodiments, the first model may be a depth residual network (deep residual network, resNet). ResNet can effectively improve the accuracy of model training in the process of increasing the network depth. The application can also obtain the video feature vector through other network models, such as a cyclic neural network and a long-short-term memory network, without limitation.

The Resnet may include at least one convolution layer for extracting features of the input data and at least one pooling layer; the pooling layer is used for sampling input data. Both the convolution layer and the pooling layer include an activation function.

In an embodiment of the application, the clothing attribute of the target object can be obtained based on at least one convolution layer and at least one pooling layer. The method comprises the steps of firstly, respectively carrying out vector conversion on a plurality of sample images to obtain a plurality of sample image vectors, wherein the plurality of sample image vectors can be combined into an image vector matrix; the second step, inputting the image vector matrix into at least one first convolution layer, for example, 3 layers, and performing convolution operation with the image vector matrix by using a convolution kernel, that is, performing inner product operation on the image vector matrix and the convolution kernel to obtain a convolution result corresponding to the image vector matrix; then, performing nonlinear transformation on the convolution result based on the activation function, and adding a bias vector to obtain an initial feature vector; thirdly, inputting the initial feature vector into a pooling layer, and performing feature sampling on the initial feature vector; then, nonlinear transformation is carried out on the sampling result based on the activation function, and offset vectors are added to obtain a plurality of clothes feature graphs; fourth, the clothing attribute of the target object is identified based on the plurality of clothing feature maps.

S3022: and respectively inputting the plurality of sample images into the feature extraction model to obtain first feature images respectively corresponding to the plurality of sample images.

In the embodiment of the application, the feature extraction model may be a depth residual network ResNet or other network models, which is not limited in this aspect of the application. Taking Resnet as an example in this embodiment, the Resnet may include at least one convolution layer for extracting features of the input data and at least one pooling layer; the pooling layer is used for sampling input data. Both the convolution layer and the pooling layer include an activation function.

The feature extraction model is used for extracting global features of target objects in the plurality of sample images to obtain a plurality of first feature images. Each first characteristic diagram corresponds to each clothes characteristic diagram one by one, and the two characteristic diagrams are the same in size.

The method for obtaining the plurality of first feature maps from the feature extraction model may refer to the first to third steps in the method for obtaining the clothing attribute of the target object from the first model in S3021, and will not be described herein.

S3023: masks are extracted from the plurality of clothing feature maps, respectively, the masks being used to indicate the positions of the clothing.

As shown in fig. 3D, the plurality of clothes feature maps are input to the clothes erasure model, respectively, to obtain masks corresponding to the plurality of clothes feature maps, respectively.

Specifically, first, a thermodynamic diagram corresponding to each of the garment features in the plurality of garment feature maps is calculated. The thermodynamic diagram can be calculated by referring to the formula (1) and the formula (2):

wherein, the number of channels is the clothing feature map; t is the t-th garment characteristic of the plurality of garment characteristic drawings.Pixel values for pixel points in the thermodynamic diagram at positions (i, j); />Is the pixel value of the pixel point with the position (k, i, j) in the clothing feature map. Can be applied to +.>Normalization was performed.

Wherein H is the height of the clothing feature map; w is the width of the clothing feature map.

Then, binarizing the pixel value of each pixel point in each thermodynamic diagram into 0 or 1 to obtain a mask corresponding to each clothing feature diagram; where 1 represents a non-clothing position and 0 represents a clothing position.

The pixel point with the pixel value larger than the threshold value in each thermodynamic diagram is binarized into 0; and binarizing the pixel points with the pixel values smaller than or equal to the threshold value in each thermodynamic diagram into 1.

In some embodiments, the threshold value corresponding to each thermodynamic diagram may be an average of pixel values of all pixels on the thermodynamic diagram.

In other embodiments, the threshold value corresponding to each thermodynamic diagram may be a maximum pixel value, a minimum pixel value or any pixel value on the thermodynamic diagram, which is not limited by the present application.

In the embodiment of the application, the clothes erasure model can be a depth residual error network ResNet or other network models, and the application is not limited to the depth residual error network ResNet or the other network models. In this embodiment, a Resnet is taken as an example, where the Resnet may include at least one fully connected layer that classifies pixels on the thermodynamic diagrams to obtain garment locations and non-garment locations on each thermodynamic diagram. In addition, the fully connected layer may include an activation function including a weight matrix and a bias constant.

Specifically, the multiple thermodynamic diagrams can be respectively input into the full-connection layer, nonlinear transformation is respectively carried out on the multiple thermodynamic diagrams based on the weight matrix and the bias vector of the activation function, and then the clothes position and the non-clothes position are determined through normalization, so that masks corresponding to the multiple thermodynamic diagrams are obtained respectively.

The clothing feature maps, the thermodynamic diagrams and the masks are in one-to-one correspondence and have the same size.

In the present application, the mask may be obtained by other means, which the present application is not limited to.

S3024: and processing the first characteristic map corresponding to each clothes characteristic map through the mask corresponding to each clothes characteristic map to obtain a non-clothes characteristic map.

The sizes of each first characteristic diagram, each clothes characteristic diagram and each mask are the same, and the three characteristic diagrams and the mask are respectively in one-to-one correspondence.

In some embodiments, the plurality of masks may be respectively subjected to a dot product operation with the corresponding first feature map, so as to obtain a plurality of non-clothing feature maps. Multiple non-garment signatures may be obtained in the same manner as described above, and the application is not limited in this regard.

S3025: and inputting the non-clothes feature graphs corresponding to the sample images into a second model to obtain the predicted non-clothes attribute of the target object.

In the embodiment of the present application, the second model may be a depth residual network res net, or may be other network models, which is not limited in this aspect of the present application. Taking Resnet as an example in this embodiment, the Resnet may include at least one convolution layer for extracting features of the input data and at least one pooling layer; the pooling layer is used for sampling input data. Both the convolution layer and the pooling layer include an activation function.

The second model is used for extracting non-clothes features of the target object in the non-clothes feature map to obtain predicted non-clothes properties of the target object.

The method for obtaining the predicted non-clothing attribute of the target object from the second model may refer to the method for obtaining the clothing attribute of the target object from the first model in S3021, and will not be described herein.

The non-clothing attribute may include at least one of time information, face, body shape, gait and hairstyle, and may be other clothing independent body biometric characteristics, as the application is not limited in this respect.

S3026: and inputting the plurality of sample images into a third model to obtain the predicted global attribute of the target object.

In some embodiments, the machine model training method may further comprise S3026.

In the embodiment of the present application, the third model may be a depth residual network res net, or may be other network models, which is not limited in this aspect of the present application. Taking Resnet as an example in this embodiment, the Resnet may include at least one convolution layer for extracting features of the input data and at least one pooling layer; the pooling layer is used for sampling input data. Both the convolution layer and the pooling layer include an activation function.

And the third model is used for extracting the global features of the target object in the non-clothes feature map to obtain the predicted global attribute of the target object.

The method for obtaining the predicted global attribute of the target object from the third model may refer to the method for obtaining the clothing attribute of the target object from the first model in S3021, and will not be described herein.

S3027: model parameters of the first model and the second model are adjusted based on the predicted clothing properties, the errors between the predicted non-clothing properties and the tag properties of the target object in the plurality of sample images.

In implementation 1, the initial machine learning model includes a first model, a second model, a feature extraction model, a garment erasure model, and a third model. During training, model parameters of the first model, the second model, the feature extraction model, the garment erasure model, and the third model may be adjusted based on errors between predicted garment properties, predicted non-garment properties, predicted global properties, and tag properties of the target object in the plurality of sample images.

Illustratively, a loss function representing an error between a predicted clothing attribute, a predicted non-clothing attribute, and a tag attribute of a target object in a plurality of sample images is constructed, a loss is iteratively calculated by a gradient descent method, and a final machine learning model is output when the loss satisfies a model convergence condition.

In implementation 2, the initial machine learning model includes a first model, a second model, and a feature extraction model. During training, model parameters of the first model, the second model and the feature extraction model can be adjusted based on the predicted clothes attribute, the predicted non-clothes attribute and the label attribute of the target object in the plurality of sample images.

In implementation 3, the initial machine learning model includes a first model, a second model, and a third model. During training, model parameters of the first model, the second model and the third model can be adjusted based on errors among predicted clothes attributes, predicted non-clothes attributes, predicted global attributes and tag attributes of target objects in the plurality of sample images.

In implementation 4, the initial machine learning model may also include only the first model and the second model. During training, model parameters of the first model and the second model can be adjusted based on the predicted clothes attribute, the predicted non-clothes attribute and the label attribute of the target object in the plurality of sample images.

S303: and processing the video to be processed through the trained machine learning model to obtain the target characteristics of the target object in the video to be processed.

The machine learning model after training can be applied to the processing of the video to predict the clothes attribute, the non-clothes attribute and the global attribute of the target object in the video, and the specific description of the second embodiment can be seen.

Example two

The following describes a video processing method according to an embodiment of the present application.

In some embodiments, a video processing method provided by the embodiments of the present application may be implemented by the server 200 or the terminal 400 in fig. 1 and 2.

In the embodiment of the present application, the execution subject is taken as the server 200 as an example.

Specifically, as exemplarily shown in fig. 4, a flow chart of a video processing method is shown, and the method may include, but is not limited to, some or all of the following steps:

s401: and inputting the plurality of detection images into the first model to obtain the clothes attribute of the target object in the plurality of detection images.

The video to be processed may be from the video capturing terminal 400-1 of fig. 1, or may be from other paths, which is not limited by the present application.

The method for acquiring a plurality of detection images from a video to be processed is described below.

The detected image is an image of a target object in the video to be processed.

In one possible implementation, a plurality of images to be processed may be acquired from a video to be processed; and detecting the image of the target object from the plurality of images to be processed to obtain a plurality of detection images.

In another possible implementation, the target object may be detected from the video to be processed, resulting in a detected video; next, a plurality of detection images are selected from the detection video.

The embodiment of the application does not limit the method for acquiring a plurality of images to be processed from the video to be processed or acquiring a plurality of detection images from the detection video, and takes one method as an example for description.

The video to be processed can be divided into one or more video clips, and at least one image is selected from each video clip of the one or more video clips according to a certain time interval; at least one image constitutes an image set; one image set corresponds to one video clip; the one or more image sets include a plurality of images to be processed.

The method for obtaining the clothing attribute of the target object from the first model may refer to the method for obtaining the predicted clothing attribute of the target object from the first model in the machine model training method S3021, and will not be described herein.

S402: masks are extracted from the plurality of clothing feature maps, respectively, the masks being used to indicate the positions of the clothing.

The method for obtaining the mask from the multiple clothes feature maps may refer to the method for extracting the mask in the machine model training method S3022, and will not be described herein.

S403: and processing the detection image corresponding to each clothes feature map through the mask corresponding to each clothes feature map to obtain a non-clothes feature map.

Specifically, firstly, respectively extracting features of a plurality of detection images to obtain a plurality of first feature images; the first feature map, the garment feature map and the mask are all the same size.

The feature extraction model is used for extracting features of the plurality of detection images respectively.

The method for obtaining the first feature map from the feature extraction model may refer to the first to third steps in the method for obtaining the clothing attribute of the target object from the first model in S3021, which are not described herein.

And then, carrying out dot multiplication on the mask corresponding to each clothes characteristic diagram and the first characteristic diagram corresponding to each clothes characteristic diagram to obtain a non-clothes characteristic diagram.

Multiple non-garment signatures may be obtained in the same manner as described above, and the application is not limited in this regard.

S404: and inputting the non-clothes feature graphs corresponding to the detection images into a second model to obtain the non-clothes attribute of the target object.

The method for obtaining the non-clothing attribute of the target object from the second model may refer to the method for obtaining the predicted non-clothing attribute of the target object from the second model in the machine model training method S3025, which is not described herein.

S405: and inputting the plurality of detection images into a third model to obtain global attributes of target objects in the plurality of detection images.

In some embodiments, the video processing method may further include S405.

The method for obtaining the global attribute of the target object by the third model may refer to the method for obtaining the predicted global attribute of the target object by the third model in the machine model training method S3026, which is not described herein.

Further, in some embodiments, the video processing method may further include:

and splicing the global attribute of the target object and the non-clothes attribute of the target object to obtain the identification attribute of the target object, and re-identifying the target object according to the identification attribute of the target object.

Because the important non-clothing attribute may be deleted by mistake in the process of erasing the clothing attribute of the target object, the global attribute of the target object in the plurality of detection images is extracted, and the global attribute and the non-clothing attribute are spliced to obtain the identification attribute, so that the non-clothing attribute of the target object is favorably supplemented, and the integrity is ensured.

The embodiment of the application provides a video-based re-identification method for a clothing changing pedestrian, which is used for obtaining non-clothing features of a target object by cross-erasing clothing features of the target object in a plurality of detection images of a video to be processed. On the one hand, the video contains more abundant non-clothing features, has enough authentication, and is beneficial to improving the accuracy of pedestrian re-identification. On the other hand, by erasing the clothing features of the target object in the plurality of detection images, the accuracy of pedestrian re-recognition can be improved.

As shown in table 1, the performance of the video-based pedestrian re-recognition method and other conventional video-based pedestrian re-recognition methods according to the embodiments of the present application in the same clothing search is compared.

TABLE 1

As shown in table 2, the performance of the pedestrian re-recognition method based on video according to the embodiment of the application is compared with that of other traditional pedestrian re-recognition methods based on video in clothing replacement search.

TABLE 2

The baseline, the AP3D and the TCLNet are all traditional pedestrian re-identification methods based on video. mAP represents the evaluation index of each pedestrian re-identification method based on video. TOP1 refers to the accuracy of matching the first attribute with the corresponding tag attribute; TOP5 refers to the accuracy of the fifth ranked attribute matching the corresponding tag attribute; TOP10 refers to the accuracy of the tenth ranked attribute matching the corresponding tag attribute.

As can be seen from tables 1 and 2, the conventional video-based pedestrian re-recognition method has improved retrieval accuracy compared with the baseline in terms of clothes retrieval, but has a relatively obvious reduction in retrieval accuracy compared with the baseline in terms of clothes replacement retrieval, which indicates that the conventional video-based pedestrian re-recognition method is unsatisfactory in terms of clothes replacement retrieval due to the fact that no clothes replacement scene exists. Compared with the traditional method, the method has a larger improvement in the aspect of clothing changing retrieval, proves the effectiveness of clothing erasure in the video-based pedestrian re-identification method, and in addition, compared with the baseline in the aspect of clothing retrieval, the method has a higher improvement, and proves the effectiveness and the universality of the video-based pedestrian re-identification method.

Further, referring to fig. 5, fig. 5 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application. Wherein the video processing apparatus may include: a first information extraction module 10, a first feature extraction module 20, a first garment erase module 30, a second information extraction module 40.

In some embodiments, the video processing apparatus further comprises a third information extraction module 50.

The first information extraction module 10 inputs a plurality of detection images into a first model to obtain clothes attributes of target objects in the plurality of detection images, wherein the first model is used for respectively extracting clothes features of the target objects in the plurality of detection images to obtain a plurality of clothes feature images and identifying the clothes attributes of the target objects according to the plurality of clothes feature images; the detected image is an image of the target object in the video to be processed.

The first feature extraction module 20 is configured to perform feature extraction on the plurality of detected images, so as to obtain a plurality of first feature maps.

A first clothing erasure module 30 for extracting masks from the plurality of clothing feature maps, respectively, the masks being used to indicate the positions of the clothing.

And a second information extraction module 40, configured to extract non-clothing properties of the target object from non-clothing feature maps corresponding to the plurality of detection images respectively.

In some embodiments, the third information extraction module 50 is configured to extract global attributes of the target objects from the plurality of detected images respectively.

The first clothing erasure module 30 is further configured to calculate a corresponding thermodynamic diagram for each clothing feature map; then binarizing the pixel value of each pixel point in each thermodynamic diagram into 0 or 1 to obtain a mask corresponding to each clothing feature diagram; where 1 represents a non-clothing position and 0 represents a clothing position.

The first clothes erasing module 30 is further configured to binarize a pixel point in each thermodynamic diagram, where the pixel value is greater than a threshold value, to 0; and binarizing the pixel points with the pixel values smaller than or equal to the threshold value in each thermodynamic diagram into 1.

The first clothes erasing module 30 is further configured to dot-multiply a mask corresponding to each clothes feature map and the first feature map corresponding to each clothes feature map to obtain a non-clothes feature map, where the mask is a binary map, and 1 represents a non-clothes position and 0 represents a clothes position.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a machine learning model training device according to an embodiment of the application. Wherein the video processing apparatus may include: a first acquisition module 10, a second feature extraction module 20, a second garment erase module 30, a second acquisition module 40, an information processing module 50.

In some embodiments, the video processing apparatus further comprises a third acquisition module 60.

A first obtaining module 10, configured to perform clothes feature extraction on a plurality of sample images, so as to obtain predicted clothes attributes of a target object in the plurality of sample images; the first model is used for respectively extracting clothes features of a target object in the plurality of sample images to obtain a plurality of clothes feature graphs, and identifying predicted clothes properties of the target object according to the plurality of clothes feature graphs; the sample image is an image of the target object in a sample video.

A second clothing erasure module 30 for extracting masks from the plurality of clothing feature maps, respectively, the masks being used to indicate the positions of the clothing.

The second clothes erasing module 30 is further configured to process the sample image corresponding to each clothes feature map through a mask corresponding to each clothes feature map, so as to obtain a non-clothes feature map.

And a second obtaining module 40, configured to extract non-clothing features of the target object from the non-clothing feature maps corresponding to the plurality of sample images respectively, so as to obtain predicted non-clothing properties of the target object.

An information processing module 50 for adjusting model parameters of the first model and the second model based on the predicted garment property, the error between the predicted non-garment property and the tag property of the target object in the plurality of sample images.

In some embodiments, the second feature extraction module 20 is configured to perform feature extraction on the plurality of detected images, so as to obtain a plurality of first feature maps; the first feature map, the garment feature map and the mask are all the same size.

In some embodiments, the third obtaining module 60 is configured to extract the predicted global attribute of the target object from the plurality of sample images respectively.

In some embodiments, the information processing module 50 is further configured to adjust model parameters of the garment erasure model based on errors between predicted garment properties, predicted non-garment properties, and tag properties of the target object in the plurality of sample images.

In some embodiments, the information processing module 50 is further configured to adjust model parameters of the feature extraction model based on an error between a predicted clothing attribute, a predicted non-clothing attribute, and a tag attribute of the target object in the plurality of sample images.

In some embodiments, the information processing module 50 is further configured to adjust model parameters of the third model based on an error between a predicted clothing attribute, a predicted non-clothing attribute, a predicted global attribute, and a tag attribute of the target object in the plurality of sample images.

The embodiment of the application also provides a computer storage medium, and the computer storage medium stores program instructions, and the program may include part or all of the steps of the method as in the first embodiment or the second embodiment when executed.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of action described, as some steps may be performed in other order or simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the server reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the server performs the steps performed in the embodiments of the methods described above.

The video processing method, the machine learning model training method, the related devices and the equipment provided by the embodiment of the application are described in detail, and specific examples are applied to the description of the principle and the implementation mode of the application, and the description of the above embodiments is only used for helping to understand the method and the core idea of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A video processing method, comprising:

inputting a plurality of detection images into a first model to obtain clothes attributes of target objects in the plurality of detection images, wherein the first model is used for respectively extracting clothes features of the target objects in the plurality of detection images to obtain a plurality of clothes feature images and identifying the clothes attributes of the target objects according to the plurality of clothes feature images; the detection image is an image of the target object in the video to be processed;

processing the detection image corresponding to each clothes feature map through the mask corresponding to each clothes feature map to obtain a non-clothes feature map;

and inputting the non-clothing feature graphs corresponding to the detection images into a second model to obtain the non-clothing attribute of the target object.

2. The method according to claim 1, wherein the extracting masks from the plurality of clothing feature maps respectively comprises:

calculating a corresponding thermodynamic diagram of each garment feature map;

binarizing a pixel value of each pixel point in each thermodynamic diagram into 0 or 1 to obtain a mask corresponding to each clothing feature diagram; where 1 represents a non-clothing position and 0 represents a clothing position.

3. The method of claim 2, wherein binarizing the pixel value of each pixel point in each thermodynamic diagram to 0 or 1 comprises:

4. A method according to any one of claims 1 to 3, wherein the processing, by using a mask corresponding to each clothing feature map, the image corresponding to each clothing feature map to obtain a non-clothing feature map specifically includes:

5. The method according to claim 1, wherein the method further comprises:

inputting the plurality of detection images into a third model to obtain global attributes of target objects in the plurality of detection images; the global attribute is used for splicing with the non-clothing attribute to obtain identification attributes of target objects in the plurality of detection images, and the identification attributes are used for re-identifying the target objects.

6. A machine learning model training method, comprising:

inputting a plurality of sample images into a first model to obtain predicted clothes properties of a target object in the plurality of sample images; the first model is used for respectively extracting clothes features of a target object in the plurality of sample images to obtain a plurality of clothes feature graphs, and identifying predicted clothes properties of the target object according to the plurality of clothes feature graphs; the sample image is an image of the target object in a sample video;

7. The method of claim 6, wherein inputting the plurality of sample images into the first model results in predicted clothing properties of the target object in the plurality of sample images, comprising:

8. The method of claim 6, wherein inputting the non-clothing feature map corresponding to each of the plurality of sample images to a second model results in the predicted non-clothing properties of the target object, comprising:

9. The method of claim 6, wherein extracting masks from the plurality of garment feature maps, respectively, comprises:

10. The method of claim 6, wherein the method further comprises:

11. The method according to claim 10, wherein the inputting the plurality of sample images into the feature extraction model to obtain the first feature maps respectively corresponding to the plurality of sample images includes:

12. The method of claim 6, wherein the method further comprises:

13. The method of claim 12, wherein said inputting the plurality of sample images into a third model results in a predicted global property of the target object, comprising:

14. A video processing apparatus, comprising:

15. A machine learning model training apparatus, comprising:

16. A computer device, comprising: a processor and a memory;

the processor is connected to a memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-13.

17. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-13.