CN114005053A

CN114005053A - Video processing method, video processing device, computer equipment and computer-readable storage medium

Info

Publication number: CN114005053A
Application number: CN202111089769.1A
Authority: CN
Inventors: 何天宇; 金鑫; 沈旭; 黄建强
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2022-02-01

Abstract

The invention discloses a video processing method, a video processing device, computer equipment and a computer readable storage medium. Wherein, the method comprises the following steps: acquiring a plurality of frame images in a video; respectively extracting the image characteristics of a plurality of frames of images; processing the image characteristics of the multi-frame image by adopting a first attention layer to obtain first target characteristics; based on the image characteristics of the multiple frames of images, extracting representative characteristics corresponding to the multiple frames of images respectively; and processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting the second attention layer to obtain the description features of the video. The invention solves the technical problem of low identification precision when video identification is carried out in the related technology.

Description

Video processing method, video processing device, computer equipment and computer-readable storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a video processing method, apparatus, computer device, and computer-readable storage medium.

Background

Nowadays, the deep learning technology is widely applied to various fields, and great convenience is provided for life, traveling and the like of people. The video re-identification is an important technology, and for example, the vehicle re-identification is taken as an example, aiming at matching vehicle appearance information of a given input under different cameras so as to achieve the effect of searching the map by the map. In general, an image is susceptible to noise interference such as detection errors and false detection. In order to solve the above problems, in the related art, a segment is used as an input to perform a feature extraction process.

In the related art, methods for recognizing neural network vehicle video again can be roughly divided into three categories: 1) RNN (Recurrent Neural Network) based methods. 2) 3D-CNN (Convolutional Neural Networks) based methods. 3) Graph (data structure) structure based methods. However, when the scheme is adopted for vehicle re-identification, the problems of complex calculation, limited use scene, low identification precision and the like can occur.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a video processing device, computer equipment and a computer readable storage medium, which are used for at least solving the technical problem of low identification precision when video identification is carried out in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a video processing method, including: acquiring a plurality of frame images in a video; respectively extracting the image characteristics of the multiple frames of images; processing the image characteristics of the multi-frame images by adopting a first attention layer to obtain first target characteristics; extracting representative features respectively corresponding to the multiple frames of images based on the image features of the multiple frames of images; and processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting a second attention layer to obtain the description features of the video.

Optionally, the processing, by using the second attention layer, the first target feature and the representative features respectively corresponding to the multiple frames of images to obtain the descriptive feature of the video includes: embedding the time sequence characteristics and the space characteristics of the multi-frame images into the first target characteristics to obtain second target characteristics; and processing the second target feature and the representative features respectively corresponding to the multiple frames of images by adopting the second attention layer to obtain the description features of the video.

Optionally, before embedding the time-series feature and the spatial feature of the multiple frames of images in the first target feature to obtain a second target feature, the method further includes: dividing a single frame image of the video into regions to obtain a plurality of partial regions; respectively carrying out averaging pooling on the plurality of partial regions to obtain region characteristics respectively representing the plurality of partial regions; the spatial characteristics of the multi-frame images are determined based on the regional characteristics of the plurality of partial regions, and the time sequence characteristics of the multi-frame images are determined based on the time sequence of the video.

Optionally, before the processing, by using the second attention layer, the second target feature and the representative features respectively corresponding to the multiple frames of images to obtain the description feature of the video, the method further includes: respectively extracting attention features of corresponding frame images in the multi-frame images, wherein the attention features are features of which importance parameters are larger than a preset threshold value in the corresponding frame images; respectively obtaining the weights of the attention features of the corresponding frame images; and applying the weight of the attention feature of the corresponding frame image to the attention feature to obtain the representative feature of the corresponding frame image.

Optionally, after obtaining the description feature of the video, the method further includes: identifying a target object in the video based on the descriptive features; and displaying the identified target object.

According to an aspect of an embodiment of the present invention, there is provided a video processing method, including: receiving a vehicle video, and acquiring a multi-frame image from the vehicle video; obtaining description features of the vehicle in the vehicle video based on the multi-frame images, wherein the description features are obtained by processing first target features and representative features respectively corresponding to the multi-frame images through a second attention layer, the first target features are obtained by processing image features of the multi-frame images through a first attention layer, and the representative features respectively corresponding to the multi-frame images are obtained by extraction based on the image features of the multi-frame images; and matching the description characteristics with the vehicle characteristics of the target vehicle to obtain a matching result, wherein the matching result is used for identifying whether the vehicle in the vehicle video is the target vehicle.

Optionally, the method further comprises: outputting the matching result in a predetermined manner, wherein the predetermined manner comprises at least one of: the display screen displays, the printing equipment prints and the alarm equipment alarms.

According to another aspect of the embodiments of the present invention, there is provided a video processing method, including: receiving a person video, and acquiring a plurality of frames of images from the person video; obtaining description features of people in the people video based on the multi-frame images, wherein the description features are obtained by processing first target features and representative features respectively corresponding to the multi-frame images through a second attention layer, the first target features are obtained by processing image features of the multi-frame images through a first attention layer, and the representative features respectively corresponding to the multi-frame images are obtained by extraction based on the image features of the multi-frame images; and matching the description characteristics with the character characteristics of the target character to obtain a matching result, wherein the matching result is used for identifying whether the character in the character video is the target character.

According to another aspect of the embodiments of the present invention, there is provided a video processing apparatus including: the first acquisition module is used for acquiring multi-frame images in a video; the first extraction module is used for respectively extracting the image characteristics of the multiple frames of images; the first processing module is used for processing the image characteristics of the multi-frame images by adopting a first attention layer to obtain first target characteristics; the second extraction module is used for extracting representative features respectively corresponding to the multiple frames of images based on the image features of the multiple frames of images; and the second processing module is used for processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting a second attention layer to obtain the description features of the video.

According to another aspect of the embodiments of the present invention, there is provided a video processing apparatus including: the first receiving module is used for receiving a vehicle video and acquiring a plurality of frames of images from the vehicle video; the third processing module is used for obtaining description characteristics of the vehicle in the vehicle video based on the multi-frame images, wherein the description characteristics are obtained by processing first target characteristics and representative characteristics corresponding to the multi-frame images respectively through a second attention layer, the first target characteristics are obtained by processing image characteristics of the multi-frame images through the first attention layer, and the representative characteristics corresponding to the multi-frame images are obtained by extraction based on the image characteristics of the multi-frame images; and the fourth processing module is used for matching the description characteristics with the vehicle characteristics of the target vehicle to obtain a matching result, wherein the matching result is used for identifying whether the vehicle in the vehicle video is the target vehicle.

According to another aspect of the embodiments of the present invention, there is provided a video processing apparatus including: the second receiving module is used for receiving the character video and acquiring a plurality of frames of images from the character video; the fifth processing module is used for obtaining description characteristics of people in the people video based on the multi-frame images, wherein the description characteristics are obtained by processing first target characteristics and representative characteristics corresponding to the multi-frame images respectively through a second attention layer, the first target characteristics are obtained by processing image characteristics of the multi-frame images through the first attention layer, and the representative characteristics corresponding to the multi-frame images are extracted based on the image characteristics of the multi-frame images; and the sixth processing module is used for matching the description characteristics with the character characteristics of the target character to obtain a matching result, wherein the matching result is used for identifying whether the character in the character video is the target character.

According to another aspect of an embodiment of the present invention, there is provided a computer apparatus including: a memory and a processor, the memory storing a computer program; the processor is configured to execute the computer program stored in the memory, and when the computer program runs, the processor is enabled to execute any one of the video processing methods.

According to another aspect of embodiments of the present invention, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform any one of the video processing methods.

According to another aspect of embodiments of the present invention, there is provided a computer program product comprising a computer program which, when executed by a processor, implements any of the video processing methods described herein.

In the embodiment of the invention, the image features of multiple frames of images in a video are extracted, the obtained image features are processed through a first attention layer to obtain first target features, the representative features corresponding to the multiple frames of images are extracted based on the image features, and the first target features and the representative features are processed through a second attention layer, so that the description features of the video are obtained. Because the description characteristics of the video are obtained by comprehensively processing the first target characteristics and the representative characteristics, the obtained description characteristics of the video are more accurate and detailed, and the technical problem of low identification precision during video identification in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 shows a block diagram of a hardware structure of a computer terminal for implementing a video processing method;

fig. 2 is a flowchart of a first video processing method according to embodiment 1 of the present invention;

fig. 3 is a flowchart of a second video processing method according to embodiment 1 of the present invention;

fig. 4 is a flowchart of a third video processing method according to embodiment 1 of the present invention;

FIG. 5 is a flow chart of a vehicle video re-identification method provided in accordance with an alternative embodiment of the present invention;

FIG. 6 is a schematic illustration of spatiotemporal position embedding in a vehicle video re-identification method provided in accordance with an alternative embodiment of the present invention;

FIG. 7 is a schematic diagram of various implementations of a vehicle video re-identification method according to an alternative embodiment of the invention;

fig. 8 is a block diagram of a first video processing apparatus according to embodiment 2 of the present invention;

fig. 9 is a block diagram of a second video processing apparatus according to embodiment 3 of the present invention;

fig. 10 is a block diagram of a third video processing apparatus according to embodiment 4 of the present invention;

fig. 11 is an apparatus block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

deep Learning (Deep Learning): deep learning refers to an algorithm set for solving various problems such as images and texts by applying various machine learning algorithms on a multilayer neural network. Deep learning can fall into neural networks in a broad category, but there are many variations on the specific implementation. The core of deep learning is feature learning, and aims to obtain hierarchical feature information through a hierarchical network, so that the important problem that features need to be designed manually in the past is solved.

Artificial Neural Networks (Artificial Neural Networks): artificial neural networks are a research hotspot emerging in the field of artificial intelligence since the 80's of the 20 th century. The method is an operation model which is abstracted and established from the information processing angle to the human brain neuron network, and different networks are formed according to different connection modes.

Vehicle Video-based Vehicle Re-identification (Video-based Vehicle Re-identification): the vehicle video re-recognition is to output a vehicle high-dimensional feature vector by a method such as machine learning by taking a video clip containing a vehicle as an input.

Example 1

There is also provided, in accordance with an embodiment of the present invention, a video processing method embodiment, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a video processing method. As shown in fig. 1, the computer terminal 10 (or mobile device) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the video processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the video processing method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

Under the above operating environment, the present application provides a video processing method as shown in fig. 2. Fig. 2 is a flowchart of a first video processing method according to embodiment 1 of the present invention, as shown in fig. 2, the method includes the following steps:

step S202, acquiring multi-frame images in a video;

step S204, respectively extracting the image characteristics of a plurality of frames of images;

step S206, processing the image characteristics of the multi-frame images by adopting the first attention layer to obtain a first target characteristic;

step S208, extracting representative features respectively corresponding to the multiple frames of images based on the image features of the multiple frames of images;

and step S210, processing the first target feature and the representative features respectively corresponding to the multiple frames of images by adopting a second attention layer to obtain the description features of the video.

Through the steps, the image features of the multiple frames of images in the video are extracted, the obtained image features are processed through the first attention layer to obtain the first target features, the representative features corresponding to the multiple frames of images are extracted based on the image features, and the first target features and the representative features are processed through the second attention layer, so that the description features of the video are obtained. Because the description features of the video are obtained by comprehensively processing the first target features and the representative features, the representative features can respectively reflect more detailed features in a multi-frame image to a certain extent, so that the obtained description features of the video are more accurate, more detailed and more abundant, and the technical problem of low identification precision during video identification in the related technology is solved.

As an alternative embodiment, multiple frames of images in a video are acquired. And decomposing the video into a plurality of frames of images so as to process the images. The video can comprise various types of videos, and can be video segments obtained by road monitoring shooting, high-speed measurement shooting, camera shooting and the like; but also video, streaming video, etc. After acquiring the multiple frames of images in the video, the acquired multiple frames of images may be further preprocessed, for example, at least one of the following processes may be included: adjusting the image tone, for example: brightness, contrast, etc.; resizing the image, for example: zooming in and out the image, adjusting the length, width and height, and the like; intercept different locations of the image, and so on. Moreover, the obtaining of the multi-frame images in the video may be selecting images of different applicable types based on different scenes, for example, when the purpose of obtaining the multi-frame images in the video is to obtain the description feature of a certain target object in the video, only a video segment where the target object appears may be selected to obtain the multi-frame images of the video related to the target object, where the target object may be any object appearing in the video, and the images obtained by preprocessing the multi-frame images are retained. For example, in a traffic scene monitored by a vehicle, only a monitoring segment through which the vehicle passes is selected, and different positions of an image can be intercepted and reserved through preprocessing, for example, a plurality of frame images of a license plate and a plurality of frame images of a vehicle body are reserved. The method can be applied to the recognition of the license plate, the recognition of the vehicle body and other vehicle-related images with different designs in the traffic scene monitored by the vehicle. Therefore, in different types of multi-frame images in multiple types of videos, the multi-frame images can be acquired aiming at the target object in the videos, and the identification and application of the related target object videos are realized. The technical problem of low identification precision in video identification in the related technology can be effectively solved. The optional embodiment has no target object limitation requirement and no target object number requirement, and effectively realizes the universality of the video identification.

As an alternative embodiment, the image features of the multiple frames of images are extracted respectively. The image features may be various, such as feature vectors, and so forth. The image features of different types of images are not required to be the same, and the multi-frame images are subjected to targeted extraction, so that the image features of the multi-frame images are extracted.

As an alternative embodiment, the first attention layer is used to process the image features of the multiple frames of images to obtain the first target feature. Namely, the first attention layer is adopted to process the images of the multiple frames of images obtained from the video, and the first target characteristic related to the video is obtained. For example, the CNN network may be used to obtain the global features of the video, that is, the multi-frame images in the video segment may be input, and the CNN neural network model is used to extract the global features of the video. Global features about the video are obtained through the first attention layer so as to analyze the global features of the video.

As an alternative embodiment, representative features respectively corresponding to multiple frames of images are extracted based on image features of the multiple frames of images. I.e. each frame of image can correspond to a representative feature. By determining the representative features, the detailed part of each frame of image can be more focused, and further the detailed area of the video can be more focused, so that the video identification result is more detailed.

As an alternative embodiment, the second attention layer is adopted to process the representative features respectively corresponding to the first target feature and the multiple frames of images, so as to obtain the description features of the video. Wherein, the second attention layer may be a density attention layer. By utilizing the second attention layer, the first target feature of the video and the representative feature of each frame image, namely the global feature and the representative feature of each frame image are combined to obtain a more detailed and comprehensive description feature of the video, so that the description feature can represent the feature with richer video characteristics, and a basis is provided for more accurately identifying the target object in the video subsequently.

As an alternative embodiment, the representative features corresponding to the first target feature and the multiple frames of images are processed by using the second attention layer to obtain the description features of the video, and the description features may be obtained by: embedding time sequence characteristics and space characteristics of multiple frames of images into the first target characteristics to obtain second target characteristics; and processing the second target characteristic and the representative characteristics corresponding to the multi-frame images by adopting the second attention layer to obtain the description characteristics of the video. It should be noted that the second target feature in the present embodiment is embedded with the time-series feature and the spatial feature of the multi-frame image, and the time-series feature and the spatial feature are embedded, so that the obtained second target feature has temporal and spatial order. The processing is carried out according to the time sequence and the space sequence, and the space-time effectiveness of image processing can be ensured. The processing can be selectively carried out by dividing the time period and the area, and the processing can be carried out in the same time period in the same area according to the requirements of practical application, so that the processing of multi-frame images in the video is related to the corresponding time and space, the accuracy and the precision of the video processing are improved, and the description characteristics of the video can be obtained more accurately, effectively and conveniently.

As an alternative embodiment, before embedding the time-series feature and the spatial feature of multiple frames of images in the first target feature to obtain the second target feature, the following operations may be adopted for processing: a single frame image of a video is divided into regions to obtain a plurality of partial regions. When a single frame image of a video is divided into regions, the image may be equally divided into four parts, that is, 4 blocks may be used as the spatio-temporal position embedding step length, and of course, the image may be divided into other parts, that is, other numbers of blocks may also be used as the spatio-temporal position embedding step length. Respectively carrying out averaging pooling on the plurality of partial areas to obtain area characteristics respectively representing the plurality of partial areas; the spatial characteristics of the multi-frame images are determined based on the regional characteristics of the plurality of partial regions, and the time sequence characteristics of the multi-frame images are determined based on the time sequence of the video. By carrying out partitioning and collecting operation on the spatial features of each frame of image, the spatial features and the time sequence features obtained based on the regional features of a plurality of partial regions are more reliable. Through the space-time interaction, the scale change (such as the unchanged shooting position, the motion of the target object, the change of the size of the target object), the motion change (such as the same position, the squatting and the vertical different posture of the target object) and the dislocation problem (such as the unchanged shooting position and the sheltered phenomenon in the motion process of the target object) of the target object can be shown, so that the description characteristics of the obtained video are more similar.

As an alternative embodiment, before processing the second target feature and the representative features respectively corresponding to the multiple frames of images by using the second attention layer to obtain the description features of the video, the method further includes: respectively extracting attention features of corresponding frame images in the multi-frame images, wherein the attention features are features of which importance parameters in the corresponding frame images are larger than a preset threshold value; respectively obtaining the weights of the attention features of the corresponding frame images; and applying the weight of the attention feature of the corresponding frame image to the attention feature to obtain the representative feature of the corresponding frame image. The representative features are obtained by acting the attention features in each frame with the corresponding weights, and attention is transferred to the important feature parts of the multi-frame images, so that the obtained representative animal can really represent the important parts of each frame of image, and accurate identification of each frame of image is realized.

As an alternative embodiment, after obtaining the description feature of the video, the method further includes: identifying a target object in the video based on the description features; and displaying the identified target object. May also be used to: locating a target object in the video identified based on the descriptive features; monitoring the target objects in the video which are identified based on the description characteristics, and the like. The identified target objects may be used in a variety of scenarios for a variety of purposes, such as: the method comprises the steps of querying and mining the target object, searching and comparing the target object, tracking the target object, analyzing and enhancing management based on the target object, and the like, wherein for example, when the target object is a vehicle, the aspect of public security management can be enhanced through the driving state of the vehicle, and the like. The method can strengthen the function application under different scenes, and can realize multiple purposes more flexibly and accurately.

Through the processing, after the first target feature describing the whole situation of the video is identified based on the multi-frame images of the video, the corresponding representative features obtained after the identification of each frame image in the multi-frame images are respectively combined with the first target feature corresponding to each frame image, so that the problem that in the related technology, when each frame image in the video is identified based on the time sequence, the previous frames are forgotten is effectively avoided.

Fig. 3 is a flowchart of a second video processing method according to embodiment 1 of the present invention, as shown in fig. 3, the method includes the following steps:

step S302, receiving a vehicle video, and acquiring a plurality of frames of images from the vehicle video;

step S304, obtaining description characteristics of the vehicle in the vehicle video based on the multi-frame images, wherein the description characteristics are obtained by processing the first target characteristics and the representative characteristics corresponding to the multi-frame images respectively through a second attention layer, the first target characteristics are obtained by processing the image characteristics of the multi-frame images through the first attention layer, and the representative characteristics corresponding to the multi-frame images are obtained by extracting the image characteristics of the multi-frame images;

and step S306, matching the description characteristics with the vehicle characteristics of the target vehicle to obtain a matching result, wherein the matching result is used for identifying whether the vehicle in the vehicle video is the target vehicle.

Through the steps, the description characteristics of the vehicle in the vehicle video are obtained by obtaining the multi-frame images of the vehicle video, and then the description characteristics can be matched with the vehicle characteristics of the target vehicle to obtain the matching result of whether the vehicle in the vehicle video is the target vehicle. The description features of the vehicle are obtained by processing the representative features corresponding to the first target feature and the multi-frame images respectively by adopting the second attention layer based on the multi-frame videos, so that the description features of the vehicle in the obtained vehicle video are more accurate and detailed, the technical problem of low identification precision during video identification in the related technology is solved, and the technical effect of identifying whether the vehicle in the video is the target vehicle to be matched is achieved.

As an alternative embodiment, the vehicle is identified, that is, whether the vehicle in the video is the target vehicle to be matched is identified, and the matching result obtained by the identification may be output in a predetermined manner, where the predetermined manner includes at least one of: the display screen displays, the printing equipment prints and the alarm equipment alarms. Based on the actual application scene, the corresponding form is selected to show the matching result, so that the matching result is easier to be known by the user, and the experience feeling of the user and the convenience of application are improved.

Fig. 4 is a flowchart of a third video processing method according to embodiment 1 of the present invention, and as shown in fig. 4, the method includes the following steps:

step S402, receiving a person video, and acquiring a plurality of frames of images from the person video;

step S404, obtaining description characteristics of the person in the person video based on the multi-frame images, wherein the description characteristics are obtained by processing the first target characteristics and the representative characteristics corresponding to the multi-frame images respectively through a second attention layer, the first target characteristics are obtained by processing the image characteristics of the multi-frame images through the first attention layer, and the representative characteristics corresponding to the multi-frame images are obtained by extracting the image characteristics of the multi-frame images;

step S406, matching the description characteristics with the character characteristics of the target character to obtain a matching result, wherein the matching result is used for identifying whether the character in the character video is the target character.

Through the steps, the description characteristics of the person in the person video are obtained by obtaining the multi-frame images of the person video, and then the description characteristics can be matched with the person characteristics of the target person to obtain the matching result of whether the person in the person video is the target person. The description features of the people are obtained by processing the representative features corresponding to the first target feature and the multi-frame images respectively by adopting the second attention layer based on the multi-frame videos, so that the description features of the people in the obtained people videos are more accurate and more detailed, the technical problem of low identification precision during video identification in the related technology is solved, and the technical effect of identifying whether the people in the videos are the target people to be matched is achieved.

Based on the above embodiment and the alternative embodiment, an alternative implementation is provided, which is based on a scene of vehicle weight recognition, and is specifically described below.

In the related art, methods for recognizing neural network vehicle video again can be roughly divided into three categories: 1) the RNN-based method is characterized in that after the CNN is used for extracting the features of a single frame, the RNN is used for modeling the vehicle change condition under a time sequence, and finally a description feature is output. 2) 3D-CNN based methods, i.e. extending common 2D CNN to 3D scenes. 3) The method based on Graph structure includes extracting features from video frame with CNN, modeling the extracted features with Graph, and outputting the final features.

However, when the vehicle weight recognition is performed by adopting the scheme, the following problems occur: 1) the method is susceptible to the problems of RNN catastrophic forgetting and the like, so that the output features only concern the last input video frame. 2) The method has high calculation complexity and limited use scenes. 3) The method uses Graph to build on the top-level CNN features and therefore lacks interest in detailed information.

In view of this, in an alternative embodiment of the present invention, a vehicle video re-identification method based on intensive interactive learning is provided. The following is a detailed description of alternative embodiments of the invention.

Fig. 5 is a flowchart of a vehicle video re-recognition method according to an alternative embodiment of the present invention, and as shown in fig. 5, an input video segment is respectively characterized by a Convolutional Neural network Block (CNN Block), and then is connected with a Self-Attention mechanism and a Dense Attention mechanism according to an alternative embodiment of the present invention.

Self-Attention can be described as:

，

q, K, V respectively represents Query, Key, and Value, and are feature vectors with length d, and in Self-orientation, the feature vectors are obtained by the following formula:

，

wherein Z is the feature output by the previous layer, and W is a learnable parameter.

The difference between the sense Attention proposed by the alternative embodiment of the present invention is the construction form of Q, K, V:

where PPool denotes lateral Pooling (average Pooling), after Pooling, each region forms a feature vector.

Fig. 6 is a schematic diagram of spatial-temporal position embedding in a vehicle video re-identification method according to an alternative embodiment of the present invention, as shown in fig. 6, including temporal position embedding and spatial position embedding, that is, the alternative embodiment of the present invention adds the spatial-temporal position embedding (STEP-Emb) proposed by the present invention before the sense attribute input, which generates embedding that varies with time and space to be superimposed on the input of each layer.

In addition, fig. 7 is a schematic diagram of different implementation manners of the vehicle video re-identification method according to the alternative embodiment of the present invention, and as shown in fig. 7, the alternative embodiment of the present invention further provides three implementation variants to make flexible selection based on different scenes.

Through the above optional embodiment, the following beneficial effects can be achieved:

(1) compared with the RNN-based method, the method can achieve remarkable performance improvement;

(2) compared with a method based on 3D CNN, the method hardly needs to increase the computational complexity additionally;

(3) compared with the method based on Graph, the method has the advantages that the detailed area can be concerned, the performance is higher, and the implementation process is simpler;

(4) the interaction among the advanced features can be modeled, the features with different scales can be automatically focused according to the advanced features, and finally a multi-scale description feature with larger information content is formed;

(5) the method comprises the spatio-temporal position embedding, and can well and explicitly indicate the relative position in the network input video clip, thereby improving the identification performance.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the video processing method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

According to an embodiment of the present invention, there is further provided a first apparatus for implementing the video processing method, and fig. 8 is a block diagram of a first video processing apparatus according to an embodiment 2 of the present invention, and as shown in fig. 8, the apparatus includes: a first obtaining module 802, a first extracting module 804, a first processing module 806, a second extracting module 808, and a second processing module 810, which are described below.

A first obtaining module 802, configured to obtain multiple frames of images in a video; a first extracting module 804, connected to the first obtaining module 802, for respectively extracting image features of multiple frames of images; a first processing module 806, connected to the first extracting module 804, configured to process image features of multiple frames of images by using a first attention layer to obtain a first target feature; a second extracting module 808, connected to the first processing module 806, configured to extract representative features corresponding to the multiple frames of images based on image features of the multiple frames of images; and a second processing module 810, connected to the second extracting module 808, configured to process the first target feature and the representative features corresponding to the multiple frames of images by using a second attention layer, so as to obtain description features of the video.

It should be noted that, the first obtaining module 802, the first extracting module 804, the first processing module 806, the second extracting module 808 and the second processing module 810 correspond to steps S202 to S210 in embodiment 1, and a plurality of modules are the same as examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.

Example 3

According to an embodiment of the present invention, there is further provided a second apparatus for implementing the video processing method, and fig. 9 is a block diagram of a second video processing apparatus according to embodiment 3 of the present invention, and as shown in fig. 9, the second video processing apparatus includes: a first receiving module 902, a third processing module 904 and a fourth processing module 906, which are explained below.

The first receiving module 902 is configured to receive a vehicle video and obtain a plurality of frames of images from the vehicle video; a third processing module 904, connected to the first receiving module 902, configured to obtain description features of the vehicle in the vehicle video based on multiple frames of images, where the description features are obtained by processing, by using the second attention layer, the first target features and the representative features corresponding to the multiple frames of images, respectively, the first target features are obtained by processing, by using the first attention layer, the image features of the multiple frames of images, and the representative features corresponding to the multiple frames of images are obtained by extracting the image features of the multiple frames of images; and a fourth processing module 906, connected to the third processing module 904, configured to match the description features with vehicle features of the target vehicle to obtain a matching result, where the matching result is used to identify whether the vehicle in the vehicle video is the target vehicle.

It should be noted that, the first receiving module 902, the third processing module 904 and the fourth processing module 906 correspond to steps S302 to S306 in embodiment 1, and a plurality of modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.

Example 4

According to an embodiment of the present invention, there is further provided a third apparatus for implementing the video processing method, and fig. 10 is a block diagram of a third video processing apparatus according to embodiment 4 of the present invention, and as shown in fig. 10, the apparatus includes: a second receiving module 1002, a fifth processing module 1004 and a sixth processing module 1006, which will be described below.

The second receiving module 1002 is configured to receive a person video and obtain a plurality of frames of images from the person video; a fifth processing module 1004, connected to the second receiving module 1002, configured to obtain description features of a person in a person video based on multiple frames of images, where the description features are obtained by processing, by using a second attention layer, a first target feature and representative features respectively corresponding to multiple frames of images, the first target feature is obtained by processing, by using the first attention layer, image features of multiple frames of images, and the representative features respectively corresponding to multiple frames of images are obtained by extracting image features of multiple frames of images; a sixth processing module 1006, connected to the fifth processing module 1004, is configured to match the descriptive characteristics with the characteristics of the target person to obtain a matching result, where the matching result is used to identify whether the person in the person video is the target person.

It should be noted that the second receiving module 1002, the fifth processing module 1004 and the sixth processing module 1006 correspond to steps S402 to S404 in embodiment 1, and a plurality of modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.

Example 5

The embodiment of the invention can provide a computer terminal which can be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute program codes of the following steps in the video processing method of the application program: acquiring a plurality of frame images in a video; respectively extracting the image characteristics of a plurality of frames of images; processing the image characteristics of the multi-frame image by adopting a first attention layer to obtain first target characteristics; based on the image characteristics of the multiple frames of images, extracting representative characteristics corresponding to the multiple frames of images respectively; and processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting the second attention layer to obtain the description features of the video.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the video processing detection method and apparatus in the embodiments of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the video processing method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, and these remote memories may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a plurality of frame images in a video; respectively extracting the image characteristics of a plurality of frames of images; processing the image characteristics of the multi-frame image by adopting a first attention layer to obtain first target characteristics; based on the image characteristics of the multiple frames of images, extracting representative characteristics corresponding to the multiple frames of images respectively; and processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting the second attention layer to obtain the description features of the video.

Optionally, the processor may further execute the program code of the following steps: the processing of the first target feature and the representative features respectively corresponding to the multiple frames of images by adopting the second attention layer to obtain the description features of the video comprises the following steps: embedding time sequence characteristics and space characteristics of multiple frames of images into the first target characteristics to obtain second target characteristics; and processing the second target characteristic and the representative characteristics corresponding to the multi-frame images by adopting the second attention layer to obtain the description characteristics of the video.

Optionally, the processor may further execute the program code of the following steps: embedding the time sequence characteristics and the spatial characteristics of the multi-frame images into the first target characteristics, and before obtaining second target characteristics, the method further comprises the following steps: dividing a single frame image of a video into regions to obtain a plurality of partial regions; respectively carrying out averaging pooling on the plurality of partial areas to obtain area characteristics respectively representing the plurality of partial areas; the spatial characteristics of the multi-frame images are determined based on the regional characteristics of the plurality of partial regions, and the time sequence characteristics of the multi-frame images are determined based on the time sequence of the video.

Optionally, before the second attention layer is adopted to process the second target feature and the representative features corresponding to the multiple frames of images respectively to obtain the description features of the video, the method further includes: respectively extracting attention features of corresponding frame images in the multi-frame images, wherein the attention features are features of which importance parameters in the corresponding frame images are larger than a preset threshold value; respectively obtaining the weights of the attention features of the corresponding frame images; and applying the weight of the attention feature of the corresponding frame image to the attention feature to obtain the representative feature of the corresponding frame image.

Optionally, the processor may further execute the program code of the following steps: after the description characteristics of the video are obtained, the method further comprises the following steps: identifying a target object in the video based on the description features; and displaying the identified target object.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: receiving a vehicle video, and acquiring a plurality of frames of images from the vehicle video; obtaining description characteristics of a vehicle in a vehicle video based on multiple frames of images, wherein the description characteristics are obtained by processing representative characteristics corresponding to a first target characteristic and the multiple frames of images respectively through a second attention layer, the first target characteristic is obtained by processing image characteristics of the multiple frames of images through the first attention layer, and the representative characteristics corresponding to the multiple frames of images are obtained by extracting the image characteristics of the multiple frames of images; and matching the description characteristics with the vehicle characteristics of the target vehicle to obtain a matching result, wherein the matching result is used for identifying whether the vehicle in the vehicle video is the target vehicle.

Optionally, the processor may further execute the program code of the following steps: further comprising: outputting the matching result in a predetermined manner, wherein the predetermined manner comprises at least one of the following: the display screen displays, the printing equipment prints and the alarm equipment alarms.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: receiving a person video, and acquiring a multi-frame image from the person video; obtaining description characteristics of people in a people video based on multiple frames of images, wherein the description characteristics are obtained by processing representative characteristics corresponding to a first target characteristic and the multiple frames of images respectively through a second attention layer, the first target characteristic is obtained by processing image characteristics of the multiple frames of images through the first attention layer, and the representative characteristics corresponding to the multiple frames of images are obtained by extracting the image characteristics of the multiple frames of images; and matching the description characteristics with the character characteristics of the target character to obtain a matching result, wherein the matching result is used for identifying whether the character in the character video is the target character.

The embodiment of the invention provides a video processing scheme. By acquiring the multi-frame images of the vehicle video, the description characteristics of the vehicle in the vehicle video are obtained, and then the description characteristics can be matched with the vehicle characteristics of the target vehicle to obtain a matching result of whether the vehicle in the vehicle video is the target vehicle. The description features of the vehicle are obtained by processing the representative features corresponding to the first target feature and the multi-frame images respectively by adopting the second attention layer based on the multi-frame videos, so that the description features of the vehicle in the obtained vehicle videos are more accurate and detailed, and the technical problem of low identification precision during video identification in the related technology is solved.

It can be understood by those skilled in the art that the structure shown in the drawings is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Devices (MID), a PAD, and the like. Fig. 11 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 11 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 6

The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the video processing method provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a plurality of frame images in a video; respectively extracting the image characteristics of a plurality of frames of images; processing the image characteristics of the multi-frame image by adopting a first attention layer to obtain first target characteristics; based on the image characteristics of the multiple frames of images, extracting representative characteristics corresponding to the multiple frames of images respectively; and processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting the second attention layer to obtain the description features of the video.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: the processing of the first target feature and the representative features respectively corresponding to the multiple frames of images by adopting the second attention layer to obtain the description features of the video comprises the following steps: embedding time sequence characteristics and space characteristics of multiple frames of images into the first target characteristics to obtain second target characteristics; and processing the second target characteristic and the representative characteristics corresponding to the multi-frame images by adopting the second attention layer to obtain the description characteristics of the video.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: embedding the time sequence characteristics and the spatial characteristics of the multi-frame images into the first target characteristics, and before obtaining second target characteristics, the method further comprises the following steps: dividing a single frame image of a video into regions to obtain a plurality of partial regions; respectively carrying out averaging pooling on the plurality of partial areas to obtain area characteristics respectively representing the plurality of partial areas; the spatial characteristics of the multi-frame images are determined based on the regional characteristics of the plurality of partial regions, and the time sequence characteristics of the multi-frame images are determined based on the time sequence of the video.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: before the second attention layer is adopted to process the second target feature and the representative features respectively corresponding to the multiple frames of images to obtain the description features of the video, the method further comprises the following steps: respectively extracting attention features of corresponding frame images in the multi-frame images, wherein the attention features are features of which importance parameters in the corresponding frame images are larger than a preset threshold value; respectively obtaining the weights of the attention features of the corresponding frame images; and applying the weight of the attention feature of the corresponding frame image to the attention feature to obtain the representative feature of the corresponding frame image.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: after the description characteristics of the video are obtained, the method further comprises the following steps: identifying a target object in the video based on the description features; and displaying the identified target object.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: receiving a vehicle video, and acquiring a plurality of frames of images from the vehicle video; obtaining description characteristics of a vehicle in a vehicle video based on multiple frames of images, wherein the description characteristics are obtained by processing representative characteristics corresponding to a first target characteristic and the multiple frames of images respectively through a second attention layer, the first target characteristic is obtained by processing image characteristics of the multiple frames of images through the first attention layer, and the representative characteristics corresponding to the multiple frames of images are obtained by extracting the image characteristics of the multiple frames of images; and matching the description characteristics with the vehicle characteristics of the target vehicle to obtain a matching result, wherein the matching result is used for identifying whether the vehicle in the vehicle video is the target vehicle.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: further comprising: outputting the matching result in a predetermined manner, wherein the predetermined manner comprises at least one of the following: the display screen displays, the printing equipment prints and the alarm equipment alarms.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: receiving a person video, and acquiring a multi-frame image from the person video; obtaining description characteristics of people in a people video based on multiple frames of images, wherein the description characteristics are obtained by processing representative characteristics corresponding to a first target characteristic and the multiple frames of images respectively through a second attention layer, the first target characteristic is obtained by processing image characteristics of the multiple frames of images through the first attention layer, and the representative characteristics corresponding to the multiple frames of images are obtained by extracting the image characteristics of the multiple frames of images; and matching the description characteristics with the character characteristics of the target character to obtain a matching result, wherein the matching result is used for identifying whether the character in the character video is the target character.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A video processing method, comprising:

acquiring a plurality of frame images in a video;

respectively extracting the image characteristics of the multiple frames of images;

processing the image characteristics of the multi-frame images by adopting a first attention layer to obtain first target characteristics;

extracting representative features respectively corresponding to the multiple frames of images based on the image features of the multiple frames of images;

and processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting a second attention layer to obtain the description features of the video.

2. The method according to claim 1, wherein the processing the first target feature and the representative features corresponding to the multiple frames of images by using the second attention layer to obtain the descriptive features of the video comprises:

embedding the time sequence characteristics and the space characteristics of the multi-frame images into the first target characteristics to obtain second target characteristics;

and processing the second target feature and the representative features respectively corresponding to the multiple frames of images by adopting the second attention layer to obtain the description features of the video.

3. The method according to claim 2, wherein before embedding the temporal feature and the spatial feature of the multi-frame image in the first target feature to obtain a second target feature, further comprising:

dividing a single frame image of the video into regions to obtain a plurality of partial regions;

respectively carrying out averaging pooling on the plurality of partial regions to obtain region characteristics respectively representing the plurality of partial regions;

the spatial characteristics of the multi-frame images are determined based on the regional characteristics of the plurality of partial regions, and the time sequence characteristics of the multi-frame images are determined based on the time sequence of the video.

4. The method according to claim 2, before the processing the representative features corresponding to the second target feature and the multiple frames of images by using the second attention layer to obtain the descriptive features of the video, further comprising:

respectively extracting attention features of corresponding frame images in the multi-frame images, wherein the attention features are features of which importance parameters are larger than a preset threshold value in the corresponding frame images;

respectively obtaining the weights of the attention features of the corresponding frame images;

and applying the weight of the attention feature of the corresponding frame image to the attention feature to obtain the representative feature of the corresponding frame image.

5. The method according to any one of claims 1 to 4, further comprising, after obtaining the descriptive feature of the video:

identifying a target object in the video based on the descriptive features;

and displaying the identified target object.

6. A video processing method, comprising:

receiving a vehicle video, and acquiring a multi-frame image from the vehicle video;

obtaining description features of the vehicle in the vehicle video based on the multi-frame images, wherein the description features are obtained by processing first target features and representative features respectively corresponding to the multi-frame images through a second attention layer, the first target features are obtained by processing image features of the multi-frame images through a first attention layer, and the representative features respectively corresponding to the multi-frame images are obtained by extraction based on the image features of the multi-frame images;

and matching the description characteristics with the vehicle characteristics of the target vehicle to obtain a matching result, wherein the matching result is used for identifying whether the vehicle in the vehicle video is the target vehicle.

7. The method of claim 6, further comprising:

outputting the matching result in a predetermined manner, wherein the predetermined manner comprises at least one of: the display screen displays, the printing equipment prints and the alarm equipment alarms.

8. A video processing method, comprising:

receiving a person video, and acquiring a plurality of frames of images from the person video;

obtaining description features of people in the people video based on the multi-frame images, wherein the description features are obtained by processing first target features and representative features respectively corresponding to the multi-frame images through a second attention layer, the first target features are obtained by processing image features of the multi-frame images through a first attention layer, and the representative features respectively corresponding to the multi-frame images are obtained by extraction based on the image features of the multi-frame images;

and matching the description characteristics with the character characteristics of the target character to obtain a matching result, wherein the matching result is used for identifying whether the character in the character video is the target character.

9. A video processing apparatus, comprising:

the first acquisition module is used for acquiring multi-frame images in a video;

the first extraction module is used for respectively extracting the image characteristics of the multiple frames of images;

the first processing module is used for processing the image characteristics of the multi-frame images by adopting a first attention layer to obtain first target characteristics;

the second extraction module is used for extracting representative features respectively corresponding to the multiple frames of images based on the image features of the multiple frames of images;

and the second processing module is used for processing the first target feature and the representative features respectively corresponding to the multi-frame images by adopting a second attention layer to obtain the description features of the video.

10. A video processing apparatus, comprising:

the first receiving module is used for receiving a vehicle video and acquiring a plurality of frames of images from the vehicle video;

the third processing module is used for obtaining description characteristics of the vehicle in the vehicle video based on the multi-frame images, wherein the description characteristics are obtained by processing first target characteristics and representative characteristics corresponding to the multi-frame images respectively through a second attention layer, the first target characteristics are obtained by processing image characteristics of the multi-frame images through the first attention layer, and the representative characteristics corresponding to the multi-frame images are obtained by extraction based on the image characteristics of the multi-frame images;

and the fourth processing module is used for matching the description characteristics with the vehicle characteristics of the target vehicle to obtain a matching result, wherein the matching result is used for identifying whether the vehicle in the vehicle video is the target vehicle.

11. A video processing apparatus, comprising:

the second receiving module is used for receiving the character video and acquiring a plurality of frames of images from the character video;

the fifth processing module is used for obtaining description characteristics of people in the people video based on the multi-frame images, wherein the description characteristics are obtained by processing first target characteristics and representative characteristics corresponding to the multi-frame images respectively through a second attention layer, the first target characteristics are obtained by processing image characteristics of the multi-frame images through the first attention layer, and the representative characteristics corresponding to the multi-frame images are extracted based on the image characteristics of the multi-frame images;

and the sixth processing module is used for matching the description characteristics with the character characteristics of the target character to obtain a matching result, wherein the matching result is used for identifying whether the character in the character video is the target character.

12. A computer device, comprising: a memory and a processor, wherein the processor is capable of,

the memory stores a computer program;

the processor configured to execute a computer program stored in the memory, the computer program when executed causing the processor to perform the video processing method of any one of claims 1 to 8.

13. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video processing method of any of claims 1 to 8.

14. A computer program product comprising a computer program, characterized in that the computer program realizes the video processing method of any of claims 1 to 8 when executed by a processor.