CN115100725A

CN115100725A - Object recognition method, object recognition apparatus, and computer storage medium

Info

Publication number: CN115100725A
Application number: CN202211014858.4A
Authority: CN
Inventors: 廖紫嫣; 邸德宁; 张姜; 郝敬松; 朱树磊; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-09-23
Anticipated expiration: 2042-08-23
Also published as: CN115100725B

Abstract

The application discloses a target identification method, a target identification device and a computer storage medium, wherein the target identification method comprises the following steps: clustering all video frames of a video to be processed based on the characteristics of at least one mode, so as to divide the video to be processed into a plurality of sub-videos to be processed; encoding original video characteristics of multiple modes of each to-be-processed sub video to obtain multi-mode encoding characteristics of each to-be-processed sub video; constructing a graph network based on the multi-modal coding features of each to-be-processed sub-video and the features of the neighboring videos; and fusing the multi-modal coding features of each sub-video to be processed and the features of the adjacent videos by using the graph network to obtain final fusion features, and identifying the target object based on the final fusion features. According to the target identification method, self-adaptive fusion of information of three different levels, namely a near level, a video level and a multi-mode level, can be realized through a brand-new modeling mode, and the characteristic identification effect is improved.

Description

Object recognition method, object recognition apparatus, and computer storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a target recognition method, a target recognition apparatus, and a computer storage medium.

Background

With the annual increase of the deployment and control scale of monitoring cameras, the identity recognition technology is widely applied. Traditional personnel identification is mainly realized through face recognition. However, for some real non-constrained scenes, such as the problems of low resolution and face blurring of the face that often occur in the structured scene, and the situations of large deflection angle and face occlusion of the face in the non-matching mode, the recognition effect of using the face modality alone for recognition is not ideal. There are also some methods for recognition through human body features and voice features. However, the human body characteristics depend on the clothing, posture, height, etc. of the target person, and the voice characteristics are susceptible to physical conditions, age, emotion, etc., such as changes in vocal cords characteristics of the target person due to cold, or interference noise in the environment. Therefore, there are respective limitations to using the human body modality or the voice modality alone for recognition.

Disclosure of Invention

The application provides a target recognition method, a target recognition device and a computer storage medium.

One technical solution adopted by the present application is to provide a target identification method, including:

acquiring a video to be processed and neighbor video characteristics thereof, wherein the neighbor video characteristics are determined based on the characteristics of a plurality of neighbor videos, the neighbor videos are videos with characteristics of at least one of different modalities, and the neighbor videos are selected according to the similarity between the video characteristics of the videos in a video library and the video characteristics of the video to be processed;

clustering all video frames of the video to be processed based on the characteristics of at least one modality, so as to enable the video to be processed to be a plurality of sub-videos to be processed;

encoding original video characteristics of multiple modes of each to-be-processed sub video to obtain multi-mode encoding characteristics of each to-be-processed sub video;

constructing a graph network based on the multi-modal coding features of each to-be-processed sub-video and the neighbor video features;

and fusing the multi-modal coding features of each to-be-processed sub-video and the neighboring video features by using the graph network to obtain final fusion features, and identifying the target object based on the final fusion features.

Wherein the constructing a graph network based on the multi-modal coding features of each to-be-processed sub-video and the neighboring video features comprises:

determining a feature matrix composed of the neighboring video features and the multi-modal coding features together;

acquiring an adjacent matrix corresponding to the characteristic matrix; the adjacency matrix represents a connection relation for fusing different features in the feature matrix;

constructing the graph network based on the feature matrix and the adjacency matrix;

the fusing the multi-modal coding features of each sub-video to be processed and the features of the neighboring videos by using the graph network to obtain final fused features, wherein the fused features comprise:

and obtaining the fusion characteristics of the video to be processed by aggregating the characteristic matrix and the adjacent matrix.

Wherein, the obtaining of the adjacency matrix corresponding to the feature matrix includes:

determining a connection weight for fusing every two features in the multiple feature matrixes;

and obtaining an adjacency matrix formed by the determined connection weights according to the determined connection weights.

Wherein the determining the connection weight for fusing every two features in the plurality of feature matrices includes:

acquiring a first distance between multi-modal coding features of different sub-videos to be processed, and determining a first connection weight between the multi-modal coding features of the different sub-videos to be processed based on the first distance and a preset exponential function;

acquiring a second distance between the neighbor video feature and the multi-modal coding feature of the sub-video to be processed, and determining a second connection weight between the neighbor video feature and the multi-modal coding feature of the sub-video to be processed based on the second distance, the preset exponential function and a balance factor.

Wherein the obtaining of the fusion feature of the video to be processed by aggregating the feature matrix and the adjacency matrix includes:

aggregating the feature matrix and the adjacent matrix to obtain the graph feature of the video to be processed;

and performing feature fusion on the graph features of the video to be processed by using a preset feature updating mechanism to obtain fusion features of the video to be processed.

The encoding of the original video features of multiple modes of each to-be-processed sub-video to obtain the multi-mode encoding features of each to-be-processed sub-video includes:

performing first pooling operation on the original video characteristics of at least one mode in each to-be-processed sub video to obtain first pooled video characteristics;

performing second pooling operation on original video features of other modalities in each to-be-processed sub-video to obtain second pooled video features;

stitching the first pooled video features with the second pooled video features;

and coding the spliced video features to obtain the multi-modal coding features of each to-be-processed sub video.

Wherein the first pooling operation is an average pooling operation and the second pooling operation is a global pooling operation.

Wherein the at least one modality is a face modality;

the clustering all video frames of the video to be processed based on the features of the at least one modality so as to make the video to be processed into a plurality of sub-videos to be processed comprises:

and dividing the video to be processed into a plurality of sub-videos to be processed according to the quality of the human face, wherein the sub-videos to be processed are defined as a high-quality area video and a low-quality area video according to the quality of the human face.

The target identification method further comprises the following steps:

respectively calculating similarity values between each feature of the multiple features of the video to be processed and each feature of a preset video to obtain multiple similarity values of each feature of the multiple features;

arranging a plurality of similarity values of each feature in the plurality of features according to the size of the similarity values, and taking a preset video corresponding to the similarity value arranged at the target position as a neighboring video;

and extracting the features of different modes in each neighboring video, and taking the extracted features as the neighboring video features corresponding to the video to be processed.

Another technical solution adopted by the present application is to provide an object recognition apparatus,

the object recognition apparatus includes: the system comprises a video acquisition module, a video clustering module, a feature coding module, a feature fusion module and a target identification module; wherein the content of the first and second substances,

the video acquisition module is used for acquiring a video to be processed and a neighbor video feature of the video to be processed, wherein the neighbor video feature is determined based on features of a plurality of neighbor videos, the neighbor videos are videos with features of at least one modality in different modalities, and the neighbor videos are selected according to similarity between video features of videos in a video library and video features of the video to be processed;

the video clustering module is used for clustering all video frames of the video to be processed based on the characteristics of at least one mode, so that the video to be processed is divided into a plurality of sub-videos to be processed;

the feature coding module is used for coding original video features of multiple modes of each to-be-processed sub video to obtain multi-mode coding features of each to-be-processed sub video;

the feature fusion module is used for constructing a graph network based on the multi-modal coding features of each to-be-processed sub video and the features of the adjacent videos;

the feature fusion module is further configured to fuse the multi-modal coding features of each to-be-processed sub-video and the neighboring video features by using the graph network to obtain final fusion features;

and the target identification module is used for identifying the target object based on the final fusion characteristics.

Another technical solution adopted by the present application is to provide an object recognition apparatus, which includes a memory and a processor coupled to the memory;

wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the object recognition method as described above.

Another technical solution adopted by the present application is to provide a computer storage medium for storing program data, which when executed by a computer, is used to implement the object recognition method as described above.

The beneficial effect of this application is: the target identification device acquires a video to be processed and the characteristics of a neighboring video of the video to be processed, wherein the characteristics of the neighboring video are determined based on the characteristics of a plurality of neighboring videos, the neighboring video is a video with the characteristics of at least one of different modalities, and the neighboring video is selected according to the similarity between the video characteristics of videos in a video library and the video characteristics of the video to be processed; clustering all video frames of a video to be processed based on the characteristics of at least one mode, so as to divide the video to be processed into a plurality of sub-videos to be processed; encoding original video characteristics of multiple modes of each to-be-processed sub video to obtain multi-mode encoding characteristics of each to-be-processed sub video; constructing a graph network based on the multi-modal coding features of each to-be-processed sub-video and the features of the neighboring videos; and fusing the multi-modal coding features of each to-be-processed sub-video and the neighboring video features by using a graph network to obtain final fusion features, and identifying the target object based on the final fusion features. According to the target identification method, self-adaptive fusion of three different levels of information, namely a near level, a video level and a multi-mode level, can be realized through a brand-new modeling mode, and the characteristic identification effect is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a target identification method provided herein;

FIG. 2 is a schematic diagram of a general flow of a target identification method provided herein;

FIG. 3 is a detailed flowchart of step S13 of the object recognition method shown in FIG. 1;

fig. 4 is a specific flowchart of step S14 of the object identification method provided in the present application;

FIG. 5 is a schematic diagram illustrating an embodiment of an object recognition device;

FIG. 6 is a schematic diagram of another embodiment of an object recognition device provided in the present application;

FIG. 7 is a schematic structural diagram of an embodiment of a computer storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The multi-modal video character recognition is to confirm the identity of a target character in a video by utilizing various modal information such as the face, the body, the voice and the like of the target character. Because of complementarity among different modal information, compared with single biological feature recognition, the method adopts a multi-mode fusion mode to solve the problem of complex scene identity recognition which cannot be solved by a single mode, and in addition, compared with picture data commonly used in the traditional identity recognition task, the method utilizes video information with richer contents, so that better reliability and recognition effect can be obtained by fusing multi-mode information and the video information.

Based on the technical principle of multi-modal video character recognition, the modeling method and the target recognition method thereof provided by the present application refer to fig. 1 and fig. 2 specifically, fig. 1 is a schematic flowchart of an embodiment of the target recognition method provided by the present application, and fig. 2 is a schematic diagram of a general flowchart of the target recognition method provided by the present application.

As shown in fig. 1, the target identification method according to the embodiment of the present application includes the following steps:

step S11: the method comprises the steps of obtaining a video to be processed and neighbor video characteristics of the video to be processed, wherein the neighbor video characteristics are determined based on the characteristics of a plurality of neighbor videos, the neighbor videos are videos with characteristics of at least one mode in different modes, and the neighbor videos are selected according to the similarity between the video characteristics of the videos in a video library and the video characteristics of the video to be processed.

In the embodiment of the application, the video to be processed can be a monitoring video acquired by a video monitoring system, such as a traffic monitoring system, in real time, or can be a monitoring video stored in a memory, and is used for identifying and tracking a target in the monitoring video.

Further, in order to improve the target recognition efficiency, the target recognition device may further extract a plurality of images in the monitoring video to form a to-be-processed video, and the extraction method may specifically be: and extracting a plurality of images in the monitoring video at equal intervals, and taking the extracted plurality of images as images in the video to be processed. The equal interval here may be a preset time period or a preset number of images, and is not limited here.

Furthermore, the target recognition device adopts a convolutional neural network to extract various visual features of each frame of image in the video to be processed, and adopts a long-term and short-term memory network to extract voice features. The various types of visual features include features of different modalities, such as facial features, head features, body features, and other modality features.

Specifically, the target recognition device performs weighted summation on a plurality of features of the same modality extracted from each image, that is, calculates to obtain one feature corresponding to the same modality, and further a plurality of features corresponding to a plurality of different modalities, and uses the calculated plurality of features corresponding to the different modalities as a plurality of features of different modalities of the video to be processed.

Specifically, two or more different modalities characterize the target object, and in the embodiment of the present application, the different modalities may be divided based on features of different forms of the target object, for example, when the target object is a person, information of the person, such as voice, body/motion, gait, and wearing, may be respectively used as one modality; in the embodiment of the application, different modes can be divided based on the characteristics of different components/parts of the target object, for example, when the target object is a person, the information of the face, the hands, the body, the head, the shoulders, the legs and the like of the person can be respectively used as one mode; and the different modalities in the embodiment of the present application may include any multiple modalities among the modalities divided according to different methods, for example, when the target object is a person, the modalities in the embodiment of the present application may include a face, a head, a body, a voice, and the like of the target object.

Here, a corresponding modality feature may be extracted for each modality, for example, a face feature, a head feature, a body feature, a voice feature, and the like of the target object are extracted.

After determining the video to be processed, the target recognition device further needs to determine a neighboring video according to the video to be processed, i.e., a neighboring selection video shown in fig. 2.

Specifically, after extracting a plurality of features of different modalities of a target object in a video to be processed, a target identification device respectively calculates similarity values between each feature in the plurality of features and each feature in a preset video to obtain a plurality of similarity values of each feature in the plurality of features, arranges the plurality of similarity values of each feature in the plurality of features according to the size of the calculated similarity values, then takes the preset video corresponding to the similarity values arranged at the target position as a neighboring video, finally extracts the features of different modalities in each neighboring video, and takes the extracted features as the neighboring video features corresponding to the video to be processed.

Here, the neighboring video feature is a feature determined based on features of a plurality of neighboring videos, and the neighboring video is a video having a feature of at least one of different modalities of the video to be processed.

Specifically, after a plurality of features of different modalities of a target object in a video to be processed are extracted, a preset video meeting requirements is selected from a preset database based on the different modalities of the extracted target object.

For example, if three features of three modalities, namely a face, a human body and a voice, of the target object in the video to be processed are extracted, a preset video with the features of any one or more of the three modalities, namely the face, the human body and the voice, is extracted from a preset database.

For example, if the preset video 1 has features of one modality, i.e., a human face, the preset video 2 has features of two modalities, i.e., a human face and a human body, the preset video 3 has features of four modalities, i.e., a human face, a human body, a voice and a head, and the preset video 4 has features of one modality, i.e., a head, the preset video 1, the preset video 2 and the preset video 3 may be extracted.

After the preset video is determined, similarity values between each feature of the multiple features of the video to be processed and each feature of the preset video are respectively calculated, and the multiple similarity values of each feature of the multiple features of the video to be processed are obtained.

Then, according to the calculated similarity values, the similarity values of each of the plurality of features are arranged, the specific arrangement mode can be that the similarity values are sorted from small to large, or the similarity values are sorted from large to small, and the preset video corresponding to the similarity value arranged at the target position is taken as the neighboring video.

It should be noted that the target position may be a preset position, may be a position determined by a preset threshold, or may be a position determined according to actual application.

Specifically, the neighbor video can be screened by a K neighbor method, and the specific screening method is as follows: and searching for each modality of a plurality of characteristics of the video to be processed, and taking the intersection of the first K2 results as K neighbor videos of the video to be processed, wherein the K neighbor videos are the K neighbor videos.

Here, the K2 may be determined, that is, in a case of the target location, the number K of determined neighboring videos is uncertain by searching different modalities, and needs to be determined according to an actual intersection result.

Further, to facilitate understanding of the intersection of the above results, taking 3 modal features a {2,3,4} and 3 modal features B {3,4,5} as an example, where each number is used to identify the object to which this modal feature belongs, that is, the intersection of the results is {3,4}, i.e., the intersection of the results is: modal characteristics A {3,4}, and modal characteristics B {3,4 }.

It is worth to be noted that the determination of the neighbor video in the K neighbor manner is a possible manner of determining the neighbor video, and the determination of the neighbor video may also be performed in other manners, and the purpose of determining the neighbor video is to extract neighbor video features in the neighbor video, and to fuse a plurality of features of the video to be processed based on the neighbor video features, so that the fused features obtained by fusion have higher discriminative performance and robustness, and the quality of the fused features is improved.

It should be noted that, since the face features have the best recognition effect and robustness compared with other modalities, the target recognition apparatus of the embodiment of the present application may perform neighbor selection of a neighbor video by using the face modality. The specific operation is as follows: the target recognition device extracts the features of the image with the highest face quality in the video to be processed, performs similarity comparison, and screens out the first K2 neighboring videos with the highest similarity to the video to be processed, which is not repeated herein. By introducing neighbor level information, the method and the device can assist the video to be processed to repair the characteristic expression of the video by means of neighbor subspace information, so that the discrimination capability in the global space is improved, and the overall recognition effect is further improved.

The quality of the human face can be evaluated through an image quality score, and the image quality score can be used for representing the definition degree of an image and/or the proportion of a target object shielding area to a target object area in the image, namely if the image is clearer and/or the target object area in the image has a lower-proportion target object shielding area, the calculated image quality score is higher; if the image is more blurred and/or the target object area in the image has a higher proportion of the target object shielding area, the calculated image quality score is lower.

Here, when the target object is a person, the image quality scores of the respective images may specifically include, but are not limited to: the human face quality scores of the human face images in the images and the human body quality scores of the human body images in the images.

Taking the face quality score as an example, the target object region can be a region where a target object face is located in the image, and if the face is clearer and/or the face is more complete, the face quality score calculated by the image is higher; if the face is more blurred, the larger the face deflection angle and/or the more face shielding, the lower the face quality score calculated by the image.

It should be noted that, after the to-be-processed video and the neighboring video are obtained, the to-be-processed video and the neighboring video may undergo the same processing procedure, that is, all feature processing operations in this embodiment, so as to obtain the multi-modal encoding feature of the to-be-processed video and the neighboring video feature of the neighboring video. In the description of the present embodiment, the feature processing of the video to be processed is taken as an example, and the feature processing is also applicable to the neighboring video.

Step S12: and clustering all video frames of the video to be processed based on the characteristics of at least one mode, so that the video to be processed is divided into a plurality of sub videos to be processed.

Further, in order to improve accuracy of video feature coding, the target identification device in the embodiment of the present application may further utilize at least one modal feature as a coding factor of the video feature coding, that is, partition a to-be-processed video by using the modal feature, and then adopt different coding modes for the plurality of partitioned to-be-processed sub-videos according to a difference of the modal features, so as to improve accuracy of specificity of the video feature coding.

Specifically, the target recognition device may cluster all video frames in the video to be processed by using the characteristics of the face modality, so as to combine a plurality of video frames with similar face characteristics into one sub-video to be processed. In addition, the target recognition device can also calculate the face quality of the video frames according to the face modal characteristics of each video frame in the video to be processed, and then divide the video to be processed according to the face quality, so as to divide the high-quality area video, the low-quality area video and the like.

Step S13: and encoding the original video characteristics of the multiple modes of each to-be-processed sub video to obtain the multi-mode encoding characteristics of each to-be-processed sub video.

In the embodiment of the present application, because the similarity between adjacent frames in a video is high, and there is a difference between frames with a longer interval, it is desirable to extract a representative feature of the video from an original video feature by a certain method, so as to reduce redundancy while ensuring the integrity and diversity of video information.

Therefore, the embodiment of the present application provides a way to encode a video to achieve the overall effect. Specifically, the target identification device may encode the original video features of multiple modalities of each to-be-processed sub-video, and then fuse the encoded original video features to obtain the multi-modal encoding features of each to-be-processed sub-video.

There are various ways to fuse the encoded multiple original video features, including but not limited to: calculating an average value of original video characteristics of multiple modes of each to-be-processed sub video, and taking the average value as a fused multi-mode coding characteristic; solving a median value of original video characteristics of each to-be-processed sub video in multiple modes to serve as fused multi-mode coding characteristics; and solving a mode value of the original video characteristics of the multiple modes of each to-be-processed sub video as fused multi-mode coding characteristics and the like. In the embodiment of the present application, the video feature fusion manner is not particularly limited.

Specifically, the face quality is an important factor for measuring the recognition effect, and a strong correlation exists between the face quality and the feature similarity. Therefore, the specific operations of the video feature encoding process of the embodiment of the present application may be: dividing the video into a high-quality area and a low-quality area according to a certain face quality threshold, and performing average pooling operation on video frames of the two areas and all video frames of the video to obtain K1 (K1 = 3) video coding features representing different qualities. Other features are not as well distinguished as human faces, so that the operation is only carried out on the features of the human faces, and the global pooling operation is directly carried out on all frames for other modes.

It should be noted that, in other embodiments, the target recognition apparatus may also use other modality features as the encoding factor of the video encoding, which is not listed here.

Referring to fig. 3, fig. 3 is a schematic flowchart of step S13 of the target identification method shown in fig. 1.

As shown in fig. 3, the target identification method according to the embodiment of the present application includes the following steps:

step S131: and performing first pooling operation on the original video features of at least one mode in each to-be-processed sub video to obtain first pooled video features.

In the embodiment of the present application, the target recognition apparatus performs an average pooling process on the original video features of at least one modality as the encoding factor, thereby obtaining a first pooled video feature.

Specifically, the target recognition device may further divide the video to be processed according to the modality, for example, when the encoding factor is a face modality, the target recognition device may divide the video frames higher than or equal to the face quality threshold into high-quality area videos and divide the video frames lower than the face quality threshold into low-quality area videos according to the face quality threshold. At this time, the target recognition apparatus may obtain three videos regarding the face modality: pending video, high quality region video, and low quality region video.

Further, the target recognition device performs average pooling operation on three videos related to the face modality respectively to obtain three video coding features representing different face qualities. By the method for dividing the videos by the quality, the videos with different qualities can be respectively subjected to average pooling, and the respective characteristic performances are improved.

Step S132: and performing second pooling operation on the original video characteristics of other modalities in each to-be-processed sub video to obtain second pooled video characteristics.

In the embodiment of the application, because the discrimination of the features of other modalities is inferior to that of the human face modality, the target recognition device can directly perform global pooling operation on the original video features of the other modalities of the video to be processed to obtain the second pooled video features.

It should be noted that, in other embodiments, the first pooling operation and the second pooling operation may also be other alternative pooling solutions or combination solutions, which are not listed here.

Step S133: and splicing the first pooled video features with the second pooled video features.

Step S134: and coding the spliced video features to obtain the multi-modal coding features of each sub-video to be processed.

In the embodiment of the present application, the target recognition apparatus obtains a plurality of pooled video features representing different face qualities and pooled video features of other modalities through steps S131 and S132, and then performs a stitching operation on the above pooled video features. Then, the target recognition device carries out multi-mode coding on the spliced video features containing the multi-mode information through a full connection layer.

Specifically, the role of the multi-modal feature coding layer is mainly two-fold: from the perspective of multimodal information fusion, the coded features can fully utilize the complementarity among multimodal features, improve expression capability and reduce redundancy among modal features. And in consideration of multi-layer information fusion, an incidence relation between a video to be processed and a neighboring video is constructed in the coded multi-mode feature space, and a foundation is laid for the model to realize the self-adaptive fusion of the three types of information. And obtaining a plurality of coded features of each video in the above mode, and taking the coded features as the graph node features of the video to be processed.

Step S14: and constructing a graph network based on the multi-modal coding features of each sub-video to be processed and the features of the adjacent videos.

In the embodiment of the present application, the target identification apparatus may perform feature fusion on the near-neighbor video features and the multi-modal coding features, and the feature fusion manner may adopt a common technical manner, such as concat (series feature fusion), add (parallel policy), and the like.

Further, the object recognition device may also use graph data to fuse the near-end video feature and the multi-modal encoding feature, please refer to fig. 4 in detail, and fig. 4 is a specific flowchart of step S14 of the object recognition method provided by the present application.

As shown in fig. 4, the target identification method according to the embodiment of the present application includes the following steps:

step S141: a feature matrix is determined that is composed of neighboring video features and multi-modal coding features together.

In an embodiment of the application, the target recognition device fuses the near-by video features and the multi-modal encoding features into one feature matrix.

Step S142: acquiring an adjacent matrix corresponding to the characteristic matrix; the adjacency matrix represents the connection relation for fusing different features in the feature matrix.

In the embodiment of the present application, the target identification apparatus obtains the adjacency matrix composed of the determined connection weights by determining the connection weights for fusing each of the plurality of features in the feature matrix with each of the features in the feature matrix.

Specifically, a node where the multi-modal coding features of the video to be processed are located is called a master node, and a node where the neighboring video features of the neighboring video are located is called a neighboring node. The adjacency matrix is constructed in the following specific manner:

the target recognition device obtains cosine distances among different multi-modal coding features, and determines first connection weights among the different multi-modal coding features based on the cosine distances and a preset exponential function. The target recognition device obtains a cosine distance between the neighbor video features and the multi-modal coding features, and determines second connection weights between the neighbor video features and the multi-modal coding features based on the cosine distance, a preset index function and a balance factor. The connection weight between the neighboring video feature and the neighboring video feature is set to a fixed value, for example, "1".

The construction process of the adjacency matrix is represented by the following formula:

wherein the content of the first and second substances,

is a contiguous matrix of the feature matrix and,

representing the connection weight between the ith target node and the jth neighbor node in the feature matrix,

representing the cosine distance between node i and the jth neighbor node.

Is a set of master nodes that are,

is a set of neighboring nodes that are,

for the multi-modal coding feature of the ith node,

for the temperature parameter,

is a balance factor.

The construction scheme for designing the adjacency matrix mainly considers the following three points:

1. the degree of connectivity between reliable nodes is improved and the degree of connectivity between unreliable nodes is reduced by the exponential function exp.

2. It is desirable that GCN (Graph constraint Neural Networks) only care about information fusion between master nodes, and therefore, information transfer to a neighboring node by another Graph node is suppressed, and the connection weight is set to 0, and only information of the neighboring node itself is retained.

3. Considering the difference between the two types of different level information of the video and the adjacent neighbors, a balance factor is set

And adjusting the degree of fusion of the model to the two types of information.

Further, the adjacency matrix obtained by the above formula calculation is an asymmetric matrix, and the degree matrix is calculated as follows:

the calculation method of the Laplace matrix of the normalized graph is as follows:

step S143: and constructing a graph network based on the feature matrix and the adjacency matrix.

In the embodiment of the application, the object recognition device can use a partial graph

And modeling the video to be processed. Where V represents a graph node composed of the video to be processed and its neighboring videos. For a graph formed by a certain video to be processed, K2 neighboring videos of the video to be processed are screened out according to the method in the step S11, and encoded K1 features of the video to be processed and the neighboring videos are extracted as graph nodes according to the method in the step S12. Thus, the number of graph nodes

Where 1 represents the video book to be processed.

Represents a set of connecting edges between graph nodes computed by the adjacency matrix construction scheme.

The target recognition device inputs the constructed local graph into the GCN, and performs feature fusion on the constructed local graph through a feature updating mechanism of the GCN, wherein a specific updating formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

and

are two hyper-parameters which are,

in order to activate the function(s),

in order to be an initial input feature,

to be updated to the second

The characteristics of the layers are such that,

a learnable weight matrix. The first term in the formula is the concatenation of the initial residuals by the parameters

To adjust. The second term weights the identity matrix I and the weight matrix W by parameters

To achieve that the attenuation of the weight matrix increases adaptively as the number of layers increases.

Further, the air conditioner is provided with a fan,

the calculation method of (c) is as follows:

wherein

Is a hyper-parameter.

Step S15: and fusing the multi-modal coding features of each to-be-processed sub-video and the neighboring video features by using a graph network to obtain final fusion features, and identifying the target object based on the final fusion features.

In the embodiment of the present application, the target identification apparatus obtains the post-fusion feature of the video to be processed according to step S14, and after cascading the graph node coding features before fusion on the basis of the post-fusion feature, obtains the input feature of the final classification layer, and classifies each graph node through the FC layer and the Softmax layer. In order to accelerate the convergence of the network, the target recognition device uses the target nodes and performs common supervision during training, and only selects the prediction result corresponding to the high-quality video feature of the target video as the final result during testing.

The target recognition method provided by the application provides a brand-new modeling mode aiming at a multi-modal video character recognition task, a plurality of video characteristics of a target video and a neighbor video of the target video, wherein multi-modal information is coded, are used as graph nodes, the mode constructs three types of information of different levels, namely the neighbor information, the video information and the multi-modal information, in one graph, and fully utilizes the information of each level and the correlation among the information by a GCN (generalized group network) relationship mining and information gathering mechanism to realize the self-adaptive fusion of the three types of levels and improve the overall recognition effect. The target identification method provided by the application is tested on the iQIYI-VID-2019 data set, and achieves the best effect at present.

In addition, based on the modeling mode, the method also designs a corresponding graph node coding mode and an adjacency matrix calculation mode. The redundancy of video information is reduced while graph node characteristics fully containing target video information are generated for each video through the video characteristic coding and multi-mode characteristic coding modules. The connection degree between the nodes is adjusted by designing a corresponding adjacency matrix calculation mode, so that the reliability of the fusion characteristics is improved. The above construction mode provides simpler and reliable input for the fusion of the GCN, and contributes to improving the overall recognition effect.

The above embodiments are only one of the common cases of the present application and do not limit the technical scope of the present application, so that any minor modifications, equivalent changes or modifications made to the above contents according to the essence of the present application still fall within the technical scope of the present application.

With continuing reference to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of the object recognition device provided in the present application. The object recognition apparatus 400 according to the embodiment of the present application includes: a video acquisition module 41, a video clustering module 42, a feature encoding module 43, a feature fusion module 44, and a target identification module 45.

The video obtaining module 41 is configured to obtain a video to be processed and a neighboring video feature thereof, where the neighboring video feature is determined based on features of multiple neighboring videos, the neighboring video is a video having features of at least one of the different modalities, and the neighboring video is selected according to a similarity between video features of videos in a video library and video features of the video to be processed.

The video clustering module 42 is configured to cluster all video frames of the to-be-processed video based on features of at least one modality, so as to divide the to-be-processed video into a plurality of to-be-processed sub-videos.

The feature encoding module 43 is configured to encode the original video features of the multiple modalities of each to-be-processed sub video to obtain the multi-modal encoding features of each to-be-processed sub video.

The feature fusion module 44 is configured to construct a graph network based on the multi-modal encoding features of each to-be-processed sub-video and the neighboring video features.

The feature fusion module 44 is further configured to fuse, by using the graph network, the multi-modal coding feature of each to-be-processed sub-video and the feature of the neighboring video to obtain a final fusion feature.

The target identification module 45 is configured to identify the target object based on the final fusion feature.

With continuing reference to fig. 6, fig. 6 is a schematic structural diagram of another embodiment of the object recognition device provided in the present application. The object recognition apparatus 500 of the embodiment of the present application includes a processor 51, a memory 52, an input-output device 53, and a bus 54.

The processor 51, the memory 52 and the input/output device 53 are respectively connected to the bus 54, the memory 52 stores program data, and the processor 51 is used for executing the program data to realize the target identification method described in the above embodiment.

In the embodiment of the present application, the processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. The general purpose processor may be a microprocessor or the processor 51 may be any conventional processor or the like.

Fig. 7 is a schematic structural diagram of an embodiment of the computer storage medium provided in the present application, and the computer storage medium 600 stores program data 61, and when the program data 61 is executed by a processor, the program data is used to implement the object recognition method of the embodiment.

Embodiments of the present application may be implemented in software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, which is defined by the claims and the accompanying drawings, and the equivalents and equivalent structures and equivalent processes used in the present application and the accompanying drawings are also directly or indirectly applicable to other related technical fields and are all included in the scope of the present application.

Claims

1. An object recognition method, characterized in that the object recognition method comprises:

clustering all video frames of the video to be processed based on the characteristics of at least one mode, so that the video to be processed is a plurality of sub-videos to be processed;

2. The object recognition method of claim 1,

the constructing of the graph network based on the multi-modal coding features of each to-be-processed sub-video and the neighbor video features comprises:

the fusing the multi-modal coding features of each to-be-processed sub-video and the neighboring video features by using the graph network to obtain final fused features, wherein the fusion comprises the following steps:

and aggregating the feature matrix and the adjacent matrix to obtain the fusion feature of the video to be processed.

3. The object recognition method according to claim 2, wherein the obtaining of the adjacency matrix corresponding to the feature matrix comprises:

determining a connection weight for fusing every two features in the feature matrixes;

4. The method of claim 3, wherein the determining the connection weight for fusing between each two features in the plurality of feature matrices comprises:

acquiring a second distance between the neighbor video feature and the multi-modal coding feature of the to-be-processed sub-video, and determining a second connection weight between the neighbor video feature and the multi-modal coding feature of the to-be-processed sub-video based on the second distance, the preset exponential function and a balance factor.

5. The object recognition method according to claim 2 or 3 or 4,

the obtaining of the fusion feature of the video to be processed by aggregating the feature matrix and the adjacency matrix includes:

6. The object recognition method of claim 1,

the encoding of the original video features of the multiple modes of each to-be-processed sub-video to obtain the multi-mode encoding features of each to-be-processed sub-video includes:

performing second pooling operation on the original video characteristics of other modalities in each to-be-processed sub video to obtain second pooled video characteristics;

stitching the first pooled video feature with the second pooled video feature;

7. The object recognition method of claim 6,

the first pooling operation is an average pooling operation and the second pooling operation is a global pooling operation.

8. The object recognition method according to claim 6 or 7,

the at least one modality is a face modality;

9. The object recognition method according to claim 1, further comprising:

and extracting the features of different modalities in each neighboring video, and taking the extracted features as the neighboring video features corresponding to the video to be processed.

10. An object recognition apparatus, characterized in that the object recognition apparatus comprises: the system comprises a video acquisition module, a video clustering module, a feature coding module, a feature fusion module and a target identification module; wherein the content of the first and second substances,

the video clustering module is used for clustering all video frames of the video to be processed based on the characteristics of at least one mode, so that the video to be processed is a plurality of sub-videos to be processed;

11. An object recognition apparatus, comprising a memory and a processor coupled to the memory;

wherein the memory is adapted to store program data and the processor is adapted to execute the program data to implement the object recognition method as claimed in any one of claims 1-9.

12. A computer storage medium for storing program data for implementing an object recognition method as claimed in any one of claims 1 to 9 when executed by a computer.