CN112580599B

CN112580599B - Video identification method, device and computer readable storage medium

Info

Publication number: CN112580599B
Application number: CN202011607400.0A
Authority: CN
Inventors: 刘鹏; 陈益如; 丁文奎
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2024-05-14
Anticipated expiration: 2040-12-30
Also published as: CN112580599A

Abstract

The present disclosure relates to a video recognition method, apparatus, and computer-readable storage medium. The method comprises the steps of obtaining a reference image in a video to be identified and text information corresponding to the reference image; performing target detection on the reference image, obtaining an image feature vector for representing pixel features of an area where a target object is located in the reference image, and performing fusion processing on the image feature vector and preset relative position information to obtain fusion image features; and extracting the characteristics of the text information to obtain text characteristics corresponding to the text information, and carrying out fusion processing on the fusion image characteristics and the text characteristics to obtain semantic information for identifying video content of the video to be identified. After the image feature vector is obtained, the image feature vector can be directly fused with the preset relative position information to obtain the fused image feature, so that the efficiency and the accuracy of identifying the video content are improved.

Description

Video identification method, device and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video identification method, apparatus, and computer readable storage medium.

Background

With the popularization of mobile terminals and the acceleration of networks, content released on a network platform slowly tends to be fused from single characters, pictures, audio and the like before, short videos with the duration within 5 minutes, which are transmitted on internet media, are formed, and the appearance of the short videos is more suitable for users to watch in a mobile state and a short leisure state.

At present, the coverage range of short videos rapidly expands, the influence is larger and larger, tens of millions of videos are uploaded and hundreds of millions of users watch the videos every day, in order to enable the users to watch the videos better, a network platform can push video contents according to the historical search records of the users or the focused anchor types, and when the network platform recommends the videos to the users, the network platform can recommend the videos to the users based on the video contents. In the related art, when video content is identified, the video content can be identified according to the video tag of the video to be identified, and the video tag is customized when a user issues a short video through a client, and cannot reflect the real content of the video. The current method for identifying the video content has lower accuracy and lower identification efficiency.

Disclosure of Invention

The disclosure provides a video identification method, a video identification device and a computer readable storage medium, which are used for improving the accuracy and the identification efficiency of identifying the video content of a video to be identified. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a video recognition method, including:

acquiring a reference image in a video to be identified and text information corresponding to the reference image;

Performing target detection on the reference image, obtaining an image feature vector for representing the pixel feature of the region where the target object is located in the reference image, and performing fusion processing on the image feature vector and preset relative position information to obtain fusion image features; the preset relative position information is used for representing the relative position of each characteristic value in the image characteristic vector in the reference image; and

Extracting features of the text information to obtain text features corresponding to the text information;

and carrying out fusion processing on the fusion image features and the text features to obtain semantic information for identifying the video content of the video to be identified.

An optional implementation manner is that the target detection is performed on the reference image to obtain an image feature vector for representing a pixel feature of an area where a target object is located in the reference image, including:

Performing target detection on the reference image, and identifying an area where the target object is located in the reference image;

Extracting image features of the region where the target object is located according to pixel values of the region where the target object is located in the reference image, so as to obtain a plurality of feature values used for representing the pixel features of the region where the target object is located in the reference image;

and generating the image feature vector according to the obtained feature values.

An optional implementation manner, the fusing processing of the image feature vector and the preset relative position information to obtain the fused image feature includes:

mapping the image feature vector and preset relative position information to obtain a first embedded vector;

and according to the attention weight parameter, carrying out fusion processing on each element in the first embedded vector to obtain the fusion image characteristic.

An optional implementation manner is that the feature extraction is performed on the text information to obtain text features corresponding to the text information, including:

Extracting word vectors and/or word vectors in the text information;

Mapping the extracted word vector and/or word vector to obtain a second embedded vector;

and according to the attention weight parameter, carrying out fusion processing on each element in the second embedded vector to obtain the text feature.

An optional implementation manner is that the fusing processing is performed on the fused image features and the text features to obtain semantic information for identifying video content of the video to be identified, including:

Respectively carrying out embedding treatment on the fusion image features and the text features to respectively obtain a third embedded vector and a fourth embedded vector;

Based on a first attention mechanism module, according to attention weight parameters corresponding to the first attention mechanism module, carrying out fusion processing on each element in the third embedded vector to obtain intermediate fusion image characteristics; based on a second attention mechanism module, according to attention weight parameters corresponding to the second attention mechanism module, carrying out fusion processing on each element in the fourth embedded vector to obtain intermediate text features;

And fusing part of the features in the intermediate fused image features with part of the features in the intermediate text features to obtain the semantic information.

An optional implementation manner is that the acquiring the reference image in the video to be identified includes:

Taking the cover image of the video to be identified as the reference image; or (b)

And extracting at least one frame of image from the video to be identified as the reference image according to a preset time interval.

According to a second aspect of embodiments of the present disclosure, there is provided a video recognition apparatus, including:

An acquisition unit configured to acquire a reference image in a video to be identified and text information corresponding to the reference image;

the detection unit is configured to perform target detection on the reference image, acquire an image feature vector used for representing pixel features of an area where a target object is located in the reference image, and perform fusion processing on the image feature vector and preset relative position information to obtain fusion image features; the preset relative position information is used for representing the relative position of each characteristic value in the image characteristic vector in the reference image;

The extraction unit is configured to perform feature extraction on the text information to obtain text features corresponding to the text information;

And the processing unit is configured to perform fusion processing on the fusion image characteristics and the text characteristics to obtain semantic information for identifying the video content of the video to be identified.

An alternative embodiment is that the detection unit is configured to perform:

An alternative embodiment is that the detection unit is further configured to perform:

An alternative embodiment is that the extraction unit is configured to perform:

Extracting word vectors and/or word vectors in the text information;

An alternative embodiment is that the processing unit is configured to perform:

An alternative embodiment is that the acquisition unit is configured to perform:

According to a third aspect of embodiments of the present disclosure, there is provided a video recognition apparatus, including:

A processor;

A memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video recognition method according to the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of a video recognition device, causes the video recognition device to perform the video recognition method as described in the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, which when executed by a processor implements the video recognition method as described in the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

Because the embodiment of the disclosure provides a scheme for automatically identifying video content, after a reference image and text information corresponding to the reference image in a video to be identified are acquired, target detection can be performed on the reference image, an image feature vector for representing the pixel position of an area where a target object is located in the reference image is acquired, fusion processing is further performed with preset relative position information to obtain a fused image feature, and feature extraction is performed on the text information to obtain text features corresponding to the text information; after obtaining the fusion image feature of the target object in the reference image and the text feature corresponding to the text information, the embodiment of the disclosure can perform fusion processing on the fusion image feature and the text feature to obtain semantic information for identifying video content of the video to be identified. When the target detection is carried out on the reference image in the video to be identified, after the image feature vector for representing the pixel feature of the region where the target object in the reference image is located is obtained, the image feature vector can be directly fused with the preset relative position information to obtain the fused image feature, the fused image feature comprises the pixel feature and the position information of the reference image, the identification accuracy can be improved when the video content is identified according to the fused image feature, and the efficiency of obtaining the fused image feature of the reference image can be improved by fusing the preset relative position information with the image feature vector; in addition, after the text information is subjected to feature extraction to obtain the text features, the text features and the fused image features can be fused to obtain semantic information for identifying video content of the video to be identified, and the fused image features and the text features of the reference image are combined when the video content is identified, so that the video content can be accurately identified through the cross-modal features.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of a first video recognition system, shown in accordance with an exemplary embodiment;

FIG. 2 is a schematic diagram of a second video recognition system, shown in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram of a third video recognition system, shown in accordance with an exemplary embodiment;

FIG. 4 is a flow chart of a video recognition method, according to an exemplary embodiment;

fig. 5 is a schematic diagram of the structure of the DETR model shown according to an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating detection of a target object in a reference image according to an exemplary embodiment;

FIG. 7 is a schematic diagram illustrating one method of generating an image feature vector according to an exemplary embodiment;

FIG. 8 is a flow diagram illustrating processing of a first embedded vector based on an attention mechanism according to an exemplary embodiment;

FIG. 9 is a flow diagram illustrating processing of fused image features and text features based on a mutual attention mechanism according to an exemplary embodiment;

FIG. 10 is a complete block diagram of a video recognition system, shown in accordance with an exemplary embodiment;

FIG. 11 is a block diagram of a video recognition device, according to an exemplary embodiment;

Fig. 12 is a block diagram illustrating a video recognition device according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In the following, some terms in the embodiments of the present disclosure are explained for easy understanding by those skilled in the art.

(1) The term "plurality" in the embodiments of the present disclosure means two or more, and other adjectives are similar thereto.

(2) "And/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

(3) The server is used for serving the terminal, and the content of the service provides resources for the terminal and stores the terminal data; the server corresponds to the application program installed on the terminal and operates in cooperation with the application program on the terminal.

(4) The client may refer to APP (Application) of a software class or a terminal device. The system has a visual display interface, and can interact with a user; corresponding to the server, providing local service for clients. Applications for software classes, except some applications that only run locally, are typically installed on a common client terminal, and need to run in conjunction with a server. After the development of the internet, more commonly used application programs include e-mail clients such as e-mail receiving clients and instant messaging clients. For this type of application program, there is a need to have a corresponding server and service program in the network to provide a corresponding service, such as a database service, a configuration parameter service, etc., so that a specific communication connection needs to be established between the client terminal and the server terminal to ensure the normal operation of the application program.

In the related art, short video applications are becoming more popular, short videos issued by a network platform are becoming more popular, and video types are also becoming more numerous, so that in order to enable a user to have better viewing experience, the network platform generally pushes video content according to a user's history search record or a focused anchor type, and the network platform recommends videos to the user based on the video content, so that a scheme for identifying the video content is needed.

Based on the above problems, the embodiments of the present disclosure introduce several optional application scenarios of the video recognition method:

scene 1: and determining the content label of the short video released by the user in the process of releasing the short video by the user.

As shown in fig. 1, the video recognition system in this scenario includes a user 10, a terminal device 11, and a server 12.

The user 10 issues a short video through a client installed on the terminal device 11; after the client acquires the short video uploaded by the user 10, the short video uploaded by the user 10 is sent to the server 12; after receiving the short video uploaded by the user 10, the server 12 acquires a reference image of the short video issued by the user 10 and text information corresponding to the reference image; the server 12 performs target detection on the reference image, acquires an image feature vector for representing the pixel feature of the region where the target object is located in the reference image, and performs fusion processing on the image feature vector and preset relative position information to obtain fusion image features; the preset relative position information is used for representing the relative position of each characteristic value in the image characteristic vector in the reference image; the server 12 performs feature extraction on the text information to obtain text features corresponding to the text information; the server 12 performs fusion processing on the fusion image characteristics and the text characteristics to obtain semantic information for identifying video content of the released short video; the content tag of the short video issued by the user 10 is determined based on the obtained semantic information.

Scene 2: and when the user searches the short videos through the keywords, determining the video content of the candidate short videos in the candidate short video resource pool.

As shown in fig. 2, the video recognition system in this scenario includes a user 20, a terminal device 21, and a server 22.

The user 20 operates a client installed on the terminal device 21, and the client acquires a search keyword input by the user 20 at a client search box and transmits the acquired search keyword to the server 22. The server 22 needs to determine short video recommendations matching the search keywords from the candidate short video resource pool to the user; when the server 22 determines the matching degree between the candidate short video in the candidate short video resource pool and the search keyword, the server 22 is required to determine the video content of the candidate short video, and the matching degree between the candidate short video and the search keyword is determined according to the video content of the candidate short video. Aiming at each candidate short video in the candidate short video resource pool, the server 22 refers to the reference image of the candidate short video and text information corresponding to the reference image; the server 22 performs target detection on the reference image, acquires an image feature vector for representing the pixel feature of the region where the target object is located in the reference image, and performs fusion processing on the image feature vector and preset relative position information to obtain a fusion image feature; the preset relative position information is used for representing the relative position of each characteristic value in the image characteristic vector in the reference image; the server 22 performs feature extraction on the text information to obtain text features corresponding to the text information; the server 22 performs fusion processing on the fusion image features and the text features to obtain semantic information for identifying video content of the released short video; the server 22 matches the search keywords with semantic information of the video content and recommends short videos with high matching degree to the user 20 through the client.

Scene 3: and recommending the short video to the user from the candidate short video resource pool when the user logs in the short video client.

As shown in fig. 3, the video recognition system in this scenario includes a user 30, a terminal device 31, and a server 32. The user 30 operates a client installed on the terminal equipment 31, and after the user logs in the client through an account, the client sends a page display request to the server 32; the server 32 obtains account features of the user 30, wherein the account features of the user 30 may be video types historically watched by the user 30, types focused on the anchor, historical search keywords, and the like; the server 32 determines short videos matching the account characteristics of the user 30 from the candidate short video resource pool according to the account characteristics of the user 30, returns the determined short videos to the client, and displays the short videos recommended by the user 30 in a display page of the client. When determining the matching degree of the account feature of the user 30 and the short video in the candidate short video resource pool, the server 32 acquires a reference image of the short video and text information corresponding to the reference image for any one short video in the candidate short video resource pool; the server 32 performs target detection on the reference image, acquires an image feature vector for representing the pixel feature of the region where the target object is located in the reference image, and performs fusion processing on the image feature vector and preset relative position information to obtain a fused image feature; the preset relative position information is used for representing the relative position of each characteristic value in the image characteristic vector in the reference image; the server 12 performs feature extraction on the text information to obtain text features corresponding to the text information; the server 32 performs fusion processing on the fusion image features and the text features to obtain semantic information for identifying video content of the released short video; the server 32 matches the account characteristics of the user 30 with the semantic information of the video content and returns the short video with high matching degree to the client, which recommends the received short video to the user 30 in the display page.

A video recognition method provided by an exemplary embodiment of the present disclosure is described below with reference to fig. 4 to 10 in conjunction with the application scenario described above. It should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principle of the present application, and the embodiments of the present disclosure are not limited in any way in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

Fig. 4 is a flow chart of a video recognition method according to an exemplary embodiment, which may include the steps of:

step S401, obtaining a reference image in a video to be identified and text information corresponding to the reference image;

Step S402, performing target detection on a reference image, obtaining an image feature vector for representing pixel features of an area where a target object is located in the reference image, and performing fusion processing on the image feature vector and preset relative position information to obtain fusion image features; the preset relative position information is used for representing the relative position of each characteristic value in the image characteristic vector in the reference image;

Step S403, extracting features of the text information to obtain text features corresponding to the text information;

And step S404, fusing the fused image features and the text features to obtain semantic information for identifying video content of the video to be identified.

The embodiment of the disclosure provides a scheme for automatically identifying video content, after a reference image in a video to be identified and text information corresponding to the reference image are acquired, target detection can be performed on the reference image, an image feature vector for representing pixel features of an area where a target object in the reference image is located is acquired, and fusion processing can be directly performed with preset relative position information after the image feature vector of the target object in the reference image is acquired, so as to obtain fusion image features, wherein the preset relative position information is used for representing the relative position of each feature value in the image feature vector in the reference image, the fusion image features comprise the pixel features and the position information of the reference image, identification accuracy can be improved when video content is identified according to the fusion image features, and the efficiency of acquiring the fusion image features of the reference image can be improved by fusing the preset relative position information and the image feature vector; in addition, after the text information is subjected to feature extraction to obtain the text features, the text features and the fused image features can be fused to obtain semantic information for identifying video content of the video to be identified, and the fused image features and the text features of the reference image are combined when the video content is identified, so that the video content can be accurately identified through the cross-modal features.

The embodiment of the disclosure can take the cover image of the video to be identified as a reference image; or extracting at least one frame of image from the video to be identified as a reference image according to a preset time interval.

It should be noted that, the cover image of the video to be identified may be preset.

According to the embodiment of the disclosure, the cover image of the video to be identified is used as the reference image, and the cover image is an important part of the video to be identified, so that eyes of a user can be attracted, and based on the full understanding of the content of the video to be identified, the cover image is representative and can present the content of the video to be identified with the greatest attraction, so that the cover image of the video to be identified is used as the reference image, the content of the video to be identified is more representative and accurate, and the accuracy of identifying the video content is improved. In addition, according to the embodiment of the disclosure, at least one frame of image can be extracted from the video to be identified as a reference image according to a preset time interval, and because the video to be identified contains multiple frames of images, at least one frame of image is extracted from the multiple frames of images as the reference image according to the preset time interval, so that the reference image covers the content of the video to be identified as much as possible, and the video content can be identified accurately.

The text information corresponding to the reference image may be text displayed in the reference image, including but not limited to:

caption information and text comment information in the reference image.

In the process of identifying the video content of the video to be identified, a reference image corresponding to the video to be identified can be obtained in an image storage space corresponding to the video to be identified, and text information corresponding to the reference image can be obtained in a text storage space corresponding to the video to be identified;

The text information corresponding to the reference image stored in the text storage space corresponding to the video to be identified may be text information corresponding to the reference image input when the user uploads the video, or may also be text information corresponding to the reference image based on voice data corresponding to the reference image frame, wherein the voice data is converted into text data through voice recognition, and the converted text data is stored as the text information corresponding to the reference image.

Or after the reference image corresponding to the video to be identified is acquired from the image storage space corresponding to the video to be identified, identifying corresponding text information from the reference image;

an alternative embodiment is to identify the reference image by optical character recognition (Optical Character Recognition, OCR) to identify text information in the reference image.

After the reference image of the video to be identified and the text information corresponding to the reference image are obtained, the embodiment of the disclosure needs to perform target detection on the reference image to obtain the fusion image characteristics of the target object in the reference image, and performs characteristic extraction on the text information to obtain the text characteristics corresponding to the text information;

The following describes a method of performing object detection with respect to a reference image and a method of performing feature extraction with respect to text information, respectively:

1. And performing target detection on the reference image.

When the embodiment of the disclosure detects the target of the reference image, the region where the target object in the reference image is located needs to be identified;

An alternative implementation manner is that the embodiment of the disclosure performs target detection on the reference image based on the DETR model, and identifies an area where a target object is located in the reference image.

The detailed schematic of the DETR model shown in fig. 5 includes convolutional neural network (Convolutional Neural Networks, CNN) network and transducer network; as shown in fig. 6, when the target object in the reference image is detected, the area where the target object in the reference image is located is identified by means of a detection frame.

It should be noted that, in the process of detecting the target object in the reference image through the DERT model, the area where the target object in the reference image is located and the background area may be distinguished; one or more target objects may be included in one reference image;

target objects in the reference image include, but are not limited to:

People in the image, animals in the image, buildings in the image, plants in the image, roads in the image.

After determining the region where the target object in the reference image is located, the image feature vector may be generated according to the following manner;

Extracting image features of the region where the target object is located according to the pixel values of the region where the target object is located in the reference image, so as to obtain a plurality of feature values used for representing the pixel features of the region where the target object is located in the reference image; and generating an image feature vector according to the obtained feature values.

When the image characteristics of the reference image are acquired, the embodiment of the invention firstly detects the target object from the reference image, and extracts the image characteristics of the area where the target object is located, so that the image characteristics acquired from the reference image are more targeted; and when the image feature of the region where the target object is located is extracted, the feature value of the pixel feature of the region where the target object is located in the reference image is obtained according to the pixel value of the region where the target object is located, and the image feature vector is generated according to the obtained feature values.

According to the embodiment of the disclosure, through a CNN (computer numerical network) in a DERT model, image feature extraction is performed on an area where a target object is located according to pixel values of the area where the target object is located in a reference image, so that a plurality of feature values are obtained, an n-by-m dimension matrix is further generated according to the plurality of feature values, and the n-by-m dimension matrix is converted into a one-dimensional vector; the one-dimensional vector can be represented by a matrix of 1 row, n x m column, or by a matrix of n x m row, 1 column, a matrix of 1 row, n x m column, or a matrix of n x m row, 1 column, namely the generated image feature vector;

For example, in the embodiment of the present disclosure, based on a CNN network in a DERT model, image feature extraction is performed on an area where a target object is located, and a matrix U ^7x7 is generated according to a plurality of feature values obtained by the feature extraction, where the matrix U ^7x7 is a matrix of 7 rows and 7 columns, and the 1 st row of matrix elements are: u ₁₁、U₁₂、U₁₃、U₁₄、U₁₅、U₁₆、U₁₇; the row 2 matrix elements are: u ₂₁、U₂₂、U₂₃、U₂₄、U₂₅、U₂₆、U₂₇; the row 3 matrix elements are: u ₃₁、U₃₂、U₃₃、U₃₄、U₃₅、U₃₆、U₃₇; the 4 th row matrix elements are: u ₄₁、U₄₂、U₄₃、U₄₄、U₄₅、U₄₆、U₄₇; the 5 th row matrix elements are: u ₅₁、U₅₂、U₅₃、U₅₄、U₅₅、U₅₆、U₅₇; the 6 th row matrix elements are: u ₆₁、U₆₂、U₆₃、U₆₄、U₆₅、U₆₆、U₆₇; the 7 th row matrix elements are: u ₇₁、U₇₂、U₇₃、U₇₄、U₇₅、U₇₆、U₇₇; and converts the matrix U ^7x7 into a matrix V representing the image features of 1 row 7*7 column. Fig. 7 is a schematic diagram illustrating one type of generating image feature vectors according to an exemplary embodiment.

According to the embodiment of the disclosure, after the image feature vector used for representing the pixel feature of the region where the target object is located in the reference image is obtained through the CNN in the DETR model, fusion processing is required to be carried out on the image feature vector and preset relative position information, so that fusion image features of the target object in the reference image are obtained;

specifically, mapping the image feature vector and preset relative position information to obtain a first embedded vector; and according to the attention weight parameters, carrying out fusion processing on each element in the first embedded vector to obtain fusion image characteristics.

According to the embodiment of the disclosure, after the image feature vector of the target object in the reference image is obtained, the relative position of each feature value in the image feature vector in the reference image can be artificially set, the image feature vector and the preset relative position information representing each feature value in the reference image are mapped to obtain the first embedded vector, after the first embedded vector is obtained, each element in the first embedded vector can be fused according to the attention weight parameter to obtain the fused image feature, and the obtained fused image feature can more accurately reflect the content of the reference image, so that the accuracy of identification can be improved in the process of identifying the content of the video to be identified.

It should be noted that, in the embodiment of the present disclosure, the CNN network in the DERT model performs image feature extraction on the region where the target object is located, and after generating an n×m dimensional matrix according to a plurality of feature values obtained by feature extraction, the n×m dimensional matrix is further converted into a one-dimensional vector, where at this time, the relative positional relationship of a plurality of feature values included in the n×m dimensional matrix in the reference image is lost, and the converted one-dimensional vector cannot acquire the relative positional relationship of each feature value in the reference image, so that the relative positional relationship of each feature value in the image feature vector in the reference image can be preset artificially;

For example, as shown in a matrix U ^7x7 in fig. 7, after the matrix U ^7x7 contains 49 eigenvalues and is converted into a matrix V of 1 row 7*7 column, that is, after an image eigenvector is generated, the relative positional relationship of each eigenvalue in the matrix V in the reference image may be artificially preset, for example, the sequence of the relative positions of each eigenvalue in the reference image is exemplified by an eigenvalue U ₁₁、U₁₂、U₁₃、U₁₄、U₁₅、U₁₆、U₁₇, and the relative positional relationship of the artificially preset eigenvalue in the reference image is simply introduced;

The relative position of the feature value U ₁₂ in the reference image is the first position of the feature value U ₁₁ directly to the right of the reference position in the reference image; the relative position of the feature value U ₁₃ in the reference image is the second position of the feature value U ₁₁ directly to the right of the reference position in the reference image; the relative position of the feature value U ₁₄ in the reference image is the first position of the feature value U ₁₃ directly to the right of the reference position in the reference image; the relative position of U ₁₅ in the reference image is the first position of the feature value U ₁₆ directly to the left of the reference position in the reference image; and the relative position of U ₁₅ in the reference image may also be the second position of the feature value U ₁₇ directly to the left of the reference position in the reference image; the relative position of U ₁₇ in the reference image is the second position of the feature value U ₁₅ directly to the right of the reference position in the reference image.

It should be noted that, when the relative positions of the feature values in the reference image are preset, if the reference positions of the feature values in the reference image in the selected image feature vector are changed, the relative positions of the feature values in the reference image are also changed, and the relative positional relationship may be described as the relative positional relationship of left, right, up, down, left, right, up, left, down, and the like according to the manual preset, which is not limited herein.

An optional implementation manner is that in the embodiment of the disclosure, a first transducer network in a DETR model may perform fusion processing on an image feature vector and preset relative position information, and obtain a fused image feature of a target object in a reference image;

the first transducer network is a Encoder-Decoder network model based on an attention mechanism;

Mapping the image feature vector output by the CNN network with preset relative position information to obtain a first embedded vector; as shown in a flow chart of processing the first embedded vector based on the attention mechanism in fig. 8, according to the attention weight parameter, each element in the first embedded vector is fused to obtain a fused image feature; the attention weight parameters comprise parameters in three weight matrixes obtained by the first transducer network in the training process and an attention coefficient w;

Specifically, multiplication operation is respectively carried out on the first embedded vector and three weight matrixes obtained in the training process in the first transducer network to obtain a query matrix W _f1, a key matrix W _f2 and a value matrix W _f3; performing point multiplication operation on the query matrix W _f1 and the key matrix W _f2, and performing normalization processing through a Softmax function to obtain a first point multiplication matrix; performing point multiplication operation on the first point multiplication matrix and the value matrix W _f3 to obtain a second point multiplication matrix; multiplying the second point multiplication matrix with the attention coefficient w to obtain an output vector;

According to the embodiment of the disclosure, after the output vector is obtained based on the attention mechanism in the first converter network, the output vector is subjected to further feature extraction through the feedforward neural network in the first converter network, so that the fusion image feature of the reference image output by the first converter network is obtained.

In addition, since the reference image in the video to be identified often contains text information corresponding to the reference object, such as subtitle information, annotation information, and the like contained in the reference image; the video content of the video to be identified can be identified according to the text information corresponding to the reference image, and the embodiment of the disclosure can combine the fusion image characteristics of the reference image and the text information corresponding to the reference image to perform cross-mode fusion, so as to further identify the video content of the video to be identified.

2. And extracting the characteristics of the text information corresponding to the reference image.

Extracting word vectors and/or word vectors in the text information; mapping the extracted word vector and/or word vector to obtain a second embedded vector; and according to the attention weight parameter, carrying out fusion processing on each element in the second embedded vector to obtain text characteristics.

It should be noted that, in a text message, usually, a meaning of a word or a word expressed in the text message may be related to its context, and each word or each word plays a different role in understanding the text message, so that, in order to better understand the text message, it is necessary to represent the text message in a vector form, extract a word vector or a word vector in the text message, and perform feature extraction on the text message based on each word vector or each word vector in the text message.

Because the embodiment of the disclosure generates the embedded vector according to the word vector and/or the word vector in the text information when extracting the characteristics of the text information, and fuses the elements in the embedded vector by using the attention weight parameters, the fused text characteristics can reflect the real information of the text more, and the accuracy of identifying the video content is further improved.

An optional implementation manner, in an embodiment of the present disclosure, text features corresponding to text information may be extracted according to a trained second transducer network;

In implementation, the text information is input into a trained second transducer network, the trained second transducer network performs feature extraction on the text information, and text features corresponding to the text information output by the second transducer network are obtained.

The second transducer network is Encoder network model based on the attention mechanism;

when the text information is subjected to feature extraction through the second transducer network, determining a word vector and/or a word vector corresponding to the text information, and generating a second embedded vector input into the second transducer network according to the word vector and/or the word vector corresponding to the text information;

as shown in a flow chart of processing the embedded vector based on the attention mechanism in fig. 8, according to the attention weight parameter, each element in the second embedded vector is fused to obtain text characteristics; the attention weight parameters comprise parameters in three weight matrixes obtained in the training process in the second transducer network and an attention coefficient w;

Specifically, a second embedded vector generated according to a word vector and/or a word vector corresponding to text information is multiplied by three weight matrices obtained in a training process in a second transducer network respectively to obtain a query matrix W _f1, a key matrix W _f2 and a value matrix W _f3; performing point multiplication operation on the query matrix W _f1 and the key matrix W _f2, and performing normalization processing through a Softmax function to obtain a first point multiplication matrix; performing point multiplication operation on the first point multiplication matrix and the value matrix W _f3 to obtain a second point multiplication matrix; multiplying the second point multiplication matrix with the attention coefficient w to obtain an output vector;

According to the embodiment of the disclosure, after the output vector is obtained based on the attention mechanism in the second converter network, the output vector is subjected to further feature extraction through the feedforward neural network in the second converter network, so that the text feature corresponding to the text information output by the second converter network is obtained.

After the fusion image characteristics of the reference image and the text characteristics of the text information corresponding to the reference image are obtained, the fusion image characteristics and the text characteristics can be fused in the following manner to obtain semantic information for identifying video content of the video to be identified;

Based on the first attention mechanism module, according to attention weight parameters corresponding to the first attention mechanism module, carrying out fusion processing on each element in the third embedded vector to obtain intermediate fusion image characteristics; based on the second attention mechanism module, according to attention weight parameters corresponding to the second attention mechanism module, carrying out fusion processing on each element in the fourth embedded vector to obtain intermediate text features;

And fusing part of the features in the intermediate fused image features with part of the features in the intermediate text features to obtain semantic information.

When semantic information for identifying video content of a video to be identified is obtained, the embodiment of the disclosure performs embedding processing on the fused image features to obtain a third embedded vector, performs embedding processing on the third embedded vector and the text features to obtain a fourth embedded vector, and performs fusion processing based on an attention mechanism module; in the fusion process, each element in the third embedded vector is fused by using an attention weight parameter corresponding to the first attention mechanism module to obtain an intermediate fusion image characteristic; and finally, carrying out fusion processing on part of the features in the middle fusion image features and part of the features in the middle text features to obtain semantic information for identifying video content of the video to be identified. The semantic information obtained by the embodiment of the disclosure is the feature obtained by fusing the fused image feature and the text feature, and the semantic information is the cross-modal feature, so that when the semantic information is used for representing the video content, the image feature and the text feature of the video are fully considered, and the accuracy of the identified video content is higher.

An alternative implementation manner is that fusion processing is carried out on the fusion image characteristics and the text characteristics of the target object based on a mutual attention mechanism through a trained third transducer network comprising a mutual attention mechanism module, so that semantic information which is output by the third transducer network and is used for identifying video content of a video to be identified is obtained.

Wherein the third transducer network is Encoder network model based on mutual attention mechanism;

As shown in fig. 9, a flow chart of processing the fused image feature and the text feature based on the mutual attention mechanism is shown, the fused image feature is input into a first attention mechanism module on the fused image feature side in the third transducer network, and the text feature is input into a second attention mechanism module on the text feature side in the third transducer network;

based on the first attention mechanism module, embedding the fused image features to obtain a third embedded vector; based on the second attention mechanism module, embedding the text features to obtain a fourth embedded vector;

Specifically, in a first attention mechanism module at the fused image feature side, performing multiplication operation on a third embedded vector and three weight matrixes obtained by a third transducer network in the training process respectively to obtain intermediate fused image features, and processing the intermediate fused image features by using a query value of the first attention mechanism module and key values and value values of the second attention mechanism module; in the second attention mechanism module at the text feature side, the third weight matrix obtained by the fourth embedded vector and the third transducer network in the training process is multiplied to obtain an intermediate text feature, and the intermediate text feature is processed by using the query value of the second attention mechanism module and the key value and value of the first attention mechanism module. The characteristics output by the first attention mechanism module are further processed through a feedforward neural network module at the side of the fused image characteristics in the third transducer network, so that image output characteristics are obtained; and further processing the characteristics output by the second attention mechanism module through a feedforward neural network module at the text characteristic side in the third transducer network to obtain text output characteristics. Therefore, cross-modal feature interaction can be achieved, so that the image output features contain text information and the text output features contain reference image information.

It should be noted that, in the embodiment of the present disclosure, after the intermediate fusion image feature and the intermediate text feature are determined, the partial feature of the intermediate fusion image feature and the partial feature of the intermediate text feature are fused, so that the image output feature output on the fusion image feature side and the text output feature output on the text feature side are the same, and can be used to represent semantic information of video content of a video to be identified, that is, any one of the image output feature and the text output feature can be selected arbitrarily in the embodiment of the present disclosure, so as to represent the semantic information.

The complete structure diagram of a video recognition system shown in fig. 10 is that a reference image of a video to be recognized is input into a DETR model, a target object in the reference image is recognized through a CNN network in the DETR model, target detection is performed on the reference image, an image feature vector for representing the pixel feature of an area where the total target object of the reference image is located is obtained, and fusion processing is performed on the image feature vector and preset relative position information through a first transducer network in the DETR model to obtain fusion image features; extracting text features corresponding to the text information according to the trained second transducer network; and according to the trained third transducer network comprising the mutual attention mechanism module, fusing the fused image features and the text features based on the mutual attention mechanism to obtain semantic information for identifying video content of the video to be identified.

Fig. 11 is a block diagram of a video recognition apparatus 1100 according to an exemplary embodiment, and referring to fig. 11, the apparatus includes an acquisition unit 1101, a detection unit 1102, an extraction unit 1103, and a processing unit 1104.

An acquisition unit configured to perform acquisition of a reference image in a video to be identified, and text information corresponding to the reference image;

And the processing unit is configured to perform fusion processing on the fusion image features and the text features to obtain semantic information for identifying video content of the video to be identified.

In an alternative embodiment, the detecting unit 1102 is configured to perform object detection on the reference image, and identify an area where the object of the object is located in the reference image;

In an alternative embodiment, the detecting unit 1102 is configured to map the image feature vector with preset relative position information to obtain a first embedded vector;

In an alternative embodiment, the extracting unit 1103 is configured to perform extracting a word vector and/or a word vector in the text information;

In an alternative embodiment, the processing unit 1104 is configured to perform embedding processing on the fused image feature and the text feature to obtain a third embedded vector and a fourth embedded vector respectively;

In an alternative embodiment, the obtaining unit 1101 is configured to perform the cover image of the video to be identified as the reference image; or (b)

Fig. 12 is a block diagram of a video recognition device 1200, according to an example embodiment, comprising:

a processor 1201 and a memory 1202 for storing instructions executable by the processor 1201;

wherein the processor 1201 is configured to execute the instructions to implement the video recognition method as in the above embodiments.

In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory 1202, comprising instructions executable by the processor 1201 of the video recognition apparatus 1200 to perform the video recognition method described above. Alternatively, the storage medium may be a non-transitory computer readable storage medium, such as ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which when run on an electronic device causes the electronic device to perform a video recognition method implementing the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video recognition, comprising:

Carrying out fusion processing on the fusion image features and the text features to obtain semantic information for identifying video content of the video to be identified;

The fusing processing is performed on the fused image features and the text features to obtain semantic information for identifying video content of the video to be identified, including:

2. The method for identifying video according to claim 1, wherein the performing object detection on the reference image to obtain an image feature vector for representing a pixel feature of an area where the object of the object is located in the reference image includes:

3. The video recognition method according to claim 1 or 2, wherein the fusing the image feature vector and preset relative position information to obtain a fused image feature includes:

4. The method for identifying video according to claim 1, wherein the feature extraction of the text information to obtain the text feature corresponding to the text information includes:

Extracting word vectors and/or word vectors in the text information;

5. The method for video recognition according to any one of claims 1,2, and 4, wherein the acquiring a reference image in the video to be recognized includes:

6. A video recognition apparatus, comprising:

The processing unit is configured to perform fusion processing on the fusion image features and the text features to obtain semantic information for identifying video content of the video to be identified;

the processing unit is configured to perform:

7. The video recognition device of claim 6, wherein the detection unit is configured to perform:

8. The video recognition device according to claim 6 or 7, wherein the detection unit is configured to perform:

9. The video recognition device of claim 6, wherein the extraction unit is configured to perform:

Extracting word vectors and/or word vectors in the text information;

10. The video recognition device of claim 6, wherein the acquisition unit is configured to perform:

11. A video recognition apparatus, comprising:

A processor;

A memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement the video recognition method of any one of claims 1-5.

12. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of a video recognition device, enable the video recognition device to perform the video recognition method of any one of claims 1-5.

13. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the video recognition method of any one of claims 1 to 5.