CN116012871A - Object recognition method, device, computer equipment, storage medium and product - Google Patents

Object recognition method, device, computer equipment, storage medium and product Download PDF

Info

Publication number
CN116012871A
CN116012871A CN202111624652.9A CN202111624652A CN116012871A CN 116012871 A CN116012871 A CN 116012871A CN 202111624652 A CN202111624652 A CN 202111624652A CN 116012871 A CN116012871 A CN 116012871A
Authority
CN
China
Prior art keywords
information
video
processed
feature
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111624652.9A
Other languages
Chinese (zh)
Inventor
李振阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Publication of CN116012871A publication Critical patent/CN116012871A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the application discloses an object identification method, an object identification device, computer equipment, a storage medium and a product, wherein video description information corresponding to a video to be processed and the video to be processed are obtained; object detection is carried out on video frames in the video to be processed, and an object image containing a target object is obtained; extracting object characteristics of the object image to obtain object characteristic information of a target object; performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object; and carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object. According to the scheme, the object characteristic information for identity recognition is enhanced through the characteristic fusion processing video description information, the identity information of the target object can be accurately recognized based on the characteristic fusion information, the accuracy of recognizing the identity information of the target object is improved, and the method and the device can be applied to various scenes such as cloud technology, artificial intelligence and intelligent traffic.

Description

Object recognition method, device, computer equipment, storage medium and product
Technical Field
The present application relates to the field of communications technologies, and in particular, to an object identification method, an object identification device, a computer device, a storage medium, and a product.
Background
When the identification of the person in the video is carried out, the characteristic information of the person in the video is generally extracted, the identity of the person is determined according to the characteristic information of the person, and as the extracted characteristic information comes from the characteristics contained in the video frame, the characteristics which can be used for the identification are fewer, so that the accuracy of identifying the person in the video is low.
Disclosure of Invention
The embodiment of the application provides an object identification method, an object identification device, computer equipment, a storage medium and a product, which can improve the accuracy of identification of objects in video.
The object identification method provided by the embodiment of the application comprises the following steps:
acquiring video description information corresponding to a video to be processed;
performing object detection on video frames in the video to be processed to obtain an object image containing a target object;
extracting object characteristics of the object image to obtain object characteristic information of the target object;
performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object;
And carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object.
Correspondingly, the embodiment of the application also provides an object recognition device, which comprises:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed and video description information corresponding to the video to be processed;
the detection unit is used for carrying out object detection on the video frames in the video to be processed to obtain an object image containing a target object;
the extraction unit is used for extracting object characteristics of the object image to obtain object characteristic information of the target object;
the fusion unit is used for carrying out feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object;
and the identification unit is used for carrying out identity identification on the target object based on the characteristic fusion information to obtain the identity information of the target object.
In an embodiment, the object recognition apparatus further comprises:
the information acquisition unit is used for acquiring initial video description information;
the attribute extraction unit is used for extracting content attributes of the video to be processed to obtain video meta-information corresponding to the video to be processed;
And the processing unit is used for carrying out selection processing on the initial video description information according to the video meta information to obtain the video description information of the video to be processed.
In an embodiment, the attribute extraction unit includes:
the text acquisition subunit is used for acquiring video text information of the video to be processed;
and the screening subunit is used for screening the video text information to determine the video meta-information of the video to be processed.
In an embodiment, the text obtaining subunit includes:
the video frame acquisition module is used for acquiring key video frames of the video to be processed;
and the text recognition module is used for recognizing text content of the key video frames to obtain video text information corresponding to the video to be processed.
In one embodiment, the screening subunit comprises:
the statistics module is used for performing word frequency statistics on the video text information to obtain frequency information of at least one keyword in the video text information;
and the determining module is used for determining video meta-information from at least one keyword according to the frequency information.
In an embodiment, the detection unit includes:
the video screening subunit is used for carrying out video frame screening processing on the video to be processed to obtain a video frame to be processed corresponding to the video to be processed;
The position detection subunit is used for detecting the object position of the video frame to be processed to obtain the position information of the object target object in the video frame to be processed;
and the image acquisition subunit is used for acquiring an object image containing the object target object from the video frame to be processed according to the position information.
In one embodiment, the fusion unit comprises:
a mode determining subunit, configured to determine at least one feature fusion mode;
the first feature fusion subunit is used for carrying out feature fusion processing on the object feature information and the video meta information based on at least one feature fusion mode to obtain at least one piece of sub-feature fusion information corresponding to the object target object;
and an information determination subunit configured to determine feature fusion information based on the at least one piece of sub-feature fusion information.
In an embodiment, the first feature fusion subunit includes:
the splicing module is used for carrying out feature superposition processing on the object feature information and the video description information to obtain superposition feature information;
and the blending module is used for carrying out feature blending processing on the object feature information and the video description information to obtain blended feature information.
In one embodiment, the fusion unit comprises:
The dimension conversion subunit is used for performing dimension conversion on the video description information to obtain processed video description information with the same dimension as the object characteristic information;
and the second feature fusion sub-sheet description is used for carrying out feature fusion processing on the processed video description information and the object feature information to obtain feature fusion information corresponding to the target object.
In an embodiment, the identification unit comprises:
the feature mining subunit is used for carrying out feature mining on the character feature information of the target object based on the feature fusion information to obtain the identity feature information of the target object;
and the identity recognition subunit is used for carrying out identity recognition on the target object according to the identity characteristic information to obtain the identity information of the target object.
In an embodiment, the identification unit comprises:
the prediction subunit is used for predicting the prediction probability of the target object as the object identity according to the feature fusion information aiming at each object identity;
and the identity determination subunit is used for determining the identity information of the target object according to the prediction probability.
Correspondingly, the embodiment of the application also provides computer equipment, which comprises a memory and a processor; the memory stores a computer program, and the processor is configured to run the computer program in the memory to perform any one of the object recognition methods provided in the embodiments of the present application.
Accordingly, embodiments of the present application also provide a computer readable storage medium for storing a computer program loaded by a processor to perform any of the object recognition methods provided by the embodiments of the present application.
Accordingly, embodiments of the present application also provide a computer program product comprising a computer program/instructions loaded by a processor to perform any of the object recognition methods provided by the embodiments of the present application.
According to the embodiment of the application, the video description information corresponding to the video to be processed is obtained; object detection is carried out on video frames in the video to be processed, and an object image containing a target object is obtained; extracting object characteristics of the object image to obtain object characteristic information of a target object; performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object; the method comprises the steps of carrying out identity recognition on a target object based on feature fusion information to obtain the identity information of the target object, carrying out feature fusion processing on video description information of a video to be processed and object feature information of the target object, enhancing the object feature information which can be used for identity recognition through the feature fusion processing video description information, and more accurately recognizing the identity information of the target object based on the feature fusion information obtained by the feature fusion processing, thereby improving the accuracy of recognizing the identity information of the target object in the video to be processed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a scene graph of an object recognition method provided by an embodiment of the present application;
FIG. 2 is a flow chart of an object recognition method provided by an embodiment of the present application;
FIG. 3 is another flow chart of an object recognition method provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of an identification model of an object identification method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an object recognition device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a terminal provided in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Embodiments of the present application provide an object recognition method, apparatus, computer device, and computer-readable storage medium. The object recognition device may be integrated in a computer device, which may be a server or a device such as a terminal.
The terminal may include a mobile phone, a wearable intelligent device, a tablet computer, a notebook computer, a personal computer (PC, personal Computer), a car-mounted computer, and the like.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms.
For example, as shown in fig. 1, the computer device obtains a video to be processed and video description information corresponding to the video to be processed; object detection is carried out on video frames in the video to be processed, and an object image containing a target object is obtained; extracting object characteristics of the object image to obtain object characteristic information of a target object; performing feature fusion processing on the object feature information and the video description information, for example, performing feature superposition processing to obtain superposition feature information and feature fusion processing to obtain fusion feature information, performing splicing processing on the superposition feature information and the fusion feature information, and taking the feature information after splicing processing as feature fusion information corresponding to the target object; and carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object. According to the scheme, the video description information of the video to be processed and the object feature information of the target object are subjected to feature fusion processing, the object feature information which can be used for identity recognition is enhanced through the feature fusion processing video description information, the identity information of the target object can be recognized more accurately based on the feature fusion information obtained through the feature fusion processing, and the accuracy of recognizing the identity information of the target object in the video to be processed is improved.
The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.
The present embodiment will be described from the viewpoint of an object recognition apparatus, which may be integrated in a computer device, which may be a server or a terminal, or the like.
An embodiment of the present application provides an object recognition method, as shown in fig. 2, a specific flow of the object recognition method may be as follows:
101. and acquiring the video to be processed and video description information corresponding to the video to be processed.
The video to be processed may be a video that needs to be subject identified, for example, the video to be processed may be a video published on a platform, or may be a video stored in a database or in a blockchain.
The video description information may include video meta information describing attributes of the video content to be processed, and may be used to enhance object feature information, for example, the video description information may include keywords of the video to be processed, for example, history, law, family, biography, and the like, and the video description information may also be a meta information feature vector corresponding to the video meta information of the video to be processed.
For example, the video to be processed may be obtained, and the corresponding video description information may be obtained according to a video classification result of the video to be processed, for example, a spoken language teaching film of the video to be processed under the scientific and teaching classified video may be determined to be a scientific and teaching and spoken language teaching according to the video classification result. Optionally, the video description information may be obtained according to a video profile of the video to be processed, or a preset keyword corresponding to the video to be processed is used as the video description information, the preset keyword may be from a video tag added to the video to be processed by a user, optionally, the video description information corresponding to the video to be processed may be obtained according to data obtained by searching the video description information on the internet, or a similar video of the video to be processed may be searched, and the video description information corresponding to the video to be processed may be determined according to data corresponding to the similar video.
Optionally, the video meta information of the video to be processed may be determined according to the video classification result of the video to be processed, the video indirection or a preset keyword, and the video description information of the video to be processed may be determined according to the video meta information, for example, the video meta information is subjected to one-hot encoding or embedding processing to obtain the video description information. In an embodiment, the video description information may be subjected to data format unification, so as to improve data processing efficiency, that is, before the step of acquiring the video to be processed and the video description information corresponding to the video to be processed, the object identification method may specifically further include:
Acquiring initial video description information;
extracting content attributes of the video to be processed to obtain video meta-information corresponding to the video to be processed;
and selecting the initial video description information according to the video meta information to obtain the video description information of the video to be processed.
The initial video description information may include a plurality of preset video meta-information, where the plurality of preset video meta-information may be obtained by collecting known video meta-information in a database, for example, each video meta-information may correspond to an information identifier, and the initial video description information may be information obtained according to the information identifier of each video meta-information.
The video meta information may be information describing content attributes of the video to be processed, such as history, law, home, and biography, among others.
For example, content attribute extraction may be specifically performed on a cover, a video name, a subtitle, a brief description, etc. of a video to be processed, so as to obtain video meta information of the video to be processed. The method comprises the steps that the initial video description information can be ordered according to preset rules, the preset video meta information corresponding to the initial video description information is provided with information marks corresponding to each video meta information, the initial value of the information marks is 0, and the initial video description information is obtained according to the information marks of each preset video meta information. Comparing the video meta-information of the video to be processed with preset video meta-information, and if the video meta-information of the video to be processed is the same as the preset video meta-information, setting the information identification of the corresponding position of the initial video description information to 1; and obtaining video meta-information according to the information identification contained in the initial video description information. The information related to the video to be processed, such as the cover, the video name, the caption, the introduction, and the like, can be obtained from a database or a blockchain, or the video to be processed can be searched on the internet, and the related information related to the cover, the video name, the caption, the introduction, and the like, of the video to be processed can be extracted from the search result.
Optionally, the video meta-information collection is obtained by collecting video meta-information, where the video meta-information collection includes preset video meta-information, and the video meta-information in the video meta-information collection may be ordered according to a preset rule, and the video meta-information collection may be represented as ks= { ks_1, ks_2, ks_k }, where a set size is k, and ks_i represents the ith video meta-information, where i=1, 2,3, … …, k.
Each preset video meta-information corresponds to an information identifier, and if the initial value of the information identifier is 0, a zero vector kw_0= {0 with the length of k as the initial video description information can be obtained 1 ,0 2 ,0 3 ,0 4 ,……,0 k-1 ,0 k Where k is the number of video meta-information in the video meta-information set.
The video meta-information may be screened by acquiring video text information of the video to be processed, according to the video text information, so as to improve accuracy of identifying a target object in the video to be processed, that is, in an embodiment, the step of extracting content attribute of the video to be processed to obtain video meta-information corresponding to the video to be processed may specifically include:
acquiring video text information of a video to be processed;
and screening the video text information to determine video meta-information of the video to be processed.
The video text information may include text information related to the video to be processed, for example, may include a subtitle, a video brief, a video name, etc. of the video to be processed, and if the video to be processed is included in an article, the video text information may also include the article.
For example, the method specifically includes obtaining an audio file of a video to be processed, performing text conversion on the audio text to obtain text information corresponding to the audio file, obtaining a video profile, a video name and the like of the video to be processed, and taking the obtained different text information as video text information.
Some videos may not include information such as a video name and a video profile, or may be in a situation that an audio file is lost or difficult to acquire, text content identification may be performed on a video frame corresponding to the video to obtain video text information of a video to be processed, that is, the step of acquiring the video text information of the video to be processed may specifically include:
acquiring a key video frame of a video to be processed;
and carrying out text content identification on the key video frames to obtain video text information corresponding to the video to be processed.
The key video frames may be all video frames of the video to be processed, or video frames of the cover video of the video to be processed, or video frames containing subtitles, or video frames containing video names.
For example, a key video frame of the video to be processed is obtained, optical character recognition (Optical Character Recognition, OCR) is performed on the key video frame, and video text information of the video to be processed is obtained according to the recognition result.
Optionally, the position of the caption in the video key frame can be already positioned through a text detection algorithm, an image area where the caption is located is intercepted, a caption image is obtained, image convolution processing is carried out on the caption image, image convolution characteristic information about the caption image is obtained, text characteristic extraction processing is carried out on the image convolution characteristic information through an LSTM network, text characteristic information about the caption image is obtained, text content contained in the caption image is identified according to the text characteristic information, text content identification is carried out on each video key frame, and video text information of a video to be processed can be obtained.
And screening the obtained video text information, and taking the words with more frequent occurrence times in the video text information as video meta-information of the video to be processed, for example, taking the first five words with the highest occurrence times as the video meta-information of the video to be processed.
Besides filtering video meta information according to the occurrence frequency of words in video text information, the video meta information can be screened from the video text information according to the occurrence frequency of words in the video text information in other video text information as a filtering condition, the occurrence frequency of words in the video text information is more, but the words in other video text information can be more reflected by the characteristics of the video text information, and the characteristics of the video to be processed can be more reflected by the video meta information obtained by combining the occurrence frequency of the other video text information, namely, the step of performing text content recognition on key video frames to obtain the video text information corresponding to the video to be processed can be more included:
Performing word frequency statistics on the video text information to obtain frequency information of at least one keyword in the video text information;
video meta information is determined from the at least one keyword based on the frequency information.
The keywords may be words contained in the video text information.
For example, the frequency information of each keyword may be calculated by a TF-IDF (term frequency-inverse document frequency) algorithm, first, the total number of words of the video text information is obtained, and the number of occurrences of each keyword in the video text information is counted, and for each keyword, the word frequency of the keyword is calculated according to the number of occurrences and the total number of words, where tf=number of occurrences/total number of occurrences. Then, the total video number in a video database is obtained, the video database contains videos to be processed, the inverse document frequency of the keyword is calculated according to the video number and the video number corresponding to the video text information containing the keyword, IDF=log (total video number/video number+1 containing the word), the frequency information of the keyword is calculated according to the word frequency and the inverse document frequency of the keyword, and TF-IDF=TF-IDF. And determining video meta-information from the video text information according to the TF-IDF value of each keyword in the video text information, for example, taking 8 keywords with the maximum TF-IDF values as the video meta-information of the video to be processed.
According to the method, the key words of the video to be processed are used as the video meta-information to be integrated into the method for identifying the identity information of the target object of the video to be processed, so that important characteristic information related to the identity information of the target object can be mined according to the information provided by the video meta-information when the identity information of the target object is identified, and the accuracy of identifying the identity information of the target object is improved. The method is applied to video auditing service, and video auditing efficiency can be improved.
102. And performing object detection on video frames in the video to be processed to obtain an object image containing the target object.
The target object may include, among other things, objects, animals, and other items in the video to be processed.
For example, the object detection may be specifically performed on each frame of video frame in the video to be processed, the video frame containing the target object is obtained from the video to be processed, and the area where the target object is located is cut from the video frame, so as to obtain the object image containing the target object.
It can be understood that when a plurality of target objects exist in the video frame, an object image corresponding to each target object can be obtained, if the same target object exists in the plurality of video frames, a plurality of object images of the target object can be obtained, and an object image sequence of the target object can be obtained according to the plurality of object images.
In general, because of continuity of video content, objects in video frames of several consecutive frames do not change greatly, so that video frames of a video to be processed are screened, and the number of video frames for object recognition is reduced, so that the time required for recognition is reduced, that is, the step of "performing object detection on video frames in a video to be processed to obtain an object image containing a target object" may specifically include:
performing video frame screening treatment on the video to be treated to obtain a video frame to be treated corresponding to the video to be treated;
object position detection is carried out on the video frame to be processed, so that position information of an object target object in the video frame to be processed is obtained;
and acquiring an object image containing the object target object from the video frame to be processed according to the position information.
The position information may be information about the position of the target object in the video frame, for example, coordinate information or the like.
For example, a plurality of video frames are extracted from a video to be processed in an equidistant manner to obtain a video frame to be processed, the video frame to be processed is detected frame by frame through a Faster regional convolutional neural network (Faster Region Convolutional Neural Networks, fast-RCNN), the position information of a target object in the video frame to be processed is obtained, and an image region containing the target object is intercepted from the video frame to be processed according to the position information to obtain an object image corresponding to the video to be processed.
Alternatively, the video sequence frame vf= { vf_1, &..vf_n } may be obtained by extracting the video frames to be processed from the video to be processed in an equidistant manner, where, for example, vf_i represents the ith video frame to be processed, and one frame of video frame is extracted from the video to be processed as the video frame to be processed every a frame interval, or one frame of video frame is extracted from the video to be processed as the video frame to be processed every b time, so as to obtain the video sequence frame.
103. And extracting object characteristics of the object image to obtain object characteristic information of the target object.
Wherein the object feature information may include information identifying a target object feature, and the object feature information may include an object feature vector.
For example, the image convolution processing, the batch data normalization, the maximum pooling and other processing may be performed on the object image to obtain the object feature information of the target object, and optionally, the object feature extraction may be performed on the object image through the residual neural network 50 (res net 50) to obtain the object feature information of the target object.
It can be understood that when there are a plurality of target objects, object feature extraction can be performed for an object image corresponding to each target object, so as to obtain object feature information of the target object.
104. And carrying out feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object.
For example, the object feature information and the video description information may be fused, for example, the object feature information and the video description information may be added to perform feature fusion, or the object feature information and the video description information may be multiplied to perform feature fusion, or the object feature information and the video description information may be subtracted to perform feature fusion, or the video description information and the object feature information may be spliced to obtain feature fusion information.
The feature fusion method can be determined according to the application scene requirement, feature fusion processing is performed on the object feature information and the video description information according to the feature fusion method, so that the flexibility of feature fusion is improved, and the obtained feature fusion information is more characterized, namely in an embodiment, the step of performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to a target object can specifically comprise the following steps:
determining at least one feature fusion mode;
performing feature fusion processing on the object feature information and the video description information based on at least one feature fusion mode to obtain at least one piece of sub-feature fusion information corresponding to the object target object;
Feature fusion information is determined based on the at least one sub-feature fusion information.
The feature fusion mode may include processing modes such as feature stitching, feature superposition, feature fusion and the like, the feature stitching may be to stitch two pieces of feature information to obtain feature information with more dimensions or more channels, the feature superposition may be to increase information quantity under each dimension or under each channel, and the feature fusion may be to multiply features and map object feature information into other feature spaces.
For example, a corresponding feature fusion mode may be determined according to the type of the object feature information, or a feature fusion mode may be determined according to a preset condition, feature fusion processing is performed on the object feature information and the video description information according to the feature fusion mode, sub-feature fusion information corresponding to each feature fusion mode is obtained, and the obtained at least one sub-feature fusion feature is used as feature fusion information corresponding to the target.
In an embodiment, the at least one feature fusion manner may include feature superposition and feature fusion, to obtain corresponding sub-feature fusion information as splicing feature information and fusion feature information, and the step of performing feature fusion processing on object feature information and video description information based on the at least one feature fusion manner to obtain at least one sub-feature fusion information corresponding to the target object may specifically include:
Performing feature superposition processing on the object feature information and the video description information to obtain spliced superposition information;
and carrying out feature blending processing on the object feature information and the video description information to obtain blending feature information.
For example, the method specifically includes performing feature superposition processing on object feature information and video description information to obtain superposition feature information, performing multiplication processing on the object feature information and the video description information to perform feature fusion processing to obtain fusion feature information, and splicing the superposition feature information and the fusion feature information to obtain feature fusion information.
The video description information corresponding to different videos to be processed is different, and the obtained object feature information is also different, so that the video description information and the object feature information have the same dimension, and are convenient to calculate for feature fusion processing, namely, the step of performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to a target object can specifically comprise:
performing dimension conversion on the video description information to obtain processed video description information with the same dimension as the object feature information;
And carrying out feature fusion processing on the processed video description information and the object feature information to obtain feature fusion information corresponding to the target object.
For example, the dimension conversion may be performed on the video description information through linear transformation to obtain processed video description information with the same dimension as the object feature information; and carrying out feature fusion processing on the processed video description information and the object feature information to obtain feature fusion information corresponding to the target object.
105. And carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object.
The identity information may include professional information, character information, and the like.
For example, the identification information of the target object may be specifically determined by identifying the target object according to the feature fusion information.
Optionally, the probability that the target object is the identity of each object may be predicted according to the feature fusion information, and the identity information of the target object may be determined according to the probability, that is, the step of "performing identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object" may specifically include:
predicting the prediction probability of the target object as the object identity according to the feature fusion information aiming at each object identity;
And determining the identity information of the target object according to the prediction probability.
For example, specifically, for each object identity, the prediction probability of the target object for the object identity may be predicted according to the feature fusion information, and the object identity with the highest prediction probability is used as the identity information of the target object.
Optionally, in an embodiment, the feature fusion information may be mined through a neural network model to obtain feature information that is more relevant to the target object, and identity information of the target object is more accurately identified, that is, step "identify the target object based on the feature fusion information to obtain identity information of the target object", including:
feature mining is carried out on character feature information of the target object based on the feature fusion information, so that identity feature information of the target object is obtained;
and carrying out identity recognition on the target object according to the identity characteristic information to obtain the identity information of the target object.
The character feature information may be feature information related to identity information in the object feature information.
For example, the feature fusion information provides a prompt of identity information, the video description information is spoken language teaching, and the identity information of the target object is likely to be a teacher, a student and the like, so that the neural network model can perform feature mining on character features for identity recognition based on the feature fusion information, weaken or discard feature information irrelevant to the identity information of the target object in the object feature information, obtain the identity feature information of the target object, recognize the identity of the target object according to the identity feature information, and determine the identity information of the target object.
As can be seen from the above, in the embodiment of the present application, the video to be processed and the video description information corresponding to the video to be processed are obtained; object detection is carried out on video frames in the video to be processed, and an object image containing a target object is obtained; extracting object characteristics of the object image to obtain object characteristic information of a target object; performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object; the method comprises the steps of carrying out identity recognition on a target object based on feature fusion information to obtain the identity information of the target object, carrying out feature fusion processing on video description information of a video to be processed and object feature information of the target object, enhancing the object feature information which can be used for identity recognition through the feature fusion processing video description information, and more accurately recognizing the identity information of the target object based on the feature fusion information obtained by the feature fusion processing, thereby improving the accuracy of recognizing the identity information of the target object in the video to be processed.
On the basis of the above embodiments, examples will be described in further detail below.
The present embodiment will be described from the perspective of an object recognition apparatus, which may be integrated in a computer device, which may be a server.
As shown in fig. 3, a specific flow of the object identification method provided in the embodiment of the present application may be as follows:
201. and acquiring a training sample set to train the identity recognition model to obtain the trained identity recognition model.
For example, the training sample set may include at least one training sample, where each training sample includes a video sample, video meta information corresponding to the video sample, and an object identity tag, for example, the video meta information corresponding to the video sample a is "scenario", "history", and "family", and the object identity tag is "worker", "merchant", and "farmer"; the video meta information corresponding to the video sample B is pronunciation, spoken language and teaching, and the object identity label is teacher, student and school; the video meta information corresponding to the video sample C is "doing work", "working" and "scenario", and the object identity tag is "worker", "boss" and "manager", etc.
The server may obtain the video description information according to the video meta information, and the specific process refers to the description related to step 101 or step 202, which is not described herein.
The server takes the video sample and the video description information as the input of the identity recognition model, predicts the identity information of the target object in the video sample according to the video description information corresponding to the video sample and the video sample through the identity recognition model, and trains the identity recognition model according to the object identity tag and the predicted identity information to obtain the trained identity recognition model.
After the identity recognition model is trained, when the identity recognition of the video target object is carried out, only the video to be processed and the video meta information related to the video to be processed are needed to be provided, the video meta information can be introduced into the training identity recognition model in the process of carrying out the identity recognition of the target object, and the introduced video meta information can enhance the object characteristic information of the training identity recognition model for carrying out the identity recognition of the target object, so that the identity recognition can be carried out more accurately.
202. And acquiring the video to be processed and video description information corresponding to the video to be processed.
For example, the server may obtain the video to be processed, perform optical character recognition (Optical Character Recognition, OCR) on the video frame of the video to be processed to extract the subtitle of the video to be processed, obtain the video text information of the video to be processed according to the recognition result, and optionally obtain the audio file of the video to be processed, perform text conversion on the audio file, and obtain the video text information of the video to be processed according to the conversion result.
The server calculates the frequency information of each keyword in the video text information through a TF-IDF (term frequency-inverse document frequency) algorithm, firstly, obtains the total word number of the video text information, counts the occurrence frequency of each keyword in the video text information, and calculates the word frequency of the keyword according to the occurrence frequency and the total word number for each keyword, wherein TF=the occurrence frequency/the total frequency. Then, the total video number in a video database is obtained, the video database contains videos to be processed, the inverse document frequency of the keyword is calculated according to the video number and the video number corresponding to the video text information containing the keyword, IDF=log (total video number/video number+1 containing the word), the frequency information of the keyword is calculated according to the word frequency and the inverse document frequency of the keyword, and TF-IDF=TF-IDF. And determining video meta-information from the video text information according to the TF-IDF value of each keyword in the video text information, for example, taking 8 keywords with the maximum TF-IDF values as the video meta-information of the video to be processed.
The video meta-information collection is performed to obtain a video meta-information set, the video meta-information set contains preset video meta-information, the video meta-information in the video meta-information set can be ordered according to a preset rule, the video meta-information set can be expressed as ks= { ks_1, ks_2, the set size of ks_k }, and ks_i is k, and the i-th video meta-information is expressed by ks_i, wherein i=1, 2,3, … …, k.
Each preset video meta-information corresponds to an information identifier, and if the initial value of the information identifier is 0, a zero vector kw_0= {0 with the length of k as the initial video description information can be obtained 1 ,0 2 ,0 3 ,0 4 ,……,0 k-1 ,0 k Where k is the number of video meta-information in the video meta-information set.
Or using a Bag-word algorithm to give a zero vector kw_0 with a length of k based on the video meta-information set, wherein the video meta-information of the video to be processed comprises the video meta-information in the video meta-information set, and setting 1 at the corresponding position of the vector kw_0, so that the video description information feature vector kw of the video to be processed can be obtained.
The server compares the video meta-information of the video to be processed with preset video meta-information, if the video meta-information of the video to be processed is the same as the preset video meta-information, the information mark of the corresponding position of the initial video description information is set to 1, otherwise, the information mark is set to 0, so as to obtain video description information, and the video description information is recorded as kw= { kw_1, kw_2, kw_3, …, kw_k-1, kw_k }, for example, the video description information can be {0 } 1 ,0 2 ,1 3 ,1 4 ,……,1 k-1 ,0 k }。
The video meta-information can help the identification model understand the video content of the video to be processed, for example, the video meta-information of a certain video to be processed is "teaching" and "pronunciation", and then the video meta-information provides some implicit prompt information for the identification model: the occupation of the person in the video can be teacher, school, student and the like, which certainly enhances the object characteristic information which can be used for identity recognition, and based on the enhanced object characteristic information, the identity recognition model can more accurately perform the identity recognition of the target object of the video to be processed.
203. And carrying out video frame serialization on the video to be processed to obtain video sequence frames.
For example, the server may extract the video frames to be processed from the video to be processed using an equidistant manner to obtain video sequence frames vf= { vf_1,..vf_n }, where, for example, vf_i represents the i-th video frame to be processed, and extracting a frame of video frame from the video to be processed as the video frame to be processed every interval a frame, or extracting a frame of video frame from the video to be processed as the video frame to be processed every time b, so as to obtain the video sequence frame.
204. And carrying out object detection on the video sequence frame through the trained identity recognition model, and determining the position information of the target object in the video sequence frame.
For example, the server may specifically perform object detection on the video sequence frame by using a fast-RCNN network structure through the trained identification model, so as to determine the position information of the target object in the video sequence frame.
205. And intercepting an image area containing the target object from the video sequence frame according to the position information to obtain an object image sequence.
For example, the server may specifically intercept an image area containing the target object from the video sequence frame according to the position information, if the video frame to be processed in the video sequence frame does not contain the target object, the intercepting is not performed, and the image area containing the target object is intercepted according to the position information, so as to obtain an object image sequence, which is denoted as person_list= { person_1, person_2,...
206. And extracting object characteristics of the object image sequence through the trained identity recognition model to obtain object characteristic information of the target object.
For example, the server may specifically perform object feature extraction on the object image sequence by using a pre-trained 50-layer Residual Network (Residual Network 50 ), to obtain object feature information pv= { pv_1, pv_2, pv_m }, where pv_i represents object feature information of the target object i.
207. And carrying out feature fusion processing on the object feature information and the video description information through the trained identity recognition model to obtain feature fusion information.
For example, the server may perform dimension conversion on the video description information based on a Full-connected layer (FC layer) through the trained identity recognition model, so that the dimension of the video description information is the same as the dimension of the object feature information, and the video description information after the dimension conversion is recorded as kw_d.
Then, feature fusion processing is carried out on the video description information and the object feature information in a feature multiplication and feature addition mode to obtain two piece of sub-feature fusion information, the two piece of sub-feature fusion information is spliced to obtain feature fusion information, the dimension of the object feature information is enlarged, more information contained in the object feature information can be obtained, and the identity of a target object can be identified more accurately.
And marking the feature fusion information obtained after the feature fusion processing of the video description information and the object feature information as:
pv_s= { pv_s_1, pv_s_2,..pv_s_m }, wherein pv_s_i= { pv_i kw_d, pv_i+kw_d }, i=1, 2 … … m, pv_s_i represents feature fusion information of the target object i.
208. And identifying the identity information of the target object based on the feature fusion information through the trained identity identification model.
For example, the server may input the feature fusion information into the full connection layer, and the server performs nonlinear transformation y=f (wx+b) on the feature fusion information of each input target object based on the full connection layer through the trained identity recognition model and outputs the feature fusion information. Wherein f is an activation function of the node, W is a weight matrix, b is a bias constant, the full connection layer may include a plurality of nodes, and the number of nodes is the number of classified categories, that is, the number of object identities.
The server performs body recognition on the feature fusion information of each target object based on the Softmax layer through the trained identity recognition modelAnd (3) identifying, namely converting the output result of the fc layer into the prediction probability of each object identity by the computing method. Wherein z is j The output value of the j-th node of the full connection layer is W, b, which is the network parameter of the layer. k is the number of output nodes, namely the number of classified categories, and the identity of the object with the highest prediction probability is determined as the identity information of the target object.
Figure BDA0003439551180000181
In one embodiment, as shown in fig. 4, the identification process based on the trained identification model may divide the identification model into three modules: the system comprises a video object positioning and feature extraction module, a video description information extraction module and a video meta-information enhancement object feature module. In the video object positioning and object extraction module, the identity recognition model needs to position the target object to be subjected to identity recognition after training, obtain the position information of the target object in the video frames of the video frame sequence, and intercept the object image from the video frames according to the indication of the position information to obtain the object image of each target object, such as the object image 1 and the object image 2 in fig. 4, wherein the object image 1 represents the object image corresponding to the target object 1, and the object image 2 represents the object image corresponding to the target object 2; the object image sequence performs object feature information extraction to obtain object feature information of each target object, such as object feature information 1 and object feature information 2 in fig. 4, where object feature information 1 represents object feature information corresponding to target object 1, and object feature information 2 represents object feature information corresponding to target object 2. The identification model after training in the video description information extraction module is required to extract some keyword sets related to the video to be processed to obtain video meta-information of the video to be processed, and video description information of the video to be processed is obtained according to the video meta-information. In the video meta-information enhancement object feature module, the object feature information of each target object is subjected to fusion enhancement by using video description information obtained based on the video meta-information, so that the recognition of each target object by the model on the occupation of the video person is more accurate.
As can be seen from the foregoing, the server in this embodiment of the present application trains the identity recognition model by obtaining the training sample set, obtains the trained identity recognition model, obtains the video to be processed and the video description information corresponding to the video to be processed, performs video frame serialization on the video to be processed, screens the video to be processed from the video to be processed, obtains the video sequence frame, performs object detection on the video sequence frame by the trained identity recognition model, determines the position information of the target object in the video sequence frame, intercepts the image region containing the target object from the video sequence frame according to the position information, obtains the object image sequence, performs object feature extraction on the object image sequence by the trained identity recognition model, obtains the object feature information of the target object, performs feature fusion processing on the object feature information and the video description information by the trained identity recognition model, and identifies the identity information of the target object based on the feature fusion information by the trained identity recognition model. According to the scheme, the video description information of the video to be processed and the object feature information of the target object are subjected to feature fusion processing, the object feature information which can be used for identity recognition is enhanced through the feature fusion processing video description information, the identity information of the target object can be recognized more accurately based on the feature fusion information obtained through the feature fusion processing, and the accuracy of recognizing the identity information of the target object in the video to be processed is improved.
In order to facilitate better implementation of the object recognition method provided in the embodiments of the present application, in an embodiment, an object recognition apparatus is also provided. Where the meaning of a noun is the same as in the above-described object recognition method, specific implementation details may be referred to in the description of the method embodiment.
The object recognition apparatus may be integrated in a computer device, as shown in fig. 5, and may include: the acquisition unit 301, the detection unit 302, the extraction unit 303, the fusion unit 304, and the identification unit 305 are specifically as follows:
(1) The acquisition unit 301: the method is used for acquiring the video to be processed and the video description information corresponding to the video to be processed.
Optionally, the object identifying apparatus may further include an information acquiring unit, an attribute extracting unit, and a processing unit, specifically:
an information acquisition unit: the method comprises the steps of acquiring initial video description information;
an attribute extraction unit: the method comprises the steps of extracting content attributes of a video to be processed to obtain video meta-information corresponding to the video to be processed;
and a processing unit: and the video processing module is used for carrying out selection processing on the initial video description information according to the video meta information to obtain the video description information of the video to be processed.
Alternatively, the attribute extraction unit may include a text acquisition subunit and a screening subunit, specifically:
Text acquisition subunit: the method comprises the steps of acquiring video text information of a video to be processed;
screening subunits: the method is used for screening the video text information to determine video meta-information of the video to be processed.
Optionally, the text acquisition subunit may include a video frame acquisition module and a text recognition module, in particular:
video frame acquisition module: the method comprises the steps of acquiring a key video frame of a video to be processed;
a text recognition module: the method is used for identifying text content of the key video frames to obtain video text information corresponding to the video to be processed.
Optionally, the screening subunit may include a statistics module and a determination module, in particular:
and a statistics module: the method comprises the steps of performing word frequency statistics on video text information to obtain frequency information of at least one keyword in the video text information;
and a determination module: for determining video meta information from at least one keyword based on the frequency information.
(2) The detection unit 302: the method is used for carrying out object detection on video frames in the video to be processed to obtain an object image containing the target object.
Alternatively, the detection unit 302 may include a video filtering sub-unit, a position detection sub-unit, and an image acquisition sub-unit, specifically:
Video screening subunit: the method comprises the steps of performing video frame screening processing on a video to be processed to obtain a video frame to be processed corresponding to the video to be processed;
position detection subunit: the method comprises the steps of detecting the object position of a video frame to be processed to obtain position information of an object target object in the video frame to be processed;
an image acquisition subunit: and the object image containing the object target object is acquired from the video frame to be processed according to the position information.
(3) The extraction unit 303: and the method is used for extracting the object characteristics of the object image to obtain the object characteristic information of the target object.
(4) Fusion unit 304: and the method is used for carrying out feature fusion processing on the object feature information and the video meta information to obtain feature fusion information corresponding to the target object.
Alternatively, the fusing unit 304 may include a mode determining subunit, a first feature fusing subunit, and an information determining subunit, specifically:
mode determination subunit: for determining at least one feature fusion approach;
a first feature fusion subunit: the method comprises the steps of performing feature fusion processing on object feature information and video meta information based on at least one feature fusion mode to obtain at least one piece of sub-feature fusion information corresponding to an object target object;
Information determination subunit: for determining feature fusion information based on the at least one sub-feature fusion information.
Optionally, the first feature fusion subunit may include a stitching module and a blending module, specifically:
and (3) splicing modules: the method comprises the steps of performing feature superposition processing on object feature information and video description information to obtain superposition feature information;
and (3) blending module: and the method is used for carrying out feature blending processing on the object feature information and the video description information to obtain blending feature information.
Alternatively, the fusion unit 304 may include a dimension conversion subunit and a second feature fusion subunit, specifically:
dimension conversion subunit: the method comprises the steps of performing dimension conversion on video description information to obtain processed video description information with the same dimension as object feature information;
second feature fusion sub-sheet description: and the method is used for carrying out feature fusion processing on the processed video description information and the object feature information to obtain feature fusion information corresponding to the target object.
(5) The identification unit 305: and the method is used for carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object.
Alternatively, the identification unit 305 may include a feature mining subunit and an identification subunit, specifically:
Feature mining subunit: the character feature information processing method comprises the steps of carrying out feature mining on character feature information of a target object based on feature fusion information to obtain identity feature information of the target object;
an identity recognition subunit: and the method is used for carrying out identity recognition on the target object according to the identity characteristic information to obtain the identity information of the target object.
Alternatively, the identification unit 305 may comprise a prediction subunit and an identity determination subunit, in particular:
prediction subunit: the method comprises the steps of predicting the prediction probability of a target object as the object identity according to feature fusion information for each object identity;
identity determination subunit: for determining identity information of the target object based on the prediction probability.
As can be seen from the above, the object recognition device in the embodiment of the present application acquires the video to be processed and the video description information corresponding to the video to be processed through the acquiring unit 301; the detection unit 302 performs object detection on video frames in the video to be processed to obtain an object image containing a target object; the extraction unit 303 performs object feature extraction on the object image to obtain object feature information of the target object; the feature fusion processing is carried out on the object feature information and the video description information through the fusion unit 304, so that feature fusion information corresponding to the target object is obtained; finally, the identification unit 305 performs identification on the target object based on the feature fusion information to obtain the identification information of the target object. According to the scheme, the video description information of the video to be processed and the object feature information of the target object are subjected to feature fusion processing, the object feature information which can be used for identity recognition is enhanced through the feature fusion processing video description information, the identity information of the target object can be recognized more accurately based on the feature fusion information obtained through the feature fusion processing, and the accuracy of recognizing the identity information of the target object in the video to be processed is improved.
The embodiment of the present application further provides a computer device, which may be a terminal or a server, as shown in fig. 6, and shows a schematic structural diagram of the computer device according to the embodiment of the present application, specifically:
the computer device may include one or more processors 1001 of a processing core, one or more memories 1002 of a computer readable storage medium, a power supply 1003, and an input unit 1004, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 6 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components. Wherein:
the processor 1001 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 1002, and calling data stored in the memory 1002. Optionally, the processor 1001 may include one or more processing cores; preferably, the processor 1001 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, a computer program, and the like, and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 1001.
The memory 1002 may be used to store software programs and modules, and the processor 1001 executes various functional applications and data processing by executing the software programs and modules stored in the memory 1002. The memory 1002 may mainly include a stored program area that may store an operating system, computer programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a stored data area; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 1002 may also include a memory controller to provide the processor 1001 with access to the memory 1002.
The computer device also includes a power supply 1003 for powering the various components, preferably, the power supply 1003 is logically connected to the processor 1001 by a power management system, such that charge, discharge, and power consumption management functions are performed by the power management system. The power supply 1003 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The computer device may also include an input unit 1004, which input unit 1004 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 1001 in the computer device loads executable files corresponding to the processes of one or more computer programs into the memory 1002 according to the following instructions, and the processor 1001 executes the computer programs stored in the memory 1002, so as to implement various functions, as follows:
acquiring video meta-information corresponding to the video to be processed;
object detection is carried out on video frames in the video to be processed, and an object image containing a target object is obtained; extracting object characteristics of the object image to obtain object characteristic information of a target object;
performing feature fusion processing on the object feature information and the video meta information to obtain feature fusion information corresponding to the target object;
and carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object.
The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.
As can be seen from the above, the computer device in the embodiment of the present application may obtain the video to be processed and the video description information corresponding to the video to be processed; object detection is carried out on video frames in the video to be processed, and an object image containing a target object is obtained; extracting object characteristics of the object image to obtain object characteristic information of a target object; performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object; and carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object. According to the scheme, the video description information of the video to be processed and the object feature information of the target object are subjected to feature fusion processing, the object feature information which can be used for identity recognition is enhanced through the feature fusion processing video description information, the identity information of the target object can be recognized more accurately based on the feature fusion information obtained through the feature fusion processing, and the accuracy of recognizing the identity information of the target object in the video to be processed is improved.
According to one aspect of the present application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium (which may also be referred to simply as storage medium). The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the above embodiments.
It will be appreciated by those of ordinary skill in the art that all or part of the steps of the various methods of the above embodiments may be performed by a computer program, or by computer program control related hardware, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, the present embodiments provide a storage medium (also referred to as a computer-readable storage medium) in which a computer program is stored, which is capable of being loaded by a processor to perform any one of the object recognition methods provided by the embodiments of the present application.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Since the computer program stored in the computer readable storage medium can execute any one of the object recognition methods provided in the embodiments of the present application, the beneficial effects that any one of the object recognition methods provided in the embodiments of the present application can be achieved, and detailed descriptions of the previous embodiments are omitted herein.
The foregoing has described in detail the methods, apparatuses, computer devices, storage media and products for object recognition provided by the embodiments of the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, and the description of the foregoing embodiments is only for aiding in the understanding of the methods and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims (15)

1. An object recognition method, comprising:
acquiring video description information corresponding to a video to be processed;
performing object detection on video frames in the video to be processed to obtain an object image containing a target object;
extracting object characteristics of the object image to obtain object characteristic information of the target object;
performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object;
and carrying out identity recognition on the target object based on the feature fusion information to obtain the identity information of the target object.
2. The method according to claim 1, wherein the performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object includes:
determining at least one feature fusion mode;
performing feature fusion processing on the object feature information and the video description information based on the at least one feature fusion mode to obtain at least one piece of sub-feature fusion information corresponding to the object target object;
the feature fusion information is determined based on the at least one sub-feature fusion information.
3. The method according to claim 2, wherein the sub-feature fusion information includes superposition feature information and blending feature information, the feature fusion processing is performed on the object feature information and the video description information based on the at least one feature fusion manner, so as to obtain at least one sub-feature fusion information corresponding to the target object, including:
performing feature superposition processing on the object feature information and the video description information to obtain superposition feature information;
and carrying out feature blending processing on the object feature information and the video description information to obtain blending feature information.
4. The method according to claim 1, wherein the performing feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object includes:
performing dimension conversion on the video description information to obtain processed video description information with the same dimension as the object feature information;
and carrying out feature fusion processing on the processed video description information and the object feature information to obtain feature fusion information corresponding to the target object.
5. The method according to claim 1, characterized in that the method is preceded by:
acquiring initial video description information;
extracting content attributes of the video to be processed to obtain video meta-information corresponding to the video to be processed;
and selecting the initial video description information according to the video meta information to obtain the video description information of the video to be processed.
6. The method of claim 5, wherein the extracting the content attribute of the video to be processed to obtain the video meta-information corresponding to the video to be processed includes:
acquiring video text information of the video to be processed;
And screening the video text information to determine video meta-information of the video to be processed.
7. The method of claim 6, wherein the obtaining video text information of the video to be processed comprises:
acquiring a key video frame of the video to be processed;
and carrying out text content identification on the key video frames to obtain video text information corresponding to the video to be processed.
8. The method of claim 5, wherein said filtering the video text information to determine video meta-information of the video to be processed comprises:
performing word frequency statistics on the video text information to obtain frequency information of at least one keyword in the video text information;
and determining the video meta information from the at least one keyword according to the frequency information.
9. The method according to claim 1, wherein the object detection of the video frame in the video to be processed to obtain an object image containing the target object includes:
performing video frame screening processing on the video to be processed to obtain a video frame to be processed corresponding to the video to be processed;
Detecting the object position of the video frame to be processed to obtain the position information of the object target object in the video frame to be processed;
and acquiring an object image containing the object target object from the video frame to be processed according to the position information.
10. The method according to claim 1, wherein the identifying the target object based on the feature fusion information to obtain the identity information of the target object includes:
feature mining is carried out on character feature information of the target object based on the feature fusion information, so that identity feature information of the target object is obtained;
and carrying out identity recognition on the target object according to the identity characteristic information to obtain the identity information of the target object.
11. The method according to any one of claims 1 to 10, wherein the identifying the target object based on the feature fusion information to obtain the identity information of the target object includes:
predicting the prediction probability of the target object as the object identity according to the feature fusion information for each object identity;
and determining the identity information of the target object according to the prediction probability.
12. An object recognition apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed and video description information corresponding to the video to be processed;
the detection unit is used for carrying out object detection on the video frames in the video to be processed to obtain an object image containing a target object;
the extraction unit is used for extracting object characteristics of the object image to obtain object characteristic information of the target object;
the fusion unit is used for carrying out feature fusion processing on the object feature information and the video description information to obtain feature fusion information corresponding to the target object;
and the identification unit is used for carrying out identity identification on the target object based on the characteristic fusion information to obtain the identity information of the target object.
13. A computer device comprising a memory and a processor; the memory stores a computer program, and the processor is configured to execute the computer program in the memory to perform the object recognition method according to any one of claims 1 to 10.
14. A storage medium storing a computer program to be loaded by a processor to perform the object recognition method of any one of claims 1 to 10.
15. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps in the object recognition method of any one of claims 1 to 10.
CN202111624652.9A 2021-10-22 2021-12-28 Object recognition method, device, computer equipment, storage medium and product Pending CN116012871A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111235727 2021-10-22
CN2021112357274 2021-10-22

Publications (1)

Publication Number Publication Date
CN116012871A true CN116012871A (en) 2023-04-25

Family

ID=86027246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111624652.9A Pending CN116012871A (en) 2021-10-22 2021-12-28 Object recognition method, device, computer equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN116012871A (en)

Similar Documents

Publication Publication Date Title
CN109117777B (en) Method and device for generating information
US10824874B2 (en) Method and apparatus for processing video
KR102288249B1 (en) Information processing method, terminal, and computer storage medium
CN109325148A (en) The method and apparatus for generating information
CN111274442B (en) Method for determining video tag, server and storage medium
CN112163122A (en) Method and device for determining label of target video, computing equipment and storage medium
CN113094552A (en) Video template searching method and device, server and readable storage medium
WO2019137391A1 (en) Method and apparatus for performing categorised matching of videos, and selection engine
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
CN113255354B (en) Search intention recognition method, device, server and storage medium
CN109635157A (en) Model generating method, video searching method, device, terminal and storage medium
CN113806588B (en) Method and device for searching video
US20130325865A1 (en) Method and Server for Media Classification
CN111368138A (en) Method and device for sorting video category labels, electronic equipment and storage medium
CN114329004A (en) Digital fingerprint generation method, digital fingerprint generation device, data push method, data push device and storage medium
CN110888896A (en) Data searching method and data searching system thereof
CN111797765B (en) Image processing method, device, server and storage medium
CN116012871A (en) Object recognition method, device, computer equipment, storage medium and product
CN115618873A (en) Data processing method and device, computer equipment and storage medium
CN113468332A (en) Classification model updating method and corresponding device, equipment and medium
Zhang et al. Hybrid improvements in multimodal analysis for deep video understanding
CN114741550B (en) Image searching method and device, electronic equipment and computer readable storage medium
CN113704623B (en) Data recommendation method, device, equipment and storage medium
Heng et al. Personalized knowledge distillation-based mobile food recognition
CN117493645B (en) Big data-based electronic archive recommendation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40083970

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination