CN116630842A

CN116630842A - Object detection method, device, electronic equipment and storage medium

Info

Publication number: CN116630842A
Application number: CN202310376483.4A
Authority: CN
Inventors: 钟华松
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2023-08-22

Abstract

The disclosure relates to an object detection method, an object detection device, electronic equipment and a storage medium, and relates to the technical field of computers. The method and the device are used for solving the technical problems that an existing object detection method is not strong in pertinence and redundant in detection result. The method comprises the following steps: determining characteristic information of a plurality of object objects in a target video and text content of the target video; determining the matching degree between the text content and each object according to the characteristic information of the object objects and the characteristic information of the text content to obtain a plurality of target matching degrees; determining a main object of the target video from a plurality of object objects according to the plurality of target matching degrees; the target matching degree corresponding to the main body object is larger than or equal to a preset threshold value. The method and the device can successfully detect the main object of the video, and improve the pertinence of video detection.

Description

Object detection method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to an object detection method, an object detection device, electronic equipment and a storage medium.

Background

With the rise of neural networks and computational power, object detection has become a central problem in the field of computer vision. The object detection is to detect whether an object to be detected is contained in an image, and if the object to be detected is contained in the image, it is also necessary to determine the position of the object in the image.

The object detection which is popular at present is mostly applied to a single frame image, the detection result is wider, and all objects in the image are usually framed. In this way, for short videos with a large number of types of objects, the number of detected objects is large, and the detection result is redundant and the pertinence is weak.

Disclosure of Invention

The disclosure provides an object detection method, an object detection device, electronic equipment and a storage medium, which are used for solving the technical problems that the existing object detection method is not strong in pertinence and redundant in detection result. The technical scheme of the present disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided an object detection method, the method including: determining characteristic information of a plurality of object objects in a target video and text content of the target video; determining the matching degree between the text content and each object according to the characteristic information of the object objects and the characteristic information of the text content to obtain a plurality of target matching degrees; determining a main object of the target video from a plurality of object objects according to the plurality of target matching degrees; the target matching degree corresponding to the main body object is larger than or equal to a preset threshold value.

Optionally, determining feature information of a plurality of objects in an image frame of the target video includes: acquiring a plurality of image frames of a target video, and respectively carrying out object detection on object objects included in each image frame to obtain a plurality of first object images; clustering the plurality of first object images according to a preset clustering algorithm to obtain at least one object image set; the object images in one object image set represent the same object; determining a second object image from each object image set respectively to obtain a plurality of second object images, and taking the characteristic information of the plurality of second object images as the characteristic information of a plurality of objects in an image frame of a target video; a second object image is the highest quality image of a collection of object images.

Optionally, the method further comprises: acquiring audio information of a target video, and performing text conversion on the audio information to obtain first text information; acquiring description information of a target video, and taking the description information as second text information; the description information includes a title and/or a video type of the target video; and obtaining the text content of the target video based on the first text information and the second text information.

Optionally, obtaining text content of the target video based on the first text information and the second text information includes: extracting target keywords from the first text information and the second text information according to a preset word stock to obtain at least one target keyword; the similarity between the target keywords and the words in the preset word stock is greater than or equal to the preset similarity; and taking at least one target keyword as the text content of the target video.

Optionally, determining the matching degree between the text content and each object according to the feature information of the plurality of objects and the feature information of the text content to obtain a plurality of target matching degrees includes: inputting the text content and a second object image into a preset matching model, outputting the matching degree between the text content and the second object image, and taking the matching degree as a target matching degree; the matching model is obtained by training based on a plurality of sample text contents, a plurality of sample images and a plurality of sample labels; one sample tag is used for representing sample text content corresponding to one sample image.

Optionally, inputting the text content and a second object image into a preset matching model, outputting a matching degree between the text content and the second object image, including: inputting the text content and a second object image into a preset matching model to respectively extract the characteristic information of the text content and the characteristic information of the second object image through the matching model; and calculating the similarity between the characteristic information of the text content and the characteristic information of the second object image, and outputting the similarity as the matching degree between the text content and the second object image.

Optionally, the method further comprises: inputting the target video into a preset classification model, and outputting the video type of the target video; the classification model is obtained based on training of a plurality of sample videos and a plurality of sample video types; one sample video corresponds to one sample video type.

According to a second aspect of embodiments of the present disclosure, there is provided an object detection apparatus, the apparatus including a determination unit and a processing unit; a determining unit configured to determine feature information of a plurality of object objects in an image frame of the target video, and determine feature information of text content of the target video; the text content is used for describing key object objects; the processing unit is further configured to determine the matching degree between the text content and each object according to the characteristic information of the object objects and the characteristic information of the text content, so as to obtain a plurality of target matching degrees; a determining unit configured to determine a subject object of the target video from the plurality of object objects according to the plurality of target matching degrees; the target matching degree corresponding to the main body object is larger than or equal to a preset threshold value.

Optionally, the determining unit is specifically configured to: acquiring a plurality of image frames of a target video, and respectively carrying out object detection on object objects included in each image frame to obtain a plurality of first object images; clustering the plurality of first object images according to a preset clustering algorithm to obtain at least one object image set; the object images in one object image set represent the same object; determining a second object image from each object image set respectively to obtain a plurality of second object images, and taking the characteristic information of the plurality of second object images as the characteristic information of a plurality of objects in an image frame of a target video; a second object image is the highest quality image of a collection of object images.

Optionally, the apparatus further includes an acquiring unit, where the acquiring unit is configured to: acquiring audio information of a target video, and performing text conversion on the audio information to obtain first text information; acquiring description information of a target video, and taking the description information as second text information; the description information includes a title and/or a video type of the target video; and obtaining the text content of the target video based on the first text information and the second text information.

Optionally, the acquiring unit is specifically configured to: extracting target keywords from the first text information and the second text information according to a preset word stock to obtain at least one target keyword; the similarity between the target keywords and the words in the preset word stock is greater than or equal to the preset similarity; and taking at least one target keyword as the text content of the target video.

Optionally, the processing unit is specifically configured to: inputting the text content and a second object image into a preset matching model, outputting the matching degree between the text content and the second object image, and taking the matching degree as a target matching degree; the matching model is obtained by training based on a plurality of sample text contents, a plurality of sample images and a plurality of sample labels; one sample tag is used for representing sample text content corresponding to one sample image.

Optionally, the processing unit is specifically configured to: inputting the text content and a second object image into a preset matching model to respectively extract the characteristic information of the text content and the characteristic information of the second object image through the matching model; and calculating the similarity between the characteristic information of the text content and the characteristic information of the second object image, and outputting the similarity as the matching degree between the text content and the second object image.

Optionally, the processing unit is further configured to: inputting the target video into a preset classification model, and outputting the video type of the target video; the classification model is obtained based on training of a plurality of sample videos and a plurality of sample video types; one sample video corresponds to one sample video type.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the object detection method of the first aspect described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having instructions stored thereon, which, when executed by a processor of an electronic device, enable the electronic device to perform the object detection method of the first aspect as described above.

The technical scheme provided by the disclosure at least brings the following beneficial effects: the object detection means determines characteristic information of a plurality of object objects in an image frame of the target video to clarify the object objects included in the target video. The object detection device determines the text content of the target video so that the object detection device can learn about the intended item to be represented by the target video. The object detection device determines the matching degree between the text content and each object according to the characteristic information of the objects and the characteristic information of the text content, and obtains a plurality of target matching degrees. Further, the object detection means determines a subject object of the target video from the plurality of object objects based on the plurality of target matching degrees. Because the target matching degree corresponding to the main object is greater than or equal to the preset threshold, the main object is more matched with the key object described by the text content compared with other object objects in the target video. Therefore, the detection result of the method and the device has pertinence, and further the intended commodity in the short video, namely the main object of the short video, can be successfully found, so that the detection result meets the requirements of users.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of a subject detection system, according to an exemplary embodiment;

FIG. 2 is one of the flow diagrams of an object detection method shown in accordance with an exemplary embodiment;

FIG. 3 is a second flow chart of an object detection method according to an exemplary embodiment;

FIG. 4 is a single frame image processing schematic diagram illustrating an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a multi-frame image processing according to an exemplary embodiment;

FIG. 6 is a third flow chart diagram illustrating an object detection method according to an exemplary embodiment;

FIG. 7 is a schematic diagram of a detection flow of subject detection, according to an example embodiment;

fig. 8 is a schematic structural view of an object detection apparatus according to an exemplary embodiment;

Fig. 9 is a schematic diagram of an electronic device according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In addition, in the description of the embodiments of the present disclosure, "/" means or, unless otherwise indicated, for example, a/B may mean a or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "a plurality" means two or more than two.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, user behavior information, etc.) and the data (including, but not limited to, program code, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Before explaining the embodiments of the present disclosure in detail, some related terms and related techniques related to the embodiments of the present disclosure are described.

Object detection is a central problem in the field of computer vision. The object detection is to detect whether the picture contains an object to be detected, and if the picture contains the object to be detected, the position and type of the object need to be determined.

The object detection method in the related art comprises the following steps: first, a plurality of anchor points are determined centering on feature points in a feature map. Then, detection is performed for each anchor point, and if a detection object exists in the anchor point, positional information of the detection object and the detection object type are output.

In practical applications, object detection may be achieved by means of a related object detection model. For example, an image is input into an associated object detection model, all objects in the image are found using the object detection model, and then the objects are framed in the image.

It can be seen that the object detection technology that is popular at present is generally applied to a single frame image, and the detection result will frame all objects in the image.

However, for video with more objects, the above detection result cannot meet the requirements of the user. For example, in a short video related to an electronic commerce, a user often wants to detect an intended commodity that the short video wants to actually embody, that is, a subject of the short video.

In view of this, the embodiments of the present disclosure provide an object detection method that aims at finding an intended commodity in a short video, that is, a commodity that really has a recommendation intention, so as to satisfy the subject detection needs of users.

The object detection method provided by the embodiment of the present disclosure is described in detail below with reference to the accompanying drawings.

The object detection method provided by the embodiment of the present disclosure may be applied to a main body detection system, and fig. 1 shows a schematic structural diagram of the main body detection system. As shown in fig. 1, the subject detection system 10 includes an electronic device 11 and a server 12. Wherein the electronic device 11 is connected to a server 12. The electronic device 11 and the server 12 may be connected in a wired manner or may be connected in a wireless manner, which is not limited in the embodiment of the present disclosure.

The electronic device 11 is configured to determine characteristic information of a plurality of object objects in an image frame of the target video, and to determine characteristic information of text content of the target video. The electronic device 11 is further configured to determine a matching degree between the text content and each object according to the feature information of the object objects and the feature information of the text content, so as to obtain a plurality of target matching degrees. Further, the electronic device 11 is configured to determine a subject object of the target video from the plurality of object objects according to the plurality of target matching degrees.

Alternatively, the electronic device 11 may acquire the target video from the server 12, and perform subject detection on the acquired target video.

Optionally, the electronic device 11 is configured to store and play the target video. For example, the electronic device 11 is deployed with multimedia software, and the electronic device 11 downloads short videos from a server or plays short videos through the multimedia software.

The electronic device 11 may be a terminal of various forms. For example, the electronic device 11 may be a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), a desktop computer, a cloud server, etc., and the embodiments of the present disclosure are not limited to a specific type of electronic device.

In the following embodiments provided in the embodiments of the present disclosure, description will be given taking an example in which the electronic device 11 and the server 12 are provided independently of each other.

Fig. 2 is a flow diagram illustrating an object detection method according to some example embodiments. In some embodiments, the above-described object detection method may be applied to the electronic device shown in fig. 1, and may also be applied to other similar devices.

As shown in fig. 2, the object detection method provided in the embodiment of the present disclosure includes the following S201 to S204.

S201, the electronic equipment determines characteristic information of a plurality of object objects in the target video.

As one possible implementation, the electronic device acquires a plurality of image frames of the target video, and performs object detection on each image frame separately to detect all object objects included on each image frame. Further, the electronic device intercepts the detected object portion from the image frame to obtain a plurality of object images. The electronic equipment inputs the object images into the neural network model respectively to obtain corresponding characteristic information.

As another possible implementation manner, the electronic device intercepts the detected object portion from the image frame, obtains a plurality of object images, and then filters the object images to obtain an image that meets the requirements (such as relatively high quality, etc.). Further, the electronic equipment inputs the screened object images into the neural network model respectively to obtain corresponding characteristic information.

It should be noted that, the neural network model in this embodiment may be a convolutional neural network (Convolutional Neural Network, CNN) model, a visual geometry group (Visual Geometry Group, VGG) model, or another model with an image processing function, and the embodiment of the present disclosure is not limited to a specific neural network model.

The neural network model comprises a plurality of stages of convolution layers, after the object image is input into the neural network model, the neural network model sequentially carries out convolution processing on the object image through the first stage of convolution layers, the object image is converted into feature vectors (represented by an array consisting of a plurality of numbers), and the electronic equipment determines the feature vectors as feature information of the object.

S202, the electronic equipment determines text content of the target video.

As one possible implementation manner, the electronic device obtains text content of the target video, and inputs the text content into the neural network model to obtain corresponding feature information.

It will be appreciated that video, as a particular multimedia asset, is typically provided with a variety of different modality information, such as audio information, image information, video description information (including video titles, comment content), and the like. Since video generally includes a plurality of image frames as compared with image information expressed by a single frame image, the content of the image information is large, and it is difficult to highlight a subject in the video. The text content of the video generally reflects the intention to be really expressed by the video, such as the title, comment content or characters obtained by converting audio information.

For example, in an electronic commerce short video showing a T-shirt, not only the T-shirt but also other objects such as trousers, shoes, etc. usually all of the objects are detected by using an object detection technology, and it cannot be determined which objects are really intended to be expressed by the video. However, in the audio information of the e-commerce short video, a key introduction is usually performed on the T-shirt, or the video title of the e-commerce short video is the "T-shirt".

It should be noted that, in the embodiment of the present disclosure, the execution sequence of S201 and S202 is not limited, and the electronic device may execute S201 first, then execute S202, execute S202 first, then execute S201, and execute S201 and S202 simultaneously.

S203, the electronic equipment determines the matching degree between the text content and each object according to the characteristic information of the object objects and the characteristic information of the text content, and obtains a plurality of target matching degrees.

As one possible implementation manner, the electronic device calculates the matching degree between the feature information of the text content and the feature information of each object, so as to obtain a plurality of target matching degrees.

In practical application, the electronic device can calculate the matching degree between the text and the image through a pre-trained matching model.

The electronic device inputs the text content and the object image of the target video into a matching model, and extracts characteristic information of the text content and characteristic information of the object image through the matching model. Further, the electronic device calculates the similarity between the feature information of the text content and the feature information of the second object image through the matching model, and outputs the similarity as the matching degree.

It can be appreciated that the matching degree between the text content and the object can be obtained rapidly and accurately by utilizing the pre-trained matching model.

S204, the electronic equipment determines a main object of the target video from the object objects according to the target matching degrees.

The target matching degree corresponding to the main body object is larger than or equal to a preset threshold value.

As one possible implementation manner, the electronic device compares each target matching degree with a preset threshold, determines a target matching degree greater than or equal to the preset threshold from a plurality of target matching degrees, and determines an object corresponding to the target matching degree greater than or equal to the preset threshold as a main object of the target video.

In some embodiments, the electronic device, after determining the subject object of the target video from the plurality of object objects, only retains the subject object, with all of the remaining object objects being discarded.

It can be understood that through the video main body detection of the method, the electronic equipment can find the intended commodity in the e-commerce short video, namely the commodity really having the selling intention, and then can search or recommend the same commodity.

The technical scheme provided by the disclosure at least brings the following beneficial effects: the electronic device determines characteristic information of a plurality of object objects in an image frame of the target video to determine the object objects included in the target video. The electronic equipment determines characteristic information of text content of the target video, and the text content is used for describing key object objects, so that the electronic equipment can know intended objects to be represented by the target video. And the electronic equipment determines the matching degree between the text content and each object according to the characteristic information of the objects and the characteristic information of the text content, and obtains a plurality of target matching degrees. Further, the electronic device determines a subject object of the target video from the plurality of object objects according to the plurality of target matching degrees. Because the target matching degree corresponding to the main object is greater than or equal to the preset threshold, the main object is more matched with the key object described by the text content compared with other object objects in the target video. Therefore, the detection result of the method and the device has pertinence, and further the intended commodity in the short video, namely the main body of the short video, can be successfully found, so that the detection result meets the requirements of users.

In one design, in order to determine feature information of a plurality of objects in an image frame of a target video, as shown in fig. 3, S201 provided in an embodiment of the disclosure specifically includes:

and S2011, the electronic equipment acquires a plurality of image frames of the target video.

As a possible implementation manner, the electronic device samples the image frames of the target video according to a preset frequency, so as to obtain a plurality of image frames of the target video.

Illustratively, the electronic device samples the target video at 1s per frame, resulting in a plurality of image frames.

And S2012, the electronic equipment respectively performs object detection on the object objects included in each image frame to obtain a plurality of first object images.

As a possible implementation manner, the electronic device performs object detection on each frame of image respectively, and detects an object in each frame of image. Further, the electronic device performs clipping processing on each frame of image to clip out the detected object to obtain a plurality of first object images.

The object objects corresponding to the first object image may be the same or different.

Alternatively, the electronic device may perform object detection on the object included on each image frame by means of an associated image detector. The image detector identifies a region of interest in the image by the detector, marking the object in the image frame.

Illustratively, the electronic device inputs an image frame to the YOLOV5 detector, obtains location information and object type of the object in the image frame, and marks the detected object in the image frame using the object frame.

As shown in fig. 4, the image 1 is input into an image detector for detecting the type of clothing, and the image detector selects all clothing object frames (including T-shirts, pants, shoes) in the image 1 to obtain an image 2 with frame selection marks. Further, the electronic device performs a crop operation on the image 2 to obtain an image only containing the object, that is, a plurality of images 3.

S2013, the electronic equipment performs clustering processing on the plurality of first object images according to a preset clustering algorithm to obtain at least one object image set.

Wherein the object images in one set of object images represent the same object.

It will be appreciated that typically short video displays a commodity or object where the commodity appears simultaneously in multiple frames of images without significant changes in spatial position. As shown in fig. 5, after object detection is performed on each of the 3 image frames in the target video, the object objects reflected by the plurality of first object images are the same.

In practical applications, the electronic device may perform clustering on the plurality of first object images by using a representative Density-based clustering (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) algorithm, that is, aggregate different images of the same commodity, and obtain a plurality of commodity sets through this operation, where each set represents a commodity.

S2014, the electronic equipment respectively determines one second object image from each object image set to obtain a plurality of second object images, and takes the characteristic information of the plurality of second object images as the characteristic information of a plurality of objects in the image frame of the target video.

Wherein, a second object image is the image with the highest image quality in an object image set.

As a possible implementation manner, the electronic device determines one image with highest image quality from each object image set, so as to obtain a plurality of second object images, and takes characteristic information of the plurality of second object images as characteristic information of a plurality of objects in an image frame of the target video.

It should be noted that the image quality can be measured by factors such as image definition, the position of the object in the image (for example, whether the object is in the middle of the image or not, and whether the object is a commodity image on the front side).

It can be understood that the embodiment of the disclosure uses the characteristic that the video has space-time track association to detect the object, and only one object is needed to be used for detecting the same object, so that a plurality of repeated objects are prevented from being detected, the calculated amount is increased, and the subsequent matching calculation efficiency is reduced.

In one design, in order to obtain the text content of the target video, as shown in fig. 6, the object detection method provided in the embodiment of the disclosure further includes:

s301, the electronic device acquires the audio information of the target video, and performs text conversion on the audio information to obtain first text information.

As one possible implementation, the electronic device obtains audio information of the target video from the electronic device. Further, the electronic device automatic speech recognition technology (Automatic Speech Recognition, ASR) performs text conversion on the acquired audio information to obtain first text information.

S302, the electronic equipment acquires description information of the target video, and the description information is used as second text information.

Wherein the descriptive information includes a title and/or video type of the target video.

As a possible implementation manner, the electronic device obtains the title and/or the video type of the target video from the electronic device, obtains the description information of the target video, and uses the description information as the second text information.

In some embodiments, to determine the video type of the target video, the electronic device may further input the target video into a preset classification model and output the video type of the target video. The classification model is obtained based on training of a plurality of sample videos and a plurality of sample video types; one sample video corresponds to one sample video type.

Illustratively, the electronic device will input multimodal information of the target video (including audio information, image information, etc. of the video) into the trained classification model, outputting the video type of the target video.

It can be appreciated that when the video lacks a type text description, the type of the video can be determined through a trained classification model, so that text content is supplemented.

And S303, the electronic equipment obtains the text content of the target video based on the first text information and the second text information.

As a possible implementation manner, the electronic device combines the first text information and the second text information to obtain combined text information, and takes the combined text information as text content of the target video.

As a possible implementation manner, the electronic device combines the first text information and the second text information to obtain combined text information. Further, the electronic device extracts keywords from the merged text information, and takes the extracted keywords as text content of the target video.

For example, the electronic device matches the combined text information with a preset word stock (for example, a word stock including 2 ten thousand entity words), and if the combined text information has a word identical to or similar to the word in the preset word stock, the electronic device proposes the word as a target keyword.

In still another example, the electronic device calculates word similarity between the words in the merged text information and words in the preset word stock to obtain a plurality of word similarity. Further, the electronic device uses the vocabulary with the vocabulary similarity greater than or equal to the preset similarity as the target keyword.

It can be understood that the electronic device extracts keywords from the text information according to the preset word stock, extracts more concise and representative words, simplifies text content, and reduces subsequent calculation amount.

For ease of understanding, as shown in fig. 7, a detection flow of subject detection in an embodiment of the present disclosure is shown. In the first step, the electronic device acquires an image frame of the short video and related text content. In the second step, the electronic device performs salient object detection on each frame of image, and as shown in fig. 7, the more salient object frame in the image is selected. The association of empty tracks in the third step is easy because an item appears simultaneously in multiple frames of images and the similarity between them is high, so they are easily associated or aggregated. Fourth, performing cross-modal matching (matching between images and texts), extracting related entity words by the electronic device through text information of the video, for example, a net red short sleeve (T-shirt) in fig. 7, and then performing consistency judgment on the texts and the images obtained in the third step, namely performing cross-modal matching between every two. For example, the net red cotta (T-shirt) is related to the object to be actually displayed by the short video, so that the video main body of the fifth step is obtained.

The foregoing embodiments have mainly described the solutions provided by the embodiments of the present disclosure from the viewpoint of apparatuses (devices). It should be understood that, in order to implement the above method, the apparatus or device includes hardware structures and/or software modules that perform respective method flows, where the hardware structures and/or software modules that perform respective method flows may form an electronic device. Those of skill in the art will readily appreciate that the algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiments of the present disclosure may divide the functional modules of the apparatus or device according to the above method examples, for example, the apparatus or device may divide each functional module corresponding to each function, or may integrate two or more functions into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present disclosure, the division of the modules is merely a logic function division, and other division manners may be implemented in actual practice.

Fig. 8 is a schematic structural view of an object detection apparatus according to an exemplary embodiment. Referring to fig. 8, an object detection apparatus 40 provided in an embodiment of the present disclosure includes a determination unit 401 and a processing unit 402.

A determining unit 401 configured to determine feature information of a plurality of object objects in an image frame of a target video, and determine feature information of text content of the target video; the text content is used for describing key object objects; the processing unit 402 is further configured to determine a matching degree between the text content and each object according to the feature information of the object objects and the feature information of the text content, so as to obtain a plurality of target matching degrees; a determining unit 401 further configured to determine a subject object of the target video from the plurality of object objects according to the plurality of target matching degrees; the target matching degree corresponding to the main body object is larger than or equal to a preset threshold value.

Optionally, the determining unit 401 is specifically configured to: acquiring a plurality of image frames of a target video, and respectively carrying out object detection on object objects included in each image frame to obtain a plurality of first object images; clustering the plurality of first object images according to a preset clustering algorithm to obtain at least one object image set; the object images in one object image set represent the same object; determining a second object image from each object image set respectively to obtain a plurality of second object images, and taking the characteristic information of the plurality of second object images as the characteristic information of a plurality of objects in an image frame of a target video; a second object image is the highest quality image of a collection of object images.

Optionally, the apparatus further comprises an obtaining unit 403, where the obtaining unit is configured to: acquiring audio information of a target video, and performing text conversion on the audio information to obtain first text information; acquiring description information of a target video, and taking the description information as second text information; the description information includes a title and/or a video type of the target video; and obtaining the text content of the target video based on the first text information and the second text information.

Optionally, the acquiring unit 403 is specifically configured to: extracting target keywords from the first text information and the second text information according to a preset word stock to obtain at least one target keyword; the similarity between the target keywords and the words in the preset word stock is greater than or equal to the preset similarity; and taking at least one target keyword as the text content of the target video.

Optionally, the processing unit 402 is specifically configured to: inputting the text content and a second object image into a preset matching model, outputting the matching degree between the text content and the second object image, and taking the matching degree as a target matching degree; the matching model is obtained by training based on a plurality of sample text contents, a plurality of sample images and a plurality of sample labels; one sample tag is used for representing sample text content corresponding to one sample image.

Optionally, the processing unit 402 is specifically configured to: inputting the text content and a second object image into a preset matching model to respectively extract the characteristic information of the text content and the characteristic information of the second object image through the matching model; and calculating the similarity between the characteristic information of the text content and the characteristic information of the second object image, and outputting the similarity as the matching degree between the text content and the second object image.

Optionally, the processing unit 402 is further configured to: inputting the target video into a preset classification model, and outputting the video type of the target video; the classification model is obtained based on training of a plurality of sample videos and a plurality of sample video types; one sample video corresponds to one sample video type.

Fig. 9 is a schematic structural diagram of an electronic device provided in the present disclosure. As shown in fig. 9, the electronic device 11 may include at least one processor 501 and a memory 502 for storing processor executable instructions, wherein the processor 501 is configured to execute the instructions in the memory 502 to implement the object detection method in the above-described embodiments.

In addition, the electronic device 11 may also include a communication bus 503 and at least one communication interface 504.

The processor 501 may be a processor (central processing units, CPU), micro-processing unit, ASIC, or one or more integrated circuits for controlling the execution of the programs of the present disclosure.

Communication bus 503 may include a path to transfer information between the above components.

Communication interface 504, using any transceiver-like device for communicating with other devices or communication networks, such as ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.

The memory 502 may be, but is not limited to, read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, but may also be electrically erasable programmable read-only memory (EEPROM), compact disc-read only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be stand alone and coupled to the processor via a bus. The memory may also be integrated with the processor.

Wherein the memory 502 is used for storing instructions for executing the disclosed aspects and is controlled by the processor 501 for execution. The processor 501 is configured to execute instructions stored in the memory 502 to implement the functions in the object detection method of the present disclosure.

As an example, in connection with fig. 8, the determination unit 401, the processing unit 402, and the acquisition unit 403 in the object detection apparatus 40 realize the same functions as those of the processor 501 in fig. 9.

In a particular implementation, as one embodiment, processor 501 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 9.

In a particular implementation, as one embodiment, the electronic device 11 may include multiple processors, such as the processor 501 and the processor 507 in FIG. 8. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In a specific implementation, as an embodiment, the electronic device 11 may further include an output device 505 and an input device 506. The output device 505 communicates with the processor 501 and may display information in a variety of ways. For example, the output device 505 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a Cathode Ray Tube (CRT) display device, or a projector (projector), or the like. The input device 506 communicates with the processor 501 and may accept input of user objects in a variety of ways. For example, the input device 506 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.

Those skilled in the art will appreciate that the structure shown in fig. 9 is not limiting of the electronic device 11 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In addition, the present disclosure also provides a computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the object detection method as provided in the above-described embodiments.

In addition, the present disclosure also provides a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the object detection method as provided in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. An object detection method, the method comprising:

determining characteristic information of a plurality of object objects in a target video and text content of the target video;

determining the matching degree between the text content and each object according to the characteristic information of the object objects and the characteristic information of the text content, and obtaining a plurality of target matching degrees;

determining a main object of the target video from the plurality of object objects according to the plurality of target matching degrees; the target matching degree corresponding to the main object is larger than or equal to a preset threshold value.

2. The method for detecting objects according to claim 1, wherein determining characteristic information of a plurality of object objects in the target video includes:

acquiring a plurality of image frames of the target video, and respectively carrying out object detection on object objects included in each image frame to obtain a plurality of first object images;

clustering the plurality of first object images to obtain at least one object image set; the objects represented by the object images in the object image set are the same;

determining a second object image from each object image set respectively to obtain a plurality of second object images; the second object image is the image with the highest image quality in the object image set;

And taking the characteristic information of the plurality of second object images as the characteristic information of a plurality of object objects in the target video.

3. The object detection method according to claim 1, characterized in that the method further comprises:

acquiring audio information of the target video, and performing text conversion on the audio information to obtain first text information;

acquiring description information of the target video, and taking the description information as second text information; the description information comprises a title and/or a video type of the target video;

and obtaining the text content of the target video based on the first text information and the second text information.

4. The method according to claim 3, wherein the obtaining text content of the target video based on the first text information and the second text information includes:

extracting target keywords from the first text information and the second text information according to a preset word stock to obtain at least one target keyword; the preset word stock comprises a plurality of words used for representing object names; the similarity between the target keyword and the words in the preset word stock is greater than or equal to the preset similarity;

And taking the at least one target keyword as text content of the target video.

5. The object detection method according to claim 2, wherein the determining the matching degree between the text content and each of the object objects according to the feature information of the plurality of object objects and the feature information of the text content, to obtain a plurality of target matching degrees, includes:

inputting the text content and a second object image into a preset matching model, outputting the matching degree between the text content and the second object image, and taking the matching degree as a target matching degree; the matching model is obtained by training based on a plurality of sample text contents, a plurality of sample object images and a plurality of sample labels; a sample tag is used to characterize the degree of matching between a sample object image and the text content of a sample.

6. The object detection method according to claim 5, wherein inputting the text content and one second object image into a preset matching model, outputting a degree of matching between the text content and the one second object image, comprises:

inputting the text content and the one second object image into the matching model to extract characteristic information of the text content and characteristic information of the one second object image respectively through the matching model;

And calculating the similarity between the characteristic information of the text content and the characteristic information of the second object image, and outputting the similarity as the matching degree between the text content and the second object image.

7. The object detection method according to claim 3, characterized in that the method further comprises:

inputting the target video into a preset classification model, and outputting the video type of the target video; the classification model is obtained based on training of a plurality of sample videos and a plurality of sample video types; one sample video corresponds to one sample video type.

8. An object detection device, characterized in that the device comprises a determination unit and a processing unit;

the determining unit is configured to determine characteristic information of a plurality of object objects in a target video and text content of the target video;

the processing unit is further configured to determine a matching degree between the text content and each object according to the feature information of the object objects and the feature information of the text content, so as to obtain a plurality of target matching degrees;

the determining unit is further configured to determine a subject object of the target video from the plurality of object objects according to the plurality of target matching degrees; the target matching degree corresponding to the main object is larger than or equal to a preset threshold value.

9. An electronic device, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the object detection method of any of claims 1-7.

10. A computer readable storage medium having instructions stored thereon, which, when executed by a processor of an electronic device, enable the electronic device to perform the object detection method according to any one of claims 1-7.