CN114896455A - Video tag generation method and device, electronic equipment and storage medium - Google Patents

Video tag generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114896455A
CN114896455A CN202210510803.6A CN202210510803A CN114896455A CN 114896455 A CN114896455 A CN 114896455A CN 202210510803 A CN202210510803 A CN 202210510803A CN 114896455 A CN114896455 A CN 114896455A
Authority
CN
China
Prior art keywords
target
image
objects
video
video resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210510803.6A
Other languages
Chinese (zh)
Inventor
迟至真
成乐乐
李思则
王仲远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202210510803.6A priority Critical patent/CN114896455A/en
Publication of CN114896455A publication Critical patent/CN114896455A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to a method and an apparatus for generating a video tag, an electronic device and a storage medium, and relates to the field of network technologies, and the specific scheme includes: and acquiring a target image from a target video resource, wherein the target video resource comprises a target object, and the target object is an object corresponding to the information reflected by the target video resource. And carrying out object detection processing on the target image to obtain a plurality of first objects, wherein the first objects are object objects in the target image. And determining the target object from the plurality of first objects according to the characteristic information of the target video resource. And determining a target label of the target object, wherein the target label corresponds to the video label of the target video resource. According to the technical scheme, the accuracy rate of identifying the video label can be improved.

Description

Video tag generation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of network technologies, and in particular, to a method and an apparatus for generating a video tag, an electronic device, and a storage medium.
Background
With the development of network technology, more and more users can browse multimedia resources on a network platform through terminals (such as mobile phones, notebook computers, tablet computers and the like). For example, the multimedia asset may be a video.
Currently, to facilitate management of videos, the videos may be classified by tags. The network platform can identify the label of the video according to the text information in the video (such as the title information of the video, the voice content in the video, and the like). For example, if there is a text message in the video that "today recommend a snack that is very good for eating", the tag of the video may be a snack (or a food, etc.).
However, the text information is typically determined by the user who uploaded the video. In the case that the expression of the text information is inaccurate, the label of the video may not be accurately identified through the text information in the video. Therefore, how to improve the accuracy of identifying the video tags becomes a problem to be solved urgently.
Disclosure of Invention
The present disclosure provides a video tag generation method, apparatus, electronic device, and storage medium, which can improve the accuracy of identifying a video tag. The technical scheme of the disclosure is as follows:
according to a first aspect of the present disclosure, a method for generating a video tag is provided, the method including:
and acquiring a target image from a target video resource, wherein the target video resource comprises a target object, and the target object is an object corresponding to the information reflected by the target video resource. And carrying out object detection processing on the target image to obtain a plurality of first objects, wherein the first objects are object objects in the target image. And determining the target object from the plurality of first objects according to the characteristic information of the target video resource. And determining a target label of the target object, wherein the target label corresponds to the video label of the target video resource.
Optionally, in a case that the number of frames of the target image is multiple frames, the method for generating the video tag further includes: and carrying out object detection processing on each frame of target image to obtain a plurality of second objects contained in each frame of target image. And clustering a plurality of second objects in the multi-frame target image to obtain a plurality of first objects.
Optionally, the feature information includes a target area, and the target area includes an area where the target object is located. The method for generating the video label further comprises the step of determining a plurality of target coincidence degrees according to the characteristic information of the target video resource, wherein the target coincidence degrees are coincidence degrees of the region where the first object is located and the target region. And determining a target object according to the target overlap ratios of the plurality of first objects, wherein the target object is the first object meeting the preset condition in the plurality of target overlap ratios.
Optionally, the target area includes an area where the person image is located. The method for generating the video tag further comprises the following steps: when the target image has the character image, the target contact ratios corresponding to the first objects are determined according to the character image.
Optionally, the target region includes a foreground region. The method for generating the video tag further comprises the following steps: and when the target image does not have the character image, performing image segmentation processing on the target image to determine a foreground region of the target image. And determining the target coincidence degrees corresponding to the plurality of first objects according to the foreground area.
Optionally, the feature information includes: and target text information, wherein the target text information is used for indicating the information reflected by the target video resource. The method for generating the video tag further comprises the following steps: and identifying target text information and determining the category of the target video resource. And determining the target object according to the category of the target video and the categories of the plurality of first objects.
Optionally, the method for generating a video tag further includes: and acquiring color information of the image frame in the target video source. According to the color information of the image frames, the color information difference degree between the image frames is determined. Determining a target image from the image frames, wherein the target image comprises a plurality of images, and the difference degree of color information between adjacent image frames in the target image is greater than a preset difference degree threshold value.
According to a second aspect of the present disclosure, there is provided a video tag generation apparatus including: an acquisition unit and a processing unit.
An acquisition unit configured to perform acquisition of a target image from a target video asset, the target video asset including a target object, the target object being an object corresponding to information reflected by the target video asset. And the processing unit is configured to execute object detection processing on the target image to obtain a plurality of first objects, wherein the first objects are object objects in the target image. The processing unit is further configured to determine a target object from the plurality of first objects according to the characteristic information of the target video resource. The processing unit is further configured to determine a target tag of the target object, wherein the target tag corresponds to a video tag of the target video resource.
Optionally, the number of frames of the target image is multiple frames. And the processing unit is also configured to execute object detection processing on the target images to obtain a plurality of second objects contained in each frame of target images. The processing unit is further configured to perform clustering on a plurality of second objects in a plurality of frames of the target images to obtain a plurality of first objects.
Optionally, the feature information includes a target area, and the target area includes an area where the target object is located. And the processing unit is also configured to determine a plurality of target coincidence degrees according to the characteristic information of the target video resource, wherein the target coincidence degrees are the coincidence degrees of the region where the first object is located and the target region. And the processing unit is also configured to determine a target object according to the target overlap ratio of the plurality of first objects, wherein the target object is the first object with the plurality of target overlap ratios meeting the preset condition.
Optionally, the target area includes an area where the person image is located. And the processing unit is further configured to determine target coincidence degrees corresponding to the plurality of first objects according to the human images when the human images exist in the target images.
Optionally, the target region includes a foreground region. And the processing unit is also configured to execute image segmentation processing on the target image and determine a foreground area of the target image when the target image does not have the character image. And the processing unit is further configured to determine target coincidence degrees corresponding to the plurality of first objects according to the foreground area.
Optionally, the feature information includes: and target text information, wherein the target text information is used for indicating the information reflected by the target video resource. And the processing unit is also configured to identify the target text information and determine the category of the target video resource. The processing unit is further configured to determine the target object according to the category of the target video and the categories of the plurality of first objects.
Optionally, the processing unit is further configured to perform acquiring color information of an image frame in the target video source. And the processing unit is also configured to determine the color information difference degree between the image frames according to the color information of the image frames. The processing unit is further configured to determine a target image from the image frames, wherein the target image comprises a plurality of images, and the difference degree of the color information between the adjacent image frames in the target image is larger than a preset difference degree threshold value.
According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising:
a processor. A memory for storing processor-executable instructions. Wherein the processor is configured to execute the instructions to implement any one of the above-described methods for optionally generating a video tag of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having instructions stored thereon, which, when executed by a processor of an electronic device, enable the electronic device to perform any one of the above-mentioned first aspect, optionally the video tag generation method.
According to a fifth aspect of the present disclosure, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of generating an optional video tag as in any one of the first aspects.
According to a sixth aspect of the present disclosure, there is provided a chip comprising a processor and a communication interface, the communication interface being coupled to the processor, the processor being configured to execute a computer program or instructions to implement the method for generating a video tag as described in the first aspect and any possible implementation manner of the first aspect.
The technical scheme provided by the disclosure at least brings the following beneficial effects: and acquiring a target image from a target video resource, wherein the target video resource comprises a target object, and the target object is an object corresponding to the information reflected by the target video resource. And carrying out object detection processing on the target image to obtain a plurality of first objects, wherein the first objects are object objects in the target image. And determining the target object from the plurality of first objects according to the characteristic information of the target video resource. And determining a target label of the target object, wherein the target label corresponds to the video label of the target video resource. Since the target object is an object in the target video resource, and the target object corresponds to the feature information. Therefore, after the target label of the target object is determined, the label of the target video resource can be determined, so that the accuracy of identifying the label of the video is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a schematic diagram of a communication system shown in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating a method of video tag generation in accordance with an exemplary embodiment;
FIG. 3 is a flow diagram illustrating another method of video tag generation in accordance with an illustrative embodiment;
FIG. 4 is a diagram illustrating an example of a model in accordance with an exemplary embodiment;
FIG. 5 is an example diagram of another model shown in accordance with an example embodiment;
FIG. 6 is an example diagram of another model shown in accordance with an example embodiment;
FIG. 7 is a flow chart illustrating another method of video tag generation in accordance with an exemplary embodiment;
FIG. 8 is a diagram illustrating an example of a person image in accordance with one illustrative embodiment;
FIG. 9 is a diagram illustrating an example of a foreground image in accordance with an illustrative embodiment;
fig. 10 is a schematic structural diagram illustrating a video tag generation apparatus according to an exemplary embodiment;
fig. 11 is a schematic structural diagram illustrating another video tag generation apparatus according to an exemplary embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
It should be noted that the user information (including but not limited to user device information, user personal information, etc.) referred to in the present disclosure is information authorized by the user or sufficiently authorized by each party.
First, an application scenario of the embodiment of the present disclosure is described.
The video tag generation method is applied to the scene of identifying video resources. In the related art, the network platform may identify the tag of the short video according to text information (e.g., title information of the short video, voice content of the short video, etc.) in the video (e.g., the short video). However, textual information is typically determined by the user who uploaded the short video. In the case that the expression of the text information is inaccurate, the label of the short video may not be accurately identified through the text information in the short video. Therefore, how to improve the accuracy of identifying the tags of the short videos becomes a problem to be solved urgently.
In order to solve the above problem, an embodiment of the present disclosure provides a method for generating a video tag, where a target image is obtained, where the target image is an image in a target video resource, the target video resource includes a target object, and the target object is an object corresponding to information reflected by the target video resource. And processing the target image to determine a target object. And determining a target label of the target object, wherein the target label corresponds to the target video resource. In this way, the target tag can be determined from the target object in the target video asset. That is, the target object exists in the target video resource, and thus, the accuracy of identifying the tag of the target video resource can be improved by using the target tag as the tag of the target video resource.
Fig. 1 is a schematic diagram of a communication system according to an embodiment of the present disclosure, as shown in fig. 1, the communication system may include: a server 01 and a terminal 02, wherein the server 01 can be connected with the terminal 02 through a wired network or a wireless network.
The server 01 may be a data server of some multimedia resource service platforms, and may be used to store and process multimedia resources. For example, the multimedia resource service platform may be a short video application service platform, a news service platform, a live broadcast service platform, a shopping service platform, a take-away service platform, a shared service platform, a functional website, and the like. The multimedia resources provided by the short video application service platform can be some short video works, the multimedia resources provided by the news service platform can be some news information, the multimedia resources provided by the live broadcast service platform can be live broadcast works and the like, and the rest of the multimedia resources are not repeated one by one. The present disclosure is not limited to a particular type of multimedia asset service platform.
As a possible implementation manner, the server 01 may include a plurality of multimedia resource service platforms, and each multimedia resource service platform uniquely corresponds to one application program. The application is installed on the terminal and displays the multimedia resource on the content display interface of the terminal 02. The server 01 is mainly used for storing relevant data of the content community application installed on the terminal 02, and can send corresponding data (which may be called a file to be transmitted) to the terminal when receiving a data acquisition request sent by the terminal 02.
In some embodiments, the server 01 may be a single server, or may be a server cluster composed of a plurality of servers. In some embodiments, the server cluster may also be a distributed cluster. The present disclosure is also not limited to the specific implementation of the server 01.
In still other embodiments, the server 01 may further comprise or be connected to a database, and the multimedia resources of the multimedia resource service platform may be stored in the database. The terminal 02 can realize the access operation of the multimedia resources in the database through the server 01.
The terminal 02 may be a mobile phone, a tablet computer, a desktop, a laptop, a handheld computer, a notebook, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a Virtual Reality (VR) device, and the like, which can install and use a content community application (e.g., a fast hand), and the specific form of the terminal is not particularly limited by the present disclosure. The system can be used for man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment and the like.
Alternatively, in the communication system shown in fig. 1 described above, the server 01 may be connected to at least one terminal 02. The present disclosure does not limit the number or types of the terminals 02.
Optionally, in the embodiment of the present disclosure, both the server 01 and the terminal 02 may be referred to as electronic devices. In the embodiment of the present disclosure, the server 01 and the terminal 02 may both serve as execution subjects for executing the embodiment of the present disclosure. The following describes an embodiment of the present disclosure by taking the server 01 as an execution subject.
In the case where the server 01 is an execution agent, the data processed by the server 01 may be data stored in the server 01. The data processed by the server 01 may be data received from the terminal 02.
After the application scenario and the implementation environment of the embodiment of the present disclosure are introduced, a detailed description is given below to a method for generating a video tag according to the embodiment of the present disclosure in conjunction with the implementation environment shown in fig. 1.
Fig. 2 is a flow chart illustrating a method of generating a video tag in accordance with an exemplary embodiment. As shown in fig. 2, the method may include steps 201-204.
201. And acquiring a target image from the target video resource.
The target image is an image in the target video resource, and the target object is an object corresponding to the information reflected by the target video resource.
It should be noted that the target video resource is composed of multiple frames of images, and the target image may include multiple frames of images in the target video resource. The embodiment of the present disclosure does not limit the type of the target video resource. For example, the target video asset may be a game-type video asset. As another example, the target video asset may be a video asset of a recommended item (e.g., sweater) type. As another example, the target video asset may be a travel-type (e.g., played at an amusement park) video asset.
In the embodiment of the present disclosure, the target video resource includes a target object, and the target object is an object corresponding to information reflected by the target video resource. Optionally, the target object may also be referred to as a subject object, that is, the target video resource includes the subject object.
For example, if the target video resource is a video resource of a recommended sweater, the information reflected by the target video resource may be the recommended sweater, and the target object may be a sweater. If the target video asset is a game-type video asset, the information reflected by the target video asset may be that the game character is moving, and the target object may be the game character, an identification of the game, or the like. If the target video resource is a video resource of a travel type (e.g., playing in an amusement park), the information reflected by the target video resource may be an entertainment item playing in the amusement park, and the target object may be equipment in the amusement park (e.g., a pirate ship, a ferris wheel, a trojan horse, etc.).
In one implementation, the target video resource is composed of image frames, and the image frames include each frame image in the target video resource. The target image may be any one of the image frames. The server may obtain image frames in the target video asset. Thereafter, a partial image is determined from the image frame as a target image.
Illustratively, the target video asset is composed of image A, image B, image C, image D-image Z. The target image may include: image a, image C, image D, image M, and image W.
In one possible implementation, the target video resource is composed of image frames, and the image frames include each frame image in the target video resource. The server may obtain color information for image frames in the target video asset. Then, the server may determine a color information difference degree between the respective image frames according to the color information of the image frames. Then, the server may determine a target image from the image frames, wherein the difference between the color information of the adjacent image frames in the target image is greater than a preset difference threshold.
Wherein the difference degree is used for reflecting the difference of the color information between the two frames of images. The larger the degree of difference, the larger the difference in color information between the two frame images. The smaller the degree of difference, the smaller the difference in color information between the two frame images.
In the case where the difference in color information between the two frame images is large, it is described that the contents displayed in the two frame images are different from each other.
The color information of the image is not limited in the embodiments of the present disclosure. For example, the color information of the image may be a Red Green Blue (RGB) value of the image. For another example, the color information of the image may be a Hue Saturation Value (HSV) Value of the image. For another example, the color information of the image may be a color histogram of the image. In the following embodiments, the information of an image is taken as an example of a color histogram of the image, and the embodiments of the present disclosure are described.
In one possible implementation, the server obtains a color histogram of the image frame. And then, the server determines the difference degree of the two adjacent image frames according to the color histograms of the two adjacent image frames. Then, the server compares the difference degree of every two adjacent image frames with a preset difference degree threshold value. And when the preset difference threshold of the difference degree of every two adjacent image frames is larger than the preset difference threshold, taking the every two adjacent image frames as target images.
For example, if the image frame includes an image a and an image B, and the image a and the image B are adjacent to each other, the preset difference threshold is 0.05. If the degree of difference between the color histogram of image a and the color histogram of image B is 0.1, image a and image B are target images.
And when the preset difference threshold of the difference degree of every two adjacent image frames is smaller than the preset difference threshold, not taking the every two adjacent image frames as the target image.
For example, if the image frame includes an image a and an image B, and the image a and the image B are adjacent to each other, the preset difference threshold is 0.05. If the degree of difference between the color histogram of image a and the color histogram of image B is 0.01, image a and image B are not the target images.
The technical scheme provided by the embodiment at least has the following beneficial effects: and acquiring color information of the image frame in the target video source. According to the color information of the image frames, the color information difference degree between the image frames is determined. Determining a target image from the image frames, wherein the target image comprises a plurality of images, and the difference degree of color information between adjacent image frames in the target image is greater than a preset difference degree threshold value. It can be understood that, in the case that the difference degree of the color information in the target image is large, it indicates that different object objects may be included in the target image. Therefore, different object objects can be ensured to exist in the target image, the number of the target images is reduced, and the object objects in the target video resource are prevented from being omitted.
202. Object detection processing is performed on the target image to obtain a plurality of first objects.
The first object is an object in the target image.
As a possible implementation mode, each frame of target image in the target images is processed through a convolutional neural network, and a plurality of first objects are obtained.
As another possible implementation manner, the object included in each frame of the target image in the target image may be identified through a trained object detection model, and the object detection model is used for identifying the object included in the image. Thereafter, a plurality of first objects are obtained.
Optionally, the plurality of first objects include a plurality of same objects and a plurality of different objects.
In one possible implementation, the object detection process is performed on the target image to obtain a plurality of second objects. And clustering the second objects to obtain a plurality of first objects.
Optionally, when the number of frames of the target image is multiple frames, the object detection processing is performed on each frame of the target image to obtain multiple second objects included in each frame of the target image. And clustering a plurality of second objects in the multi-frame target image to obtain a plurality of first objects.
It should be noted that, in the following embodiments, the plurality of second objects may be M second objects, where M is a positive integer. The plurality of first objects may be N-class first objects. The M second objects comprise N types of first objects, N is a positive integer, and M is larger than or equal to N.
In the embodiment of the disclosure, each of the N classes of first objects includes at least one second object, and the similarity of all the second objects in each class of first objects is greater than a preset similarity threshold.
That is, in the embodiment of the present disclosure, all the second objects in each class of the first objects may be regarded as the same object, that is, N classes of the first objects correspond to N second objects.
In a possible implementation manner, the M second objects are clustered by a clustering algorithm, so as to obtain N types of first objects.
Exemplarily, if M is 7, the 7 second objects include: after 7 second objects are clustered, 2 types of first objects (such as a white coat and black high-heeled shoes) can be obtained for the object a, the object B, the object C, the object D, the object E, the object F and the object G, wherein the white coat comprises: object a, object B and object D, a black high-heeled shoe comprising: object C, object E, object F, and object G.
Optionally, multi-target object tracking may be implemented by combining video stream and image-level general detection. And tracking a plurality of targets shown in the video to form a track of action (tracklet) of the target, wherein each track of the road is finally used as a main target candidate of the road.
Illustratively, the video stream includes images (such as image a, image B, and image C) at 3 time points, and the object in image a includes: object a, object B and object C, the objects in image B comprising: object a and object C, the objects in image C comprising: object B and object D. Four tracks can be constructed, respectively a track of object a, a track of object B, a track of object C, and a track of object D.
Thus, the same ID (identification) can be ensured for the same object, and the consistency of the targets in time is ensured.
It should be noted that the clustering algorithm is not limited in the embodiments of the present disclosure. For example, the clustering algorithm may be a k-means clustering algorithm (k-means clustering algorithm). As another example, the clustering algorithm may be a hierarchical clustering algorithm. As another example, the clustering algorithm may be a Self-organizing mapping Maps (SOM) clustering algorithm. As another example, the clustering algorithm may be a Fuzzy C-means (FCM) clustering algorithm.
It is understood that a plurality of second objects may be obtained from a plurality of frame images in the embodiments of the present disclosure. Wherein the same object may be present in different images. That is, the same object may exist in the plurality of second objects. Therefore, the second objects can be clustered through clustering processing, and different first objects in the multi-frame images are obtained.
Optionally, the object detection processing is performed on the target image to obtain a first object.
That is to say, in the embodiment of the present disclosure, the object detection processing is performed on the target image, and at least one first object may be obtained. In general, a plurality of first objects may be included in the target image. In the following embodiments, a plurality of first objects are taken as examples to describe the embodiments of the present disclosure.
203. And determining the target object from the plurality of first objects according to the characteristic information of the target video resource.
The characteristic information of the target video resource can reflect the information of the target video resource. The object detection model is constructed based on a target detection network and a fusion characteristic pyramid network.
As a possible implementation, the target object is determined from the plurality of first objects based on the object detection model and the feature information of the target video asset.
It can be understood that the object detection model is constructed based on the target detection network and the feature pyramid network. Because the target detection network can detect the object in the image, the characteristic pyramid can improve the detection performance of the small object. Accordingly, the object detection model can more accurately identify the object in the image.
In one possible implementation, as shown in FIG. 3, step 203 includes steps 301-304.
301. Detecting whether target text information exists.
The target text information is used for indicating information reflected by the target video resource.
Illustratively, the target text information may be: a sweater is recommended today. Alternatively, the target text information may be: snow boots are today recommended. Alternatively, the target text information may be: today, a car is recommended.
The present disclosure is not limited to the representation form of the target text information. For example, the target text information may be represented by a title of the target video asset. As another example, the target text information may be represented by speech in the target video asset. As another example, the target text information may be represented by subtitles in the target video asset.
In one possible design, the target text information includes: the title information of the target video resource, and/or the text information of the cover page of the target video resource, and/or the text information converted by the voice information of the target video resource.
The technical scheme provided by the embodiment at least has the following beneficial effects: the target text information includes: the title information of the target video resource, and/or the text information of the cover page of the target video resource, and/or the text information converted by the voice information of the target video resource. Therefore, the label of the target video resource can be determined through the target text information, and then the label of the target video resource is further determined by combining the target label of the target object, so that the accuracy rate of identifying the label of the target video resource is improved.
As a possible implementation mode, the target video resource is detected through a voice recognition technology, and whether the target text information exists or not is determined.
As another possible implementation manner, the target video resource is detected through a character recognition technology, and whether the target text information exists is determined.
In one possible implementation, step 302 is performed when the target text information is not present. When the target text information is present, step 304 is performed.
302. A plurality of target goodness-of-fit are determined.
The target coincidence degree is used for indicating the coincidence degree of the area where the first object is located and the target area, and the characteristic information comprises the target area.
As a possible implementation, when the target text information does not exist in the target video resource, the target area is determined. And determining N target overlap ratios according to the area of the target region and the area of each first object in the N types of first objects.
Illustratively, if the area of the target region is 20, the first objects of 3 classes (i.e., N is 3) are object a, object B, and object C, respectively, the area of object a is 10, the area of object B is 15, and the area of object C is 19, the degree of overlap between object a and the target region is 50%, the degree of overlap between object B and the target region is 75%, and the degree of overlap between object C and the target region is 95%.
Alternatively, the target overlap ratio may be determined according to the position, shape, etc. of the target region and the first object.
As another possible implementation, a plurality of target contact ratios are determined based on the object detection model and the feature information of the target video resource.
In the embodiment of the present disclosure, the object detection model is constructed based on a target detection network and a Feature Pyramid Network (FPN).
It should be noted that, the target detection network is not limited in the embodiments of the present disclosure. For example, the target detection network may be a central network centret. As another example, the target detection network may be Fast R-CNN (a type of target detection network). As another example, the target detection network may be Mask R-CNN (a type of target detection network).
Optionally, the object detection model is constructed based on the target detection network and the feature pyramid network (and/or the attention mechanism).
It should be noted that the FPN constructs the feature pyramid by using hierarchical semantic features of the convolutional network itself. FPN comprises two parts: the first part is a bottom-up process and the second part is a fusion process of top-down and side-to-side connections. The detection performance of the small object can be improved through the FPN.
In an implementable manner, the object detection model may be trained through a training set to obtain a trained object detection model. And the loss of the trained object detection model is less than a preset loss threshold value.
Illustratively, the training set may include jacket, skirt, service accessories, luggage, toys, and the like.
Illustratively, as shown in fig. 4, a schematic structural diagram of the object detection model is shown. Wherein a data layer (e.g., data) can receive a data set (e.g., a training set). Pooling layers (e.g., max pool) may be pooled. A residual network layer (e.g., res-block) may implement the functionality of the residual structure of the neural network. The first output layer (e.g., P3), the second output layer (e.g., P4), the third output layer (e.g., P5), and the fourth output layer (e.g., P6) are used to implement the output of the feature pyramid network layer. The first convolutional layer (e.g., scale1-conv), the second convolutional layer (e.g., scale2-conv), the third convolutional layer (e.g., scale2-conv), and the fourth convolutional layer (e.g., conv1-1) are used to extract different features of the input. Deconvolution layers (e.g., deconv) are used to achieve upsampling. The first loss function layer (such as head-hm), the second loss function layer (such as head-wh) and the third loss function layer (such as head-reg) are used for representing the difference of detection results.
The output of the feature pyramid network layer is described by taking the third output layer (i.e., P5) and the fourth output layer (i.e., P6) as an example. In conjunction with fig. 4, as shown in fig. 5, the third output layer (i.e., P5) and the fourth output layer (i.e., P6) may receive inputs to the two-layer network, respectively, and concatenate (concat) the two-layer outputs. For example, P5 is 1 × 128 and P6 is 1 × 128, and the cascade connection becomes an output of 1 × 256. The output result (e.g., P5_ out) may then be processed by mean pooling (e.g., AvgPool, which is an average of 3 × 3 results as the output of the layer), convolution (e.g., Conv), upsampling (e.g., Upsample, such as 2 × 2 output, upsampled to 4 × 4 output), and output probability (e.g., Sigmoid, which is used to indicate an output range of 0 to 1).
It should be noted that the embodiment of the present disclosure is not limited to the cascading manner, and a better output effect can be achieved by updating the network parameters. In the embodiment of the disclosure, object classification, location regression and difficult sample mining can be unified into a network through multitask (i.e. multitask learning) to perform task learning of e2 e.
Optionally, the object detection model may be trained over multiple scales.
It should be noted that, through multi-scale training of multi-scale MultiScale, the multi-scale deconvolution operation is used to up-sample the features of the deep convolutional layer, and the fusion feature map is constructed after the shallow network is cascaded. Thus, the integrated feature map can produce fewer candidate regions, enabling a higher recall rate.
It should be noted that, when the object detection model is constructed and trained based on the target detection network, the predicted central point has only one pixel, and the remaining pixels are negative samples. Also, the current loss function (e.g., focal loss) results in: when label (label) is 1 and the predicted value is 0.7, loss is reduced sharply due to the influence of alpha; resulting in a reduction in the predicted score for the center point.
It should be noted that, in the embodiment of the present disclosure, the focal length may be adjusted, and the alpha is adjusted downward (for example, set to 0.5), so that not only the effect of the focal length (difficult sample focused learning) is ensured, but also the serious decrease of the focal length caused when the predicted score of the center point of the positive sample is lower is avoided. Meanwhile, a weight parameter is added to the positive sample, and the loss penalty of the positive is increased.
303. And determining the target object according to the target contact ratios.
In the embodiment of the present application, the target object is a plurality of first objects with target overlapping degrees meeting a preset condition.
In one possible design, the preset conditions are: a first object of a largest target coincidence of the N target coincidences.
As a possible implementation, the N target overlap ratios are compared, and the maximum target overlap ratio is determined. And then, taking the first object corresponding to the maximum target coincidence degree as a target object.
Illustratively, object a corresponds to a target degree of overlap of 90%, object B corresponds to a target degree of overlap of 30%, and object C corresponds to a target degree of overlap of 95%, and the target object is object C.
In another possible design, the preset conditions are: and the first object with the target contact ratio of the N target contact ratios larger than a preset contact ratio threshold value.
For example, if the preset contact degree threshold is 60%, the preset target contact degree for object a is 50%, the target contact degree for object B is 30%, and the target contact degree for object C is 95%, the target object is object C.
It is to be understood that the feature information may reflect information of the target video asset, and the feature information includes a target area indicating that the target area may reflect the target video asset. The greater the target overlap ratio, the more similar the first object is to the target object of the target video asset. And taking the first object corresponding to the largest target coincidence degree in the plurality of target coincidence degrees as the target object. In this way, the accuracy of identifying the target object can be improved.
304. A target object is determined from the plurality of first objects based on the target text information.
As a possible implementation mode, the target text information and the N types of first objects are input into a preset identification model to obtain N predicted values, and the predicted values are used for reflecting the similarity degree of the objects and the target text information. Then, the first object corresponding to the largest predicted value among the N predicted values is set as the target object.
Illustratively, as shown in fig. 6, the preset recognition model includes an object feature module, a text feature module, a matching module, and an output module. The object feature module is used for analyzing the object (such as the N-type first object), the text feature module is used for analyzing the text information, and the matching module is used for determining the target text information and the predicted value of each first object. The output module is used for outputting the predicted value and determining the target object. For example, the target text message may be "this slipper is particularly fashionable. Moreover, the price of the pair of slippers is not very expensive. The "first objects include a jacket, slippers, and pants. Wherein the predicted value of the upper garment is 0.6, the predicted value of the slippers is 0.79, and the predicted value of the trousers is 0.2, and the slippers are determined as the target object.
The technical scheme provided by the embodiment at least has the following beneficial effects: and under the condition that the target text information exists in the target video resource, determining the target object from the N types of first objects according to the target text information. The target text information is used for indicating the information reflected by the target video resource, and the target object is determined from the first object by combining the target text information, so that the accuracy of determining the target object can be improved, and the accuracy of identifying the label of the target video resource is improved.
204. A target tag of the target object is determined.
Wherein the target tag corresponds to the target video resource.
For example, if the target tag of the target object is a car, the target video asset is a video associated with the car. If the target label of the target object is mountain, the target video resource is a video related to the landscape introduction. If the target tag of the target object is a game character, the target video asset is a video associated with the game.
As a possible implementation manner, the target image is input into the trained object detection model, and M first objects and M target labels are determined, where the M first objects correspond to the M target labels.
Illustratively, 4 objects are determined: object a, object B, object C, and object D. Wherein, object A is a skirt, object B is a boot, object C is an automobile, and object D is a flower. And if the target object is an object C, the target label is an automobile.
The technical scheme provided by the embodiment at least has the following beneficial effects: and acquiring a target image from a target video resource, wherein the target video resource comprises a target object, and the target object is an object corresponding to the information reflected by the target video resource. And carrying out object detection processing on the target image to obtain a plurality of first objects, wherein the first objects are object objects in the target image. And determining the target object from the plurality of first objects according to the characteristic information of the target video resource. And determining a target label of the target object, wherein the target label corresponds to the target video resource. Since the target object is an object in the target video resource, and the target object corresponds to the feature information. Therefore, after the target label of the target object is determined, the label of the target video resource can be determined, so that the accuracy of identifying the label of the video is improved.
In an implementable manner, as shown in fig. 7, after step 301, step 701 may be included.
701. Whether the target image has the human image or not is detected.
As a possible implementation manner, when target text information does not exist in the target video resource, whether a person image exists in the target image or not is detected through a human body recognition algorithm.
Illustratively, the human body recognition algorithm may be a single step detection algorithm, Fast RCNN (an object detection algorithm), Fast-RCNN (an object detection algorithm), or the like.
In another possible implementation manner, whether a person exists in the N-type first object is detected. When a person exists in the N types of first objects, it is determined that a person image exists in the target image. When no person exists in the N types of first objects, it is determined that the person image does not exist in the target image.
In one possible implementation, step 302 includes step 702. When the personal image exists in the target image, step 702 is executed.
In one possible implementation, step 302 includes step 702 or steps 703-704. When the personal image exists in the target image, step 702 is executed. When the personal image does not exist in the target image, step 703 and step 704 are executed.
702. A plurality of first degrees of overlap are determined.
In one possible design, the target area includes an area in which the image of the person is located. The target coincidence degree comprises a first coincidence degree, and the first coincidence degree is used for indicating the coincidence degree of the region where the first object is located and the region where the person image is located.
Illustratively, referring to fig. 8, if the area of the person image 801 is 20, the first objects of 3 types (i.e., N is 3) are a jacket 802, a skirt 803, and a sports shoe 804, respectively. If the area of the top 802 is 15, the area of the skirt 803 is 4, and the area of the sports shoe 804 is 1, the degree of overlap between the top 802 and the image of the person is 75%, the degree of overlap between the skirt 803 and the image of the person is 20%, and the degree of overlap between the sports shoe 804 and the image of the person is 5%.
The technical scheme provided by the embodiment at least has the following beneficial effects: whether the target image has the human image or not is detected. When the target text information does not exist in the target video resource and the character image exists in the target image, a plurality of first contact degrees are determined, the target contact degrees comprise first contact degrees, and the first contact degrees are used for indicating the contact degree between the area where the first object is located and the area where the character image is located. It can be understood that, since the target text information does not exist in the target video resource and the person image exists in the target image, the person image can be used as a reference, and the object with a higher degree of coincidence with the person image can be used as the target object, so as to improve the accuracy of determining the target object.
703. And carrying out image segmentation processing on the target image to determine a foreground area of the target image.
As a possible implementation manner, when there is no target text information in the target video resource and there is no character image in the target image, the target image is processed according to an image segmentation algorithm to determine a foreground region of the target image.
It should be noted that the image segmentation algorithm is not limited in the embodiments of the present disclosure. For example, the image segmentation algorithm may be a threshold-based segmentation algorithm. For another example, the image segmentation algorithm may be a region-based image segmentation algorithm. For example, the image segmentation algorithm may be an edge detection based segmentation algorithm.
For example, as shown in fig. 9, if the target image 901 includes a car 902, the foreground region of the target image is a region 903 where the car 902 is located.
704. A plurality of second degrees of overlap are determined.
The target coincidence degree comprises a second coincidence degree, and the second coincidence degree is used for indicating the coincidence degree of the area where the first object is located and the foreground area.
The technical scheme provided by the embodiment at least has the following beneficial effects: under the condition that target text information does not exist in the target video resource and the target image does not have the figure image, carrying out image segmentation processing on the target image to determine a foreground area of the target image; and determining a plurality of second coincidence degrees, wherein the target coincidence degrees comprise second coincidence degrees, and the second coincidence degrees are used for indicating the coincidence degree of the area where the first object is located and the foreground area. It will be appreciated that the foreground region is typically the region of the image in which the subject object is located. Therefore, the accuracy of determining the target object can be improved by the second degree of coincidence.
In another implementable manner, the characteristic information includes a target area, and the target area includes an area where the target object is located. Determining a target object from the plurality of first objects according to the feature information of the target video asset (i.e. step 203) may include: and determining a plurality of target coincidence degrees according to the characteristic information of the target video resource, wherein the plurality of target coincidence degrees are coincidence degrees of the region where each first object is located and the target region. And then, determining a target object according to the target coincidence degrees corresponding to the plurality of first objects, wherein the target object is the first object corresponding to the maximum target coincidence degree in the plurality of target coincidence degrees. Optionally, the object detection model is further configured to determine a target coincidence degree corresponding to the first object based on the feature information of the target video resource.
That is, step 302 is performed without performing step 301, i.e., step 203 may further include steps 302-304.
The technical scheme provided by the embodiment at least has the following beneficial effects: since the characteristic information can reflect the information of the target video asset. Therefore, the greater the target coincidence degree, the greater the correlation between the first object and the target video resource, and the higher the probability that the first object is the target object. And taking the first object corresponding to the largest target coincidence degree in the plurality of target coincidence degrees as the target object. In this way, the accuracy of identifying the target object can be improved.
In one possible implementation, the target area includes an area where a preset object image is located. And when the target image has the preset object image, determining the target coincidence degrees corresponding to the first objects according to the preset object image.
It should be noted that, the preset object image is not limited in the embodiment of the present disclosure. The preset object image may be a human image, an animal image, a cartoon image, or the like.
The technical scheme provided by the embodiment at least has the following beneficial effects: and when the character images exist in the target video resources, determining a plurality of target contact ratios according to the character images. It can be understood that, in the case that the human image exists in the target video resource, the human image can be used as a reference, and the object with a higher degree of overlapping with the human image can be used as the target object, so as to improve the accuracy of determining the target object.
In one implementable approach, the target region includes a foreground region. And when the target image does not have the character image, performing image segmentation processing on the target image to determine a foreground region of the target image. And determining the target coincidence degrees corresponding to the plurality of first objects according to the foreground area.
The technical scheme provided by the embodiment at least has the following beneficial effects: and under the condition that no character image exists in the target video resource, performing image segmentation processing on the target image to determine a foreground area of the target image. And determining the target coincidence degrees corresponding to the plurality of first objects according to the foreground area. It will be appreciated that the foreground region is typically the region of the image in which the subject object is located. Therefore, the accuracy of determining the target object can be improved by the second degree of coincidence.
In one possible implementation, the feature information includes: and target text information, wherein the target text information is used for indicating the information reflected by the target video resource. The target text information can be identified and the category of the target video resource can be determined. And then, determining the target object according to the category of the target video and the categories of the plurality of first objects.
Illustratively, the target text information may be: a sweater is recommended today. The category of the target video resource is sweater. The categories of the plurality of first objects are respectively: sweater (object a), shoes (object B), and bag (object C), object a is the target object.
The technical scheme provided by the embodiment at least has the following beneficial effects: and identifying target text information and determining the category of the target video resource. And then, determining the target object according to the category of the target video and the categories of the plurality of first objects. The target text information can reflect the category of the target video resource, and the target object is determined from the first object by combining the target text information, so that the accuracy of determining the target object can be improved, and the accuracy of identifying the label of the target video resource is improved.
In one practical way, the target image may be input into the trained object detection model to determine the target object. Wherein the object detection model comprises: the device comprises an identification module, a classification module, a fusion module and an output module. The identification module can identify M first objects in the target image, and the classification module can classify the M first objects to obtain N types of first objects. The fusion module may determine the target object (i.e., step 301, step 302, step 303, step 304, step 701, step 702, step 703, and step 704) in combination with information (e.g., target text information, character images, foreground regions, etc.) in the target video asset. The output module may determine a target tag for the target object (i.e., step 204).
It is understood that the above method may be implemented by a video tag generation apparatus. The video tag generation apparatus includes hardware structures and/or software modules for performing the respective functions in order to realize the above functions. Those of skill in the art will readily appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments.
The embodiment of the present disclosure may perform division of functional modules on the video tag generation apparatus according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiments of the present disclosure is illustrative, and is only one division of logic functions, and there may be another division in actual implementation.
Fig. 10 is a block diagram illustrating a structure of a video tag generation apparatus according to an exemplary embodiment. Referring to fig. 10, the video tag generation apparatus 100 includes an acquisition unit 1001 and a processing unit 1002.
An acquisition unit 1001 configured to perform acquisition of a target image from a target video asset, the target video asset including a target object, the target object being an object corresponding to information reflected by the target video asset. A processing unit 1002 configured to perform object detection processing on the target image, resulting in a plurality of first objects, the first objects being object objects in the target image. The processing unit 1002 is further configured to determine a target object from the plurality of first objects according to the feature information of the target video asset. The processing unit 1002 is further configured to perform determining a target tag of the target object, the target tag corresponding to a video tag of the target video asset.
Optionally, the number of frames of the target image is multiple frames. The processing unit 1002 is further configured to perform object detection processing on the target images, resulting in a plurality of second objects included in each frame of the target images. The processing unit 1002 is further configured to perform clustering on a plurality of second objects in the multi-frame target images, so as to obtain a plurality of first objects.
Optionally, the feature information includes a target area, and the target area includes an area where the target object is located. The processing unit 1002 is further configured to determine a plurality of target overlap ratios according to the feature information of the target video resource, where the target overlap ratio is a degree of overlap between the region where the first object is located and the target region. The processing unit 1002 is further configured to determine a target object according to target overlap ratios of the plurality of first objects, where the target object is the first objects with the target overlap ratios meeting preset conditions.
Optionally, the target area includes an area where the person image is located. The processing unit 1002 is further configured to determine target contact ratios corresponding to the plurality of first objects according to the human figure when the human figure exists in the target image.
Optionally, the target region includes a foreground region. The processing unit 1002 is further configured to perform image segmentation processing on the target image to determine a foreground region of the target image when the target image does not have a human image. The processing unit 1002 is further configured to determine target overlap ratios corresponding to the plurality of first objects according to the foreground region.
Optionally, the feature information includes: and target text information, wherein the target text information is used for indicating the information reflected by the target video resource. The processing unit 1002 is further configured to perform the steps of identifying the target text information and determining the category of the target video resource. The processing unit 1002 is further configured to determine the target object according to the category of the target video and the categories of the plurality of first objects.
Optionally, the processing unit 1002 is further configured to perform acquiring color information of an image frame in the target video source. The processing unit 1002 is further configured to determine a color information difference degree between each image frame according to the color information of the image frame. The processing unit 1002 is further configured to determine a target image from the image frames, where the target image includes a plurality of images, and a difference degree of color information between adjacent image frames in the target image is greater than a preset difference degree threshold.
With regard to the video tag generation apparatus in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 11 is a schematic structural diagram of a video tag generation apparatus 110 provided in the present disclosure. As shown in fig. 11, the video tag generating apparatus 110 may include at least one processor 1101 and a memory 1103 for storing instructions executable by the processor 1101. Wherein the processor 1101 is configured to execute instructions in the memory 1103 to implement the video tag generation method in the above-described embodiments.
In addition, the video tag generation apparatus 110 may further include a communication bus 1102 and at least one communication interface 1104.
The processor 1101 may be a GPU, a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of programs in accordance with the disclosed aspects.
Communication bus 1102 may include a path that transfers information between the aforementioned components.
Communication interface 1104, which may be any transceiver or other communication network, may be used for communicating with other devices or communication networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), etc.
The memory 1103 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and connected to the processing unit by a bus. The memory may also be integrated with the processing unit as a volatile storage medium in the GPU.
The memory 1103 is used for storing instructions for executing the disclosed solution, and the processor 1101 controls the execution. The processor 1101 is configured to execute instructions stored in the memory 1103 to thereby implement the functions in the disclosed method.
In particular implementations, processor 1101 may include one or more GPUs, such as GPU0 and GPU1 in fig. 11, as one embodiment.
In particular implementations, video tag generation apparatus 110 may include a plurality of processors, such as processor 1101 and processor 1107 in fig. 11, as an example. Each of these processors may be a single-Core (CPU) processor or a multi-core (multi-GPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In a specific implementation, the video tag generation apparatus 110 may further include an output device 1105 and an input device 1106, as an embodiment. The output device 1105 is in communication with the processor 1101 and may display information in a variety of ways. For example, the output device 1105 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 1106 is in communication with the processor 1101 and can accept user input in a variety of ways. For example, the input device 1106 may be a mouse, keyboard, touch screen device or sensing device, etc.
Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of the video tag generating means 110, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.
The present disclosure also provides a computer-readable storage medium, on which instructions are stored, and when the instructions in the storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute the group communication method provided by the embodiment of the present disclosure.
The embodiment of the present disclosure further provides a computer program product containing instructions, which when run on an electronic device, causes the electronic device to execute the method for generating a video tag provided by the embodiment of the present disclosure.
The embodiment of the present disclosure also provides a communication system, as shown in fig. 1, the system includes a server 01 and a terminal 02. The server 01 and the terminal 02 are respectively configured to execute corresponding steps in the foregoing embodiments of the present disclosure, so that the communication system solves the technical problem solved by the embodiments of the present disclosure and achieves the technical effect achieved by the embodiments of the present disclosure, which is not described herein again.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method for generating a video tag, the method comprising:
acquiring a target image from a target video resource, wherein the target video resource comprises a target object, and the target object is an object corresponding to information reflected by the target video resource;
carrying out object detection processing on the target image to obtain a plurality of first objects, wherein the first objects are object objects in the target image;
determining the target object from the plurality of first objects according to the characteristic information of the target video resource;
determining a target label of the target object, wherein the target label corresponds to a video label of the target video resource.
2. The method according to claim 1, wherein, in a case where the number of frames of the target image is multiple frames, performing object detection processing on the target image to obtain multiple first objects comprises:
carrying out object detection processing on each frame of target image to obtain a plurality of second objects contained in each frame of target image;
and clustering the plurality of second objects in the target images of the plurality of frames to obtain a plurality of first objects.
3. The method of claim 1, wherein the feature information comprises a target area, the target area comprising an area in which the target object is located; the determining the target object from the plurality of first objects according to the characteristic information of the target video resource includes:
determining a plurality of target coincidence degrees according to the characteristic information of the target video resource, wherein the target coincidence degrees are coincidence degrees of the region where the first object is located and the target region;
and determining the target object according to the target overlap ratios of the plurality of first objects, wherein the target object is the first object of which the target overlap ratios meet preset conditions.
4. The method of claim 3, wherein the target area comprises an area in which a human image is located;
the determining a plurality of target contact ratios comprises:
and when the target image is determined to have the character image, determining the target contact degrees corresponding to the plurality of first objects according to the character image.
5. The method of claim 3 or 4, wherein the target region comprises a foreground region, the method further comprising:
when no person image exists in the target image, performing image segmentation processing on the target image to determine the foreground area of the target image;
and determining the target coincidence degrees corresponding to the plurality of first objects according to the foreground area.
6. The method of claim 1, wherein the feature information comprises: target text information, wherein the target text information is used for indicating information reflected by the target video resource;
the determining the target object from the plurality of first objects according to the characteristic information of the target video resource includes:
identifying the target text information and determining the category of the target video resource;
and determining the target object according to the category of the target video and the categories of the plurality of first objects.
7. An apparatus for generating a video tag, comprising:
an acquisition unit configured to perform acquisition of a target image from a target video resource including a target object that is an object corresponding to information reflected by the target video resource;
a processing unit configured to perform object detection processing on the target image to obtain a plurality of first objects, wherein the first objects are object objects in the target image;
the processing unit is further configured to determine the target object from the plurality of first objects according to the feature information of the target video resource;
the processing unit is further configured to perform determining a target tag of the target object, the target tag corresponding to a video tag of the target video resource.
8. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of generating a video tag of any of claims 1-6.
9. A computer-readable storage medium having instructions stored thereon, wherein the instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of generating a video tag of any of claims 1-6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the method of generating a video tag of any of claims 1-6.
CN202210510803.6A 2022-05-11 2022-05-11 Video tag generation method and device, electronic equipment and storage medium Pending CN114896455A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210510803.6A CN114896455A (en) 2022-05-11 2022-05-11 Video tag generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210510803.6A CN114896455A (en) 2022-05-11 2022-05-11 Video tag generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114896455A true CN114896455A (en) 2022-08-12

Family

ID=82721967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210510803.6A Pending CN114896455A (en) 2022-05-11 2022-05-11 Video tag generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114896455A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103780973A (en) * 2012-10-17 2014-05-07 三星电子(中国)研发中心 Video label adding method and video label adding device
CN104573706A (en) * 2013-10-25 2015-04-29 Tcl集团股份有限公司 Object identification method and system thereof
CN104715023A (en) * 2015-03-02 2015-06-17 北京奇艺世纪科技有限公司 Commodity recommendation method and system based on video content
CN110163076A (en) * 2019-03-05 2019-08-23 腾讯科技(深圳)有限公司 A kind of image processing method and relevant apparatus
CN114399699A (en) * 2021-12-06 2022-04-26 北京达佳互联信息技术有限公司 Target recommendation object determination method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103780973A (en) * 2012-10-17 2014-05-07 三星电子(中国)研发中心 Video label adding method and video label adding device
CN104573706A (en) * 2013-10-25 2015-04-29 Tcl集团股份有限公司 Object identification method and system thereof
CN104715023A (en) * 2015-03-02 2015-06-17 北京奇艺世纪科技有限公司 Commodity recommendation method and system based on video content
CN110163076A (en) * 2019-03-05 2019-08-23 腾讯科技(深圳)有限公司 A kind of image processing method and relevant apparatus
CN114399699A (en) * 2021-12-06 2022-04-26 北京达佳互联信息技术有限公司 Target recommendation object determination method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11436739B2 (en) Method, apparatus, and storage medium for processing video image
US10210178B2 (en) Machine learning image processing
TWI746674B (en) Type prediction method, device and electronic equipment for identifying objects in images
US10902262B2 (en) Vision intelligence management for electronic devices
CN108446390B (en) Method and device for pushing information
US10755447B2 (en) Makeup identification using deep learning
US11500927B2 (en) Adaptive search results for multimedia search queries
WO2021051601A1 (en) Method and system for selecting detection box using mask r-cnn, and electronic device and storage medium
US20230136913A1 (en) Generating object masks of object parts utlizing deep learning
Ge et al. Co-saliency detection via inter and intra saliency propagation
AU2014348909A1 (en) Image based search
CN111931859B (en) Multi-label image recognition method and device
CN112102929A (en) Medical image labeling method and device, storage medium and electronic equipment
Kompella et al. A semi-supervised recurrent neural network for video salient object detection
US20100092077A1 (en) Dominant color descriptors
KR102617756B1 (en) Apparatus and Method for Tracking Missing Person based on Attribute
Wang et al. Robust pixelwise saliency detection via progressive graph rankings
CN114896455A (en) Video tag generation method and device, electronic equipment and storage medium
Frontoni et al. People counting in crowded environment and re-identification
Du et al. Supervised training and contextually guided salient object detection
Zhao et al. Coarse-to-fine online learning for hand segmentation in egocentric video
CN114677578A (en) Method and device for determining training sample data
Fareed et al. Saliency detection by exploiting multi-features of color contrast and color distribution
Noaman et al. Image colorization: A survey of methodolgies and techniques
US10126821B2 (en) Information processing method and information processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination