CN116775938B

CN116775938B - Method, device, electronic equipment and storage medium for retrieving comment video

Info

Publication number: CN116775938B
Application number: CN202311026762.4A
Authority: CN
Inventors: 张皓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2024-05-17
Anticipated expiration: 2043-08-15
Also published as: CN116775938A

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for retrieving an explanation video, wherein the method comprises the following steps: acquiring published candidate comment videos from the same video category of the short video platform; video shot segmentation is carried out on the candidate comment video to obtain a plurality of candidate comment fragments, and fragment images of frames in the candidate comment fragments are extracted; extracting candidate image features of each fragment image, and constructing a candidate feature library based on the candidate image features; the method comprises the steps of receiving a target image uploaded by a client based on a short video platform, extracting target image characteristics of the target image, performing similarity matching in a candidate feature library based on the target image characteristics, determining target comment fragments from candidate comment fragments according to matching results, and obtaining a retrieval result of the target image according to the target comment fragments, so that the construction efficiency of a retrieval database can be improved, and the method can be applied to various scenes such as cloud technology, video retrieval, artificial intelligence, intelligent traffic, auxiliary driving and the like.

Description

Method, device, electronic equipment and storage medium for retrieving comment video

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for retrieving an explanatory video, an electronic device, and a storage medium.

Background

Currently, relevant video data may be searched based on images, for example, corresponding video data may be retrieved from a pre-built retrieval database based on images to be retrieved.

In the related art, the data of the search database is usually obtained by capturing the data according to the keywords by a search engine, however, the data captured by the method generally has a great amount of noise, and the requirements of data cleaning and sorting are improved, so that the construction efficiency of the search database is reduced.

Disclosure of Invention

The following is a summary of the subject matter of the detailed description of the application. This summary is not intended to limit the scope of the claims.

The embodiment of the application provides a method, a device, electronic equipment and a storage medium for retrieving an explanation video, which can improve the construction efficiency of a retrieval database.

In one aspect, an embodiment of the present application provides a method for retrieving an comment video, including:

acquiring published candidate comment videos from the same video category of the short video platform;

Video shot segmentation is carried out on the candidate comment video to obtain a plurality of candidate comment fragments, and fragment images of frames in the candidate comment fragments are extracted;

Extracting candidate image features of each fragment image, and constructing a candidate feature library based on the candidate image features;

And receiving a target image uploaded by a client based on the short video platform, extracting target image characteristics of the target image, performing similarity matching in the candidate feature library based on the target image characteristics, determining a target comment fragment from the candidate comment fragments according to a matching result, and obtaining a retrieval result of the target image according to the target comment fragment.

On the other hand, the embodiment of the application also provides a video searching device for explanation, which comprises the following steps:

The first acquisition module is used for acquiring published candidate comment videos from the same video category of the short video platform;

the first processing module is used for carrying out video shot segmentation on the candidate comment video to obtain a plurality of candidate comment fragments, and extracting fragment images of frames in the candidate comment fragments;

the second processing module is used for extracting candidate image features of each fragment image and constructing a candidate feature library based on the candidate image features;

And the third processing module is used for receiving the target image uploaded by the client of the short video platform, extracting target image characteristics of the target image, performing similarity matching in the candidate feature library based on the target image characteristics, determining a target comment fragment from the candidate comment fragments according to a matching result, and obtaining a retrieval result of the target image according to the target comment fragment.

Further, the second processing module is further configured to:

detecting the face area of the segment image to obtain a plurality of target face frames;

Setting the pixel value of each pixel point in the target face frame to zero, or determining the pixel mean value of the pixel points except the pixel points in the target face frame, and setting the pixel value of each pixel point in the target face frame as the pixel mean value.

Further, the second processing module is further configured to:

scaling the fragment image for a plurality of times to obtain an image pyramid;

Detecting face areas of all images in the image pyramid to obtain a plurality of first candidate face frames, and performing non-maximum suppression on the first candidate face frames to obtain second candidate face frames;

Performing secondary classification on the second candidate face frames, removing the second candidate face frames without faces according to classification results, and performing regression calibration and non-maximum suppression on the remaining second candidate face frames to obtain third candidate face frames;

And carrying out regression calibration and non-maximum suppression on the third candidate face frame to obtain a plurality of target face frames.

Further, the second processing module is further configured to:

Determining attention weights corresponding to the candidate feature elements, and weighting the candidate feature elements according to the attention weights to obtain attention elements corresponding to the candidate feature elements;

the attention features of the segment images are constructed based on the individual attention elements, and a candidate feature library is constructed based on the attention features.

Further, the second processing module is further configured to:

Determining feature average values of all the candidate feature elements in the candidate image features;

And determining a characteristic difference value between the candidate characteristic element and the characteristic mean value, and determining the attention weight corresponding to the candidate characteristic element according to the characteristic difference value.

Further, the second processing module is further configured to:

transposing the characteristic difference value to obtain a transposed difference value, and generating a covariance matrix of the candidate image characteristic according to the characteristic difference value and the transposed difference value;

constructing a diagonal matrix of the covariance matrix, and carrying out eigenvalue decomposition on the covariance matrix based on the diagonal matrix to obtain reference characteristics;

And extracting first reference feature elements in the reference features, and normalizing the product between the transposed difference value and the reference feature elements to obtain attention weights corresponding to the candidate feature elements.

Further, the first acquisition module is further configured to:

Acquiring published candidate short videos from the same video category of a short video platform, and acquiring video tags of the candidate short video labels;

when the video tag indicates the object to be illustrated, the candidate short video is determined to be a candidate illustration video.

Further, the first processing module is further configured to:

extracting features of the candidate comment videos to obtain candidate video features of the candidate comment videos;

splicing the candidate video features with the histogram features of the candidate comment video to obtain spliced video features;

predicting the boundary frame probability of each frame in the candidate comment video according to the spliced video characteristics;

And determining the frames with the boundary frame probability larger than or equal to a preset probability threshold as video shot boundary frames, and segmenting the candidate comment video into a plurality of candidate comment fragments according to the video shot boundary frames.

Further, the first processing module is further configured to:

Extracting segment features of the candidate comment segments from the candidate video features;

Classifying the candidate comment fragments according to the fragment characteristics to obtain object tags of the candidate comment fragments, wherein the object tags are used for indicating sub-objects of objects to be comment contained in the candidate comment fragments;

and marking the corresponding candidate comment fragments by using the object labels.

Further, the third processing module is further configured to:

Acquiring the object labels of the candidate comment fragments;

Determining, among the other candidate comment segments other than the target comment segment, the reference comment segment labeled with the same object tag as the target comment segment;

and taking the target comment fragment and the reference comment fragment as a retrieval result of the target image.

Further, the third processing module is further configured to:

Sorting the candidate image features according to the sequence of the similarity between the target image features and the candidate image features from high to low, and determining the candidate comment fragments corresponding to the candidate image features ranked in front of a preset ranking threshold as target comment fragments;

Or sorting the candidate image features according to the sequence of the similarity between the target image features and the candidate image features from low to high, and determining the candidate comment fragments corresponding to the candidate image features ranked behind a preset ranking threshold as target comment fragments.

On the other hand, the embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the explanation video retrieval method when executing the computer program.

In another aspect, an embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program is executed by a processor to implement the above-described video search method.

In another aspect, embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from a computer-readable storage medium, and the processor executes the computer program so that the computer device performs the method for retrieving an explanatory video as described above.

The embodiment of the application at least comprises the following beneficial effects: the method comprises the steps of obtaining published candidate comment videos from the same video category of a short video platform, then carrying out video shot segmentation on the candidate comment videos to obtain a plurality of candidate comment fragments, extracting fragment images of frames in the candidate comment fragments, extracting candidate image features of each fragment image, and constructing a candidate feature library based on the candidate image features, wherein the candidate comment videos are located under the same video category of the short video platform, namely, the obtained candidate comment videos are sorted and classified by the short video platform in advance, so that the requirements of data cleaning and sorting for constructing the candidate feature library are reduced, and the construction efficiency of a search database is improved; in addition, the candidate comment video is subjected to video shot segmentation, so that the data granularity in the candidate feature library can be thinned, the subsequent comment video retrieval can be better supported, and the accuracy of the comment video retrieval is further improved; on the basis, the target image uploaded by the client based on the short video platform is received to obtain a search result, which is equivalent to integrating the function of video explanation search into the short video platform, so that candidate explanation videos can be conveniently obtained from the short video platform on the one hand, and the functions of the short video platform can be diversified on the other hand.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and do not limit the application.

FIG. 1 is a schematic illustration of an alternative implementation environment provided by an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating an alternative video retrieval method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative embodiment of the present application for obtaining candidate narrative video from a short video platform;

Fig. 4 is an alternative schematic diagram for refining candidate video according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative embodiment of the present application for constructing a candidate feature library;

FIG. 6 is a schematic diagram of an alternative structure of a depth feature extraction model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative embodiment of a descrambling process for pixel values of a segmented image;

FIG. 8 is a schematic diagram of an alternative face region detection for a segmented image according to an embodiment of the present application;

FIG. 9 is an alternative schematic diagram of determining attention weights provided by embodiments of the present application;

fig. 10 is an optional schematic diagram of video shot segmentation for candidate narrative video according to the embodiment of the present application;

FIG. 11 is a schematic diagram showing an alternative effect of video shot segmentation using video shot boundary frames according to an embodiment of the present application;

fig. 12 is an alternative schematic diagram of an object tag of a candidate comment fragment according to an embodiment of the application;

FIG. 13 is an alternative schematic diagram of an expanded search result provided by an embodiment of the present application;

FIG. 14 is a schematic diagram of an alternative process for illustrative video retrieval provided by an embodiment of the present application;

Fig. 15 is an alternative practical flowchart illustrating a video retrieval method according to an embodiment of the present application;

fig. 16 is an alternative overall flowchart illustrating a video retrieval method according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of an alternative video search device according to an embodiment of the present application;

fig. 18 is a partial block diagram of a terminal according to an embodiment of the present application;

Fig. 19 is a partial block diagram of a server according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In the embodiments of the present application, when related processing is performed according to data related to characteristics of a target object, such as attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and related laws and regulations and standards are complied with for collection, use, processing, etc. of the data. Wherein the target object may be a user. In addition, when the embodiment of the application needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the application to normally operate is acquired.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

The short video is an internet content transmission mode, and refers to high-frequency pushed video content which is played on various new media platforms and is suitable for being watched in a mobile state and a short-time leisure state, and the video content varies from a few seconds to a few minutes.

A short video platform, an internet new media service carrier, for providing users with short videos (e.g., videos within 5 minutes in duration) that are propagated on the internet new media.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, it means to replace a human eye with a camera and a Computer to perform machine Vision such as identifying and measuring on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eye observation or transmitting to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI for short) is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

In order to solve the above problems, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for retrieving an explanatory video, which can improve the efficiency of constructing a retrieval database.

Referring to fig. 1, fig. 1 is a schematic diagram of an alternative implementation environment provided in an embodiment of the present application, where the implementation environment includes a terminal 101 and a server 102, where the terminal 101 and the server 102 are connected through a communication network.

The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. In addition, server 102 may also be a node server in a blockchain network. Alternatively, the server 102 may acquire the candidate comment video that has been published from the same video category of the short video platform, construct a candidate feature library for performing comment video search according to the candidate comment video, and perform comment video search according to the target image sent by the terminal 101 to obtain the search result of the target image.

The terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart television, a smart watch, a vehicle-mounted terminal, etc. The terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and embodiments of the present application are not limited herein. Alternatively, the terminal 101 may be provided with a short video platform or a client that may be used for data interaction with the short video platform, through which a target image for performing object recognition search may be uploaded to the server 102, and a search result obtained by the server 102 according to the target image search may be received.

For example, the server 102 may obtain candidate commentary videos that have been published from the same video category of the short video platform. Then, the server 102 may perform video shot segmentation on the candidate comment video to obtain a plurality of candidate comment fragments, and extract fragment images of each frame in the candidate comment fragments; server 102 may then extract candidate image features for each of the fragment images and construct a candidate feature library based on the candidate image features. The terminal 101 with the client can upload the target image for object recognition to the server 102, after receiving the target image uploaded by the terminal 101, the server 102 can extract the target image characteristics of the target image, perform similarity matching in a candidate feature library constructed in advance based on the target image characteristics, determine a target comment fragment from the candidate comment fragments according to the matching result, and obtain a retrieval result of the target image according to the target comment fragment, so that the terminal 101 can receive the retrieval result returned by the server 102, and further play the target comment fragment in the retrieval result. Because the candidate comment videos are positioned under the same video category of the short video platform, that is, the acquired candidate comment videos are sorted by the short video platform in advance, the requirements of data cleaning and sorting for constructing a candidate feature library are reduced, and the construction efficiency of a search database is improved; in addition, the candidate comment video is subjected to video shot segmentation, so that the data granularity in the candidate feature library can be thinned, the subsequent comment video retrieval can be better supported, and the accuracy of the comment video retrieval is further improved; on the basis, the target image uploaded by the client based on the short video platform is received to obtain a search result, which is equivalent to integrating the function of the comment video search into the short video platform, so that candidate comment videos can be conveniently obtained from the short video platform on the one hand, and the functions of the short video platform can be diversified on the other hand.

The method provided by the embodiment of the application can be applied to different technical fields, including but not limited to various scenes such as cloud technology, video retrieval, artificial intelligence, intelligent traffic, auxiliary driving and the like.

Referring to fig. 2, fig. 2 is a schematic flowchart of an alternative method for retrieving an explanatory video provided in an embodiment of the present application, where the method may be performed by a terminal, or may be performed by a server, or may be performed by a combination of the terminal and the server, and in the embodiment of the present application, the method is described by way of example by the server, and the method for retrieving an explanatory video includes, but is not limited to, the following steps 201 to 204.

Step 201: candidate commentary videos that have been published are obtained from the same video category of the short video platform.

In one possible implementation, the video category is used to classify the short video in the short video platform, the candidate comment video is a video used to comment on an object to be comment, where the object to be comment is a subject to be comment that appears in the candidate comment video, and the video category may include a tour, a food, a game, a movie, music, an electronic device, and the like, for example, the candidate comment video may be a video that comments on a scenic spot landmark, or may be a video that comments on a function of a home appliance, or may be a video that comments on a game event triggered in a game screen, or may be a video that comments on a food, and the like, and accordingly, the object to be comment may be a scenic spot landmark, a home appliance, a game event, a food, and the like.

The video category of the candidate comment video may be determined according to the requirement of subsequently constructing the candidate feature library, for example, if the comment video related to the scenic spot needs to be retrieved, the video category may be tourism.

In one possible implementation manner, the published candidate comment video acquired from the short video platform may refer to a comment video uploaded through the short video platform, or may refer to a comment video uploaded to the short video platform through another new media platform for publishing, or may refer to a comment video published in another new media platform that can be viewed by the short video platform, in other words, the published candidate comment video acquired from the short video platform is not limited to a video published directly from the short video platform, for example, the short video platform may perform data interaction with another video platform, and may obtain published data of another video platform through a short video platform query, so that the candidate comment video may include published video data in the short video platform, and published video data on another video platform.

Because the candidate comment videos are located under the same video category of the short video platform, that is, the acquired candidate comment videos are sorted by the short video platform in advance, the requirements of data cleaning and sorting for constructing the candidate feature library are reduced, and the construction efficiency of the search database is improved.

In one possible implementation manner, in the process of acquiring the published candidate comment video from the same video category of the short video platform, the published candidate short video may be specifically acquired from the same video category of the short video platform, a video tag of the candidate short video label is acquired, and when the video tag indicates that the candidate short video is used for performing comment on an object to be comment, the candidate short video is determined to be the candidate comment video.

After the candidate short videos are released, the short video platform may label the candidate short videos by using a video tag, where the video tag is used to classify the candidate short videos more finely, and the video tag may be used to indicate whether the candidate short videos are comment videos, for example, under the video category "tour", the video tag may be "tour guide" or "comment", in addition, the video tag may also be used to indicate a specific name of an object to be comment, for example, under the video category "tour", the video tag may be "sight spot a", and so on.

In one possible implementation manner, the number of the video tags may be multiple, and the multiple video tags are respectively used for indicating whether the candidate short video is an illustration video or not and for indicating a specific name of an object to be illustrated, so that the candidate short video is marked more carefully, and at this time, the multiple video tags may be combined to determine whether the candidate short video is used for illustrating a specific object to be illustrated.

The short video platform is provided with a large amount of short video data, the short video platform recognizes, classifies and sorts the short videos with various video expression forms in advance according to the video categories and the video expression forms of the short videos, and marks the short videos with various video expression forms under different video categories respectively through different video labels, so that when the published candidate comment videos are acquired through the short video platform, the published candidate short videos can be acquired from the video categories corresponding to the short video platform, and the candidate short videos are initially screened by utilizing the video categories of the short video platform; and then, determining candidate comment videos from a plurality of candidate short videos according to the video labels marked by the short video platform for each candidate short video, thereby realizing secondary screening of the candidate short videos and effectively improving the accuracy and the refinement degree of acquiring the candidate comment videos.

For example, referring to fig. 3, fig. 3 is a schematic diagram of an alternative method for obtaining candidate comment videos from a short video platform according to an embodiment of the present application, where all short videos under the category of "tourist" video may be selected from multiple video categories as candidate short videos, and then the candidate short videos with video labels of "comment" and "sight a" are determined as candidate comment videos.

It will be appreciated that, illustratively, when the short video platform has "scenic spot narrative" as a video category, all candidate short videos under that video category are candidate narrative videos.

Step 202: and video shot segmentation is carried out on the candidate comment video to obtain a plurality of candidate comment fragments, and fragment images of frames in the candidate comment fragments are extracted.

In the related art, since the candidate comment videos may include comment contents for a plurality of objects to be comment, for example, candidate comment videos formed by splicing video clips of a plurality of different objects to be comment, if the candidate comment videos are directly identified, a situation of missing identification or erroneous identification is likely to occur. Accordingly, the candidate comment videos can be subjected to video shot segmentation, so that the data granularity in the candidate feature library can be thinned, the subsequent comment video retrieval can be better supported, and the accuracy of the comment video retrieval is further improved.

The splitting of the candidate comment video according to the change of the video shots may refer to splitting the candidate comment video according to different factors such as actions, scenes, angles, features, and the like, so as to distinguish different video shots in the candidate comment video. The method comprises the steps of dividing a shot object to be illustrated into a plurality of video segments, wherein the video segments can be obtained by dividing the presentation time from an event according to the partial state change or partial detail display of an action of the shot object to be illustrated, or the video segments can be obtained by displaying a plurality of partial details in the shot object. For example, for the comment video shot for a scenic spot, the duration of the comment video of one scenic spot may be between a few minutes and ten minutes, and one scenic spot may include a plurality of sub-scenic spots, and part of scenic spots are difficult to shoot through the same lens, so that candidate comment segments corresponding to a plurality of sub-scenic spots can be segmented through video shot segmentation of the candidate comment video, or candidate comment segments displayed in different parts of details in the same scenic spot are segmented, so that the candidate comment video can be refined, and the candidate comment segments with different display contents are searched as granularity, thereby being beneficial to improving the accuracy of the comment video search.

Specifically, referring to fig. 4, fig. 4 is an optional schematic diagram of refinement processing of a candidate comment video according to an embodiment of the present application, it may be seen that a candidate comment video may be subjected to video shot segmentation to obtain a candidate comment fragment a, a candidate comment fragment B, and a candidate comment fragment C, then further refinement processing is performed on the candidate comment fragment, fragment images of each frame in the candidate comment fragment are extracted, and the candidate comment video is refined, and, taking the candidate comment fragment B as an example, fragment images of six frames may be extracted and obtained for subsequent processing.

In one possible implementation manner, based on the processing capability of the server, on the basis of video shot segmentation of the candidate comment video, a video segmentation rule based on the duration proportion, the fixed time interval or the number of segmentation fragments of the video is added, and the candidate comment video is subjected to primary segmentation or secondary segmentation, so that the data processing amount of a single comment fragment is reduced, and the processing efficiency of the server is improved. In addition, the frame extraction processing for the candidate comment fragment may be extraction processing for each frame of image of the candidate comment fragment, or sampling extraction processing may be performed according to a preset sampling frequency.

Step 203: and extracting candidate image features of each fragment image, and constructing a candidate feature library based on the candidate image features.

In one possible implementation manner, the candidate image features can be characterized as information carried by the segment image, so that each segment image, each comment segment and each candidate comment video can be distinguished through the candidate image features, and therefore, a candidate feature library is constructed by utilizing the candidate image features, a refined video retrieval function can be provided, and the accuracy of video retrieval is improved. Accordingly, the candidate feature library includes candidate image features of each segment image, and at the same time, each candidate image feature has an association relationship with the corresponding segment image, so that the corresponding segment image can be determined according to the candidate image feature.

For example, referring to fig. 5, fig. 5 is an optional schematic diagram of constructing a candidate feature library according to an embodiment of the present application, a pre-trained depth feature extraction network model may be used to extract depth features of all segment images, such as segment image a to segment image N, one by one, to obtain candidate image features of each segment image. And then, associating and constructing a candidate feature library according to the corresponding relation between each candidate image feature and the fragment image, the candidate comment fragment and the candidate comment video, so that search matching based on the fragment image can be realized, and the search accuracy is improved.

In one possible implementation, the extraction of candidate image features from the segmented image may be performed using a pre-trained depth feature extraction model based on computer vision techniques. Referring to fig. 6, fig. 6 is an optional structural schematic diagram of a depth feature extraction model according to an embodiment of the present application, it can be seen that the depth feature extraction model may employ an encoder-decoder architecture trained by contrast loss and Captioning loss (description loss), in which the decoder is decoupled into two parts, a single-mode decoder and a multi-mode decoder, respectively, and cross-attention in the single-mode decoder is omitted to encode a plain text representation, and the output of the encoder and the multi-mode decoder are concatenated with cross-attention to learn the multi-mode image-text representation. A contrast penalty is applied between the output of the encoder and the single-mode text decoder, and Captioning penalty is applied at the output of the multi-mode decoder. Furthermore, by treating all tags simply as text, the depth feature extraction model is trained with tagged image data and noisy image-text data. In the training process of the depth feature extraction model, the depth feature extraction model can be pre-trained by means of a standard data set (namely positive sample data) marked manually and a large number of noise open-source image text pairs (namely negative sample data), so that the similarity between images in the positive sample data and texts is improved, the similarity between images in the negative sample data and texts is reduced, and the feature extraction accuracy of the depth feature extraction model can be improved.

Step 204: and receiving a target image uploaded by a client based on the short video platform, extracting target image characteristics of the target image, performing similarity matching in a candidate feature library based on the target image characteristics, determining a target comment fragment from candidate comment fragments according to a matching result, and obtaining a retrieval result of the target image according to the target comment fragment.

In one possible implementation manner, the target image, that is, the image to be retrieved, may be image data uploaded by the client through the short video platform, where the image data may be image data stored in the client in advance, may be image data captured by the short video platform in real time by calling a capturing function in the client, or may be screenshot image data of a short video being played by the client in the short video platform. The data format of the target image is the same as or similar to the data format of the fragment image of each candidate comment video, the processing requirement on the data is low, the data difference caused by different data sources is not required to be corrected, the processing efficiency is high, and the data compatibility is high, so that the same image feature extraction processing process can be utilized to extract the features of the target image, the data format of the obtained target image features is the same as the data format of the candidate image features, and the similarity matching can be directly carried out in the candidate feature library based on the target image features, and the feature matching rate is improved.

And, through receiving the target image that is based on the customer end of short video platform uploads and then obtains the retrieval result, be equivalent to the function of commentary video retrieval and integrate in short video platform, on the one hand can conveniently acquire candidate commentary video from short video platform, on the other hand, also can make the function of short video platform more diversified.

In a possible implementation manner, the client can perform data interaction with the server of the short video platform, the client can upload the target image to the server, and after receiving the target image sent by the client, the server can perform feature extraction processing on the target image by adopting the manner of performing feature extraction on the fragment image to obtain the target image feature of the target image; then, performing similarity matching in a pre-constructed candidate feature library according to the target image features to obtain a matching result, namely matching candidate image features similar to the target image features, wherein the similarity matching can be realized by adopting the modes of calculating cosine similarity, euclidean distance and the like; then, a target comment fragment is determined from a plurality of candidate comment fragments according to the candidate image features obtained by matching, specifically, when the similarity between the candidate image features of any one fragment image in the candidate comment fragments and the target image features is greater than or equal to a preset similarity threshold, the candidate comment fragment is determined to be the target comment fragment. And then determining the retrieval result of the target image according to the target comment fragment, and returning the retrieval result to the client by the server after obtaining the retrieval result of the target image.

In one possible implementation manner, before extracting the candidate image features of each segment image, face region detection may be performed on the segment image to obtain a plurality of target face frames; then, the pixel value of each pixel point in the target face frame is set to zero, or the pixel average value of the pixel points except the pixel points in the target face frame is determined, and the pixel value of each pixel point in the target face frame is set to the pixel average value. For the comment video except for the comment video introduced by the person comment, a person part image appears in a fragment image of part of the comment video, and the person part image is not an object to be comment of the comment video, namely, the person part is an interference object, and the person part image can have shielding comment objects to influence the recognition result, for example, for the scenic spot comment video, the scenic spot comment video is recorded in a scene of a scenic spot through being in situ, the fragment image generally comprises the self image of a comment person, the image of the comment person occupies a part of the region in the fragment image, and a large number of tourists often exist in the scenic spot scene, so that a large number of person part images appear in the fragment image, the partial characteristics of the scenic spot are shielded, and the recognition effect of the scenic spot is influenced.

Therefore, the region of the character part image in the fragment image, namely the interference region, is determined by carrying out face image detection on the fragment image, and then the interference region part is subjected to interference elimination processing, so that the recognition effect of the object to be illustrated is improved. The pixel value of each pixel point in the target face frame is set as the pixel mean value, so that information related to the object to be illustrated in the fragment image can be recovered to the maximum extent, and the recognition effect of the object to be illustrated can be improved.

For example, referring to fig. 7, fig. 7 is an optional schematic diagram of performing descrambling processing on pixel values of a segment image according to an embodiment of the present application, by performing face region detection on the segment image, determining a target face frame from the segment image, performing pixelation processing on the segment image, and determining pixel points of an interference region portion according to the target face frame, where, as shown in fig. 7, a region of the target face frame may be represented by a pixel point with a pixel value of X, and a region other than the target face frame may be represented by a pixel point with a pixel value of Y. The target face frame in the segment image is subjected to descrambling, the pixel value of the target face frame can be set to zero, or the pixel value of the target face frame is set to the pixel mean value of the pixels except the target face frame, wherein the pixel mean value of the pixels except the target face frame can be represented by the pixel point with the pixel value Z, namely the mean value of the pixels with the pixel value Y in the segment image. The face region detection includes identifying facial features of a person, identifying overall image features of the person, and correspondingly, the target face frame includes a face image and may also include an overall image of the person.

In one possible implementation manner, in the process of performing face region detection on the segment image to obtain a plurality of target face frames, the segment image can be zoomed for multiple times to obtain an image pyramid; secondly, carrying out face area detection on each image in the image pyramid to obtain a plurality of first candidate face frames, and carrying out non-maximum suppression on the first candidate face frames to obtain second candidate face frames; secondly, classifying the second candidate face frames, removing the second candidate face frames without faces according to classification results, and carrying out regression calibration and non-maximum suppression on the remaining second candidate face frames to obtain third candidate face frames; and then, carrying out regression calibration and non-maximum suppression on the third candidate face frame to obtain a plurality of target face frames.

In one possible implementation, face detection is a problem in computer vision, namely locating one or more faces in a segmented image. Locating the face in the segment image means that the coordinates of the face are found in the segment image, and the range of the face is divided by the boundary boxes around the face to form a candidate face box. Thus, the face region detection can be performed on the segment image using the face detection model.

For example, referring to fig. 8, fig. 8 is an optional schematic diagram of face region detection on a segment image according to an embodiment of the present application. First, the segment images may be rescaled to ranges of different sizes, resulting in scaled images of different sizes, i.e., image pyramids. The face detection model is a deep cascade multi-task framework, and may include multiple neural Network sub-models such as a region candidate Network (Proposal Network, P-Net), a region filtering Network (R-Net), and a region Output Network (Output Network). In the first stage, the P-Net can be used to detect the face region of each scaled image in the image pyramid, wherein a face classifier is used to determine whether the face exists in the region of each image, and meanwhile, the frame regression and a locator of a face key point are used to make preliminary proposal of the face region, so as to obtain a plurality of first candidate face frames with faces possibly existing. In addition, in the first stage, non-maximum suppression can be performed on the preliminarily generated first candidate face input frames, the first candidate face input frames with the space distances close to each other in the segment images are calibrated, and the first candidate face input frames with the high overlapping degree are combined, so that the second candidate face frame is obtained. And then, inputting the second candidate face frames into an R-Net for further processing, wherein the R-Net can perform face-non-face classification on the second candidate face frames, and screen the second candidate face frames with only faces from the second candidate face frames. And then carrying out boundary regression calibration processing and non-maximum suppression on the second candidate face frame with the face to suppress the second candidate face frame in the false positive example, thereby obtaining a third candidate face frame. Then, the third candidate face frame is input to the O-Net for processing, face discrimination (classification), boundary regression of the candidate face frame and face feature point positioning are performed again, the face feature point is utilized for regression, non-maximum value suppression processing is performed again after the regression processing of the face feature point is performed again, more image features can be reserved, and an accurate target face frame is determined from the third candidate face frame. It should be noted that the three neural network submodels are not directly connected, but the output of the previous stage is sent as input to the next stage, and additional processing may be performed between the stages; for example, the second candidate face frame may be obtained by performing non-maximum suppression filtering on the first candidate face frame presented by the P-Net in the first stage before providing the first candidate face frame to the R-Net model in the second stage.

In one possible implementation manner, in the process of constructing the candidate feature library based on the candidate image features, attention weights corresponding to the candidate feature elements can be determined first, and the candidate feature elements are weighted according to the attention weights to obtain attention elements corresponding to the candidate feature elements; then, attention features of the segment images are constructed based on the respective attention elements, and a candidate feature library is constructed based on the attention features. Wherein the candidate image features comprise a plurality of candidate feature elements. Since each segment image corresponds to a candidate image feature, the candidate image feature may be characterized as an overall image description of the segment image, while the candidate image feature comprises a plurality of candidate feature elements, the segment image also comprises a plurality of candidate objects, each candidate feature element may be characterized as an object image description of a respective object in the segment image. However, in a clip image, related contents of a plurality of objects, for example, contents of which a part is related to the object to be illustrated, and contents of which a part is not related to the object to be illustrated, are usually presented.

For the candidate image features extracted by the depth features, feature average merging processing can be directly carried out on all candidate image features, namely, the average value of all candidate image features is directly calculated to serve as the basic feature for constructing a candidate feature library, but the mode ignores image information required to be expressed by the candidate image features, and content irrelevant to an object to be illustrated is easily fused to the basic feature, so that the data accuracy of the basic feature in the candidate feature library is influenced, and the accuracy of the video retrieval is further influenced. Therefore, the attention weight corresponding to each candidate feature element can be determined, the candidate feature elements are subjected to weighted correction by using the attention weight, the image information expressed in the fragment image is highlighted, and the suppression of the information content irrelevant to the object to be illustrated is facilitated, so that the accuracy of the data in the candidate feature library can be improved.

In one possible implementation manner, the attention weight corresponding to the candidate feature element may be determined according to the correlation between the object corresponding to the candidate feature element and the object to be illustrated. The higher the correlation between the object corresponding to the candidate feature element and the object to be illustrated is, the larger the attention weight corresponding to the candidate feature element is, so that the attention weight can be utilized to carry out weighted correction processing on the candidate feature element to obtain the attention element corresponding to the candidate feature element, further, the influence of the candidate feature element with high correlation with the object to be illustrated can be improved, the influence of the candidate feature element with low correlation with the object to be illustrated is restrained, and the accuracy of the video search to be illustrated is improved.

In one possible implementation manner, the attention weight corresponding to the candidate feature element may be determined according to the proportion of the area occupied by the object (the object other than the interference object such as the character image) corresponding to the candidate feature element in the segment image. When the proportion of the area occupied by the object corresponding to the candidate feature element in the fragment image is larger, the importance of the object corresponding to the candidate feature element as the image information to be expressed in the current fragment image can be considered to be higher, so that the attention weight corresponding to the candidate feature element is larger, the main image information to be expressed in each fragment image can be highlighted, the interference of irrelevant information content is restrained, and the accuracy of the video retrieval is improved. For example, the candidate feature elements may be ranked according to the magnitude of the feature value of the candidate feature element, and the larger the feature value, the higher the importance of the candidate feature element may be considered, so that a greater attention weight is assigned thereto to emphasize subject image information to be expressed in the clip image.

In one possible implementation manner, in the process of determining the attention weights corresponding to the candidate feature elements, feature average values of all candidate feature elements in the candidate image features may be determined first; and then, determining a characteristic difference value between the candidate characteristic element and the characteristic mean value, and determining the attention weight corresponding to the candidate characteristic element according to the characteristic difference value. By calculating the feature difference between the candidate feature elements and the feature mean, the deviation degree of each candidate feature element relative to the average level of the segment image can be determined, namely the significance of each candidate feature element is determined, so that corresponding attention weight can be allocated according to the significance of each candidate feature element in the segment image, and the main information content to be expressed in the segment image can be emphasized.

In one possible implementation manner, in the process of determining the attention weights corresponding to the candidate feature elements according to the feature difference values, after the feature difference values between each candidate feature element and the feature mean are calculated, the proportion of the feature difference value to the sum of all the candidate feature elements may be used to determine the attention weights corresponding to the candidate feature elements. In addition, after the feature difference value between each candidate feature element and the feature mean value is calculated, the attention weight corresponding to each candidate feature element can be determined by using the ratio of the feature difference value corresponding to each candidate feature element to the own feature value.

In one possible implementation manner, in the process of determining the attention weight corresponding to the candidate feature element according to the feature difference value, the feature difference value may be transposed to obtain a transposed difference value, and a covariance matrix of the candidate image feature is generated according to the feature difference value and the transposed difference value; then, constructing a diagonal matrix of the covariance matrix, and carrying out eigenvalue decomposition on the covariance matrix based on the diagonal matrix to obtain reference characteristics; and extracting the first reference characteristic element in the reference characteristic, and normalizing the product between the transposed difference value and the reference characteristic element to obtain the attention weight corresponding to the candidate characteristic element.

For example, referring to fig. 9, fig. 9 is an alternative schematic diagram for determining attention weights according to an embodiment of the present application. It can be seen that when extracted from a candidate comment fragmentZhang Pianduan images, extracting candidate image features of all fragment graphs to obtainCandidate image features, each candidate image feature includingCandidate feature elements, whereCandidate image features for Zhang Pianduan images are. Firstly, calculating the average value of all candidate image features, wherein the specific calculation formula is as follows: /(I)

Wherein,Is the characteristic mean valueTo extract the total number of fragment images obtained,For the total number of candidate feature elements in one of the fragment images,ForZhang Pianduan th/>, imageCandidate feature elements,、、AndAre all positive integers. Then, a covariance matrix of the candidate image features is calculated, and a specific calculation formula is as follows:

wherein, Is a covariance matrix. Since the covariance matrix is a real symmetric matrix, eigenvalue decomposition can be performed as shown in the following formula:

Wherein the matrix Is covariance matrixMatrix of candidate feature elements of (2), matrixIs covariance matrixDiagonal matrix of matrixThe characteristic elements on the diagonal of (a) are covariance matrixAnd matrixThe characteristic elements on the diagonal lines of the characteristic elements are arranged from large to small according to the direction from top left to bottom right. Through the pair covariance matrixAfter the eigenvalue decomposition, the reference characteristics can be obtained, and the reference characteristics are matrixExtracting the first reference feature element in the reference feature, namely matrixThe first column of feature elements in (a) and the reference feature element is a matrixThe reference feature element can represent the maximum variance direction of the feature under a new coordinate system after the feature is centered, so that the reference feature element can represent the information with high importance in the fragment image. ForAnd Zhang Pianduan, normalizing each candidate characteristic element of the image by using the product between the transposed difference value and the reference characteristic element to obtain the attention weight corresponding to the candidate characteristic element, wherein the specific calculation formula is as follows:

wherein, For reference feature element,ForZhang Pianduan th/>, imageAttention weights of the candidate feature elements. It is noted thatIs a sigmoid function used for normalizing the data. Therefore, based on the attention weight, each candidate feature element of the segment image is weighted to obtain the attention element of each segment image, and a specific calculation formula is as follows:

wherein, Expressed asAttention elements corresponding to the candidate feature elements. Then, the attention features of the segment images, i.e./>, are constructed based on the individual attention elementsFurther, the candidate feature library can be constructed by using the attention features of each fragment image.

It will be appreciated that a similar approach may be taken in extracting target image features of the target image, namely:

The target image is characterized in that As the target feature elements, the attention weights of the respective target feature elements are:

The attention elements corresponding to the target feature elements are as follows:

and finally, obtaining the attention characteristic of the target image, and performing similarity matching based on the attention characteristic of the target image and the attention characteristic of each fragment image in the candidate characteristic library.

In one possible implementation manner, in the process of video shot segmentation of the candidate comment video, feature extraction may be performed on the candidate comment video first to obtain candidate video features of the candidate comment video; secondly, splicing the candidate video features and the histogram features of the candidate comment video to obtain spliced video features; predicting the boundary frame probability of each frame in the candidate comment video according to the characteristics of the spliced video; and then, determining the frames with the boundary frame probability larger than or equal to a preset probability threshold as video shot boundary frames, and segmenting the candidate comment video into a plurality of candidate comment fragments according to the video shot boundary frames. The candidate video features of the candidate comment video are obtained through feature extraction of the candidate comment video, and then the candidate video features are segmented to form a plurality of feature segments, wherein each feature segment can represent a lens segment of one object in the candidate comment video. And then, classifying each characteristic segment by using a classification module, positioning the classified characteristic segments, and then, segmenting the candidate comment video into a plurality of candidate comment segments according to a positioning result, so that video segments corresponding to each object can be segmented from the candidate comment video.

For example, referring to fig. 10, fig. 10 is an optional schematic diagram of video shot segmentation for candidate narrative videos according to an embodiment of the present application. Inputting the candidate comment video into a 64-channel depth discrete convolution neural network model (Depth Decoupling Convolutional Neural Network, DD-CNN) for feature extraction to obtain a first video primary feature, then inputting the first video primary feature into the 64-channel DD-CNN for feature extraction to obtain a first video secondary feature, and then mixing the first video primary feature and the first video secondary feature and carrying out average pooling treatment to obtain a first pooling feature.

After the first pooling feature is obtained, the first pooling feature is input into the DD-CNN of the 128 channels for feature extraction to obtain a second video primary feature, then the second video primary feature is input into the DD-CNN of the 128 channels for feature extraction to obtain a second video secondary feature, and then the second video primary feature and the second video secondary feature are mixed and subjected to average pooling treatment to obtain the second pooling feature.

After the second pooling feature is obtained, the second pooling feature is input into the DD-CNN of the 256 channels for feature extraction to obtain a third video primary feature, then the third video primary feature is input into the DD-CNN of the 256 channels for feature extraction to obtain a third video secondary feature, and then the third video primary feature and the third video secondary feature are mixed and subjected to average pooling treatment to obtain the third pooling feature.

After the third pooling feature is obtained, the first pooling feature, the second pooling feature and the third pooling feature are input to a learnable similarity module to perform similarity measurement, similarity scoring features are output, and can represent commonalities and differences among the first pooling feature, the second pooling feature and the third pooling feature, so that similar feature data can be closer in a feature space, dissimilar feature data are more dispersed, and prediction accuracy is improved. And then, inputting the histogram features, the similarity scoring features and the third pooled features of the candidate comment video after the compression processing to a full-connection layer for connection processing to obtain full-connection data, and inputting the full-connection data to a classification module for classification processing to obtain a single transition frame from local prediction and a transition segment (or transition frame) from overall prediction in the candidate comment video.

For example, referring to fig. 11, fig. 11 is a schematic diagram of an optional effect of video shot segmentation using video shot boundary frames according to an embodiment of the present application, the video shot boundary frames of candidate comment videos may be determined by combining local transition frames and whole transition segments, and one candidate comment video may include a plurality of video shot boundary frames, so that the candidate comment video may be segmented into a plurality of candidate comment segments by using the video shot boundary frames, where the duration of each candidate comment segment may be the same or different.

In one possible implementation manner, after video shot segmentation is performed on the candidate comment video to obtain a plurality of candidate comment fragments, fragment features of the candidate comment fragments may be extracted from the candidate video features; then classifying the candidate comment fragments according to fragment characteristics to obtain object tags of the candidate comment fragments, wherein the object tags are used for indicating sub-objects of objects to be comment contained in the candidate comment fragments; then, the corresponding candidate comment fragments are labeled with object labels.

Specifically, after the candidate video features are obtained, the candidate video features may be segmented by using a time sequence segmentation module to form segment features of a plurality of candidate comment segments, where the time sequence segmentation module may segment the video segments according to a time length of the candidate comment video according to a preset time length threshold or a preset proportion time length, or segment the presentation time according to a partial state change or a partial detail presentation of an action of an object to be comment of the candidate comment video, so as to distinguish the presentation time from the event. The object to be illustrated is a main object in the candidate illustration video, and because the object to be illustrated can comprise a plurality of sub-objects, and each segment feature can represent a feature corresponding to the sub-object, the candidate illustration segments can be subjected to video classification based on the segment feature to obtain object labels of each candidate illustration segment, and then the object labels are utilized to mark corresponding candidate illustration frequency bands, thereby being beneficial to searching and matching the candidate illustration video by utilizing the object labels and improving the accuracy of the video searching.

For example, referring to fig. 12, fig. 12 is an alternative schematic diagram of an object tag of a candidate comment fragment according to an embodiment of the application. It may be seen that the object tag added to the candidate comment fragment may include sub-object tag information for indicating a sub-object of the object included in the candidate comment fragment, or may include object tag information for indicating an object included in the candidate comment fragment. It should be noted that the object tag may include at least one of the above sub-object tag information, and object tag information. As shown in fig. 12, the candidate comment video 1 is a comment video that is comment for a sight a in the category of video of "travel", wherein after the candidate comment video 1 is subjected to video shot segmentation, 3 candidate comment segments are obtained, wherein a first candidate comment segment is a video segment that is comment for a sight a, a second candidate comment segment is a video segment that is comment for a sub-sight B in a sight a, and a third candidate comment segment is a video segment that is comment for a sub-sight C in a sight a. Thus, the object tag of the first candidate narrative segment may include object tag information of "sight a"; the second candidate comment fragment may include object tag information of "sight a" and sub-object tag information of "sub-sight B"; accordingly, the third candidate comment fragment may include object-marker information of "sight a" and sub-object-marker information of "sub-sight C". Therefore, the object labels can be used for conveniently distinguishing different candidate comment fragments, and video retrieval efficiency is improved.

In one possible implementation manner, in the process of obtaining the retrieval result of the target image according to the target comment fragment, the object tag of each candidate comment fragment may be obtained first; among the other candidate commentary fragments except for the target commentary fragment, determining a reference commentary fragment marked with the same object tag as the target commentary fragment; then, the target comment fragment and the reference comment fragment are set as search results of the target image. By acquiring the object labels of the candidate comment fragments and comparing the object labels with the object labels of the target comment fragments, the candidate comment fragments related to the target comment fragments can be determined, so that comment fragments with higher relevance to the target comment fragments can be screened out, the search range is reduced, and redundant information is reduced. Among other candidate commentary segments other than the target commentary segment, determining a reference commentary segment labeled with the same object tag as the target commentary segment may provide diversified reference information. The reference comment fragment may provide a different view angle, viewpoint, or supplementary information from the target comment fragment, or comment information of a similar object, thereby enriching the retrieval result of the target image. By taking the target comment fragment and the related reference comment fragment as the retrieval result of the target image, the accuracy and the richness of the retrieval result can be improved. The target comment fragment provides a specific description of the target image, while the reference comment fragment provides a more comprehensive or more detailed description, and verifies the content of the target image from multiple angles, thereby being beneficial to improving the retrieval quality of the target image, and improving the diversity and accuracy of the retrieval result and the correlation with the target image. By accurately explaining the segment matching and comprehensively expanding the similar segment contents, a more accurate and comprehensive search result can be obtained.

For example, referring to fig. 13, fig. 13 is an optional schematic diagram of an extended search result provided in an embodiment of the present application. As shown in fig. 13, the target comment fragment may include a plurality of target comment fragments, wherein the object tag of the first target comment fragment may include the tag information of "sight a" and "sub sight B", and the object tag of the second target comment fragment may include the tag information of "sight a" and "sub sight C", and the sight a includes the sub sight B, the sub sight C, and the sub sight D. And among other candidate comment fragments except the target comment fragment, a first candidate comment fragment, a second candidate comment fragment, a third candidate comment fragment and a fourth candidate comment fragment are included, wherein the object tag of the first candidate comment fragment can include the tag information of the 'sight spot A' and the 'sub sight spot D', the object tag of the second candidate comment fragment can include the tag information of the 'sight spot X' and the 'sub sight spot Y', the object tag of the third candidate comment fragment can include the 'sub sight spot Y' and the 'sub sight spot C', and the object tag of the fourth candidate comment fragment can include the 'sub sight spot Z' and the 'sub sight spot D', and the object tag of the first candidate comment fragment can include the sub sight spot Y and the sub sight spot Z. In determining that the reference comment fragment labeled with the same object label as the target comment fragment, the labeled same object label may indicate that the labeling information in the object label is completely or partially identical, as shown in fig. 13, the labeling information of the object label of the first candidate comment fragment is partially identical to the labeling information of the object label of the first target comment fragment, and thus, the object label of the first candidate comment fragment is identical to the object label of the first target comment fragment; accordingly, the object tag of the third candidate comment fragment is the same as the object tag of the second target comment fragment, so the first candidate comment fragment and the third candidate comment fragment can be used as reference comment fragments, and the first target comment fragment, the second target comment fragment, the first candidate comment fragment and the third candidate comment fragment are used as the search result of the target image together.

In addition, the object tag with the same label may represent any sub-object having the same object or the label information of the same object to which the same object belongs, as shown in fig. 13, the object tag of the fourth candidate comment fragment includes the label information of the "sub-sight D" sub-object including the "sight a" object, and the object tag of the first target comment fragment includes the label information of the same object to which the same object belongs "sight a" and the label information of any sub-object "sub-sight B" having the same object "sight a", so the fourth candidate comment fragment may also be used as the reference comment fragment.

In one possible implementation manner, in the process of determining the target comment fragment from the candidate comment fragments according to the matching result, the candidate image features may be ranked according to the order of high-to-low similarity between the target image features and the candidate image features, and the candidate comment fragment corresponding to the candidate image features ranked in front of the preset ranking threshold is determined as the target comment fragment; or sorting the candidate image features according to the sequence of the similarity between the target image features and the candidate image features from low to high, and determining the candidate comment fragments corresponding to the candidate image features ranked behind a preset ranking threshold as the target comment fragments.

By sorting according to the similarity between the target image features and the candidate image features, candidate comment fragments with high similarity can be arranged in front, so that comment fragments with high similarity with the target image can be screened out, and the retrieval accuracy of the target comment fragments is improved. By selecting different ranking thresholds, different matching requirements can be met, the sensitivity of the matching algorithm can be controlled, and thus the matching accuracy can be controlled, a lower ranking threshold can emphasize accurate similarity matching, and a higher ranking threshold can emphasize broad relevance matching. Under the condition that candidate image features are ranked from high to low in similarity, candidate comment fragments ranked before a preset ranking threshold value are selected to be determined as target comment fragments, comment fragments with high similarity can be preferentially selected, and accuracy of a search result is improved; and under the condition that the candidate image features are ranked from low to high in similarity, selecting the candidate comment fragments ranked after the preset ranking threshold value as target comment fragments can preferentially remove comment fragments with low similarity (i.e. irrelevant), and more comment fragments which are possibly relevant are reserved so as to improve the richness of the search result.

Referring to fig. 14, fig. 14 is a schematic diagram illustrating an alternative process of video retrieval according to an embodiment of the present application. The server can obtain the short video belonging to the video category of the scenic spot illustration through the short video platform as the candidate illustration video, and then performs video shot segmentation processing on the candidate illustration video to obtain a plurality of candidate illustration fragments. And then, performing frame extraction processing on each candidate comment fragment to obtain fragment images of each frame in the candidate comment fragments. Then, face detection processing is carried out on all the fragment images, and pixel points of the target face frame are set to be zero, so that the influence of the target face frame on the fragment images can be restrained, and the descrambling processing on the fragment images is realized. And extracting the features of the segment images subjected to the descrambling treatment to obtain candidate image features, determining feature differences between each candidate feature element in the candidate image features and feature average values of each candidate feature element, and generating covariance matrixes of the candidate image features by using the feature differences. After constructing the covariance matrix, eigenvalue decomposition can be performed on the covariance matrix to obtain reference characteristic elements, and attention weights corresponding to candidate characteristic elements are calculated by using the reference characteristic elements. Then, the attention weight corresponding to each candidate feature element is used for weighting the candidate feature elements to obtain attention elements, the attention elements are used for constructing attention features of the fragment images, and then a candidate feature library is constructed based on the attention features.

The client can upload the target image to the server, and after receiving the target image, the server can detect the face area of the target image to obtain a plurality of target face frames; then, the pixel value of each pixel point in the target face frame is set to zero, or the pixel average value of the pixel points except the pixel points in the target face frame is determined, and the pixel value of each pixel point in the target face frame is set to the pixel average value. After the influence of the target face frame in the target image is restrained, the target image can be extracted by utilizing a pre-trained depth feature extraction model. After determining the target image characteristics of the target image, determining the attention weights corresponding to all target characteristic elements in the target image characteristics, and weighting the target characteristic elements according to the attention weights to obtain attention elements corresponding to the target characteristic elements; a target attention feature of the target image is constructed based on the individual attention elements. After the target attention characteristic of the target image is obtained, similarity matching can be performed in a candidate characteristic library based on the target attention characteristic, a target comment fragment is determined from candidate comment fragments according to a matching result, and a retrieval result of the target image is obtained according to the target comment fragment. Since the target image may have a person partial image, and the person partial image is not a search object of the target image, that is, the person partial image is an interference object, and the person partial image may have a shielding search object, which affects a search result, for example, the search of a scenic spot comment video is performed with respect to a scenic spot photo, and the scenic spot photo is recorded in a scene of the scenic spot, so that a large number of tourists often exist in the scene of the scenic spot, and a large number of person partial images may appear in the target image, which shields a part of features of the scenic spot, and affects the recognition effect of the scenic spot. The face region detection includes identifying facial features of a person, identifying overall image features of the person, and correspondingly, the target face frame includes a face image and may also include an overall image of the person.

In one possible implementation manner, in the process of performing face region detection on a target image to obtain a plurality of target face frames, scaling can be performed on the target image for multiple times to obtain an image pyramid; secondly, carrying out face area detection on each image in the image pyramid to obtain a plurality of first candidate face frames, and carrying out non-maximum suppression on the first candidate face frames to obtain second candidate face frames; secondly, classifying the second candidate face frames, removing the second candidate face frames without faces according to classification results, and carrying out regression calibration and non-maximum suppression on the remaining second candidate face frames to obtain third candidate face frames; and then, carrying out regression calibration and non-maximum suppression on the third candidate face frame to obtain a plurality of target face frames.

In one possible implementation, the face detection model may be used to detect the face region of the target image, where related content of multiple objects typically appears, such as content that appears to be related to the search object, and content that appears to be partially unrelated to the search object. For the target image features extracted by the depth features, if feature average merging processing is directly performed on all the target image features, that is, the average value of all the target image features is directly calculated as the feature for performing the video interpretation retrieval, retrieval information required to be expressed by the target image features is easily ignored, and content irrelevant to a retrieval object is easily fused to the basic feature, so that the data accuracy of the target image features is influenced, and further, the accuracy of the video interpretation retrieval is influenced. Therefore, the attention weight corresponding to each target feature element can be determined, the target feature elements are subjected to weighted correction by using the attention weight, the image retrieval information expressed in the target image is highlighted, and the suppression of the information content irrelevant to the retrieval object is facilitated, so that the accuracy of the video retrieval can be improved.

In one possible implementation manner, the attention weight corresponding to the target feature element may be determined according to the proportion of the area occupied by the object (objects other than the interference object such as the character image) corresponding to the target feature element in the target image. When the proportion of the area occupied by the object corresponding to the target feature element in the fragment image is larger, the importance of the object corresponding to the target feature element as the image retrieval information to be expressed in the current target image can be considered to be higher, so that the attention weight corresponding to the target feature element is larger, the main retrieval information to be expressed in each target image can be highlighted, the interference of irrelevant information content is restrained, and the accuracy of the video retrieval is improved. For example, the target feature elements may be ranked according to their magnitudes, and a target feature element with a larger feature value may be considered to be of higher importance, so as to be assigned a greater attention weight, to emphasize subject retrieval information to be expressed in the target image.

In one possible implementation manner, in the process of determining the attention weight corresponding to the target feature element, feature average values of all target feature elements in the target image feature may be determined first; and then, determining a characteristic difference value between the target characteristic element and the characteristic mean value, and determining the attention weight corresponding to the target characteristic element according to the characteristic difference value. By calculating the characteristic difference between the target characteristic elements and the characteristic mean value, the deviation degree of each target characteristic element relative to the average level of the target image can be determined, namely the significance of each target characteristic element is determined, so that corresponding attention weight can be allocated according to the significance of each target characteristic element in the target image, and the main retrieval information content to be expressed in the target image can be emphasized.

In one possible implementation manner, in the process of determining the attention weight corresponding to the target feature element according to the feature difference value, after calculating the feature difference value between each target feature element and the feature mean value, the proportion of the feature difference value and the sum of all target feature elements may be used to determine the attention weight corresponding to the target feature element. In addition, after the feature difference value between each target feature element and the feature mean value is calculated, the attention weight corresponding to the target feature element can be determined by using the ratio of the feature difference value corresponding to each target feature element to the own feature value.

In one possible implementation manner, in the process of determining the attention weight corresponding to the target feature element according to the feature difference value, the feature difference value may be transposed to obtain a transposed difference value, and a covariance matrix of the target image feature is generated according to the feature difference value and the transposed difference value; then, constructing a diagonal matrix of the covariance matrix, and carrying out eigenvalue decomposition on the covariance matrix based on the diagonal matrix to obtain reference characteristics; and then, extracting the first reference characteristic element in the reference characteristic, and normalizing the product between the transposed difference value and the reference characteristic element to obtain the attention weight corresponding to the target characteristic element. Therefore, after the attention weight corresponding to the target feature element is reached, the target feature element can be weighted according to the attention weight to obtain the attention element corresponding to the target feature element; a target attention feature of the target image is constructed based on the individual attention elements.

The principle of the illustrative video search method provided by the embodiment of the present application is described in detail below with specific examples.

Referring to fig. 15, fig. 15 is an alternative practical flowchart illustrating a video retrieval method according to an embodiment of the present application. The server 102 may obtain all short videos under the tour category through the short video platform as candidate comment videos, and then perform video shot segmentation processing on all candidate comment videos related to scenic spot comment, so as to obtain a plurality of candidate comment fragments. And then, performing frame extraction processing on each candidate comment fragment to obtain fragment images of each frame in the candidate comment fragments. Then, face detection processing is carried out on all the fragment images, the pixel points of the target face frame are set to zero or the pixel points of the target face frame are replaced by the pixel points of the pixel mean value except the target face frame, so that the influence of the target face frame on the fragment images can be restrained, and the descrambling processing on the fragment images is realized. And extracting the features of the segment images subjected to the descrambling treatment to obtain candidate image features, determining feature differences between each candidate feature element in the candidate image features and feature average values of each candidate feature element, and generating covariance matrixes of the candidate image features by using the feature differences. After constructing the covariance matrix, eigenvalue decomposition can be performed on the covariance matrix to obtain reference characteristic elements, and attention weights corresponding to candidate characteristic elements are calculated by using the reference characteristic elements. Then, the attention weight corresponding to each candidate feature element is utilized to carry out weighting processing on the candidate feature elements to obtain attention elements, attention features of the fragment images are constructed by utilizing the attention elements, and then a candidate feature library is constructed based on the attention features, namely feature information about scenic spots and landmarks can be stored in the candidate feature library.

The terminal 101 may be installed with a client of the short video platform and the terminal 101 is equipped with a camera assembly. After the terminal 101 runs the client of the short video platform, the client interface 1501 is displayed, at this time, the camera component of the terminal 101 may be invoked to collect the target image by triggering the "sweep-identify" control 1502 in the client interface 1501, and after the client obtains the target image, the target image may be uploaded to the server 102.

After receiving the target image, the server 102 may perform face region detection on the target image to obtain a plurality of target face frames; then, the pixel value of each pixel point in the target face frame is set to zero, or the pixel average value of the pixel points except the pixel points in the target face frame is determined, and the pixel value of each pixel point in the target face frame is set to the pixel average value. After suppressing the influence of the target face frame in the target image, the target image can be extracted with a pre-trained depth feature extraction model (such as a CoCa model). After determining the target image characteristics of the target image, determining the attention weights corresponding to all target characteristic elements in the target image characteristics, and weighting the target characteristic elements according to the attention weights to obtain attention elements corresponding to the target characteristic elements; a target attention feature of the target image is constructed based on the individual attention elements. After obtaining the target attention characteristic of the target image, similarity matching may be performed in the candidate characteristic library based on the target attention characteristic, a target comment fragment is determined from the candidate comment fragments according to the matching result, a search result of the target image is obtained according to the target comment fragment, and then the server 102 pushes the search result to the corresponding client through the short video platform, so that the search result of viewing the target image can be displayed in the client interface 1501 of the terminal 101, and the search result is a comment video or comment fragment matched with the target image.

When the search result includes a plurality of target comment segments, the client interface 1501 of the terminal 101 may display the target comment segment corresponding to the target attention feature with the highest similarity first, and after the target comment segment is played, the client interface 1501 may further display the remaining target comment segments for the user to select for playing. Or the candidate comment video corresponding to the target comment fragment can be displayed for the user to select and play.

As can be seen, when the subject object displayed by the target image is a landmark scene, the server 102 can determine the target comment fragment from the candidate comment fragments related to the scene and the landmark object (i.e. the candidate comment fragments under the same video category of "travel"), and obtain a scene comment video or a scene comment fragment matched with the target image based on the target comment fragment, so that the user can obtain a short video of the corresponding scene comment introduction by taking a scene photo, thereby facilitating the user to quickly understand the related information of the scene, and the introduction form based on the short video is more vivid than the introduction form based on the text data.

In addition, although the example shown in fig. 15 is described by taking the example of retrieving the scenic spot explanatory video as an example, the method for retrieving the explanatory video provided in the embodiment of the application may be applied to other scenes.

For example, when the candidate feature library is constructed, the published candidate comment video may be obtained from the video category "science and technology", and at this time, the candidate comment video may be a comment video for an electronic device such as a smart phone, a home appliance, or the like. When the terminal 101 collects the target image, it can collect the image for a certain household appliance, and further upload the target image to the server 102, and the server 102 can obtain the corresponding search result. At this time, the search result may include a target comment fragment for the home appliance, so that the information such as the function and the use method of the home appliance can be quickly known through the target comment fragment, and the related specification or data does not need to be checked.

For another example, when the candidate feature library is constructed, the candidate comment video that has been released may be obtained from the video category "agriculture", and in this case, the candidate comment video may be a comment video for a plant such as a flower or a pot plant. When the terminal 101 collects the target image, it can collect the image for a plant, and further upload the target image to the server 102, and the server 102 can obtain the corresponding search result. At this time, the search result may include a target comment fragment for the plant, so that information such as a growth characteristic and a planting method of the plant may be quickly known through the target comment fragment without viewing related specifications or data.

The following describes in detail an explanatory video retrieval method provided by an embodiment of the present application.

Referring to fig. 16, fig. 16 is an optional overall flowchart of an illustrative video search method according to an embodiment of the present application, wherein the illustrative video search method includes, but is not limited to, the following steps 1601 to 1615:

Step 1601: and acquiring the published candidate short videos from the same video category of the short video platform.

Step 1602: and acquiring video tags of the candidate short video labels, and determining the candidate short videos as candidate comment videos when the video tags indicate to be used for comment of the object to be comment.

Step 1603: and extracting the characteristics of the candidate comment videos to obtain candidate video characteristics of the candidate comment videos.

Step 1604: and splicing the candidate video features with the histogram features of the candidate comment video to obtain spliced video features.

Step 1605: and predicting the boundary frame probability of each frame in the candidate interpretation video according to the characteristics of the spliced video.

Step 1606: and determining frames with the boundary frame probability being greater than or equal to a preset probability threshold as video shot boundary frames, and segmenting the candidate comment video into a plurality of candidate comment fragments according to the video shot boundary frames.

Step 1607: and carrying out face region detection on the fragment images to obtain a plurality of target face frames, and extracting candidate image features of each fragment image after setting the pixel value of each pixel point in the target face frames to zero.

Step 1608: and determining the feature average value of all candidate feature elements in the candidate image features, and determining the feature difference value between the candidate feature elements and the feature average value.

Step 1609: generating a covariance matrix, and decomposing eigenvalues of the covariance matrix to obtain reference characteristics; and extracting a reference characteristic element, and normalizing the product between the transposed difference value and the reference characteristic element to obtain the attention weight corresponding to the candidate characteristic element.

In the step, a covariance matrix of candidate image features is generated according to the feature difference value and the transposition difference value; the transposed difference value is obtained by transposing the characteristic difference value; after constructing a diagonal matrix of the covariance matrix, carrying out eigenvalue decomposition on the covariance matrix based on the diagonal matrix; the reference feature element is the first element in the reference feature.

Step 1610: and weighting the candidate feature elements according to the attention weight to obtain attention elements corresponding to the candidate feature elements.

Step 1611: attention features of the segment images are constructed based on the individual attention elements, and a candidate feature library is constructed based on the attention features.

Step 1612: and receiving the target image uploaded by the client based on the short video platform.

Step 1613: extracting target image characteristics of a target image, performing similarity matching in a candidate feature library based on the target image characteristics, determining a target comment fragment from candidate comment fragments according to a matching result, acquiring an object tag marked by the target comment fragment, determining a corresponding reference comment fragment according to the object tag, and taking the target comment fragment and the reference comment fragment as a retrieval result.

Step 1614: and sending the retrieval result of the target image to the client.

Step 1615: ending the step flow.

According to the comment video retrieval method provided by the embodiment of the application, the published candidate comment videos are acquired from the same video category of the short video platform, then video shot segmentation is carried out on the candidate comment videos to obtain a plurality of candidate comment fragments, fragment images of each frame in the candidate comment fragments are extracted, candidate image features of each fragment image are extracted, and a candidate feature library is constructed based on the candidate image features, namely, the acquired candidate comment videos are subjected to sorting and classification in advance by the short video platform, so that the requirements of data cleaning and sorting for constructing a candidate feature library are reduced, and the construction efficiency of a retrieval database is improved; in addition, the candidate comment video is subjected to video shot segmentation, so that the data granularity in the candidate feature library can be thinned, the subsequent comment video retrieval can be better supported, and the accuracy of the comment video retrieval is further improved; on the basis, the target image uploaded by the client based on the short video platform is received to obtain a search result, which is equivalent to integrating the function of the comment video search into the short video platform, so that candidate comment videos can be conveniently obtained from the short video platform on the one hand, and the functions of the short video platform can be diversified on the other hand.

It will be appreciated that, although the steps in the flowcharts described above are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

Referring to fig. 17, fig. 17 is an optional structural diagram of an illustrative video retrieving apparatus 1700 according to an embodiment of the present application, the illustrative video retrieving apparatus 1700 includes:

a first obtaining module 1701, configured to obtain candidate comment videos that have been published from the same video category of the short video platform;

A first processing module 1702 configured to perform video shot segmentation on a candidate comment video to obtain a plurality of candidate comment fragments, and extract fragment images of each frame in the candidate comment fragments;

a second processing module 1703, configured to extract candidate image features of each segment image, and construct a candidate feature library based on the candidate image features;

The third processing module 1704 is configured to receive a target image uploaded by the client based on the short video platform, extract a target image feature of the target image, perform similarity matching in the candidate feature library based on the target image feature, determine a target comment fragment from the candidate comment fragments according to the matching result, and obtain a search result of the target image according to the target comment fragment.

Further, the second processing module 1703 is further configured to:

Setting the pixel value of each pixel point in the target face frame to zero, or determining the pixel average value of the pixel points except the pixel points in the target face frame, and setting the pixel value of each pixel point in the target face frame as the pixel average value.

Further, the second processing module 1703 is further configured to:

scaling the segment images for multiple times to obtain an image pyramid;

Performing secondary classification on the second candidate face frames, removing the second candidate face frames without faces according to classification results, performing regression calibration and non-maximum suppression on the remaining second candidate face frames, and obtaining third candidate face frames;

Further, the second processing module 1703 is further configured to:

attention features of the segment images are constructed based on the individual attention elements, and a candidate feature library is constructed based on the attention features.

Further, the second processing module 1703 is further configured to:

determining the feature average value of all candidate feature elements in the candidate image features;

Further, the second processing module 1703 is further configured to:

Constructing a diagonal matrix of the covariance matrix, and performing eigenvalue decomposition on the covariance matrix based on the diagonal matrix to obtain reference characteristics;

And extracting the first reference characteristic element in the reference characteristic, and normalizing the product between the transposed difference value and the reference characteristic element to obtain the attention weight corresponding to the candidate characteristic element.

Further, the first obtaining module 1701 is further configured to:

Acquiring published candidate short videos from the same video category of the short video platform;

and acquiring video tags of the candidate short video labels, and determining the candidate short videos as candidate comment videos when the video tags indicate to be used for comment of the object to be comment.

Further, the first processing module 1702 is further configured to:

predicting the boundary frame probability of each frame in the candidate comment video according to the characteristics of the spliced video;

and determining frames with the boundary frame probability being greater than or equal to a preset probability threshold as video shot boundary frames, and segmenting the candidate comment video into a plurality of candidate comment fragments according to the video shot boundary frames.

Further, the first processing module 1702 is further configured to:

extracting segment features of candidate comment segments from the candidate video features;

Classifying the candidate comment fragments according to fragment characteristics to obtain object tags of the candidate comment fragments, wherein the object tags are used for indicating sub-objects of the objects contained in the candidate comment fragments;

Further, the third processing module 1704 is further configured to:

obtaining object labels of each candidate comment fragment;

Among the other candidate commentary fragments except for the target commentary fragment, determining a reference commentary fragment marked with the same object tag as the target commentary fragment;

And taking the target comment fragment and the reference comment fragment as retrieval results of the target image.

Further, the third processing module 1704 is further configured to:

according to the sequence of the similarity between the target image features and the candidate image features from high to low, sequencing the candidate image features, and determining candidate comment fragments corresponding to the candidate image features ranked in front of a preset ranking threshold as target comment fragments;

Or sorting the candidate image features according to the sequence of the similarity between the target image features and the candidate image features from low to high, and determining the candidate comment fragments corresponding to the candidate image features ranked behind a preset ranking threshold as the target comment fragments.

The above description video retrieval device 1700 and the description video retrieval method are based on the same inventive concept, by acquiring the published candidate description video from the same video category of the short video platform, then performing video shot segmentation on the candidate description video to obtain a plurality of candidate description fragments, extracting fragment images of each frame in the candidate description fragments, extracting candidate image features of each fragment image, and constructing a candidate feature library based on the candidate image features, wherein the candidate description video is located under the same video category of the short video platform, that is, the acquired candidate description video is already sorted by the short video platform in advance, so that the requirements of data cleaning and sorting for constructing the candidate feature library are reduced, and the construction efficiency of a retrieval database is improved; in addition, the candidate comment video is subjected to video shot segmentation, so that the data granularity in the candidate feature library can be thinned, the subsequent comment video retrieval can be better supported, and the accuracy of the comment video retrieval is further improved; on the basis, the target image uploaded by the client based on the short video platform is received to obtain a search result, which is equivalent to integrating the function of the comment video search into the short video platform, so that candidate comment videos can be conveniently obtained from the short video platform on the one hand, and the functions of the short video platform can be diversified on the other hand.

The electronic device for executing the video searching method according to the embodiment of the present application may be a terminal, and referring to fig. 18, fig. 18 is a partial block diagram of the terminal according to the embodiment of the present application, where the terminal includes: camera assembly 1810, first memory 1820, input unit 1830, display unit 1840, sensor 1850, audio circuitry 1860, wireless fidelity (WIRELESS FIDELITY, wiFi) module 1870, first processor 1880, first power supply 1890, and the like. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 18 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The camera assembly 1810 may be used to capture images or video. Optionally, camera assembly 1810 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions.

The first memory 1820 may be used to store software programs and modules, and the first processor 1880 may execute the software programs and modules stored in the first memory 1820 to perform various functional applications and data processing of the terminal.

The input unit 1830 may be used to receive input numerical or character information and generate key signal inputs related to the setting and function control of the terminal. In particular, input unit 1830 may include touch panel 1818 and other input devices 1832.

The display unit 1840 may be used to display input information or provided information and various menus of the terminal. The display unit 1840 may include a display panel 1841.

Audio circuitry 1860, speaker 1861, and microphone 1862 may provide an audio interface.

First power source 1890 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery.

The number of sensors 1850 may be one or more, the one or more sensors 1850 including, but not limited to: acceleration sensors, gyroscopic sensors, pressure sensors, optical sensors, etc. Wherein:

the acceleration sensor may detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal. For example, an acceleration sensor may be used to detect the components of gravitational acceleration in three coordinate axes. The first processor 1880 may control the display unit 1840 to display the user interface in a lateral view or a longitudinal view according to the gravitational acceleration signal acquired by the acceleration sensor. The acceleration sensor may also be used for the acquisition of motion data of a game or a user.

The gyroscope sensor can detect the body direction and the rotation angle of the terminal, and the gyroscope sensor can be cooperated with the acceleration sensor to collect the 3D action of the user on the terminal. The first processor 1880 may implement the following functions based on the data collected by the gyro sensor: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor may be disposed at a side frame of the terminal and/or a lower layer of the display unit 1840. When the pressure sensor is disposed at a side frame of the terminal, a grip signal of the terminal by a user may be detected, and left-right hand recognition or shortcut operation may be performed by the first processor 1880 according to the grip signal collected by the pressure sensor. When the pressure sensor is disposed at the lower layer of the display unit 1840, the control of the operability control on the UI interface is realized by the first processor 1880 according to the pressure operation of the user on the display unit 1840. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor is used to collect the ambient light intensity. In one embodiment, first processor 1880 may control the display brightness of display unit 1840 based on the intensity of ambient light collected by the optical sensor. Specifically, when the intensity of the ambient light is high, the display luminance of the display unit 1840 is turned up; when the ambient light intensity is low, the display brightness of the display unit 1840 is turned down. In another embodiment, the first processor 1880 may also dynamically adjust the shooting parameters of the camera assembly 1810 according to the ambient light intensity collected by the optical sensor.

In this embodiment, the first processor 1880 included in the terminal may perform the illustrative video retrieval method of the previous embodiment.

The electronic device for performing the video searching method according to the embodiment of the present application may also be a server, and referring to fig. 19, fig. 19 is a partial block diagram of a server according to the embodiment of the present application, where the server 1900 may have a relatively large difference due to different configurations or performances, and may include one or more second processors 1922 and a second memory 1932, and one or more storage media 1930 (such as one or more mass storage devices) storing application programs 1942 or data 1944. Wherein the second memory 1932 and the storage medium 1930 may be transitory or persistent storage. The program stored on the storage medium 1930 may include one or more modules (not shown), each of which may include a series of command operations to the server 1900. Still further, a second processor 1922 may be provided in communication with a storage medium 1930 to execute a series of command operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more second power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input output interfaces 1958, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

A processor in server 1900 may be used to perform the narrative video retrieval method.

The embodiments of the present application also provide a computer-readable storage medium storing a computer program for executing the illustrative video retrieval method of the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising a computer program stored on a computer readable storage medium. A processor of a computer device reads the computer program from a computer-readable storage medium, and the processor executes the computer program so that the computer device performs the method for retrieving an explanatory video as described above.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present application, plural (or multiple) means two or more, and that greater than, less than, exceeding, etc. are understood to not include the present number, and that greater than, less than, within, etc. are understood to include the present number.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, etc., which can store program codes.

It should also be appreciated that the various embodiments provided by the embodiments of the present application may be arbitrarily combined to achieve different technical effects.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit and scope of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. A method of commentary video retrieval, comprising:

Acquiring published candidate comment videos from the travel category of the short video platform; wherein the candidate commentary video comprises a video for commentary on a attraction;

Video shot segmentation is carried out on the candidate comment video to obtain a plurality of candidate comment fragments, fragment characteristics of the candidate comment fragments are extracted from candidate video characteristics of the candidate comment fragments, the candidate comment fragments are classified according to the fragment characteristics to obtain object labels of the candidate comment fragments, the object labels are used for indicating sub-scenic spots of scenic spots contained in the candidate comment fragments, the object labels comprise object label information and sub-object label information, or the object labels comprise a plurality of sub-object label information;

extracting segment images of frames in the candidate comment segments;

Extracting candidate image features of each fragment image, and constructing a candidate feature library based on the candidate image features; wherein the candidate image features comprise a plurality of candidate feature elements, the constructing a candidate feature library based on the candidate image features comprises: determining feature average values of all the candidate feature elements in the candidate image features; determining a characteristic difference value between the candidate characteristic element and the characteristic mean value, transposing the characteristic difference value to obtain a transposed difference value, and generating a covariance matrix of the candidate image characteristic according to the characteristic difference value and the transposed difference value; constructing a diagonal matrix of the covariance matrix, and carrying out eigenvalue decomposition on the covariance matrix based on the diagonal matrix to obtain reference characteristics; extracting first reference feature elements in the reference feature, normalizing the product between the transposed difference value and the reference feature elements to obtain attention weights corresponding to the candidate feature elements, and weighting the candidate feature elements according to the attention weights to obtain attention elements corresponding to the candidate feature elements; constructing attention features of the fragment images based on the attention elements, and constructing a candidate feature library based on the attention features;

Receiving a target image uploaded by a client based on the short video platform, extracting target image characteristics of the target image, performing similarity matching in the candidate feature library based on the target image characteristics, determining target comment fragments from the candidate comment fragments according to a matching result, acquiring the object labels of the candidate comment fragments, determining reference comment fragments which are marked with the same object labels as the target comment fragments in the candidate comment fragments except for the target comment fragments, taking the target comment fragments and the reference comment fragments as retrieval results of the target image, wherein the object labels marked with the same marks represent that mark information in the object labels are partially consistent, and the target image is acquired by a camera component and uploaded to a server by the client after triggering a control in a client interface of the client.

2. The narrative video retrieval method according to claim 1, wherein before said extracting candidate image features of each of said segment images, said narrative video retrieval method further comprises:

3. The method for retrieving an illustration video according to claim 2, wherein the performing face region detection on the segment image to obtain a plurality of target face frames includes:

scaling the fragment image for a plurality of times to obtain an image pyramid;

4. The method for retrieving a video of claim 1, wherein the step of obtaining the candidate video of the commentary that has been released from the travel category of the short video platform comprises:

Acquiring published candidate short videos from the tourism category of a short video platform, and acquiring video tags of the candidate short video labels;

And when the video tag indicates that the candidate short video is used for illustrating scenic spots, determining the candidate short video as a candidate illustration video.

5. The method for retrieving an comment video of claim 1, wherein said video shot-cutting the candidate comment video to obtain a plurality of candidate comment fragments includes:

extracting features of the candidate comment video to obtain candidate video features;

6. The narrative video retrieval method according to claim 1, wherein said determining a target narrative segment from the candidate narrative segments according to the matching result includes:

7. An narrative video retrieval apparatus, comprising:

The first acquisition module is used for acquiring published candidate comment videos from the travel category of the short video platform; wherein the candidate commentary video comprises a video for commentary on a attraction;

A first processing module, configured to perform video shot segmentation on the candidate comment video to obtain a plurality of candidate comment segments, extract segment features of the candidate comment segments from candidate video features of the candidate comment segments, classify the candidate comment segments according to the segment features to obtain object labels of the candidate comment segments, and extract segment images of frames in the candidate comment segments by using the candidate comment segments corresponding to the object label labels, where the object labels are used to indicate sub-scenic spots of scenic spots included in the candidate comment segments, and the object labels include object label information and sub-object label information, or the object labels include a plurality of sub-object label information;

The second processing module is used for extracting candidate image features of each fragment image and constructing a candidate feature library based on the candidate image features; wherein the candidate image features comprise a plurality of candidate feature elements, the second processing module is further configured to: determining feature average values of all the candidate feature elements in the candidate image features; determining a characteristic difference value between the candidate characteristic element and the characteristic mean value, transposing the characteristic difference value to obtain a transposed difference value, and generating a covariance matrix of the candidate image characteristic according to the characteristic difference value and the transposed difference value; constructing a diagonal matrix of the covariance matrix, and carrying out eigenvalue decomposition on the covariance matrix based on the diagonal matrix to obtain reference characteristics; extracting first reference feature elements in the reference feature, normalizing the product between the transposed difference value and the reference feature elements to obtain attention weights corresponding to the candidate feature elements, and weighting the candidate feature elements according to the attention weights to obtain attention elements corresponding to the candidate feature elements; constructing attention features of the fragment images based on the attention elements, and constructing a candidate feature library based on the attention features;

And a third processing module, configured to receive a target image uploaded by a client of the short video platform, extract target image features of the target image, perform similarity matching in the candidate feature library based on the target image features, determine target comment segments from the candidate comment segments according to a matching result, obtain the object tags of the candidate comment segments, determine, from the candidate comment segments other than the target comment segments, a reference comment segment that is marked with the same object tag as the target comment segment, and take the target comment segment and the reference comment segment as a retrieval result of the target image, where the object tags marked with the same tag represent that tag information in the object tags is partially consistent, and the target image is acquired by the client call a camera component and uploaded to a server after triggering a control in a client interface of the client.

8. The narrative video retrieval apparatus of claim 7, wherein the second processing module is further for:

9. The narrative video retrieval apparatus of claim 8, wherein the second processing module is further for:

scaling the fragment image for a plurality of times to obtain an image pyramid;

10. The narrative video retrieval apparatus of claim 7, wherein the first acquisition module is further for:

When the video tag indicates the video tag is used for illustrating the scenic spot, the candidate short video is determined to be a candidate illustration video.

11. The narrative video retrieval apparatus of claim 7, wherein the first processing module is further for:

12. The narrative video retrieval apparatus of claim 7, the third processing module further for:

13. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the narrative video retrieval method of any one of claims 1-6 when the computer program is executed by the processor.

14. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the narrative video retrieval method of any one of claims 1 through 6.