CN112115299A

CN112115299A - Video searching method and device, recommendation method, electronic device and storage medium

Info

Publication number: CN112115299A
Application number: CN202010979533.4A
Authority: CN
Inventors: 冯博豪; 庞敏辉; 谢国斌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2020-12-22

Abstract

The embodiment of the application discloses a video searching method, a video searching device, a video recommending method, electronic equipment and a storage medium, which relate to computer vision and video analysis technologies and comprise the following steps: the method comprises the steps of obtaining a search request carrying search information, wherein the search request is used for requesting to search a target video corresponding to the search information, if the search information comprises text information, determining first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label, selecting and outputting the target video from each video according to the first similarity, and combining the video frame label to describe the video more in detail and accurately, so that the accuracy of similarity calculation can be improved, the accuracy and the reliability of video search are improved, and the search experience of a user can be improved.

Description

Video searching method and device, recommendation method, electronic device and storage medium

Technical Field

The present application relates to computer vision, image technology and video analysis technology in artificial intelligence and computing technology, and in particular, to a video search method, apparatus, recommendation method, electronic device and storage medium.

Background

With the development of the internet, video applications are more and more popular with people. Particularly, video applications of small videos bring rapidness and convenience to people, however, with the increase of video quantity, if the accuracy of searching is improved, a problem to be solved is solved urgently.

In the prior art, when a user uploads a video to a server of a video application, the uploaded video can be labeled, and a worker of the video application can label the video uploaded by the user to generate a text tag corresponding to the video.

Disclosure of Invention

A video search method, a video search device, a video recommendation method, an electronic device and a storage medium for improving video accuracy are provided.

According to a first aspect, a video search method is provided, in which a search request carrying search information is obtained, where the search request is used to request to search for a target video corresponding to the search information;

if the search information comprises text information, determining first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label;

and selecting and outputting the target video from the videos according to the first similarity.

In this embodiment, by performing similarity calculation between the text information and the tag information including the text tag and the video frame tag, the technical effects of accuracy and reliability of the search result can be improved.

According to a second aspect, an embodiment of the present application provides a video search apparatus, including:

the device comprises an acquisition module, a search module and a processing module, wherein the acquisition module is used for acquiring a search request carrying search information, and the search request is used for requesting to search a target video corresponding to the search information;

the first determining module is used for determining a first similarity between the text information and preset label information of each video if the search information comprises the text information, wherein the label information comprises a text label and a video frame label;

the selecting module is used for selecting the target video from the videos according to the first similarity;

and the output module is used for outputting the target video.

According to a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as in any one of the embodiments above.

According to a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method according to any one of the above embodiments.

According to a fifth aspect, an embodiment of the present application provides a video recommendation method, including:

acquiring a history record of a user accessing a video;

determining text information corresponding to the historical records;

determining third similarity between text information corresponding to the historical record and preset label information of each video, wherein the label information comprises a text label and a video frame label;

and selecting and recommending videos for the user from the videos according to the third similarity.

The application provides a video searching method, a video searching device, a video recommending method, electronic equipment and a storage medium, wherein the video searching method comprises the following steps: the method comprises the steps of obtaining a search request carrying search information, wherein the search request is used for requesting to search a target video corresponding to the search information, if the search information comprises text information, determining first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label, selecting and outputting the target video from each video according to the first similarity, in the embodiment, the label information comprises contents of two dimensions, the content of one dimension is the text label, the content of the other dimension is the video frame label, compared with a scheme that the target video is determined only based on the text label in the related technology, the embodiment of the application can describe the video in more detail and accurately by combining the video frame label, so that the accuracy of similarity calculation can be improved, and the technical effects of accuracy and reliability of video search can be improved, especially, when the content of the text label and the content of the video are greatly different, the label information including the video frame label of the embodiment is adopted to carry out video search, so that the target video corresponding to the search intention of the user can be accurately determined, the accuracy and the reliability of the video search are improved, and the search experience of the user is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic view of an application scenario of a video search method according to an embodiment of the present application;

FIG. 2 is a schematic view of a video search interface according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a video search method according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a video search method according to another embodiment of the present application;

FIG. 5 is a diagram of a video frame image according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a video search method according to another embodiment of the present application;

FIG. 7 is a diagram illustrating an exemplary video search apparatus according to the present application;

FIG. 8 is a diagram illustrating an apparatus for searching video according to another embodiment of the present application;

FIG. 9 is a block diagram of an electronic device of an embodiment of the application;

fig. 10 is a flowchart illustrating a video recommendation method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below with reference to the accompanying drawings, in which various details of the embodiments of the application are included to assist understanding, and which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the embodiments of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a video search method according to an embodiment of the present application.

In the application scenario shown in fig. 1, the user equipment 100 has an application program with a video playing function, which may be a tremble, a fast-hand, a micro-vision, and so on, installed thereon. The user transmits a search request for requesting a search for a target video corresponding to the search information to the server 200 through the above-described application installed in the user equipment 100.

A schematic diagram of the search interface of the user device 100 can be seen in fig. 2.

As shown in fig. 2, the user may input, at an input box (e.g., "input search content" shown in fig. 2), information about a video (which may be referred to as a target video) that the user desires to search, i.e., the information may be used to characterize the user's search intention, such as the time when the target video is uploaded to the server 200, as well as the uploader of the target video, as well as the content of the target video, and so on.

When the user is clicking on "search" as shown in fig. 2, the user device 100 is triggered to send a search request to the server.

After receiving the search request, the server 200 may obtain a target video corresponding to the search information from the database based on the search request, and transmit the target video to the user device 100, where the target video is displayed by the user device 100.

The user equipment 100 is a device that can install various applications and can display an object provided in the installed application, and the electronic device may be mobile or fixed, for example, a mobile phone, a tablet computer, various wearable devices, an in-vehicle device, a Personal Digital Assistant (PDA), a point of sale (POS), a device that can perform short video recommendation, or other electronic devices that can implement the above functions.

The server 100 may be any device capable of providing internet services.

The user equipment 100 and the server 200 are communicatively connected through a network, which may be a local area network, a wide area network, and the like.

It should be noted that the above embodiments are only used for exemplarily illustrating application scenarios to which the video search method of the present embodiment can be applied, and are not to be construed as a limitation on the application scenarios.

Based on the description of the application scenario, the user equipment may generate a search request carrying search information based on the information related to the target video input by the user, and transmit the search request to the server through the communication network. In the related art, the search information is generally text information, such as text information input by the user based on an input box in fig. 2, and the like, and the server matches the text information with a preset text tag and feeds back the target video according to a matching result.

However, the text information may deviate from the content of the video, which results in a low accuracy of the matching result and reduces the search experience of the user.

The inventor of the application obtains the inventive concept of the application through creative work: and matching the text information with label information comprising a text label and a video frame label, and determining the target video based on the matching result.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

According to an aspect of an embodiment of the present application, an embodiment of the present application provides a video search method, which is applied to artificial intelligence and computer technology, and in particular, is applied to computer vision, image technology and video analysis technology, so as to achieve reliability and accuracy of video search.

Referring to fig. 3, fig. 3 is a flowchart illustrating a video search method according to an embodiment of the present application.

As shown in fig. 3, the method includes:

s101: and acquiring a search request carrying search information, wherein the search request is used for requesting to search a target video corresponding to the search information.

The execution main body in the embodiment of the present application may be a video search device, and the video search device may specifically be a server (including a cloud server and a local server), and for the description of the server, reference may be made to the above example, which is not described herein again.

In connection with the application scenario shown in fig. 1, the search request may be generated for the user device based on search information input by the user (i.e., the user's search intent).

S102: and if the search information comprises text information, determining first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label.

The text information may be understood as text or voice related information, for example, the text information may be a user inputting "food" in the input box shown in fig. 2; for another example, a voice component may be set on the user equipment, and the user may input voice information through the voice component, so that the text information may be the text information generated by parsing the voice information for the user equipment.

The tag information may be understood as description information of each video, that is, based on the tag information, related content of each video may be determined, for example, if the tag information is a food, the corresponding video should be a video related to food, such as a video related to cooking.

Here, the "first" in the first similarity is used to distinguish from the second similarity hereinafter, and is not to be construed as a limitation on the content of the similarity.

In this embodiment, the first similarity may be understood as a degree of similarity between the text information and the tag information.

It should be noted that, in this embodiment, the tag information may include two dimensions of content, where the content in one dimension is a text tag, and the content in the other dimension is a video frame tag.

The text labels are described as follows:

in a possible implementation manner, when a user uploads a certain video to a server, the name, the approximate content, the type (such as story, fun, music, and the like) and the like of the video can be labeled, and the server can generate a text label of the video according to the labeling information of the user on the video.

In another possible implementation manner, the server may set a partial tag (such as a type of a video, etc.), when a user uploads a certain video, the partial tag set by the server may be selected and the video may be labeled, and the server generates a text tag according to the partial tag selected by the user and the labeled partial tag.

That is, the server generates a label for describing the relevant information of the video based on the text description of the uploaded video by the user and/or the text description preset by the selected server, and the label is called a text label.

Among them, the description about the video frame tag is as follows:

the video comprises a plurality of frames of images, and the video frame label can be used for representing the label of the plurality of frames of images, namely the video frame can understand the label which is determined based on the plurality of frames of images in the video and is used for describing the video.

It should be noted that, in this embodiment, since the tag information includes the text tag and the video frame tag, the tag information is relatively richer and can describe the video more clearly and in detail, and when the similarity calculation is performed based on the tag information and the target video is determined, the accuracy and reliability of the similarity calculation can be improved, thereby improving the accuracy of video search.

S103: and selecting and outputting the target video from the videos according to the first similarity.

In a technical solution that may be implemented, a video with a high similarity may be selected from the videos as a target video based on a first similarity, for example, a similarity greater than a preset threshold in the first similarity may be determined, and a video corresponding to the similarity greater than the preset threshold is determined as the target video, and for example, the first n similarities with the maximum similarity in the first similarity may be selected, and a video corresponding to the selected similarity is determined as the target video.

Based on the above analysis, an embodiment of the present application provides a video search method, including: the method comprises the steps of obtaining a search request carrying search information, wherein the search request is used for requesting to search a target video corresponding to the search information, if the search information comprises text information, determining first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label, selecting and outputting the target video from each video according to the first similarity, in the embodiment, the label information comprises contents of two dimensions, the content of one dimension is the text label, the content of the other dimension is the video frame label, compared with a scheme that the target video is determined only based on the text label in the related technology, the embodiment of the application can describe the video in more detail and accurately by combining the video frame label, so that the accuracy of similarity calculation can be improved, and the technical effects of accuracy and reliability of video search can be improved, especially, when the content of the text label and the content of the video are greatly different, the label information including the video frame label of the embodiment is adopted to carry out video search, so that the target video corresponding to the search intention of the user can be accurately determined, the accuracy and the reliability of the video search are improved, and the search experience of the user is improved.

The video search method of the embodiment of the present application will now be described in more detail with reference to the principle of determining video frame tags.

Specifically, referring to fig. 4, fig. 4 is a schematic flowchart of a video search method according to another embodiment of the present application.

As shown in fig. 4, the method includes:

s201: and slicing any video to obtain a slice set corresponding to any video.

It should be understood that the server may receive the videos uploaded by the users, and each user may upload a plurality of videos, so that the server stores a plurality of videos, and the server may slice any one of the videos received by the server.

In some embodiments, the video stored in the server is obtained by filtering based on video information and audio information of the video.

For example, when the server receives a video uploaded by a user, the video can be audited from the dimension of the video information and the dimension of the audio information of the video, and if the audition is passed, the video is stored.

It is worth to be noted that, by auditing videos from two dimensions of video information and audio information, what is caused by auditing videos from a single dimension of video information in the related art can be avoided: the problem of low auditing accuracy is solved, and the missed auditing is avoided, so that the technical effects of reliability and accuracy of video auditing are realized.

For example, a video classification algorithm (LPCG) may be set in the server, and when the server receives a video uploaded by a user, the server may call the video classification algorithm to perform classification calculation on the video, so as to implement auditing on the video.

Based on the above analysis, a slice can be understood as a cut of a video into a plurality of frames of images. Accordingly, a slice set may be understood to include a set of multiple frame images.

In some embodiments, S201 may include:

s2011: and slicing any video by taking time as a slicing unit to obtain an image corresponding to any video.

That is, the video may be sliced based on time, for example, the video is sliced in 0.1 second as a slice unit, and each image corresponding to the video is obtained.

It is worth noting that, in general, video is color video, and in one possible implementation, the video may be grayed before being sliced.

By carrying out gray processing on the video, color interference can be avoided, and the reliability and accuracy of slicing, subsequent polymerization and the like are improved.

S2012: the image is sliced in units of slices of the object, and a slice set is obtained.

The object may be a living human or animal, or a building, and the embodiment is not limited.

That is to say, in this embodiment, the video may be sliced based on time first, and on this basis, the video may be sliced based on the object, and by slicing the video based on two layers of time and the object, which is equivalent to slicing the video from a large range first, and then slicing the video again from a small focus point, the technical effects of accuracy and reliability of slicing may be improved, so that the video frame tag generated based on the slice set has higher reliability.

S202: and clustering the slice set to obtain a target key frame image of any video.

In this embodiment, the slice set is processed in a clustering manner, so that the division of images with high similarity in the slice set into the same category can be reduced, the repeated labeling of the same images is avoided, the influence of image noise is avoided, the accuracy of video frame labels is reduced, and the determination of video frame labels with high attaching degree for describing videos can be improved.

In some embodiments, S202 may include:

s2021: and clustering the slice set by taking the object as a category unit.

That is, in the clustering process, classification of each image in the slice set may be performed on an object basis, such as on the basis of a person, an animal, a building, and the like.

In a possible implementation technical scheme, images with discontinuous time in a slice set are determined to be different categories, omission of video frame labels is avoided, and comprehensiveness and integrity of the video frame labels are improved.

S2022: and determining the image with the largest image information entropy in any category as a candidate key frame image for any category in the clustering results.

The entropy of the image information can be used to characterize the information content of the image. In relative terms. The larger the entropy of the image information is, the larger the information amount of the image is, and conversely, the smaller the entropy of the image information is, the smaller the information amount of the image is.

In the present embodiment, by determining the image with the largest entropy of image information as the candidate key frame image, since the image with the largest entropy of image information is the image with the largest amount of image information, the candidate key frame image is made to be an image representative in the category, so that the video frame label has strong representativeness to the video.

S2023: and carrying out redundancy processing on the candidate key frame images to obtain target key frame images.

To reduce the amount of labeling, after the candidate key frame images are obtained, the candidate key frame images may be subjected to redundancy processing.

In some embodiments, S2023 may comprise:

s20231: and determining edge histograms corresponding to any two candidate key frame images adjacent to the time aiming at any two candidate key frame images adjacent to the time.

The edge histogram is used for embodying the edge and texture features of the image, and the method for determining the edge histogram of a certain candidate key frame image may include: performing edge operator operation on the candidate key frame image; calculating the edge direction of each pixel of the candidate key frame image; quantizing the edge direction to obtain an edge direction value; and carrying out histogram statistics and normalization processing on the edge direction values to obtain an edge histogram of the candidate key frame image.

S20232: difference information between the respective corresponding edge histograms is determined.

S20233: and carrying out redundancy processing on the candidate key frame images based on the difference information to obtain target key frame images.

Wherein, the difference information may be a difference value, for example, if the candidate key frame image a and the candidate key frame image B are adjacent in time, the difference information may be a difference value between a normalized value of an edge histogram of the key frame image a and a normalized value of an edge histogram of the candidate key frame image B.

Correspondingly, the difference value can be compared with a preset threshold value, if the difference value is greater than or equal to the threshold value, the difference between the candidate key frame image A and the candidate key frame image B is larger, and in order to ensure the integrity and comprehensiveness of the video frame label, the candidate key frame image A and the candidate key frame image B are both determined as target key frame images; if the difference value is smaller than the threshold value, it indicates that the difference between the candidate key frame image a and the candidate key frame image B is small, and in order to reduce the amount of labeling and save the storage space of the server, the candidate key frame image a or the candidate key frame image B may be determined as the target key frame image, or an image with larger entropy of image information in the candidate key frame image a and the candidate key frame image B may be determined as the target key frame image.

S203: and generating a video frame label according to the target key frame image.

Based on the analysis, the target key frame image is the image from which the redundant image is removed, and on the basis, the image with larger image information amount is reserved, so that the video frame label generated according to the target key frame image has higher integrity, accuracy and reliability.

In some embodiments, S203 may include:

s2031: description information for describing the target key frame image is generated.

The description information may be used to represent information representing the content of the key frame image.

In some embodiments, S2031 may comprise:

s20311: a phrase describing an object in the target key frame image is determined.

S20312: and connecting the phrases based on preset connecting words to obtain the description information.

For example, semantic analysis and syntactic analysis may be performed on the target keyframe image by the server, resulting in descriptive information.

Specifically, semantic analysis and syntactic analysis may be performed on the target key frame Image based on a preset network model, such as a Neural Composite Portion for Image Capture (NCPIC), so as to obtain the description information.

The generation description information is now exemplarily described with reference to fig. 5 as follows:

the server may identify the object of the target key frame image and form a phrase describing the object: puppies, rugs, and grass. The server may be preset with a corpus for storing connection words, and when determining phrases for describing the object, the server may select each connection word from the corpus, so that each phrase and each connection word may form a reasonable sentence, and if the sentences are multiple, semantic comparison may be performed between each sentence and the target key frame image to obtain a final sentence (which may be one or multiple), for example: puppies played a ball on the grass.

It should be noted that, in this embodiment, a phrase is formed first, and then the phrase is combined with a connecting word to obtain description information, so that the description information can describe a target key frame image in a proper manner, thereby achieving the technical effects of reliability and accuracy of a video frame tag.

S2032: and generating a video frame label according to the description information.

In this embodiment, the description information is used to describe the target key video frame image, so that the video frame tag generated according to the description information can describe the video more comprehensively, completely and accurately, thereby improving the reliability and accuracy of similarity calculation, and further achieving the technical effects of reliability and accuracy of video search.

In this step, the video frame tag may be generated directly from the description information, and as in connection with the above-described embodiment, the "puppy playing a ball on a grass" may be determined as the video frame tag. The description information can be expanded and synonymy replaced to enrich the video frame label. For example, expanding "puppy playing a ball on grass" to "dog playing a ball on grass" and "puppy playing a ball on grass" etc.

In some embodiments, the target key frame image includes a plurality of frame images, and the description information includes description information corresponding to each of the plurality of frame images, S2032 includes:

s20321: and determining the occurrence times of the description information corresponding to each of the plurality of frame images.

S20322: and selecting the video frame tags from the description information respectively corresponding to the multiple frames of images based on preset selection parameters and occurrence times.

The selection parameter may be percentage or threshold, and the percentage and the selection threshold may be set by the server based on a requirement, a history, a test, and the like, which is not limited in this embodiment.

For example, the description information with the first 5% of the occurrence number may be selected to determine the video frame tag; the description information with the occurrence number larger than the preset number threshold (i.e. the selection threshold) can also be selected to determine the video frame tag.

In some embodiments, the number of occurrences of each description information may be calculated and sorted based on a label sorting (EmbedRank) algorithm, and the video frame labels may be determined based on the percentage and the number of occurrences.

In the embodiment, the video frame tags are selected through the selection parameters and the occurrence times, so that the storage space occupied by the video frame tags can be reduced, the computing resources in similarity computing are reduced, and the like.

S204: and acquiring a search request carrying search information, wherein the search request is used for requesting to search a target video corresponding to the search information.

For the description of S204, reference may be made to S101, which is not described herein again.

S205: and if the search information comprises text information, determining first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label.

For the description of S205, reference may be made to S102, which is not described herein again.

In some embodiments, the server presets a database for storing tag information, and determines a first similarity of the text information and the tag information in the database.

It should be noted that, by pre-constructing a database for storing tag information, the tag information can be uniformly stored, so that convenience in calculating the similarity is realized, and the technical effect of improving the efficiency of determining the first similarity is achieved.

In this embodiment, the text tag may be expanded based on an original text tag (for example, text information provided when the user uploads a video, or a text tag labeled by the server (or a staff member) based on a video uploaded by the user) by the server.

For example, the server may perform semantic analysis on the original text labels to obtain similar labels for supplementing and improving the original text labels, may perform sorting (for example, based on a label sorting (EmbedRank) algorithm) on the supplemented original text labels, and may screen the sorted original text labels through a maximum edge correlation (MMR) model to obtain the text labels.

S206: and selecting and outputting the target video from the videos according to the first similarity.

For the description of S206, reference may be made to S103, which is not described herein again.

It should be noted that, in order to support diversification of video search, the present embodiment may also support picture search video, which is exemplarily described with reference to fig. 6.

Fig. 6 is a schematic flow chart of a video search method according to another embodiment of the present application.

As shown in fig. 6, the method includes:

s301: and acquiring a search request carrying search information, wherein the search request is used for requesting to search a target video corresponding to the search information.

For the description of S301, reference may be made to S101, which is not described herein again.

S302: judging whether the search information comprises text information and/or pictures, and if the search information comprises the text information, executing S303 to S304; if the search information includes a picture, executing S305 to S306; if the search information includes text information and pictures, S307 to S309 are performed.

S303: and determining a first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label.

For the description of S303, reference may be made to S102, which is not described herein again.

S304: and selecting and outputting the target video from the videos according to the first similarity.

For the description of S304, reference may be made to S103, which is not described herein again.

In some embodiments, the target video may be filtered again in combination with the historical search record, the historical viewing record and the historical comment information of the user, and the filtered target video may be output.

S305: a second similarity of the picture to the images of the videos is determined.

The image in this step may be a target key frame image of each video.

S306: and selecting and outputting the target video from the videos according to the second similarity.

The principle of determining the target video according to the second similarity may refer to the foregoing embodiment, and the principle of determining the target video according to the first similarity is not described herein again.

S307: and determining a first similarity between the text information and preset label information of each video, wherein the label information comprises a text label and a video frame label.

S308: a second similarity of the picture to the images of the videos is determined.

S309: and determining and outputting the target video according to the intersection of the first similarity and the second similarity.

That is, the server may determine a video corresponding to the first similarity (hereinafter, referred to as a first video set) and a video corresponding to the second similarity (hereinafter, referred to as a second video set), respectively, and may determine a video included in both the first video set and the second video set, and determine the portion of the video as the target video.

Similarly, a part of videos may also be selected from the intersection to be determined as the target video, and the implementation principle of the method may be referred to in the above embodiments, which is not described herein again.

In other embodiments, weighting coefficients may also be assigned to the first similarity and the second similarity in advance, respectively, and the target video may be determined based on the weighting coefficients.

In this embodiment, the server may support both the video search based on the text information and the video search based on the picture, may improve the technical effects of flexibility and diversity of the video search, and may improve the technical effect of the accuracy of the target video by determining the target video in combination with the first similarity and the second similarity.

According to another aspect of the embodiments of the present application, there is also provided a video search apparatus, configured to perform the video search method according to any of the above embodiments, such as performing the method shown in any of fig. 3, fig. 4, and fig. 6.

Referring to fig. 7, fig. 7 is a schematic diagram of a video search apparatus according to an embodiment of the present application.

As shown in fig. 7, the apparatus includes:

the acquisition module 11 is configured to acquire a search request carrying search information, where the search request is used to request to search for a target video corresponding to the search information;

a first determining module 12, configured to determine, if the search information includes text information, a first similarity between the text information and preset tag information of each video, where the tag information includes a text tag and a video frame tag;

a selecting module 13, configured to select the target video from the videos according to the first similarity;

and the output module 14 is used for outputting the target video.

As can be seen in fig. 8, in some embodiments, the method further includes:

the slicing module 15 is configured to slice any one of the videos to obtain a slice set corresponding to the any one of the videos;

a clustering module 16, configured to perform clustering processing on the slice set to obtain a target key frame image of any one of the videos;

and a generating module 17, configured to generate the video frame tag according to the target key frame image.

In some embodiments, the slicing module 15 is configured to slice any one of the videos by taking time as a slicing unit, obtain an image corresponding to the any one of the videos, slice the image by taking an object as a slicing unit, and obtain the slice set.

In some embodiments, the clustering module 16 is configured to perform clustering processing on the slice set by using the object as a category unit, determine, for any category in the clustering result, an image with the largest image information entropy in any category as a candidate key frame image, and perform redundancy processing on the candidate key frame image to obtain the target key frame image.

In some embodiments, the clustering module 16 is configured to, for any two temporally adjacent candidate key frame images, determine edge histograms corresponding to the any two temporally adjacent candidate key frame images, determine difference information between the respective corresponding edge histograms, and perform redundancy processing on the candidate key frame images based on the difference information to obtain the target key frame image.

In some embodiments, the generating module 17 is configured to generate description information for describing the target key frame image, and generate the video frame tag according to the description information.

In some embodiments, the target key frame image includes multiple frame images, the description information includes description information corresponding to each of the multiple frame images, and the generating module 17 is configured to determine the number of occurrences of the description information corresponding to each of the multiple frame images, and select the video frame tag from the description information corresponding to each of the multiple frame images based on a preset selection parameter and the number of occurrences.

In some embodiments, the generating module 17 is configured to determine a phrase for describing the object in the target key frame image, connect the phrase based on a preset connecting word, and obtain the description information.

As can be seen from fig. 8, in some embodiments, if the search information further includes a picture, the method further includes:

a second determining module 18, configured to determine a second similarity between the picture and the image of each video;

and the selecting module 13 is configured to determine an intersection of the first similarity and the second similarity as the target video.

In some embodiments, each of the videos is obtained by filtering based on video information and audio information of each of the videos.

As can be seen in fig. 8, in some embodiments, the method further includes:

the processing module 19 is used for carrying out gray processing on any one video;

and the slicing module is used for slicing any video after the gray processing to obtain the slice set.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 9, is a block diagram of an electronic device according to an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of embodiments of the present application described and/or claimed herein.

As shown in fig. 9, the electronic apparatus includes: one or more processors 101, memory 102, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 9 illustrates an example of one processor 101.

The memory 102 is a non-transitory computer readable storage medium provided by the embodiments of the present application. The memory stores instructions executable by at least one processor, so that the at least one processor executes the video search method provided by the embodiment of the application. The non-transitory computer-readable storage medium of the embodiments of the present application stores computer instructions for causing a computer to perform the video search method provided by the embodiments of the present application.

Memory 102, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules in embodiments of the present application. The processor 101 executes various functional applications of the server and data processing, i.e., implements the video search method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 102.

The memory 102 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 102 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 102 may optionally include memory located remotely from processor 101, which may be connected to an electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, Block-chain-Based Service Networks (BSNs), mobile communication networks, and combinations thereof.

The electronic device may further include: an input device 103 and an output device 104. The processor 101, the memory 102, the input device 103, and the output device 104 may be connected by a bus or other means, and the bus connection is exemplified in fig. 9.

The input device 103 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 104 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Block-chain-Based Service Networks (BSNs), Wide Area Networks (WANs), and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the conventional physical host and Virtual Private Server (VPS) service.

According to another aspect of the embodiment of the present application, a video recommendation method is further provided.

Referring to fig. 10, fig. 10 is a flowchart illustrating a video recommendation method according to an embodiment of the present application.

As shown in fig. 10, the method includes:

s401: and acquiring a historical record of the video accessed by the user.

S402: and determining text information corresponding to the history records.

The text information corresponding to the history record can be understood from two dimensions, wherein one dimension is from the history record itself, such as a text label of a video watched by the user, an author of the video watched by the user, the uploading time of the video watched by the user, and comment information (including comment information of the user and other users) of the video watched by the user; another dimension is other information that is expanded based on the history, such as related information like videos of interest to the user, determined based on the history.

S403: and determining a third similarity between the text information corresponding to the historical record and the preset label information of each video, wherein the label information comprises a text label and a video frame label.

For the description of the tag information, reference may be made to the above embodiments, which are not described herein again.

That is to say, in this embodiment, the server may recommend a video to the user according to the history and the tag information, and because the server fully considers the contents of two dimensions, namely the text tag and the video frame tag, for the video recommended by the user, the accuracy and reliability of video recommendation may be improved, and the video watching experience of the user may be improved.

S404: and selecting and recommending videos for the user from the videos according to the third similarity.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solution of the present application can be achieved, and the present invention is not limited thereto.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A video search method, comprising:

acquiring a search request carrying search information, wherein the search request is used for requesting to search a target video corresponding to the search information;

2. The method of claim 1, further comprising:

slicing any video to obtain a slice set corresponding to the any video;

clustering the slice set to obtain a target key frame image of any one video;

and generating the video frame label according to the target key frame image.

3. The method of claim 2, wherein slicing any one of the videos, and obtaining a slice set corresponding to the any one of the videos comprises:

slicing any one video by taking time as a slicing unit to obtain an image corresponding to any one video;

and slicing the image by taking the object as a slice unit to obtain the slice set.

4. The method of claim 3, wherein clustering the set of slices comprises:

clustering the slice set by taking the object as a category unit;

determining an image with the largest image information entropy in any category as a candidate key frame image aiming at any category in the clustering result;

and carrying out redundancy processing on the candidate key frame image to obtain the target key frame image.

5. The method of claim 4, wherein redundantly processing the candidate key frame images to obtain the target key frame image comprises:

determining edge histograms corresponding to any two candidate key frame images adjacent in time aiming at any two candidate key frame images adjacent in time;

determining difference information between the respective corresponding edge histograms;

and carrying out redundancy processing on the candidate key frame image based on the difference information to obtain the target key frame image.

6. The method of claim 2, wherein generating the video frame tag from the target key frame image comprises:

generating description information for describing the target key frame image;

and generating the video frame label according to the description information.

7. The method of claim 6, wherein the target key frame image comprises a plurality of frame images, the description information comprises description information corresponding to each of the plurality of frame images, and generating the video frame tag according to the description information comprises:

determining the occurrence times of the description information corresponding to the multiple frames of images;

and selecting the video frame tags from the description information respectively corresponding to the multiple frames of images based on preset selection parameters and the occurrence times.

8. The method of claim 6, wherein generating description information describing the target key frame image comprises:

determining a phrase describing an object in the target key frame image;

and connecting the phrases based on preset connecting words to obtain the description information.

9. The method according to any one of claims 1 to 8, wherein if the search information further includes a picture, further comprising:

determining a second similarity of the picture to the images of the videos;

and selecting and outputting the target video from the videos according to the first similarity comprises: and determining and outputting the intersection of the first similarity and the second similarity.

10. The method according to any one of claims 1 to 8, wherein each of the videos is obtained by filtering based on video information and audio information of each of the videos.

11. The method of any of claims 2 to 8, further comprising: carrying out gray level processing on any one video;

and slicing any one video, wherein obtaining a slice set corresponding to the any one video comprises: and carrying out slicing processing on any video subjected to gray level processing to obtain the slice set.

12. A video search apparatus, comprising:

and the output module is used for outputting the target video.

13. The apparatus of claim 12, further comprising:

the slicing module is used for carrying out slicing processing on any video to obtain a slice set corresponding to any video;

the clustering module is used for clustering the slice set to obtain a target key frame image of any one video;

and the generating module is used for generating the video frame label according to the target key frame image.

14. The apparatus of claim 13, wherein the slicing module is configured to slice the any one of the videos in units of time slices to obtain an image corresponding to the any one of the videos, and slice the image in units of object slices to obtain the slice set.

15. The apparatus according to claim 14, wherein the clustering module is configured to perform clustering processing on the slice set by using the object as a category unit, determine, for any category in the clustering results, an image with the largest image information entropy in any category as a candidate key frame image, and perform redundancy processing on the candidate key frame image to obtain the target key frame image.

16. The apparatus according to claim 15, wherein the clustering module is configured to, for any two temporally adjacent candidate key frame images, determine edge histograms corresponding to the any two temporally adjacent candidate key frame images, determine difference information between the respective corresponding edge histograms, and perform redundancy processing on the candidate key frame images based on the difference information to obtain the target key frame image.

17. The apparatus of claim 13, wherein the generating means is configured to generate description information describing the target key frame image, and generate the video frame tag according to the description information.

18. The apparatus according to claim 17, wherein the target key frame image includes multiple frame images, the description information includes description information corresponding to each of the multiple frame images, and the generating module is configured to determine a number of occurrences of the description information corresponding to each of the multiple frame images, and select the video frame tag from the description information corresponding to each of the multiple frame images based on a preset selection parameter and the number of occurrences.

19. The apparatus of claim 17, wherein the generating module is configured to determine a phrase describing the object in the target key frame image, and obtain the description information by concatenating the phrase based on a preset concatenation word.

20. The apparatus according to any one of claims 12 to 19, wherein if the search information further includes a picture, the apparatus further includes:

a second determining module, configured to determine a second similarity between the picture and the images of the videos;

and the selecting module is used for determining the intersection of the first similarity and the second similarity as the target video.

21. The apparatus according to any one of claims 12 to 19, wherein each of the videos is obtained by filtering based on video information and audio information of each of the videos.

22. The apparatus of any of claims 13 to 19, further comprising:

the processing module is used for carrying out gray processing on any one video;

23. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.

25. A video recommendation method, comprising:

acquiring a history record of a user accessing a video;

determining text information corresponding to the historical records;