CN110209881B

CN110209881B - Video searching method, device and storage medium

Info

Publication number: CN110209881B
Application number: CN201811322381.XA
Authority: CN
Inventors: 庄钟鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2023-05-12
Anticipated expiration: 2038-11-08
Also published as: CN110209881A

Abstract

The embodiment of the application discloses a video searching method. The method comprises the following steps: responding to a video search request message about a first video sent by a terminal device, and acquiring a first image feature string of the first video, wherein the first image feature string comprises image features of one or more image frames of the first video; determining a target second video from the plurality of second videos according to the similarity of the first image feature string and the second image feature strings of the plurality of second videos; and returning the video information of the target second video to the terminal equipment as a search result.

Description

Video searching method, device and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a video searching method, apparatus, and storage medium.

Background

With the continuous development of online video playing technology, video has become an entrance for huge traffic on the internet. In the face of massive video content, an accurate video search engine can definitely greatly improve the experience of users.

Video search engines typically provide text-based video search functionality, such as a user entering or selecting keywords associated with a video, such as director, age of shooting, etc., on a search page, and then sending the relevant information to a background search engine. The background search engine searches out the name of the video with the highest matching degree from the database by using a text fuzzy matching algorithm, acquires other related information of the video, and feeds the searched information back to the user.

Text-based video search techniques require that the user must have some knowledge of the content of the video to be searched, such as knowing who it is the lead actor, the age of shooting, etc., and that the accuracy of the search depends largely on the user's knowledge of the video content and the background. The more the user knows about the video content and the background, the more accurate the knowledge, and the more accurate the search results. Conversely, if the user has little knowledge of the video content and the background, then little search can be performed. For example, a user may have seen a short video (e.g., a highlight in a movie) from a certain share, be very interested in its content, want to search for the original movie of the short video, but have no knowledge about the information about the short video (don't know the actors inside, who the director is, the video is too short, the names of people inside the video, etc.). At this time, using a text-based video search technique, an effective search is hardly performed.

Technical content

Some embodiments of the present application provide a video searching method, apparatus and storage medium, so that a user can accurately search for desired video information without knowing the background information of the video.

The video searching method provided by the embodiment of the application comprises the following steps:

responding to a video search request message about a first video sent by a terminal device, and acquiring a first image feature string of the first video, wherein the first image feature string comprises image features of one or more image frames of the first video;

determining a target second video from the plurality of second videos according to the similarity of the first image feature string and the second image feature strings of the plurality of second videos;

and returning the video information of the target second video to the terminal equipment as a search result.

The video searching device provided by the embodiment of the application comprises:

an acquisition module, configured to acquire a first image feature string of a first video in response to a video search request message about the first video sent by a terminal device, where the first image feature string includes image features of one or more image frames of the first video;

the determining module is used for determining a target second video from the plurality of second videos according to the similarity between the first image characteristic string and the second image characteristic strings of the plurality of second videos;

and the feedback module is used for returning the video information of the target second video to the terminal equipment as a search result.

Embodiments of the present application also provide a non-transitory computer readable storage medium having stored therein machine readable instructions executable by a processor to perform the above-described method.

In the technical scheme provided by the embodiment of the application, the image characteristic strings of the videos are extracted and compared with the image characteristic strings in the database, so that one or more videos with similar image characteristics can be searched and presented to the user. Thus, the user can accurately search the wanted video information under the condition that the background information of the video is completely unknown.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a schematic illustration of an operating environment in some embodiments of the present application;

fig. 2 is a flowchart of a video searching method according to some embodiments of the present application;

FIG. 3 is another flow chart of a video search method in some embodiments of the present application;

FIG. 4 is a schematic diagram of a user interface in some embodiments of the present application;

FIG. 5 is a schematic diagram of a process of acquiring an image feature string of a video according to an embodiment of the present application;

FIG. 6 is a schematic diagram of image features extracted from an image frame in an embodiment of the present application;

fig. 7 is a schematic diagram of two images included in a first video according to an embodiment of the present application;

FIG. 8 is an interactive schematic diagram of a video searching method according to some embodiments of the present application;

fig. 9 is a schematic structural diagram of a video searching apparatus according to some embodiments of the present application; a kind of electronic device with high-pressure air-conditioning system

Fig. 10 is another schematic structural diagram of a video searching apparatus according to some embodiments of the present application.

Detailed Description

In order to make the technical solution and advantages of the present application more apparent, the present application will be further described in detail below by referring to the accompanying drawings and examples.

For simplicity and clarity of description, the following description sets forth aspects of the present application by describing several representative embodiments. Not all embodiments are shown herein. The embodiments are provided with numerous details for the understanding of the present application, and the technical solutions of the present application may be implemented without limitation to these details. Some embodiments are not described in detail in order to avoid unnecessarily obscuring aspects of the present application, but rather only to present a framework. Hereinafter, "comprising" means "including but not limited to", "according to … …" means "according to at least … …, but not limited to only … …". The word "comprising" in the description and in the claims means that it includes at least to some extent that it should be interpreted that other features may be present in addition to those mentioned later.

In text-based video search techniques, users are required to have some knowledge of the content of the video to be searched, and the accuracy of the search depends largely on the user's knowledge of the video content and the background. If the user has little knowledge of the video content and the background, then little search can be performed.

For this reason, the embodiment of the application provides a video searching method. In the video searching method provided by the embodiment of the application, the image characteristic strings are extracted one by one according to the collected videos, and the extracted image characteristic strings are stored in a database to establish an image characteristic string database. And extracting the image characteristic strings of the video provided by the user, comparing the extracted image characteristic strings with the image characteristic strings in the database, thereby obtaining the video with similar characteristics, and presenting the video to the user.

In the embodiment of the present application, the image feature string refers to a set of image features. For example, for a plurality of image frames in a video, the image features of each image frame may be extracted separately, the image feature string being composed of the image features of the plurality of image frames.

By the video searching method provided by the embodiment of the application, the video related information can be searched under the condition that a user does not know the video content and the background information, and the dependence of a text-based video searching technology on the video content and the background information is avoided.

FIG. 1 is a schematic diagram of an operating environment 100 in some embodiments of the present application. As shown in fig. 1, respective terminal devices (e.g., terminal devices 104-a through 104-c) of a plurality of users are each connected to a server 112 via a network 106.

In some embodiments of the present application, each user connects to the server 112 through an application 108-a through 108-c executing on the terminal device 104. For example, the application 108 may be a video application.

The server 112 is configured to provide video services to the terminal device 104, for example, services such as video playback, video search, etc. may be provided to the terminal device. The user may search for various video programs to play, etc.

In some embodiments of the present application, the server 112 may maintain a database 118, where the database 118 stores image feature strings corresponding to each of the plurality of videos collected in advance, and related information of each video, such as content introduction of each video, and so on.

After seeing the first video, the user may issue a video search request for the first video to the server 112 through the terminal device 104, requesting the server 112 to search for a target video related to the first video. For example, the first video may be a short video, and the target video may be a complete video corresponding to the short video.

After the server 112 receives the video search request message about the first video sent by the terminal device 104, an image feature string of the first video may be obtained according to the video search request message, and then an image feature string matched with the image feature string of the video may be searched for in the image feature strings stored in the database 118, so as to determine a target video corresponding to the first video, and relevant information of the target video may be fed back to the terminal device 104 as a search result.

In some embodiments, examples of terminal device 104 include, but are not limited to, a palmtop computer, a wearable computing device, a Personal Digital Assistant (PDA), a tablet computer, a notebook computer, a desktop computer, a smart phone, or any number or combination of these data processing devices.

In some embodiments, network 106 may include a Local Area Network (LAN) and a Wide Area Network (WAN) such as the Internet. The network 106 may be implemented using any well-known network protocol, including various wired or wireless protocols.

In some embodiments, the server 112 and database 118 may each be implemented on one or more separate data processing devices or a distributed computer network.

The video searching method provided in the embodiment of the present application is described below with reference to fig. 2. Fig. 2 is a flowchart of a video searching method according to some embodiments of the present application. The method may be performed by the server 112 shown in fig. 1. As shown in fig. 2, the method includes the following operations:

s201, a first image feature string of a first video is obtained in response to a video search request message about the first video sent by a terminal device, wherein the first image feature string contains at least one image feature of the first video.

In some embodiments, the user may have seen the first video through some way, and if the user wants to find the target video corresponding to the first video, a video search request message is triggered. For example, the first video is a short video, and the target video is an original video corresponding to the short video.

After receiving the video search request message of the user, the server acquires a first image feature string of the first video, so that a target video corresponding to the first video is searched out through a searching method based on the image feature.

For example, image recognition techniques may be employed to identify target objects, such as common objects (faces, cars, planes, clouds, etc.), in a certain image frame of the first video, as well as location information (e.g., location coordinates) of such objects in that image frame. In some embodiments, the names of these target objects and their location information may be regarded as "image features". Using the image features, a set of similar images can be searched from the image library.

In some embodiments, the first image feature string may include an image feature of one image frame of the first video, which may reduce the amount of computation and increase the search speed.

However, many different images may have similar features due to the limited image feature discrimination of a single image. Therefore, to improve the accuracy of the search, in some embodiments, the first image feature string may also contain image features of a plurality of image frames of the first video. At this time, the first image feature string is obtained by combining the image features of the plurality of image frames, and the recognition degree is greatly improved, so that the accuracy of searching is also improved.

S202, determining target second videos from the plurality of second videos according to the similarity between the first image feature strings and the second image feature strings of the plurality of second videos.

In some embodiments, a database maintains a plurality of respective second image feature strings for the second video. The server may collect a plurality of second videos in advance, extract second image feature strings of the second videos one by one through feature extraction, and then store the extracted second image feature strings in the database, and establish a relationship with the acquired detailed information. The detailed information in the database may be updated periodically.

After the first image feature string of the first video is extracted, the first image feature string and the second image feature strings in the database can be respectively compared to obtain the similarity of the first image feature string and each second image feature string.

And then, according to the similarity between the first image feature strings and each second image feature string, selecting a group of second image feature strings with the highest similarity, and taking the second videos corresponding to the second image feature strings as target videos. Here, the selected set of second image feature strings may contain one or more second image feature strings.

In other embodiments, a similarity threshold may be preset, and then the first image feature string and each second image feature string in the database are sequentially compared, if the similarity between the first image feature string and the current second image feature string is higher than the similarity threshold, the second video corresponding to the current second image feature string is used as a target second video, and then the similarity between the first image feature string and the next second image feature string is continuously compared until the end condition is met. For example, the end condition may be: the current second image characteristic string is the last second image characteristic string, namely all the second image characteristic strings in the database are compared; alternatively, the number of second image feature strings having a similarity higher than the similarity threshold has reached a preset threshold. For example, the number of target second videos is preset to 10, that is, the relevant information of the 10 target second videos is returned to the user as the search result. Then, in the process of comparison, if the similarity of 10 second image feature strings is higher than the similarity threshold, the second videos corresponding to the 10 second image feature strings may be regarded as the target second videos.

S203, the video information of the target second video is used as a search result and returned to the terminal equipment which sends the video search request message.

After the server returns the search results to the terminal device, the user of the terminal device may choose to view a certain second video in the search results, and may also view various detailed information of the video.

In some embodiments, in addition to storing a plurality of second image feature strings, video information corresponding to each second image feature string is stored in association in the database. After the target second video is determined, corresponding video information can be acquired from the database, and the video information is fed back to the terminal equipment which sends the video search request message.

In some embodiments, the video information may include a play link of the target second video. In addition, the video information may further include detailed information of the target second video, such as title, director, actors, scenario introduction, etc.

Fig. 3 is another flowchart of a video searching method provided in some embodiments of the present application. The method may be performed by the server 112 shown in fig. 1. As shown in fig. 3, the method includes the following operations:

s301, acquiring a first video in response to a video search request message about the first video sent by a terminal device.

In some embodiments, the user may have seen a short video (i.e., the first video) through some approach (e.g., from content shared by other users), and the user is very interested in the content of this short video, wanting to find the original video to which this short video corresponds. At this time, the user may provide the short video to the background server, so that the background server searches for the original video corresponding to the short video through a search method based on image features.

In some embodiments, the short video may be provided to the background server in different ways. For example, the following steps S3011 and S3012 give two different processing manners.

S3011, obtaining a playing link of the first video from the video search request message, and downloading the first video according to the playing link.

In some embodiments, when a user is watching a video (e.g., a short video, a video clip, etc.) through a video application, the user may want to search for the original video to which the video corresponds. At this time, a button for "searching for an original video" may be set on the play page of the video.

For example, FIG. 4 is a schematic diagram of a user interface in some embodiments of the present application. As shown in fig. 4, on a play page 400 of a short video, a user can search for an original video corresponding to the short video by clicking a "search for original video" button 401.

After the user clicks the "search original video" button 401, the video play application may send a play link of the current short video (i.e., the first video) to the background server through a search request message. After receiving the search request message, the server downloads the content of all or part of the short video (for example, 5 seconds of playing time) according to the playing link carried in the search request message, so as to perform subsequent video searching based on image characteristics.

S3012, responding to the video search request message, and taking the video uploaded by the user as the first video.

In some embodiments, the user may upload the viewed short video to a server for the server to conduct a video search based on image features. For example, the user may be watching the short video from another approach than the video application. At this time, the user may upload the short video to the server.

S302, a first image feature string of the first video is obtained, wherein the first image feature string contains one or more image features of the first video.

In some embodiments, the first image feature string may include an image feature of one image frame of the first video, which may reduce the amount of computation and increase the search speed. Or, the first image feature string may also include image features of a plurality of image frames of the first video, so as to improve the recognition degree of the image feature string and improve the accuracy of searching.

In the following embodiments, image features of a plurality of image frames of the first video included in the first image feature string are described as an example.

In the embodiment of the application, the first image feature string of the first video may be acquired through different methods. For example, the image features of each frame of image in the first video may be extracted and combined to obtain the first image feature string.

However, considering that the similarity of directly adjacent image frames in a video is often high, if image feature strings of all the image frames are directly combined to form the image feature strings, the generated image feature strings are not high in identification degree, and the data volume is also large.

For example, the 1 st to 100 th image frames of a certain video are selected to generate an image feature string, if the content of the 100 image frames is that two people stand at sea to speak, the objects in the extracted image features are the same, and only the coordinates slightly change, the recognition degree of the image feature string is very low. Therefore, in order to improve the recognition of the image feature string, the image features of two adjacent image frames may be compared first, and if the similarity reaches a certain threshold, only the image features of the previous image frame are retained. The method improves the recognition degree of the image characteristic strings and simultaneously reduces the total data volume of one video image characteristic string.

In some embodiments, step S302 may include:

s3021, determining image characteristics of a first image frame in the first video, and adding the image characteristics of the first image frame into the first image characteristic string.

S3022, acquiring a second image frame after the first image frame, comparing the second image frame with the first image frame, and determining whether to add the image features of the second image frame into the first image feature string according to the comparison result.

In some embodiments, steps S3021 and S3022 may be repeatedly performed, e.g., after step S3022, a current second image frame may be taken as a new first image frame, and then a new second image frame may be selected from the first video, returning to step S3021, thereby determining whether to add the image characteristics of the new second image frame to the first image characteristic string.

After the first image feature string corresponding to the first video is obtained, the first image feature string can be matched with the second image feature string stored in the database, so that the second image feature string matched with the first image feature string is searched.

S303, selecting a second image characteristic string from the database.

In some embodiments, the second image feature strings may be sequentially selected from the database, and for each second image feature string, a similarity to the first image feature string is calculated, thereby obtaining a similarity of the first image feature string to each second image feature string in the database.

For each second image feature string, the similarity between it and the first image feature string can be determined by the following steps S304 to S306:

s304, selecting a plurality of groups of image features in the second image feature string, wherein each group of image features comprises m continuous image features.

Here, assuming that m image features are included in the first image feature string, in order to compare the first image feature string with the second image feature string, the same number of image features as the first image feature string may be selected from the second image feature string, respectively, i.e., if the first image feature string includes m image features (image feature 1, image feature 2, …, image feature m), it is necessary to select m consecutive image features from the second image feature string.

In some embodiments, such m consecutive image features may be referred to as a set of image features. Assuming that the second image feature string contains n image features in total, it can be derived that the second image feature string contains (n-m+1) sets of image features in total.

Each set of image features may be consecutive m image features starting from the 1 st image feature of the second image feature string or consecutive m image features starting from other positions.

S305, calculating the similarity of each group of image features of the first image feature string and the second image feature string respectively.

In some embodiments, step S305 may include:

s3051, a set of image features in the second image feature string is selected.

In some embodiments, after a set of image features in the second image feature string is selected, for each set of image features of the second image feature string, the similarity of the first image feature string to it may be calculated by the following steps.

S3052, determining a similarity between an i-th image feature in the first image feature string and an i-th image feature in the set of image features.

Here, assume that m image features of the first image feature string are a ₁ To A _m Wherein A is _i An ith image feature, B, of m image features representing the first image feature string _i Representing an ith image feature of the m image features selected in the second image feature string, i e [1, m]. The initial value of i is 1.

In this step, m image features in the first image feature string are compared with m features in the second image feature string in pairs, i.e., when i=1, comparison a ₁ And B ₁ When the similarity between the two is i=2, comparison a ₂ And B ₂ Similarity between them.

S3053 if A _i And B is connected with _i Is lower than the second similarity threshold, and the process returns to step S3051.

In some embodiments, when A _i And B is connected with _i When the similarity of m image features of the first image feature string is below a second similarity threshold, the comparison of the set of image features in the second image feature string may be ended.

For example, first compare A ₁ And B ₁ Similarity between if when A ₁ And B ₁ If the similarity of (2) is not less than the second similarity threshold, then continue to compare A ₂ And B ₂ Similarity between them.

If A ₁ And B ₁ If the similarity is lower than the second similarity threshold, the positions of the m image features corresponding to the first video and not the second video can be not compared any more ₂ And B ₂ ，...，A _m And B _m And the similarity between the two is reduced, so that the calculated amount is reduced, and the searching efficiency is improved.

At this time, it may be possible to directly return to step S3051 to select another m image features in the second image feature string. For example, if the selection is from 1 st to m-th image features before, the 2 nd to m+1th image features may now be selected, and then steps S3052 and S3053 are repeatedly performed.

S3054 if A _i And B is connected with _i If the similarity of the two images is not lower than the second similarity threshold, recording A _i And B is connected with _i If i=m, indicating that the comparison of all image features has been completed, step S3055 is performed; otherwise, let i=i+1, return to step S3052.

And S3055, calculating the similarity between the m image features of the first image feature string and the current group of m image features in the second image feature string according to the m similarity between the m image features of the first image feature string and the m image features of the group of m image features in the second image feature string recorded in the step S3054.

In some embodiments, the sum of the similarities of each pair of image features recorded in step S3054 may be taken as the similarity between the m image features of the first image feature string and the current m image features in the second image feature string.

S3056, judging whether all (n-m+1) groups of image features in the second image feature string are compared, if so, executing step S306, otherwise, returning to step S3051, and selecting the next group of image features for comparison.

And S306, determining the maximum value in the similarity corresponding to each group as the similarity between the first image characteristic string and the second image characteristic string.

From the m image features of the first image feature string and the plurality of sets of image features of the second image feature string calculated in step S305, the maximum value is selected as the similarity of the first image feature string and the second image feature string.

The similarity between the first image feature string and a second image feature string in the database can be obtained through the steps S304 to S306. Thereafter, step S303 may be returned to select a next second image feature string in the database, and the operations of steps S304 to S306 may be repeated for the second image feature string to obtain a similarity between the first image feature string and the second image feature string.

S307, according to the similarity between each second image characteristic string and the first image characteristic string, determining the second video corresponding to the second image characteristic string meeting the condition as the target second video.

For example, in some embodiments, all second image feature strings may be ranked according to similarity from high to low, and the second videos corresponding to the first N second image feature strings are determined to be target second videos, where N is a natural number, for example, n=10.

Alternatively, among all the second image feature strings, the second image feature strings having the similarity higher than the second similarity threshold may be determined, and the second videos corresponding to the second image feature strings may be determined as the target second videos. For example, a second video corresponding to a second image feature string having a similarity higher than 90% may be used as the target second video.

And S308, returning the video information of the target second video to the terminal equipment sending the video search request message as a search result.

In some embodiments, the video information of the target second video includes at least: and playing the link of the second video. In addition, the video information may further include detailed information of the target second video, such as title, director, actors, scenario introduction, etc.

The process of generating an image feature string is described below with reference to fig. 5. The server may extract the first image feature string of the first video by using the method shown in fig. 5, or may extract the corresponding second image feature string of the second video collected in advance according to the method shown in fig. 5.

Fig. 5 is a schematic diagram of a process of acquiring an image feature string of a video according to an embodiment of the present application. In the embodiment shown in fig. 5, the image feature string of the video is obtained by comparing the image features of two adjacent image frames.

S501, determining the image characteristics of a first image frame in a video, and adding the image characteristics of the first image frame into an image characteristic string.

In this step, a first image frame in the video may be selected first, and the image features of the first image frame are added to the image feature string, and for each subsequent image frame, it is determined whether to add the image features thereof to the image feature string according to the similarity with the previous adjacent image frame.

In some embodiments, the first image frame may be an image frame corresponding to any playing time of the video, for example, may be an image frame corresponding to 0 th second (i.e. start time) of the video; or may be an image frame during playback.

In some embodiments, the image features may include: the name of the object contained in the image frame, and the position information of the object in the image frame.

In order to acquire the names of the objects and the position information thereof contained in the image frames, the images can be identified through an image identification model, so that the objects contained in each frame of image and the position information of each object in the image frames are identified, and the image characteristics of each frame of image are obtained.

In addition, if the image recognition model has the face recognition capability, the existence of the celebrities in the image can be recognized, and the name corresponding to the face can be used as a part of the image characteristics, so that the searching accuracy is further improved, and the searching range is reduced.

The process of extracting image features from an image frame will be described below taking an example in which the image features include an object name and position information thereof.

Fig. 6 is a schematic diagram of image features extracted from an image frame in an embodiment of the present application. As shown in fig. 6, the object names identified in the image frame 600 include: plants and humans. In the embodiment shown in fig. 6, with the lower left corner of the image frame 600 as the origin of coordinates, then the coordinates of the plant are (480,540) and the coordinates of the person are (960,540). Thus, the resulting image features may be: plants, (480, 540); human, (960,540).

In the above example, the position information of the identified object is represented by absolute pixel coordinates. Considering that the resolution of the same video may be different, there may be, for example, a version with a resolution of 1920 (wide) x 1080 (high) and a version with a resolution of 1280 (wide) x 720 (high). At this time, the positional information of each object may be represented using relative coordinates, thereby solving the problem of comparison between images of the same content but different resolutions.

The term "relative coordinates" refers to coordinate values expressed in terms of a ratio of width or height to the entire image, not absolute pixel values. For example, the relative coordinates of the position where the plant of fig. 6 is located are (480/1920,540/1080) = (1/4, 1/2). Similarly, the relative coordinates of the person in image 600 are (1/2 ).

In the above example, the positional information of the object is expressed in the form of relative coordinates. In some embodiments, the location information of the object may also be expressed in other ways. For example, a rectangular coordinate system may be established, and then the angle of the center point of each object relative to the origin of coordinates, and the distance between the center point of each object and the origin of coordinates, may be determined. Then, the angle and the distance are taken as position information of the object.

After determining the image features of a first image frame in the video, the image features of the first image frame are added to the image feature string. Image features of the next image frame may then be extracted and compared to image features of the first image frame.

S502, acquiring a second image frame in the video.

After extracting the image features of the first image frame, a second image frame may be extracted and then compared with the first image frame to determine whether to add the image features of the second image frame to the image feature string.

In some embodiments, the second image frame may be a next image frame adjacent to the first image frame. Thus, adjacent image frames in the video can be compared in pairs in sequence, and if the similarity between the image of the next frame and the image of the previous frame is higher than a predetermined similarity threshold, the image frame of the next frame is ignored. And if the similarity between the image of the next frame and the image of the previous frame is not higher than a preset similarity threshold value, adding the image characteristics of the image of the next frame into the image characteristic string.

Alternatively, the second image frame may be an image frame not adjacent to the first image frame in order to reduce the amount of data. At this time, the second image frame may be determined according to a predetermined time interval. For example, the predetermined time interval is 0.5 seconds, then the first image frame is the 0 th second image frame, the second image frame is the 0.5 th second image frame, and so on.

S503, comparing the second image frame with the first image frame, and determining whether to add the image features of the second image frame into the first image feature string according to the comparison result.

After the second image frame is acquired, it may be determined whether to add image features of the second image frame to the image feature string by comparing the second image frame with the first image frame.

In some embodiments, the first similarity threshold may be predetermined, for example 90%. And if the similarity between the second image frame and the first image frame is smaller than a first similarity threshold value, adding the image features of the second image frame into the image feature string.

In some embodiments, step S503 may include the following steps S5031 to S5034.

S5031, for each object identified in the first image frame, determining whether the same object exists in the corresponding region in the second image frame. If so, steps S5032 to S5034 are performed to determine the similarity of the first image frame to the second image frame. If there is no identical object in the corresponding region in the second image frame for any one of the first image frames, it is determined that the first image frame is dissimilar to the second image frame, step S504 is performed, and the image features of the second image frame are added to the first image feature string.

In some embodiments, for each object i in the first image frame, it may be determined in turn whether the same object is present in the corresponding region in the second image frame, where i e [1, n ], n represents the number of objects in the first image frame, and the initial value of i may be 1.

That is, starting from the 1 st object in the first image frame, determining whether the same object exists in the corresponding area in the second image frame, and if so, continuing to judge the 2 nd object in the first image frame; if not, the determination of step S5031 is ended, and the process proceeds to step S504.

The corresponding region may be the same region as the object is located in the first image frame, or may be a region similar to the object is located in the first image frame.

For example, as shown in fig. 6, image frame 610 is a first image frame and image frame 620 is a second image frame. The region 602 in the second image frame 620 is a region similar to the region 601 in the first image frame 610. If within the area 602 of the second image frame there is a plant that is the same as the plant in the area 601, then the corresponding area in the second image frame is considered to have the same object.

In some embodiments, an error range may be set to achieve this determination, and if the difference between the relative coordinates of the object in the first image frame and the relative coordinates of the corresponding object in the second image frame is within the error range, then the same object may be considered to be present in the corresponding region in the second image frame.

S5032, calculating a maximum positional deviation of a position of each object in the first image frame and the second image frame.

In some embodiments, it is assumed that the relative coordinates of object i in the first image frame are (x ₁ ，y ₁ ) The relative coordinates in the second image frame are (x ₂ ，y ₂ ). The maximum positional deviation of the object i can be calculated by the following formula:

Delta _i ＝MAX(abs(x ₁ -x ₂ )，abs(y ₁ -y ₂ ))

wherein Delta is _i Representing the maximum positional deviation of the i-th object in the first image frame; MAX () represents taking the maximum of two values; abs () represents taking absolute value.

In the embodiment of the present application, according to the above formula, the maximum positional deviation of each object in the first image frame may be calculated.

S5033, calculating an average deviation from the maximum positional deviation of each object in the first image frame.

After calculating the maximum position deviation of each object in the first image frame, an average deviation of all the objects can be calculated according to the maximum position deviation.

For example, the average deviation may be calculated according to the following formula:

Delta＝(Delta ₁ +Delta ₂ +…+Delta _n )/n

here Delta represents the average deviation and n represents the number of objects in the first image frame.

And S5034, obtaining the similarity between the first image frame and the second image frame according to the average deviation.

Here, the similarity between the first image frame and the second image frame may be calculated from the average deviation.

For example, the similarity between the first image frame and the second image frame may be calculated according to the following formula:

S＝100-(Delta*100)

wherein S represents a similarity between the first image frame and the second image frame; since Delta is calculated from relative coordinates, it must be a number less than 1. It can be seen that the larger the average deviation, the lower the similarity between two image frames.

S5035, determining whether the similarity between the second image frame and the first image frame is smaller than a first similarity threshold, and if the similarity between the second image frame and the first image frame is smaller than the first similarity threshold, executing step S504.

In some embodiments, if the similarity between the second image frame and the first image frame is greater than or equal to the first similarity threshold, the second image frame may be ignored, and step S502 is returned, where the current second image frame is taken as a new first image frame, and the next image frame of the current second image frame is taken as a new second image frame to determine.

The method of calculating the similarity between two image frames is described above. In some embodiments, other ways of calculating the similarity between two image frames may also be employed.

For example, whether objects contained in two image frames are identical or not may be compared respectively to obtain object similarity, then positions of the respective objects in the two image frames are compared to obtain position similarity, and then the similarity between the two image frames is determined by combining the object similarity and the position similarity.

For example, fig. 7 is a schematic diagram of two images included in a first video in an embodiment of the present application. As shown in fig. 7, the image features of the first image frame 710 obtained through the image recognition operation are: human, relative coordinates (x ₁ ，y ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the Table, relative coordinates (x ₂ ，y ₂ ) The method comprises the steps of carrying out a first treatment on the surface of the Wave line, relative coordinates (x ₃ ，y ₃ ). The image features of the second image frame 720 obtained through the image recognition operation are: human, relative coordinates (x ₁ ，y ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the Table, relative coordinates (x ₂ ，y ₂ ) The method comprises the steps of carrying out a first treatment on the surface of the Wave line, relative coordinates (x ₄ ，y ₄ )。

Here, the similarity of the objects contained in the first image frame 710 and the second image frame 720 may be compared first. The first image frame 710 includes: people, tables, and wavy lines; the second image frame 720 includes: people, tables, and wavy lines. The first image frame 710 and the second image frame 720 contain identical objects, and the object similarity is a full score, for example, 40 scores.

And then comparing the relative coordinate value of the object in the first image frame with the relative coordinate value of the corresponding object in the second image frame to obtain the position similarity.

In comparing the relative coordinate values, an error range may be set, and if a difference between the relative coordinates of the object in the first image frame and the relative coordinates of the corresponding object in the second image frame is within the error range, the positions of the two may be considered to be the same.

For example, in the first image frame 710 and the second image frame 720, the relative coordinates of the person are identical, the relative coordinates of the table are identical, and only the relative coordinates of the wavy line deviate. Thus, the position similarity may be 60×2/3=40 minutes.

And obtaining the similarity of the second image frame and the first image frame according to the object similarity and the position similarity.

After the object similarity and the position similarity of the two images are calculated, the total similarity can be obtained according to a predefined calculation method.

For example, the sum of the obtained object similarity and the positional similarity may be taken as the total similarity, for example 40+40=80%. In addition, the total similarity may be calculated by other manners, for example, a weighted sum, etc., which is not limited in the embodiments of the present application.

S504, adding the image characteristics of the second image frame into the first image characteristic string.

In some embodiments, if the similarity between the second image frame and the first image frame is greater than or equal to a first similarity threshold, the following step S505 may be further performed.

S505, if the similarity between the image features of the second image frame and the image features of the first image frame is greater than or equal to the first similarity threshold, and the playing time interval between the second image frame and the first image frame is greater than or equal to the time threshold, adding the image features of the second image frame into the first image feature string.

Here, if the similarity of the two images is equal to or greater than the first similarity threshold, but the play time interval of the two images is equal to or greater than the time threshold, for example, 1 second, the image features of the second image frame are added to the first image feature string.

In some embodiments, steps S502 to S505 may be repeatedly performed, i.e. the current second image frame is taken as the new first image frame, the next image frame of the current second image frame is taken as the new second image frame, and then steps S502 to S505 are repeatedly performed until an end condition is met, e.g. the current second image frame is the last image frame of the first video, or the number of image features in the first image feature string reaches a predefined threshold.

While the method of acquiring the first image feature string of the first video has been described above, other methods may be adopted by those skilled in the art to acquire the first image feature string. For example, in some embodiments, images may also be selected from the first video at predetermined time intervals, and then image features of the selected image frames may be combined to form the image feature string. For example, the predetermined time interval may be 5 seconds. Then the video frames of the 0 th second, the 5 th second, the 10 th second and the … th in the first video can be taken, and the image features of the video frames are combined to obtain the image feature string. The video frame of the 0 th second may be the first frame of the first video.

Fig. 8 is an interaction schematic diagram of a video searching method according to an embodiment of the present application. The video searching method provided in the embodiment of the present application is described below with reference to fig. 8. As shown in fig. 8, the method includes the steps of:

s801, the server extracts image characteristic strings of the collected videos and stores the extracted image characteristic strings in a database.

In some embodiments, the server may collect some original videos in advance, and extract image feature strings from the original videos. The detailed information about the original video (e.g., title, director, introduction of actors, scenario, etc.) may be further stored in the database.

The method for extracting the image feature strings may refer to the embodiment described in fig. 5, and will not be described herein.

S802, the server responds to a video search request message of a user about a first video to acquire the first video.

In some embodiments, the video search request message may include a link to the short video, and the server may download the short video based on the link.

Or the terminal device can upload the short video to the server, and the server performs subsequent searching according to the short video uploaded by the user.

S803, the server extracts an image feature string from the first video.

Here, the method for extracting the image feature string by the server may refer to the previous embodiment, and will not be described herein.

S804, the server matches the image feature string extracted in step S803 with the image feature string in the database.

Here, the method for matching the image feature strings by the server may refer to the foregoing embodiments, which are not described herein.

S805, the server determines a group of videos with similar characteristics according to the matching result.

S806, the server acquires the related information of the group of videos with similar characteristics from the database.

And S807, the server feeds the relevant information back to the terminal equipment as a search result.

In the above process, the specific operation of each step may be referred to the foregoing method embodiments, which are not described herein.

The video searching method provided by the embodiment of the application is described above.

The following describes a video search device provided in an embodiment of the present application with reference to the accompanying drawings.

Fig. 9 is a schematic structural diagram of a video searching apparatus according to some embodiments of the present application. As shown in fig. 9, the apparatus 900 includes:

an obtaining module 902, configured to obtain a first image feature string of a first video in response to a video search request message about the first video sent by a terminal device, where the first image feature string includes image features of one or more image frames of the first video;

a determining module 904, configured to determine a target second video from the plurality of second videos according to a similarity between the first image feature string and the second image feature strings of the plurality of second videos in the database;

and the feedback module 906 is configured to return the video information of the target second video to the terminal device as a search result.

In some embodiments, the obtaining module 902 is further configured to:

determining image features of a first image frame in the first video, and adding the image features of the first image frame into the first image feature string;

And acquiring a second image frame after the first image frame, comparing the second image frame with the first image frame, and determining whether to add the image features of the second image frame into the first image feature string according to a comparison result.

In some embodiments, the obtaining module 902 is further configured to:

for each object in the first image frame, determining a similarity of the first image frame and the second image frame if the same object is contained in the corresponding region of the second image frame;

and if the similarity of the first image frame and the second image frame is smaller than a first similarity threshold value, adding the image features of the second image frame into the first image feature string.

In some embodiments, the obtaining module 902 is further configured to:

and if the similarity between the second image frame and the first image frame is greater than or equal to the first similarity threshold value and the playing time interval between the second image frame and the first image frame is greater than or equal to the time threshold value, adding the image features of the second image frame into the first image feature string.

In some embodiments, the obtaining module 902 is further configured to:

For at least one object in the first image frame, if the corresponding region of the second image frame does not contain the same object, the image features of the second image frame are added into the first image feature string.

In some embodiments, the determining module 904 is further to:

calculating the similarity between the first image feature string and each second image feature string;

and determining the second video corresponding to the second image characteristic strings meeting the conditions as the target second video according to the similarity between the first image characteristic string and each second image characteristic string.

In some embodiments, the first image feature string includes m image features, where m is a positive integer;

the determining module 906 is further configured to:

selecting a plurality of sets of image features in the second image feature string, wherein each set of image features comprises m consecutive image features of the second image feature string;

respectively calculating the similarity of each group of image features of the first image feature string and the second image feature string;

and determining the maximum value in the calculated similarity as the similarity between the first image characteristic string and each second image characteristic string.

In some embodiments, the determining module 906 is further configured to:

determining a similarity between an ith image feature in the first image feature string and an ith image feature in the set of image features of the second image feature string; wherein i is a positive integer and the initial value is 1;

recording the determined similarity between the ith image feature in the first image feature string and the ith image feature in the set of image features of the second image feature string if the similarity between the ith image feature in the first image feature string and the ith image feature in the set of image features of the second image feature string is not less than a second similarity threshold; letting i=i+1 and returning to the step of determining the similarity between the i-th image feature in the first image feature string and the i-th image feature in the set of image features of the second image feature string until i=m;

and determining the similarity of the group of image features of the first image feature string and the second image feature string according to the m similarities of the m image features of the recorded first image feature string and the m image features of the group of m image features in the second image feature string.

In some embodiments, the image features of each image frame include at least an object contained in the image frame and location information of the object in the image frame.

Wherein the relative coordinate values are determined according to a ratio of pixel coordinate values of a center point of the object to a width and a height of the image frame.

In some embodiments, the apparatus 900 further comprises: a download module 901, configured to:

acquiring a playing link of the first video from the video searching request message;

and downloading the first video according to the playing link.

In some embodiments, the video information of the target second video includes at least: and playing the link of the second video.

Fig. 10 is another schematic structural diagram of a video search device in some embodiments of the present application. The video search device 1000 may be the server 112 shown in fig. 1, or may be a component integrated into the server 112.

As shown in fig. 10, the video search device 1000 includes one or more processors (CPUs) 1002, a network interface 1004, a memory 1006, and a communication bus 1008 for interconnecting these components.

In some embodiments, the network interface 1004 is configured to implement a network connection between the video search apparatus 1000 and an external device, for example, receiving a video search request message from a terminal device, querying a database, feeding back search results to the terminal device, and so on.

The video search apparatus 1000 may further include one or more output devices 1012 (e.g., one or more visual displays), and/or include one or more input devices 1014 (e.g., a keyboard, mouse, or other input controls, etc.).

Memory 1006 may be a high-speed random access memory such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; or non-volatile memory such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices.

The memory 1006 includes:

an operating system 1016 including programs for handling various basic system services and for performing hardware-related tasks;

a video search application 1018 for obtaining a first image feature string of a first video in response to a video search request message sent by a terminal device regarding the first video, the first image feature string containing image features of one or more image frames of the first video;

determining a target second video from a plurality of second videos according to the similarity between the first image characteristic string and the second image characteristic strings of the plurality of second videos in the database;

Specific functions and implementations of the video search application 1018 can be found in the above-described method embodiments, and are not described herein.

In addition, each functional module in each embodiment of the present application may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The functional modules of the embodiments may be located in one terminal or network node, or may be distributed across multiple terminals or network nodes.

In addition, each of the embodiments of the present invention can be realized by a data processing program executed by a data processing apparatus such as a computer. Obviously, the data processing program constitutes the invention. In addition, a data processing program typically stored in one storage medium is executed by directly reading the program out of the storage medium or by installing or copying the program into a storage device (such as a hard disk and/or a memory) of the data processing apparatus. Therefore, such a storage medium also constitutes the present invention. The storage medium may use any type of recording means, such as paper storage medium (e.g., paper tape, etc.), magnetic storage medium (e.g., floppy disk, hard disk, flash memory, etc.), optical storage medium (e.g., CD-ROM, etc.), magneto-optical storage medium (e.g., MO, etc.), etc.

The present invention thus also provides a storage medium in which a data processing program is stored for performing any one of the embodiments of the above-described method of the present invention.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A video search method, comprising:

responding to a video search request message about a first video sent by a terminal device, and acquiring a first image feature string of the first video, wherein the first image feature string comprises image features of one or more image frames of the first video; wherein the image features of each of the one or more image frames of the first video comprise: the name of the object contained in the image frame, and the position information of the object in the image frame; the first image feature string comprises m image features, wherein m is a positive integer;

For each of a respective second image feature string of the plurality of second videos,

selecting a plurality of sets of image features in the second image feature string, wherein each set of image features comprises m consecutive image features in the second image feature string;

determining the maximum value in the calculated similarity as the similarity between the first image characteristic string and the second image characteristic string;

determining a target second video from the plurality of second videos according to the similarity of the first image feature string and the second image feature strings of the plurality of second videos; the second image feature strings of the plurality of second videos are maintained in a database in advance, and video information corresponding to each second image feature string is also stored in the database in an associated mode; each second image feature string contains image features of one or more image frames of the second video; wherein the image features of each of the one or more image frames of the second video comprise: the name of the object contained in the image frame, and the position information of the object in the image frame; the location information includes: the relative coordinate value of the center point of the object relative to the reference point in the image frame is determined according to the ratio of the pixel coordinate value of the center point of the object to the width and the height of the image frame; the video information of the target second video is used as a search result and returned to the terminal equipment;

The calculating the similarity of each group of image features of the first image feature string and the second image feature string respectively includes:

recording the similarity between the ith image feature in the first image feature string and the ith image feature in the group of image features of the second image feature string if the similarity between the ith image feature in the first image feature string and the ith image feature in the group of image features of the second image feature string is not lower than a second similarity threshold; letting i=i+1 and returning to the step of determining the similarity between the i-th image feature in the first image feature string and the i-th image feature in the set of image features of the second image feature string until i=m;

2. The method of claim 1, wherein the acquiring the first image feature string of the first video comprises:

3. The method of claim 2, wherein the comparing the second image frame with the first image frame, and determining whether to add image features of the second image frame to the first image feature string based on the comparison result, comprises:

for each object in the first image frame, determining a similarity of the first image frame and the second image frame if the same object is contained in the corresponding region of the second image frame; and if the similarity of the first image frame and the second image frame is smaller than a first similarity threshold value, adding the image features of the second image frame into the first image feature string.

4. A method as claimed in claim 3, further comprising:

5. A method according to claim 3, further comprising:

6. The method of claim 1, wherein the determining the target second video from the plurality of second videos based on the similarity of the first image feature string to the respective second image feature strings of the plurality of second videos comprises:

7. The method of claim 1, further comprising, prior to acquiring the first image feature string of the first video:

and downloading the first video according to the playing link.

8. The method of claim 1, wherein the video information of the target second video includes at least: and playing the link of the second video.

9. A video search apparatus, comprising:

an acquisition module, configured to acquire a first image feature string of a first video in response to a video search request message about the first video sent by a terminal device, where the first image feature string includes image features of one or more image frames of the first video; wherein the image features of each of the one or more image frames of the first video comprise: the name of the object contained in the image frame, and the position information of the object in the image frame; the first image feature string comprises m image features, wherein m is a positive integer;

a determining module for

determining a target second video from the plurality of second videos according to the similarity of the first image feature string and the second image feature strings of the plurality of second videos; the second image feature strings of the plurality of second videos are maintained in a database in advance, and video information corresponding to each second image feature string is also stored in the database in an associated mode; each second image feature string contains image features of one or more image frames of the second video; wherein the image features of each of the one or more image frames of the second video comprise: the name of the object contained in the image frame, and the position information of the object in the image frame; the location information includes: the relative coordinate value of the center point of the object relative to the reference point in the image frame is determined according to the ratio of the pixel coordinate value of the center point of the object to the width and the height of the image frame;

The feedback module is used for returning the video information of the target second video to the terminal equipment as a search result;

the determination module is further to:

10. The apparatus of claim 9, wherein the acquisition module is further to:

11. A non-transitory computer readable storage medium having stored therein machine readable instructions executable by a processor to perform the method of any of claims 1-8.