CN113190695A

CN113190695A - Multimedia data searching method and device, computer equipment and medium

Info

Publication number: CN113190695A
Application number: CN202110492379.2A
Authority: CN
Inventors: 刘俊启
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2021-07-30
Anticipated expiration: 2041-05-06
Also published as: CN113190695B

Abstract

The disclosure provides a multimedia data searching method and device, computing equipment and media, and relates to the technical field of artificial intelligence, in particular to the field of intelligent searching. The implementation scheme is as follows: acquiring multimedia data to be searched, wherein the multimedia data to be searched comprises a video to be searched and an audio to be searched which are synchronous in time; based on the change of the sound characteristics of the audio to be searched in the time domain, extracting a first key frame sequence from the video to be searched of the multimedia data to be searched according to a preset rule; acquiring a second key frame sequence corresponding to each candidate multimedia data in the plurality of candidate multimedia data, wherein the second key frame sequence of each candidate multimedia data is obtained by extraction according to the preset rule; and determining the multimedia data matched with the multimedia data to be searched in the candidate multimedia data based on the first key frame sequence and the second key frame sequence corresponding to each candidate multimedia data in the candidate multimedia data.

Description

Multimedia data searching method and device, computer equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of intelligent search. And more particularly, to a method, apparatus, electronic device, computer-readable storage medium, and computer program product for multimedia data searching.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like: the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

Multimedia can provide richer information content compared with information transmission media such as texts, images and the like. In some scenarios, a user may wish to search for multimedia data that he desires. Existing multimedia data searching methods are generally text-based searches, i.e., search results are obtained by matching search words input by a user with text labels of respective multimedia data in a multimedia database. The searching mode is irrelevant to the content of the multimedia data, and is only dependent on the accuracy of the searching words and the label labeling of the multimedia data input by the user, and the searching result is generally difficult to satisfy the user.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for multimedia data searching.

According to an aspect of the present disclosure, there is provided a multimedia data search method including: acquiring multimedia data to be searched, wherein the multimedia data to be searched comprises a video to be searched and an audio to be searched which are synchronized in time; based on the change of the sound characteristics of the audio to be searched in the time domain, extracting a first key frame sequence from the video to be searched of the multimedia data to be searched according to a preset rule; acquiring a second key frame sequence corresponding to each candidate multimedia data in the plurality of candidate multimedia data, wherein the second key frame sequence of each candidate multimedia data is obtained by extraction according to the preset rule; and determining the multimedia data matched with the multimedia data to be searched in the candidate multimedia data based on the first key frame sequence and the second key frame sequence corresponding to each candidate multimedia data in the candidate multimedia data.

According to another method of the present disclosure, there is provided a multimedia data search apparatus including: the multimedia searching device comprises a first obtaining unit, a second obtaining unit and a searching unit, wherein the first obtaining unit is used for obtaining multimedia data to be searched, and the multimedia data to be searched comprises video to be searched and audio to be searched which are synchronous in time; the first extraction unit is configured to extract a first key frame sequence from a video to be searched of multimedia data to be searched according to a preset rule based on the change of sound characteristics of the audio to be searched in a time domain; the second obtaining unit is configured to obtain a second key frame sequence corresponding to each candidate multimedia data in the plurality of candidate multimedia data, wherein the second key frame sequence of each candidate multimedia data is obtained by extraction according to a preset rule; and the determining unit is configured to determine multimedia data matched with the multimedia data to be searched in the candidate multimedia data based on the first key frame sequence and the second key frame sequence corresponding to each candidate multimedia data in the candidate multimedia data.

According to another aspect of the present disclosure, there is provided a computer device including: a memory, a processor and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the above method.

According to another aspect of the present disclosure, a non-transitory computer readable storage medium is provided, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the steps of the above-described method when executed by a processor.

According to one or more embodiments of the present disclosure, a multimedia data search scheme of "searching multimedia with multimedia" based on sound characteristics is provided. The matching between the multimedia data is executed based on the key frame sequences respectively extracted from the multimedia data to be searched and each candidate multimedia data, so that the data amount required to be processed in the process of searching the multimedia data is effectively reduced, the computing resources are saved, and the multimedia data searching efficiency is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

fig. 2 illustrates a flowchart of a multimedia data search method according to an embodiment of the present disclosure;

fig. 3 illustrates a matching type diagram of multimedia data to be searched and candidate multimedia data according to an embodiment of the present disclosure;

fig. 4 is a block diagram illustrating a structure of a multimedia data search apparatus according to an embodiment of the present disclosure;

FIG. 5 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable the multimedia data search method to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use the

client device

101, 102, 103, 104, 105, and/or 106 to obtain multimedia data to be searched or to present page information regarding search results. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, Apple iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., Google Chrome OS); or include various Mobile operating systems, such as Microsoft Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In certain embodiments, the data store used by the server 120 may be a database, such as a relational database. One or more of these databases may store, update, and search to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

For purposes of the disclosed embodiments, client applications for conducting multimedia data searches may be included in the

client devices

101, 102, 103, 104, 105, and 106 in the example of fig. 1. The client application may be, for example, an application program that needs to be downloaded and installed before running, a video website that can be accessed through a browser, a light-weight applet that runs in a host application, and the like. The client application may provide various functions based on the multimedia data, such as searching, viewing, uploading, downloading, clipping, etc. of the multimedia data. Accordingly, server 120 may be a server for use with the client application. The server 120 may provide multimedia services to client applications running in the

client devices

101, 102, 103, 104, 105 and 106 based on stored multimedia data assets, multimedia clipping tools, etc. Specifically, the server 120 may execute the multimedia data searching method 200 according to the embodiment of the present disclosure based on the stored multimedia resources, provide a multimedia data searching service to the user, and implement a fast and accurate multimedia data search.

Fig. 2 is a flowchart illustrating a multimedia data searching method 200 according to an exemplary embodiment. The method 200 may be performed at a server (e.g., the server 120 shown in fig. 1), that is, the execution subject of the steps of the method 200 may be the server 120 shown in fig. 1.

As shown in fig. 2, the method 200 includes:

step 201, acquiring multimedia data to be searched, wherein the multimedia data to be searched comprises a video to be searched and an audio to be searched which are synchronized in time;

step 202, based on the change of the sound characteristics of the audio to be searched in the time domain, extracting a first key frame sequence from the video to be searched of the multimedia data to be searched according to a preset rule;

step 203, acquiring a second key frame sequence corresponding to each candidate multimedia data in the plurality of candidate multimedia data, wherein the second key frame sequence of each candidate multimedia data is extracted according to a preset rule; and

step 204, determining multimedia data matched with the multimedia data to be searched in the candidate multimedia data based on the first key frame sequence and a second key frame sequence corresponding to each candidate multimedia data in the candidate multimedia data.

According to an embodiment of the present disclosure, there is provided a multimedia data search scheme of searching for multimedia in multimedia based on sound characteristics in multimedia data. The matching can be executed based on the key frame sequences respectively extracted from the multimedia data to be searched and each candidate multimedia data, so that the data amount required to be processed in the multimedia data searching process is effectively reduced, the computing resources are saved, and the multimedia data searching efficiency is improved.

The multimedia data searching method 200 of the embodiment of the disclosure relates to the technical field of multimedia data processing, in particular to artificial intelligence and computer vision technology. The method 200 can be applied in a multimedia data understanding scenario, for example, for searching for multimedia data matching with user-specified multimedia data to be searched, or making a multimedia data recommendation to a user based on multimedia data content, etc.

For example, in some scenarios, a user may view a highlight movie segment through some channel and wish to search for a movie or a television episode containing the movie segment. In this case, the multimedia data to be searched is the movie fragment; the candidate multimedia data may be all multimedia data stored in a server or an associated database, or all movie/tv show videos, movie/tv show videos of a certain genre (e.g., comedy, suspense, etc.) or starring by certain actors, etc.; the multimedia data matched with the multimedia data to be searched is the multimedia data matched with the highlight film and television segment searched from the candidate multimedia data.

For another example, in other scenarios, the user may view a highlight segment in a sporting event through some channel, such as a goal segment in a football event, a dunk segment in a basketball event, etc., and may wish to search for a more complete video of the event that includes the highlight segment, such as a full-time multimedia data of the event that includes the highlight segment, a half-time multimedia data of the event, etc. In this case, the multimedia data to be searched is the highlight of the sports event; the candidate multimedia data may be all multimedia data stored in the server or an associated database, or all sporting event multimedia data, a type of sporting event multimedia data, or the like; the multimedia data matched with the multimedia data to be searched is event multimedia data which is searched from the candidate multimedia data and is matched with the wonderful segment.

The various steps of method 200 are described in detail below.

Referring to fig. 2, the multimedia data to be searched in step 210 may be acquired in various ways.

According to some embodiments, a user may upload multimedia data to be searched through a client device and initiate a multimedia data search request requesting a search for multimedia data matching the multimedia data to be searched. Accordingly, in step 210, the server may directly obtain the multimedia data to be searched uploaded by the user.

According to other embodiments, a user may specify an address of multimedia data to be searched through a client device and initiate a multimedia data search request. Accordingly, the server may acquire multimedia data to be searched from a corresponding address in step 210.

According to still other embodiments, the server may treat any multimedia data viewed by the user as the multimedia data to be searched without specification by the user. In this case, the server may determine multimedia data matching the multimedia data to be searched through the method 200, and push the multimedia data to the client device to provide the multimedia intelligent recommendation service to the user.

According to some embodiments, the video to be searched and the audio to be searched included in the multimedia data to be searched are kept synchronized by respective time stamps.

In step 202, according to some embodiments, extracting a first sequence of key frames from the video to be searched of the multimedia data to be searched according to a preset rule based on a change of a sound feature of the audio to be searched in a time domain may include: determining at least one silent period in the multimedia data to be searched based on the change of the sound characteristics of the audio to be searched in the time domain; and determining at least one video frame close to the mute period in the video to be searched as a first key frame for each mute period in the at least one mute period to form a first key frame sequence. Therefore, the key frames can be extracted according to the 'silent periods' in the multimedia data, and the data volume required to be processed for matching between the subsequent multimedia data is reduced.

Because the 'non-silent period' in the multimedia data usually has richer information content, the extraction of the video frames is carried out based on the 'silent period', and the extraction of the video frames in the 'silent period' can be avoided in the process of extracting the video frames in a targeted manner, so that more information in the multimedia data can be extracted.

According to some embodiments, a video frame adjacent to one of two ends of the silent period in the video to be searched can be determined as the first key frame.

According to some embodiments, a video frame having a predetermined distance from one of the two ends of the silent period in the video to be searched may be determined as the first key frame.

According to some embodiments, determining at least one silent period in the multimedia data to be searched based on a change in sound characteristics of the audio to be searched in a time domain may include: at least one silent period in the multimedia data to be searched is identified using voice boundary detection (VAD) based on a change in a sound characteristic of the audio to be searched in a time domain. Thereby, it is possible to easily recognize a silent period in the multimedia data to be searched.

According to some embodiments, the sound feature includes an energy value, and determining at least one silent period in the multimedia data to be searched based on a variation of the energy value of the audio to be searched in a time domain may include: and for any time period in the time length range of the multimedia data to be searched, in response to the fact that the energy value of the audio to be searched in the time period is not larger than a first preset energy threshold value and the energy value of the audio to be searched at a time point close to the time period is larger than a second preset energy threshold value, determining the time period as a silent time period in the multimedia data to be searched, wherein the first preset energy threshold value is not larger than the second preset energy threshold value. Accordingly, the silent period in the multimedia data can be effectively identified based on the change of the real-time energy value of the audio to be searched.

According to some embodiments, the sound feature may further include loudness, and determining at least one silent period in the multimedia data to be searched based on a change in loudness of the audio to be searched in a time domain may include: and for any time period in the time length range of the multimedia data to be searched, in response to that the loudness of the audio to be searched in the time period is not greater than a third preset energy threshold value and the loudness of the audio to be searched at a time point close to the time period is greater than a fourth preset energy threshold value, determining the time period as a silent time period in the multimedia data to be searched, wherein the third preset energy threshold value is not greater than the fourth preset energy threshold value.

According to some embodiments, extracting the first key frame sequence from the video to be searched of the multimedia data to be searched according to the preset rule based on the change of the sound feature of the audio to be searched in the time domain may include: performing voice recognition on the audio to be searched based on the change of the sound characteristics of the audio to be searched in the time domain; in response to the fact that the voice recognition result of the audio to be searched comprises preset characters, determining time points corresponding to the recognized preset characters in the audio to be searched as key time points; and extracting a first key frame sequence from the video to be searched based on the determined key time point. Therefore, the key frame can be extracted according to the characters recognized by the voice recognition technology, and the data volume required to be processed for matching between subsequent multimedia data is reduced.

According to some embodiments, the preset character may be a specific character designated in advance, for example, "you", "i", "home", and the like, which are frequently used in daily life.

According to some embodiments, the preset character may also be a high frequency character in the audio to be searched. Specifically, according to the voice recognition result of the audio to be searched, at least one character with a higher frequency of occurrence in the recognition result is determined as a preset character.

In step 203, a corresponding second key frame sequence may be extracted from each candidate multimedia data of the plurality of candidate multimedia data based on the same preset rule. The specific manner of extracting the second key frame sequence is the same as the manner of extracting the first key frame sequence, and is not described herein again.

According to some embodiments, for each candidate multimedia data in the plurality of candidate multimedia data, the second key frame sequence corresponding to the candidate multimedia data may be extracted from the candidate multimedia data in advance based on a preset rule. Therefore, by extracting the second key frame sequence corresponding to each candidate multimedia data in advance, the extraction of key frames of the candidate multimedia data can be avoided during each search, and the search efficiency is improved.

According to some embodiments, the second key frame sequence corresponding to each candidate multimedia data is stored in the designated database in advance, and the stored second key frame sequence is associated with the candidate multimedia data corresponding to the stored second key frame sequence.

In step 204, according to some embodiments, determining multimedia data matching the multimedia data to be searched from the plurality of candidate multimedia data based on the first sequence of key frames and the second sequence of key frames corresponding to each candidate multimedia data from the plurality of candidate multimedia data may include: and for each candidate multimedia data in the candidate multimedia data, determining the candidate multimedia data as the multimedia data matched with the multimedia data to be searched in response to the fact that the second key frame sequence and the first key frame sequence corresponding to the candidate multimedia data meet the preset matching condition. Therefore, the matching relation between the multimedia data to be searched and the candidate multimedia data can be judged based on the matching result between the first key frame sequence and the second key frame sequence, and the calculation amount of multimedia data searching is effectively reduced.

Wherein the matching of the candidate multimedia data with the multimedia data to be searched may include having portions overlapping each other between the candidate multimedia data and the multimedia data to be searched. Specifically, the matching between the candidate multimedia data and the multimedia data to be searched may include various types, for example, as shown in fig. 3, candidate multimedia data a is included in the multimedia data to be searched, candidate multimedia data B includes the entire multimedia data to be searched, and candidate multimedia data C and the multimedia data to be searched partially overlap. The candidate multimedia data a, the candidate multimedia data B, and the candidate multimedia data C above are all matched with the multimedia data to be searched.

According to some embodiments, the step of matching the second key frame sequence corresponding to the candidate multimedia data with the first key frame sequence satisfying the preset matching condition may include: aiming at a second subsequence with a first preset length in the second key frame sequence, a first subsequence with the first preset length corresponding to the second subsequence exists in the first key frame sequence, wherein the similarity between every two frames sequentially corresponding to the second subsequence and the first subsequence is greater than a preset threshold value.

It is to be understood that the first subsequence in the first sequence of key frames and the second subsequence in the second sequence of key frames may be located at different relative positions in the first sequence of key frames and the second sequence of key frames, respectively. By comparing the similarity between every two frames sequentially corresponding to the second subsequence and the first subsequence, whether the candidate multimedia data and the multimedia data to be searched have an overlapping part with a first preset length can be judged, and whether the candidate multimedia data is matched with the multimedia data to be searched can be further determined.

It is understood that the actual overlapping portion between the candidate multimedia data and the multimedia data to be searched may be greater than the first preset length. In the exemplary embodiment of the disclosure, based on that the similarity between every two frames sequentially corresponding to the first subsequence and the second subsequence within the first preset length is greater than the preset threshold, it can be determined that there is an "overlap" portion between the candidate multimedia data and the multimedia data to be searched, without performing comparison of additional video frames, and thus the multimedia data search efficiency can be effectively improved.

According to some embodiments, for a second subsequence of a second preset length located at one end of a second sequence of key frames, a first subsequence of the second preset length corresponding to the second subsequence exists at the other end of the first sequence of key frames, wherein a similarity between every two frames sequentially corresponding to the second subsequence and the first subsequence is greater than a preset threshold, and the second preset length is smaller than the first preset length.

When the matching relationship between the candidate multimedia data and the multimedia data to be searched is a partial overlap, for example, as shown in the relationship between the candidate multimedia data C and the multimedia data to be searched in fig. 3, the overlap portion between the candidate multimedia data and the multimedia data to be searched is limited, and may not reach the first preset length. Therefore, in this case, it is possible to avoid missing candidate multimedia data partially overlapping with the multimedia data to be searched in the process of multimedia data search by reducing the requirement for the length of the overlapping portion, i.e., making the second preset length smaller than the first preset length.

According to some embodiments, the similarity of the images may be determined by a structural similarity algorithm (SSIM), a peak signal-to-noise ratio algorithm (PSNR), or various machine learning methods, which are not limited herein.

According to some embodiments, in a case where the first key frame sequence includes a timestamp of each first key frame in the multimedia data to be searched, and the second key frame sequence includes a timestamp of each second key frame in the corresponding candidate multimedia data, the meeting of the preset condition between the second key frame sequence corresponding to the candidate multimedia data and the first key frame sequence may include: and aiming at a fourth subsequence with a third preset length in the second key frame sequence, a third subsequence with the third preset length corresponding to the fourth subsequence exists in the first key frame sequence, wherein for any two frames in the fourth subsequence, the time difference between the two frames in the fourth subsequence is the same as the time difference between two frames in the third subsequence which sequentially correspond to the two frames.

According to some embodiments, when determining whether the second key frame sequence and the first key frame sequence corresponding to the candidate multimedia data satisfy the preset condition, the determination may be performed based on a plurality of methods described in the above embodiments.

According to some embodiments, after determining multimedia data matching the multimedia data to be searched among the plurality of candidate multimedia data, the feedback is based on page information of the determined multimedia data matching the multimedia data to be searched. Thereby enabling the user to intuitively acquire search result information.

According to some embodiments, matching type information between the candidate multimedia data and the multimedia data to be searched may be included in the page information. For example, the user may be visually presented with the overlapping portion between the multimedia data to be searched and the candidate multimedia data by way of illustration (as shown in fig. 3).

According to another aspect of the present disclosure, there is also provided a multimedia data searching apparatus 400, as shown in fig. 4, the apparatus 400 including:

a first obtaining unit 401 configured to obtain multimedia data to be searched, wherein the multimedia data to be searched includes a video to be searched and an audio to be searched that are synchronized in time;

a first extraction unit 402 configured to extract a first key frame sequence from a video to be searched of multimedia data to be searched according to a preset rule based on a change of a sound feature of the audio to be searched in a time domain;

a second obtaining unit 403, configured to obtain a second key frame sequence corresponding to each candidate multimedia data in the multiple candidate multimedia data, where the second key frame sequence of each candidate multimedia data is extracted according to a preset rule; and

the determining unit 404 is configured to determine, based on the first sequence of key frames and a second sequence of key frames corresponding to each candidate multimedia data of the plurality of candidate multimedia data, a multimedia data matching the multimedia data to be searched in the plurality of candidate multimedia data.

According to some embodiments, the extraction unit comprises: a first determining subunit configured to determine at least one silent period in the multimedia data to be searched based on a change in a sound characteristic of the audio to be searched in a time domain; and a second determining subunit, configured to determine, for each of the at least one silent period, at least one video frame close to the silent period in the video to be searched as a first key frame to constitute a first key frame sequence.

According to some embodiments, the first determining subunit is further configured to: at least one silent period in the multimedia data to be searched is identified using voice boundary detection (VAD) based on a change in a sound characteristic of the audio to be searched in a time domain.

According to some embodiments, the sound feature comprises an energy value, the first determining subunit being further configured for: and for any time period in the time length range of the multimedia data to be searched, in response to the fact that the energy value of the audio to be searched in the time period is not larger than a first preset energy threshold value and the energy value of the audio to be searched at a time point close to the time period is larger than a second preset energy threshold value, determining the time period as a mute time period in the multimedia data to be searched, wherein the first preset energy threshold value is not larger than the second preset energy threshold value.

According to some embodiments, the extraction unit comprises: the identification subunit is configured to perform voice identification on the audio to be searched based on the change of the sound characteristics of the audio to be searched in the time domain; the third determining subunit is configured to determine, in response to that the voice recognition result of the audio to be searched includes a preset character, a time point corresponding to the recognized preset character in the audio to be searched as a key time point; and a fourth determining subunit, configured to extract the first key frame sequence from the video to be searched based on the determined key time point.

According to some embodiments, the determining unit is further configured to: and for each candidate multimedia data in the candidate multimedia data, determining the candidate multimedia data as the multimedia data matched with the multimedia data to be searched in response to the fact that the second key frame sequence and the first key frame sequence corresponding to the candidate multimedia data meet the preset matching condition.

According to some embodiments, the step of matching the second key frame sequence corresponding to the candidate multimedia data with the first key frame sequence satisfying the preset matching condition includes: aiming at a second subsequence with preset length in the second key frame sequence, a first subsequence with preset length corresponding to the second subsequence exists in the first key frame sequence, wherein the similarity between every two frames sequentially corresponding to the second subsequence and the first subsequence is greater than a preset threshold value.

According to some embodiments, the multimedia data search apparatus further comprises: the second extraction unit is configured to, for each candidate multimedia data in the plurality of candidate multimedia data, extract a second key frame sequence corresponding to the candidate multimedia data from the candidate multimedia data in advance based on a preset rule.

According to another aspect of the present disclosure, there is also provided a computer device comprising: a memory, a processor and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the above method.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the above-described method.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the steps of the above-mentioned method when executed by a processor.

Referring to fig. 5, a block diagram of a structure of a computer device 500, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506, an output unit 507, a storage unit 508, and a communication unit 509. The input unit 506 may be any type of device capable of inputting information to the device 500, and the input unit 506 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 507 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 508 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as multimedia data searching. For example, in some embodiments, the multimedia data search may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the multimedia data searching method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform a multimedia data search by any other suitable means (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A multimedia data search method, comprising:

acquiring multimedia data to be searched, wherein the multimedia data to be searched comprises a video to be searched and an audio to be searched which are synchronous in time;

based on the change of the sound characteristics of the audio to be searched in the time domain, extracting a first key frame sequence from the video to be searched of the multimedia data to be searched according to a preset rule;

acquiring a second key frame sequence corresponding to each candidate multimedia data in the plurality of candidate multimedia data, wherein the second key frame sequence of each candidate multimedia data is obtained by extraction according to the preset rule; and

and determining the multimedia data matched with the multimedia data to be searched in the candidate multimedia data based on the first key frame sequence and the second key frame sequence corresponding to each candidate multimedia data in the candidate multimedia data.

2. The method of claim 1, wherein the extracting a first key frame sequence from the video to be searched of the multimedia data to be searched according to a preset rule based on the change of the sound feature of the audio to be searched in the time domain comprises:

determining at least one silent period in the multimedia data to be searched based on the change of the sound characteristics of the audio to be searched in the time domain; and

and for each mute period in the at least one mute period, determining at least one video frame close to the mute period in the video to be searched as a first key frame to form the first key frame sequence.

3. The method of claim 2, wherein the determining at least one silent period in the multimedia data to be searched based on the change of the sound characteristic of the audio to be searched in the time domain comprises:

identifying at least one silent period in the multimedia data to be searched by using voice boundary detection (VAD) based on the change of the sound characteristics of the audio to be searched in the time domain.

4. The method of claim 2, wherein the sound feature comprises an energy value, and determining at least one silent period in the multimedia data to be searched based on a variation of the energy value of the audio to be searched in a time domain comprises:

and for any time period in the time length range of the multimedia data to be searched, in response to the fact that the energy value of the audio to be searched in the time period is not larger than a first preset energy threshold value and the energy value of the audio to be searched at a time point close to the time period is larger than a second preset energy threshold value, determining the time period as a silent time period in the multimedia data to be searched, wherein the first preset energy threshold value is not larger than the second preset energy threshold value.

5. The method of claim 1, wherein the extracting a first key frame sequence from the video to be searched of the multimedia data to be searched according to a preset rule based on the change of the sound feature of the audio to be searched in the time domain comprises:

performing voice recognition on the audio to be searched based on the change of the sound characteristics of the audio to be searched in a time domain;

in response to the fact that the voice recognition result of the audio to be searched comprises preset characters, determining time points corresponding to the recognized preset characters in the audio to be searched as key time points; and

based on the determined key time points, a first sequence of key frames is extracted from the video to be searched.

6. The method of claim 1, wherein the determining the multimedia data of the candidate multimedia data that matches the multimedia data to be searched based on the first sequence of key frames and a second sequence of key frames corresponding to each candidate multimedia data of the candidate multimedia data comprises:

and for each candidate multimedia data in the candidate multimedia data, determining the candidate multimedia data as the multimedia data matched with the multimedia data to be searched in response to that a second key frame sequence corresponding to the candidate multimedia data and the first key frame sequence meet a preset matching condition.

7. The method of claim 6, wherein the step of satisfying the preset matching condition between the second key frame sequence corresponding to the candidate multimedia data and the first key frame sequence comprises:

aiming at a second subsequence with a preset length in the second key frame sequence, a first subsequence with the preset length corresponding to the second subsequence exists in the first key frame sequence, wherein the similarity between every two frames sequentially corresponding to the second subsequence and the first subsequence is greater than a preset threshold value.

8. The method of claim 1, further comprising:

and for each candidate multimedia data in the candidate multimedia data, extracting a second key frame sequence corresponding to the candidate multimedia data from the candidate multimedia data in advance based on the preset rule.

9. The method of claim 1, further comprising:

after determining the multimedia data matched with the multimedia data to be searched in the candidate multimedia data, feeding back page information based on the determined multimedia data matched with the multimedia data to be searched.

10. A multimedia data search apparatus comprising:

the multimedia searching device comprises a first obtaining unit, a second obtaining unit and a searching unit, wherein the first obtaining unit is used for obtaining multimedia data to be searched, and the multimedia data to be searched comprises video to be searched and audio to be searched which are synchronous in time;

the first extraction unit is configured to extract a first key frame sequence from a video to be searched of the multimedia data to be searched according to a preset rule based on the change of the sound characteristics of the audio to be searched in a time domain;

the second obtaining unit is configured to obtain a second key frame sequence corresponding to each candidate multimedia data in the plurality of candidate multimedia data, wherein the second key frame sequence of each candidate multimedia data is obtained by extraction according to the preset rule; and

a determining unit, configured to determine, based on the first sequence of key frames and a second sequence of key frames corresponding to each candidate multimedia data in the plurality of candidate multimedia data, multimedia data in the plurality of candidate multimedia data that matches the multimedia data to be searched.

11. The apparatus of claim 10, wherein the first extraction unit comprises:

a first determining subunit, configured to determine at least one silent period in the multimedia data to be searched based on a change of a sound characteristic of the audio to be searched in a time domain; and

a second determining subunit, configured to determine, for each of the at least one silence period, at least one video frame close to the silence period in the video to be searched as a first key frame to form the first key frame sequence.

12. The apparatus of claim 11, wherein the first determining subunit is further configured to:

means for identifying at least one silence period in the multimedia data to be searched using voice boundary detection (VAD) based on a variation of a sound feature of the audio to be searched in a time domain.

13. The apparatus of claim 11, wherein the sound feature comprises an energy value, the first determining subunit further configured to:

and determining the time period as a mute time period in the multimedia data to be searched in response to that the energy value of the audio to be searched in the time period is not greater than a first preset energy threshold value and the energy value of the audio to be searched at a time point close to the time period is greater than a second preset energy threshold value aiming at any time period in the time period range of the multimedia data to be searched, wherein the first preset energy threshold value is not greater than the second preset energy threshold value.

14. The apparatus of claim 10, wherein the first extraction unit comprises:

the identification subunit is configured to perform voice identification on the audio to be searched based on the change of the sound characteristics of the audio to be searched in the time domain;

the third determining subunit is configured to determine, in response to that the voice recognition result of the audio to be searched includes preset characters, time points corresponding to the recognized preset characters in the audio to be searched as key time points; and

a fourth determining subunit, configured to extract the first key frame sequence from the video to be searched based on the determined key time point.

15. The apparatus of claim 10, wherein the determining unit is further configured to:

16. The apparatus of claim 15, wherein the second key frame sequence corresponding to the candidate multimedia data and the first key frame sequence satisfying a preset matching condition comprises:

17. The apparatus of claim 10, further comprising:

and the second extraction unit is configured to, for each candidate multimedia data in the plurality of candidate multimedia data, extract a second key frame sequence corresponding to the candidate multimedia data from the candidate multimedia data in advance based on the preset rule.

18. A computer device, comprising:

a memory, a processor, and a computer program stored on the memory,

wherein the processor is configured to execute the computer program to implement the steps of the method of any one of claims 1-9.

19. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method of any of claims 1-9.

20. A computer program product comprising a computer program, wherein the computer program realizes the steps of the method of any one of claims 1-9 when executed by a processor.