CN111694984A

CN111694984A - Video searching method and device, electronic equipment and readable storage medium

Info

Publication number: CN111694984A
Application number: CN202010535144.2A
Authority: CN
Inventors: 王璐; 杨羿; 裴中佑; 李�一; 贺翔; 朱延峰; 陈晓冬; 刘林
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-09-22
Anticipated expiration: 2040-06-12
Also published as: CN111694984B

Abstract

The application discloses a video searching method, a video searching device, electronic equipment and a readable storage medium, and relates to the field of computer vision. The specific implementation scheme is as follows: acquiring information to be searched; searching according to the information to be searched to obtain a target video; analyzing the target video to obtain video abstract information; the video summary information comprises at least one of: key text information, key frame images; and displaying the video abstract information. According to the scheme in the application, when the video is searched, the video abstract information of the target video obtained by searching can be directly displayed, so that the exposure of the effective content in the target video is improved, a user does not need to obtain the required content by watching the complete video, and the time overhead of the user is reduced.

Description

Video searching method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer technology, and more particularly, to the field of computer vision.

Background

Currently, when searching for a video, a search is usually performed according to information to be searched for a corresponding video. However, since the video itself has diversified contents and differentiated expressions, such as pictures, texts, actions, audio, etc., for the searched video, if the user wants to obtain the required key contents from the video, the user is required to watch the video completely and select the contents from the video. This will cause a large time overhead for the user.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for video search.

According to an aspect of the present disclosure, there is provided a video search method including:

acquiring information to be searched;

searching according to the information to be searched to obtain a target video;

analyzing the target video to obtain video abstract information; wherein the video summary information comprises at least one of: key text information, key frame images;

and displaying the video abstract information.

Therefore, when the video search is carried out, the video abstract information of the target video obtained by the search can be directly displayed, so that the exposure of the effective content in the target video is improved, a user does not need to obtain the required content by watching the complete video, and the time overhead of the user is reduced.

According to another aspect of the present disclosure, there is provided a video search apparatus including:

the acquisition module is used for acquiring information to be searched;

the searching module is used for searching according to the information to be searched to obtain a target video;

the analysis module is used for analyzing the target video to obtain video abstract information; wherein the video summary information comprises at least one of: key text information, key frame images;

and the first display module is used for displaying the video summary information.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to the technology of the application, the problem that a user is greatly subjected to time overhead when the content in the searched video is obtained at present is solved, and the user can obtain the required content without watching the complete video by directly displaying the video abstract information of the searched target video when the video is searched, so that the time overhead of the user is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of a video search method according to an embodiment of the present application;

FIG. 2 is one of the contents presentation diagrams of an embodiment of the present application;

FIG. 3 is a second schematic illustration of a content display of an embodiment of the present application;

FIG. 4 is a block diagram of a video search apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device for implementing a video search method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. In the description and in the claims "and/or" means at least one of the connected objects.

Referring to fig. 1, fig. 1 is a flowchart of a video search method provided in an embodiment of the present application, where the method is applied to an electronic device, and as shown in fig. 1, the method includes the following steps:

step 101: and acquiring information to be searched.

In this embodiment, according to different application scenarios, the information to be searched can be acquired in different manners. For example, in a search scenario, content input by a user in an information input box may be determined as information to be searched; or in a recommendation scenario, the information to be searched may be determined based on user history search information, user browsing records, and/or user information, etc. Note that, as to the manner of acquiring the information to be searched, the present embodiment is not limited.

Step 102: and searching according to the information to be searched to obtain the target video.

In this embodiment, when searching is performed according to the information to be searched, searching may be performed in a preset video library. And the obtained target video is the video which is matched with the information to be searched in an associated manner. For example, if the information to be searched is the production method of the XX food, the corresponding target video is a video introducing the production method of the XX food. For another example, if the information to be searched is a YY toy, the corresponding target video may be an advertisement video related to the YY toy.

Step 103: and analyzing the target video to obtain video abstract information.

Optionally, the video summary information may include at least one of: key text information, key frame images. For example, the video summary information includes key text information parsed from the target video; or the video abstract information comprises one or more key frame images analyzed from the target video; alternatively, the video summary information includes key text information and key frame images, such as one or more images, parsed from the target video.

Step 104: and displaying the video abstract information.

Optionally, when the video summary information is displayed, the video summary information may be displayed on the search result display page.

According to the video searching method in the embodiment of the application, after the information to be searched is obtained, searching is carried out according to the information to be searched to obtain the target video, the target video is analyzed to obtain the video abstract information comprising the key text information and/or the key frame image, and the video abstract information is displayed. Therefore, when video searching is carried out, the video abstract information of the searched target video can be directly displayed, so that the exposure of effective content in the target video is improved, a user does not need to obtain required content by watching a complete video, and the time overhead of the user is reduced.

In the embodiment of the application, in order to enable the displayed video summary information to fully and exactly embody the target video, the target video can be analyzed from multiple angles, so that the content in the target video is displayed based on multi-mode video understanding, and a user can conveniently know the content in the target video. The plurality of angles includes, but is not limited to, text, audio, images, and the like.

Optionally, in a case that the video summary information includes the key text information, the process of obtaining the video summary information may include: identifying at least one of the following items in the target video to obtain key text information associated with the information to be searched: text content, audio, video frame images. Therefore, the target video can be analyzed from multiple angles, and the displayed video abstract information can fully and exactly represent the target video.

The following describes a process of parsing a target video by taking a text content tag, an audio tag, and a video frame image tag as examples.

< text content tag >

For text content in video, it can be generally classified into title (title), subtitle (subtitle), roll title (roll _ titles), advertisement logo (logo), and other text (docs) classes. The location, duration, meaning, etc. of each type of textual content typically differ. For example, the duration of the headline and the advertisement Logo is generally longer, while the duration of the subtitle and the roll-to-roll subtitle is shorter. For example, a caption generally indicates the subject of a video, and a caption, a scroll caption, or the like generally indicates the detailed content of the video. For another example, for subtitles, standard subtitles are a horizontal text sequence located below an image, and non-standard subtitles have vertical, non-uniform size, and native background text. Therefore, when the text content in the target video is identified, the text content can be classified and identified, so that the identification accuracy is improved.

Optionally, in a case that the video summary information includes the key text information, the process of obtaining the video summary information may include:

selecting a text box in a target video; the text box can be understood as an area containing text content on each frame of image in the target video, and can be detected by using a target detection method, such as a detection method based on a reference box anchor; further, after the text box is selected, a preset rule (for example, different image positions correspond to different preset scores, and/or different durations correspond to different preset scores, etc.) may be adopted to score each frame of text box (whether the text box is in the same frame may not be distinguished), and the text box with the score (i.e., confidence) larger than a preset threshold (for example, 0.85, or 0.9, etc.) is selected for subsequent processing;

classifying the selected text box based on the duration; in order to accurately classify the text boxes, the selected text boxes can be labeled by using the duration field, the text boxes with the same position, height and/or font are clustered, and the clustered text boxes are scored according to different durations so as to be classified; wherein each cluster is a text box of the same category as much as possible;

respectively carrying out text recognition on each type of text box to obtain text contents in each type of text box; for example, text Recognition may be performed using Optical Character Recognition (OCR); preferably, different recognition modes can be adopted according to different types of text boxes in consideration of respective characteristics;

based on the information to be searched, performing correlation analysis on the text content in each type of text box to obtain corresponding key text information; in order to enable the displayed key text information to reflect the target video more exactly, word segmentation processing can be performed on text contents in the text box by using a proper name recognition operator based on Natural Language Processing (NLPC) to obtain a key word group, and further key text information is obtained.

Therefore, the text content in the target video is identified by classification, and the identification accuracy can be improved.

In addition, in order to avoid the interference of too much useless text content, such as scroll subtitles, other text (docs) and even text content in the advertisement Logo, the importance degree is low, and the importance degree of the text content in the standard subtitles is high, text recognition can be performed only for the standard subtitles. The text recognition process in this case may include: firstly, detection optimization is carried out, because the standard subtitles are a horizontal character sequence located below the image, and non-standard subtitles have vertical characters, different sizes, original background characters and the like, a text box of the standard subtitles can be selected by using a target detection method (such as based on Anchor), namely the text box with an unnecessary length-width ratio is removed, and the detection range is narrowed by using the detection (Detect) to Detect the position where the standard subtitles possibly appear based on the relative position where the standard subtitles possibly appear in the image; then, recognition optimization is carried out, because the text box information is relatively fixed based on the position of a subtitle, duplication removal between frames can be carried out by introducing inter-frame time sequence information, and further, post-processing is carried out on the image and the text at the same time, wherein the post-processing of the image comprises reducing the size deformation of the image, and the post-processing of characters comprises filtering repeated contents through a sliding time window; and finally, recognizing the text content of the text box in the processed image.

< Audio tag >

From an audio modality, video content breaking can be performed based on Automatic Speech Recognition technology (ASR). Due to the influence of a complex environment, local accent, a speaking mode and the like in a video, noise signals of an ASR recognition result are very much, so that in order to improve the accuracy of speech recognition, before performing ASR recognition, an audio interrupt signal can be removed in an audio extraction stage, for example, a Voice Activity Detection (VAD) model based on webRTC for web page instant messaging is used for performing mute Detection, and whether an inactive FRAME exists is judged (for example, a FRAME extraction duration is default to 10s, an active represents a sound, and an inactive represents a mute), so as to remove the interference of the audio interrupt signal on the recognition result. Secondly, when the ASR model is used to convert speech into text, due to the influence of noise environment, accent, speaking mode, scene domain, etc. in the video content, ASR recognition results can be further optimized, including but not limited to: 1) splicing single short strings by positioning time points of audio tracks; 2) eliminating accent words such as spoken words, exclamation words, word-of-speech words, etc.; 3) carrying out currency judgment on an ASR recognition result, for example, carrying out currency judgment by using a confusion ppl value of a NLPC neural network DNN language model; 4) and performing text error correction on the ASR recognition result.

< video frame image tag >

From the video frame image modality, video content breaking can be performed based on continuous frame OCR. Since the video is a time-series atlas, the interference of a large number of repeated frames needs to be removed. For example, in the stage of extracting the image frame, a key frame may be selected based on a locate model, specifically, the locate clustering features are a Hue, Saturation, and brightness histogram and a gradient histogram of the frame, and a frame at the center of a cluster (for example, K shots may be clustered first, and then K subshots are clustered for each shot) is used as a key frame, and OCR recognition is performed on the key frame. Secondly, when an image is converted into a text based on the OCR model, due to the unique timing information of the video, optimization processing can be performed, including but not limited to: 1) de-duplication based on temporal clustering, filtering by sliding time window, such as filtering by 50% overlap of frames with frames every 5 seconds; 2) based on spatial clustering and de-duplication, aggregating the text boxes with similar coordinate positions, for example, clustering and identifying the texts with the similar coordinate positions of the boxes by adopting a K-nearest neighbor (kNN) algorithm; 3) text duplication removal (for example, duplication removal according to the position in the frame, the frame number and the edit distance (Levenshtein distance)), smoothness filtering, text error correction and the like are performed on the text conversion result, and text keywords are extracted.

Optionally, in a case that the video summary information includes a key frame image, the process of obtaining the video summary information may include: and extracting the video frame images in the target video to obtain the key frame images related to the information to be searched. Alternatively, the key frame image may be one or more images that are most closely associated and matched with the information to be searched. For example, if the information to be searched is the XX food production method, the extracted key frame image may be an effect image of the XX food after production is completed; or, if the information to be searched is a YY toy, the extracted key frame image may be a plurality of images of the YY toy at different viewing angles. Therefore, the user can know the video content quickly and conveniently by displaying the key frame image.

In the embodiment of the application, the display form of the video abstract information can be standardized for the convenience of the user to check. Optionally, after the video summary information is obtained, the video summary information may be further processed to obtain target display content meeting a preset rule, and the target display content is displayed. For example, the processing is text importance induction to remove some contents, so that the displayed contents are more compact; and/or the processing is adjusting the display content, and the like. The preset rules are, for example, to highlight important text content, extract key content as a title, preferentially display a certain type of content, and/or display in a left-image and right-text manner, and so on. Therefore, the required content can be conveniently searched by the user through the standard display form without acquiring the required content from the disordered display content.

Further, after the video summary information is presented, the method further includes: receiving the input of a user aiming at the video summary information; in response to the input, a detail page of the target video is presented. Wherein the input can be selected as click operation, sliding operation or the like on a display page or a display area of the video summary information. Therefore, the user can know the complete video content conveniently, and the user experience is improved.

It should be noted that, scenarios to which the embodiments of the present application are applicable may include, but are not limited to, a search scenario, a recommendation scenario, and the like. For example, taking a video advertisement in a search scene as an example, when the video advertisement is delivered, a video is often represented by using static elements such as a video playing label, a cover page image, and/or a video label. The user can know the advertisement content only by clicking the video playing object or the video label to enter the landing page and then watching the video advertisement. However, the video advertisements are represented in the form of video playing marks and video labels, and a user does not know what value the video can bring to the user and whether the appeal can be met, so that the user may hesitate and unwilling to open the video advertisements, and further, the user cannot directly know the video content of the landing page. By means of the displayed video abstract information in the embodiment of the application, effective contents in the video advertisements can be exposed, user search requirements can be met more directly, more contents in unopened video advertisements are transmitted to the users in unit time, and the users can know whether the video advertisements can meet the requirements of the users. Under the condition that the requirement is determined to be met, the video advertisement can be further opened through input operation to know the complete content in the video advertisement, so that conversion and clicking are brought to the advertiser, and the opportunity of fully showing the content is provided for the advertiser.

For example, in a search scenario, if the information to be searched is "a method of making shredded pork with fish flavor", the video summary information displayed on the search result page may be as shown in fig. 2; or if the information to be searched is "where to treat freckles," the video summary information shown on the search result page may be as shown in fig. 2. Further, when the user clicks the search result page of the "method of making shredded pork with fish" the landing page of the corresponding video can be displayed, so that the user can know the complete method of making shredded pork with fish.

For another example, taking a video advertisement in a recommended scene as an example, when it is determined that the information to be searched is a car based on the user history search information and the user browsing record, video summary information of a certain video obtained by searching may be directly displayed when recommendation is performed by means of the embodiment of the present application, as shown in fig. 3. Further, when the user clicks the search result page showing the video summary information, the detail page of the corresponding video can be displayed so that the user can know the content of the complete video advertisement.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a video search apparatus according to an embodiment of the present application, and as shown in fig. 4, the video search apparatus 40 includes:

an obtaining module 41, configured to obtain information to be searched;

the searching module 42 is configured to perform searching according to the information to be searched to obtain a target video;

the analysis module 43 is configured to analyze the target video to obtain video summary information; wherein the video summary information comprises at least one of: key text information, key frame images;

the first display module 44 is configured to display the video summary information.

Optionally, in a case that the video summary information includes key text information, the parsing module 43 is specifically configured to:

identifying at least one of the following target videos to obtain key text information associated with the information to be searched: text content, audio, video frame images.

Optionally, in a case that the video summary information includes a key frame image, the parsing module 43 is specifically configured to:

and extracting the video frame images in the target video to obtain the key frame images related to the information to be searched.

Optionally, the parsing module 43 includes:

the selecting unit is used for selecting a text box in the target video;

the classification unit is used for classifying the selected text boxes on the basis of the duration;

the recognition unit is used for respectively carrying out text recognition on each type of text box to obtain text contents in each type of text box;

and the analysis unit is used for performing correlation analysis on the text content in each type of text box based on the information to be searched to obtain corresponding key text information.

Optionally, the video search apparatus 40 further includes:

the processing module is used for processing the video abstract information to obtain target display content meeting a preset rule;

the first display module 44 is specifically configured to:

and displaying the target display content.

Optionally, the video search apparatus 40 further includes:

the receiving module is used for receiving the input of the user aiming at the video abstract information;

a second presentation module for presenting a detail page of the target video in response to the input.

It can be understood that the video search apparatus 40 according to the embodiment of the present application can implement each process implemented in the method embodiment shown in fig. 1 and achieve the same beneficial effects, and for avoiding repetition, details are not repeated here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 5, the electronic device is a block diagram of an electronic device according to the video search method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the video search methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the video search method provided by the present application.

The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the video search method in the embodiment of the present application (for example, the obtaining module 41, the search module 42, the parsing module 43, and the first presentation module 44 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing, i.e., implements the video search method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 502.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device for video search, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 optionally includes memory located remotely from processor 501, which may be connected to the video-searching electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the video search method may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the video-searched electronic apparatus, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, when the video is searched, the video abstract information of the target video obtained by searching can be directly displayed, so that the exposure of the effective content in the target video is improved, a user does not need to obtain the required content by watching the complete video, and the time overhead of the user is reduced.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A video search method, comprising:

acquiring information to be searched;

searching according to the information to be searched to obtain a target video;

and displaying the video abstract information.

2. The method of claim 1, wherein in the case that the video summary information includes key text information, the parsing the target video to obtain video summary information comprises:

3. The method of claim 1, wherein in the case that the video summary information includes key frame images, the parsing the target video to obtain video summary information comprises:

4. The method of claim 1, wherein in the case that the video summary information includes key text information, the parsing the target video to obtain video summary information comprises:

selecting a text box in the target video;

classifying the selected text box based on the duration;

respectively carrying out text recognition on each type of text box to obtain text contents in each type of text box;

and performing correlation analysis on the text content in each type of text box based on the information to be searched to obtain corresponding key text information.

5. The method of claim 1, wherein after parsing the target video and obtaining video summary information, the method further comprises:

processing the video abstract information to obtain target display content meeting preset rules;

the displaying the video summary information comprises:

and displaying the target display content.

6. The method of claim 1, after said presenting said video summary information, further comprising:

receiving the input of a user aiming at the video summary information;

in response to the input, a detail page of the target video is presented.

7. A video search apparatus, comprising:

the acquisition module is used for acquiring information to be searched;

8. The apparatus of claim 7, wherein, in the case that the video summary information includes key text information, the parsing module is specifically configured to:

9. The apparatus of claim 7, wherein, in the case that the video summary information includes key frame images, the parsing module is specifically configured to:

10. The apparatus of claim 7, wherein the parsing module comprises:

the selecting unit is used for selecting a text box in the target video;

11. The apparatus of claim 7, further comprising:

the first display module is specifically configured to:

and displaying the target display content.

12. The apparatus of claim 7, further comprising:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.