CN113596601A

CN113596601A - Video picture positioning method, related device, equipment and storage medium

Info

Publication number: CN113596601A
Application number: CN202110071179.XA
Authority: CN
Inventors: 郭洋; 朱明清
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-11-02

Abstract

The application discloses a video picture positioning method based on artificial intelligence technology and storage technology, which comprises the following steps: receiving search information sent by terminal equipment; matching the search information with a target associated text of a target video in an index library to obtain a matching score; if the matching score meets the matching condition, determining time information corresponding to the target associated text; and sending time information corresponding to the target associated text and the target associated text to the terminal equipment so that the terminal equipment displays the picture positioning result of the target video. The application also provides a related device, equipment and a storage medium. According to the method and the device, scenes with related contents can be found quickly, interference of other similar video picture thumbnails is saved, and accuracy of video picture positioning is improved. In addition, the user can conveniently and intuitively view all video pictures related to the searched content in the target video, and therefore searching efficiency is improved.

Description

Video picture positioning method, related device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, a related apparatus, a device, and a storage medium for positioning a video frame.

Background

With the development and progress of society and technological innovation, multimedia technology also shows a rapid development. Multimedia technology integrating computer technology, communication technology and television technology is inseparable from people's daily life. The video is a common multimedia form and has good entertainment and dissemination.

When a user needs to view a certain segment in a video, a dragging operation can be performed on a progress bar provided by a video player, and a video picture thumbnail corresponding to a playing position is displayed. Based on the method, the user is helped to quickly locate a certain time point position through the video picture thumbnail.

However, the way of positioning by video picture thumbnails is cumbersome and easily misses a video picture thumbnail desired by the user, resulting in low positioning accuracy. In addition, if the video is a video with little difference in picture content, such as a lecture video or a conference video, it is difficult to locate an accurate time point through a video picture thumbnail.

Disclosure of Invention

The embodiment of the application provides a video picture positioning method, a related device, equipment and a storage medium, a scene with related contents can be quickly found by text search or voice search, interference of thumbnails of other similar video pictures can be saved by search based on a single video, and video picture positioning accuracy is improved. In addition, the method and the device can facilitate the user to more intuitively view all video pictures related to the searched content in the target video, thereby improving the searching efficiency.

In view of the above, an aspect of the present application provides a method for positioning a video frame, including:

receiving search information sent by terminal equipment, wherein the search information is a search text or a search voice;

matching the search information with a target associated text of a target video in an index library to obtain a matching score, wherein the index library comprises an associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

if the matching score meets the matching condition, determining time information corresponding to the target associated text;

and sending the time information corresponding to the target associated text and the target associated text to the terminal equipment so that the terminal equipment displays the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

Another aspect of the present application provides a method for positioning a video frame, including:

acquiring search information, wherein the search information is a search text or a search voice;

sending search information to a server to enable the server to match the search information with a target associated text of a target video in an index library to obtain a matching score, wherein the index library comprises the associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

if the matching score meets the matching condition, receiving time information corresponding to the target associated text and the target associated text sent by the server;

and displaying the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

Another aspect of the present application provides a video frame positioning apparatus, including:

the acquisition module is used for receiving search information sent by the terminal equipment, wherein the search information is a search text or a search voice;

the matching module is used for matching the search information with a target associated text of a target video in an index library to obtain a matching score, wherein the index library comprises the associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

the determining module is used for determining the time information corresponding to the target associated text if the matching score meets the matching condition;

and the sending module is used for sending the time information corresponding to the target associated text and the target associated text to the terminal equipment so that the terminal equipment displays the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

In one possible design, in another implementation manner of another aspect of the embodiment of the present application, the video picture positioning apparatus further includes an identification module and a storage module;

the identification module is used for carrying out Optical Character Recognition (OCR) processing on the subtitle information in the target video to obtain the associated text aiming at the target video before the matching module matches the search information with the target associated text of the target video in the index library to obtain a matching score, if the target video comprises the subtitle information;

the acquisition module is also used for acquiring the time information corresponding to the associated text;

and the storage module is used for storing the associated text and the time information corresponding to the associated text in the index database.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the matching module is specifically used for generating a first text sequence according to the search information, wherein the first text sequence comprises M characters, and M is an integer greater than or equal to 1;

generating a second text sequence according to a target associated text of a target video, wherein the second text sequence comprises N characters, and N is an integer greater than or equal to 1;

constructing a character matrix according to the first text sequence and the second text sequence;

determining the accumulated operand corresponding to the maximum path from the character matrix;

the ratio between the cumulative operand and M is taken as the match score.

the matching module is specifically used for generating a first text sequence according to the search information, wherein the first text sequence comprises R words, and R is an integer greater than or equal to 1;

generating a second text sequence according to the target associated text of the target video, wherein the second text sequence comprises T words, and T is an integer greater than or equal to 1;

determining a word set according to the first text sequence and the second text sequence, wherein the word set is a union set of R words and T words;

determining a first word frequency vector according to the word set and the first text sequence;

determining a second word frequency vector according to the word set and the second text sequence;

and taking the cosine similarity between the first word frequency vector and the second word frequency vector as a matching score.

the recognition module is used for carrying out Automatic Speech Recognition (ASR) processing on the speech information in the target video to obtain the associated text aiming at the target video before the matching module matches the search information with the target associated text of the target video in the index library to obtain a matching score, if the target video comprises the speech information;

the matching module is specifically used for generating a first phoneme sequence according to the search information, wherein the first phoneme sequence comprises P phonemes, and P is an integer greater than or equal to 1;

generating a second phoneme sequence according to the target associated text of the target video, wherein the second phoneme sequence comprises Q phonemes, and Q is an integer greater than or equal to 1;

constructing a phoneme matrix according to the first phoneme sequence and the second phoneme sequence;

determining a cumulative operand corresponding to the maximum path from the phoneme matrix;

the ratio between the accumulated operand and P is taken as the match score.

the identification module is used for carrying out image identification processing on video frames in the target video aiming at the target video to obtain the associated text before the matching module matches the search information with the target associated text of the target video in the index database to obtain a matching score;

the matching module is specifically used for acquiring a first word vector through an input layer included in the semantic matching model based on the search information;

acquiring a second word vector through an input layer included in a semantic matching model based on a target associated text of a target video;

based on the first word vector, obtaining a first semantic vector through a presentation layer included in a semantic matching model;

based on the second word vector, a second semantic vector is obtained through a presentation layer included in the semantic matching model;

based on the first semantic vector and the second semantic vector, the cosine distance is obtained through a matching layer included in the semantic matching model, and the cosine distance is used as a matching score.

the acquisition module is used for acquiring search information, wherein the search information is search text or search voice;

the sending module is used for sending search information to the server so that the server matches the search information with a target associated text of a target video in an index library to obtain a matching score, wherein the index library comprises the associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

the acquisition module is further used for receiving time information corresponding to the target associated text and the target associated text sent by the server if the matching score meets the matching condition;

and the display module is used for displaying the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

the acquisition module is specifically used for providing a text input area;

receiving a search text for a target video through a text input area;

or the like, or, alternatively,

starting a voice acquisition device;

and receiving search voice aiming at the target video through the voice acquisition equipment.

the display module is used for providing a playing progress bar;

displaying a time point identifier on the playing progress bar according to time information corresponding to the target associated text, wherein the time point identifier belongs to a picture positioning result;

and highlighting the target associated text in the text display area corresponding to the time point identification, wherein the target associated text belongs to the picture positioning result.

Another aspect of the present application provides a computer device, comprising: a memory, a processor, and a bus system;

wherein, the memory is used for storing programs;

a processor for executing the program in the memory, the processor for performing the above-described aspects of the method according to instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a server, including: a memory, a processor, and a bus system;

wherein, the memory is used for storing programs;

Another aspect of the present application provides a terminal device, including: a memory, a processor, and a bus system;

wherein, the memory is used for storing programs;

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a video frame positioning method, which includes the steps of firstly receiving search information sent by terminal equipment, then matching the search information with a target associated text of a target video in an index library to obtain a matching score, wherein the index library comprises the associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, if the matching score meets a matching condition, time information corresponding to the target associated text is determined, finally time information corresponding to the target associated text and the target associated text are sent to the terminal equipment, and the terminal equipment displays a frame positioning result of the target video according to the time information corresponding to the target associated text and the target associated text. By the method, scenes with related contents can be found quickly by text search or voice search, interference of thumbnails of other similar video pictures can be saved by search based on a single video, and accuracy of video picture positioning is improved. In addition, the appearance time, the appearance position and the like of the target associated text can be searched, so that a user can conveniently and intuitively check all video pictures related to the searched content in the target video, and the searching efficiency is improved.

Drawings

FIG. 1 is a block diagram of an embodiment of a video frame alignment system;

FIG. 2 is a flowchart illustrating a video frame positioning method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an embodiment of a method for locating a video frame in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a semantic matching model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another embodiment of a method for locating a video frame in an embodiment of the present application

FIG. 6 is a schematic diagram of an interface showing a text entry area in an embodiment of the present application;

FIG. 7 is a schematic diagram of an interface showing voice input prompts in an embodiment of the present application;

FIG. 8 is a schematic diagram of an interface showing a result of positioning a frame in an embodiment of the present application;

FIG. 9 is a schematic diagram of an interface of a video frame corresponding to a skip-to-time point identifier in the embodiment of the present application;

FIG. 10 is a diagram of a video frame alignment apparatus according to an embodiment of the present application;

FIG. 11 is another schematic diagram of a video frame alignment apparatus according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a server in an embodiment of the present application;

fig. 13 is a diagram illustrating a result of the terminal device in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Multimedia data on the internet, including video, music, and text, is the subject of continuous and popular research due to the increasing data. Particularly, it is obvious that the rapid growth of videos makes it important to find out potential utilizable values how to efficiently browse video contents. Video generally refers to various techniques for capturing, recording, processing, storing, transmitting, and reproducing a series of still images as electrical signals. Common video contents include, but are not limited to, dramas, movies, fantasy, sporting events, animations, documentaries, news, music movies, and the like, and a user can view a video frame corresponding to a certain time point in a video in a manner of fast forwarding or dragging a progress bar during watching the video, that is, continue to play the video contents from the time point.

In order to improve the accuracy of video frame positioning and improve the efficiency of searching, the present application provides a method for positioning a video frame, which is applied to a video frame positioning system shown in fig. 1, where as shown in the figure, the video frame positioning system includes a server and a terminal device, and a client is deployed on the terminal device, and the client may be a player specifically, including but not limited to a web player, a mobile player, a television player, a computer player, and the like. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, safety service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited.

It is understood that the video frame positioning method provided by the present application may employ Computer Vision (CV), Speech Technology (Speech Technology) and Natural Language Processing (NLP) based on Artificial Intelligence (AI). The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The key technologies of the Speech technology include an Automatic Speech Recognition (ASR) technology, a Text To Speech (TTS) technology, and a voiceprint Recognition technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Based on the above description and the video frame positioning system corresponding to fig. 1, please refer to fig. 2, and fig. 2 is a schematic flow chart of a video frame positioning method according to an embodiment of the present application, and as shown in the figure, specifically:

in step S1, for the video with subtitles, the server may match the subtitle text in the video using OCR technology and determine the corresponding screen time point.

In step S2, for the video with subtitles, the server may employ ASR technology to convert the speech content in the video into text, then match the text, and determine the corresponding picture time point. It should be noted that step S1 and step S2 are two optional processing manners, and are not limited herein.

In step S3, for the video without subtitles, the server may employ ASR technology to convert the speech content in the video into text, then match the text, and determine the corresponding picture time point.

In step S4, the server separates the subtitle text, which has time axis information.

In step S5, in the video in which the text recognition is completed, by embedding a search component in the client deployed in the terminal device, text input and search can be performed based on the search component.

In step S6, the user inputs search information, and when the server receives the search information, the server searches through the stored associated text to find the entry or sentence in the video that is related to the search information and the associated time information, and the server transmits the information to the terminal device.

In step S7, the terminal device presents the retrieved subtitle text on the progress bar through the player.

In step S8, the terminal device displays a time node corresponding to the associated text on the progress bar based on the time information corresponding to the associated text.

In step S9, the user clicks the progress bar to jump to a time point of the associated text, where the associated text may specifically be a subtitle or a voice.

With reference to fig. 3, a method for positioning a video frame in the present application will be described below, where an embodiment of the method for positioning a video frame in the present application includes:

101. the method comprises the steps that a server receives search information sent by terminal equipment, wherein the search information is a search text or a search voice;

in this embodiment, the search information acquired by the terminal device may be a search text or a search voice. The search information is sent by the terminal device to the server. Specifically, in one example, when the terminal device plays the target video, the user may directly input the search information on the terminal device. In another example, the user may select a video in the list as the target video, and after selecting the target video, enter search information on the terminal device. In another example, the user may enter search information, and the server selects any one of the videos from the video database as a target video and matches the search information with the target video.

102. The server matches the search information with a target associated text of a target video in an index database to obtain a matching score, wherein the index database comprises an associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

in this embodiment, the server stores an index library, where an associated text of each of the K videos and time information corresponding to the associated text of each video are stored in the index library, where the associated text may be a subtitle text recognized by an OCR technology, a subtitle text recognized by an ASR technology, or a text recognized by an image recognition technology. The server may match the search information with each associated text corresponding to the target video in the index library, and obtain a matching score between the search information and each associated text, for convenience of description, the present application takes a target associated text in a plurality of associated texts as an example, in an actual situation, the target associated text may be any one associated text, or may be one associated text with the highest matching degree with the search information, and this is not limited here.

It can be understood that the index library is a database (database), which can be regarded as an electronic file cabinet, in short, a place for storing electronic files, and a user can add, query, update, delete, etc. to data in the files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application. A Database Management System (DBMS) is a computer software System designed for managing a Database, and generally has basic functions such as storage, interception, security assurance, and backup. The database management system may classify the database according to the database model it supports, such as relational, XML (Extensible Markup Language); or classified according to the type of computer supported, e.g., server cluster, mobile phone; or classified according to the Query Language used, such as Structured Query Language (SQL), XQuery; or by performance impulse emphasis, e.g., maximum size, maximum operating speed; or other classification schemes. Regardless of the manner of classification used, some DBMSs are capable of supporting multiple query languages across categories, for example, simultaneously.

It should be noted that, because the number of videos is large and each video often includes a large amount of associated texts, the associated texts and the corresponding time information thereof in each video can be stored in the Cloud based on Cloud technology (Cloud technology), so as to implement Cloud storage (Cloud storage). The cloud technology is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize data calculation, storage, processing and sharing. The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

A distributed cloud storage system (hereinafter, referred to as a storage system) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of different types in a network through application software or application interfaces to cooperatively work by using functions such as cluster application, grid technology, and a distributed storage file system, and provides a data storage function and a service access function to the outside. At present, a storage method of a storage system is as follows: logical volumes are created, and when created, each logical volume is allocated physical storage space, which may be the disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as data Identification (ID), the file system writes each object into a physical storage space of the logical volume, and the file system records storage location information of each object, so that when the client requests to access the data, the file system can allow the client to access the data according to the storage location information of each object.

The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided in advance into stripes according to a group of capacity measures of objects stored in a logical volume (the measures often have a large margin with respect to the capacity of the actual objects to be stored) and Redundant Array of Independent Disks (RAID), and one logical volume can be understood as one stripe, thereby allocating physical storage space to the logical volume.

103. If the matching score meets the matching condition, the server determines time information corresponding to the target associated text;

in this embodiment, the server matches the search information with the target associated text in the index library to obtain a matching score, where the matching score may be a score from 0 to 1, or may adopt other expression manners, which is not limited herein. If the matching score is cosine distance or Levenshtein (Levenshtein), the smaller the matching score is, the higher the matching degree is, and at this time, if the matching score is less than or equal to the matching threshold, the matching score meets the matching condition. If the matching score is the cosine similarity, the larger the matching score is, the higher the matching degree is, and at this time, if the matching score is greater than or equal to the matching threshold, the matching score meets the matching condition.

Specifically, when the matching score meets the matching condition, the target associated text is determined to be successfully matched with the search information. Then, the server obtains, based on the index library, the occurrence time of the target associated text in the target video, where the occurrence time is the time information corresponding to the target associated text, for example, the time information of the target associated text may be represented as "15 minutes 09 seconds", and for example, each frame of picture in the target video is numbered in sequence, and the time information corresponding to the target associated text may be represented as "25691", that is, the 25691 th frame in the target video.

104. And the server sends the time information corresponding to the target associated text and the target associated text to the terminal equipment, so that the terminal equipment displays the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

In this embodiment, the server sends a picture positioning result for the target video to the terminal device, where the picture positioning result includes the target associated text and the time information corresponding to the target associated text, and thus, the terminal device may display the target associated text and the time information corresponding to the target associated text through the client (or the player). The user can directly view the screen positioning result related to the search information on the client.

Specifically, for convenience of understanding, it will be described in an example below, assuming that search information input by a user through a terminal device when viewing a target video (e.g., documentary a) is a search text, the search text is "turn pen", the search text is matched with each associated text of the target video in an index library, and taking the target associated text as "follow the turn pen" as an example, at this time, the search text is successfully matched with the target associated text, so that the server sends time information corresponding to the target associated text and the target associated text to the terminal device, and the terminal device displays the target associated text and the time information through a player.

According to the method, scenes with related contents can be found quickly by text search or voice search, interference of thumbnails of other similar video pictures can be saved by search based on a single video, and accuracy of video picture positioning is improved. In addition, the appearance time, the appearance position and the like of the target associated text can be searched, so that a user can conveniently and intuitively check all video pictures related to the searched content in the target video, and the searching efficiency is improved.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment provided by the embodiment of the present application, before the server matches the search information with the target associated text of the target video in the index library to obtain the matching score, the method may further include:

aiming at a target video, if the target video comprises subtitle information, performing Optical Character Recognition (OCR) processing on the subtitle information in the target video by a server to obtain an associated text;

the server acquires time information corresponding to the associated text;

and the server stores the associated text and the time information corresponding to the associated text in an index database.

In this embodiment, a method for recognizing subtitle information in a target video by using an OCR recognition technology is described. For convenience of description, the target video is taken as an example for description, and if the target video includes subtitle information, the server recognizes the subtitle information by using an OCR technology, so as to obtain an associated text. OCR is mainly composed of the following parts.

Firstly, inputting and preprocessing an image;

for different image formats, different storage formats and different compression modes exist. Pretreatment: mainly comprising binarization, noise removal and the like. Most of pictures shot by a camera are color images, the information content of the color images is huge, the contents of the pictures can be simply divided into foreground and background, in order to enable a computer to recognize characters more quickly and better, the color images need to be processed firstly, so that only the foreground information and the background information of the pictures are obtained, the foreground information is set to be black, and the background information is set to be white, so that a binary image is obtained. The denoising is performed according to the characteristics of the noise, and is called noise removal.

Secondly, analyzing the layout;

the document pictures are segmented, and the line segmentation process is called layout analysis.

Thirdly, character cutting;

due to the limitation of photographing conditions, the characters are often adhered or broken, so that the performance of the recognition system is greatly limited, and the character cutting function is required.

Fourthly, character recognition;

template matching can be adopted, or feature extraction is mainly adopted, and due to the influence of factors such as displacement of characters, thickness of strokes, pen breakage, adhesion, rotation and the like, the difficulty of feature extraction is greatly influenced.

Fifthly, post-processing and proofreading;

and correcting the recognition result according to the relation of the specific language context.

Specifically, assuming that the target video includes 200 sentences of caption information, each sentence of caption information is identified respectively, and 200 associated texts are obtained, where 200 associated texts include the target associated text, and the target associated text may be any one of the 200 associated texts, that is, an associated text subsequently matched with the search information. Next, the server needs to obtain time information corresponding to each associated text, so as to store the associated text and the time information corresponding to the associated text in the index library.

For ease of understanding, please refer to table 1, where table 1 is an illustration between the associated texts in the index database and the time information thereof.

TABLE 1

Associated text	Time information
		The ten-year and one-degree ocean sports can start again	12683
This is a traditional "custom" left by marine animal ancestors "	12793
		Let us focus the lens on the beluga	13232
It is about to participate in the fast swimming competition of giant whale group in the ocean sports meeting	13555
		Blue whale is one of several large animals in the ocean	14000
Wenzhou woolen cloth with large size and capable of being shaped like lattice	14385
		When it catches food, it sucks a bite suddenly and sucks in a lot of water	15675

As can be seen from table 1, each associated text corresponds to a time information, and the time information in table 1 indicates a time point at which the associated text appears in the target video, for example, the associated text is "ten years and one degree of ocean motion would start again" appearing at 12683 th frame of the target video, which may last for 50 frames, and it should be noted that the present application focuses on the first frame at which the associated text appears.

Secondly, in the embodiment of the application, a mode of recognizing the subtitle information in the target video by using an OCR recognition technology is provided, and in the above mode, for the video with the subtitle text, the content of the subtitle text can be preferentially recognized and is used as the associated text to perform subsequent matching processing, so that the matching accuracy is improved.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, the matching the search information with the target associated text of the target video in the index library by the server to obtain the matching score may specifically include:

the server generates a first text sequence according to the search information, wherein the first text sequence comprises M characters, and M is an integer greater than or equal to 1;

the server generates a second text sequence according to the target associated text of the target video, wherein the second text sequence comprises N characters, and N is an integer greater than or equal to 1;

the server constructs a character matrix according to the first text sequence and the second text sequence;

the server determines the accumulated operand corresponding to the maximum path from the character matrix;

the server will accumulate the ratio between the operands and M as the match score.

In this embodiment, a method of matching by using a levenstein distance based on OCR recognition is described. Considering that OCR may have a certain error rate, during subtitle retrieval, search information may be converted into a first text sequence, aligned with a second text sequence corresponding to an associated text, and after fuzzy matching is performed under a condition that a certain error is allowed, a corresponding target associated text and time information thereof are returned to the terminal device.

The similarity between two strings can be quantified by using the levenstein distance, that is, the minimum editing times required for converting one string into another string through inserting (Insertion), deleting (Deletion) and replacing (substentitation) characters, and the three operation costs of inserting, deleting and replacing are defined as the numerical value 1. The smaller the edit distance, the more similar the two character strings are. Where the levenstein distance allows for insert, delete, and replace operations, indicating the size of the difference between the two strings. For ease of understanding, please refer to the following formula for the calculation of the levenstein distance:

wherein a denotes a first text sequence, b denotes a second text sequence, i denotes lev_a,b(i,j)Representing the distance between the first i characters in the first text sequence and the first j characters in the second text sequence,

is shown when a_i≠b_jThe value is 1, otherwise it represents an indicative function with a value of 0.

Specifically, the first text sequence comprises M characters, the second text sequence comprises N characters, and the algorithm core lies in aligning each character in the first text sequence with each character in the second text sequence. Assuming that the search information is "how good each colleague is", and the target associated text is "how good each colleague is", based on this, calculation is performed by using the above formula (1), and a character matrix shown in table 2 is obtained.

TABLE 2

The purpose of this character matrix is to find how to change from the second text sequence to the first text sequence by the minimum number of operations, the smaller the number of operations, the more similar the search information is to the target associated text, the number in the character matrix represents the number of operations (1 operation for insertion, deletion and replacement), for example, row 2, column 6 "2" indicates that 2 operations are required from "big classmates" to "colleagues", the first operation is to replace the "student" word, and the second operation is to insert the "big" word.

In the first text sequence a and the second text sequence b, when a [ i ] ═ b [ j ], the examination is continued with a [ i +1] ═ b [ j +1 ]. In the process of calculating the levenstan distance between the first text sequence and the second text sequence, there are four operation modes as follows:

firstly, not operating;

assuming that the character included in the first text sequence is "puppy" and the character included in the second text sequence is "dog", it is possible to choose not to operate for the last character "dog" of the two strings, that is to say that the lewistan distance between "puppy" and "dog" is equal to the lewistan distance between "small" and "small".

Secondly, replacing operation;

assuming that the character included in the first text sequence is "cat" and the character included in the second text sequence is "dog", for the last characters "cat" and "dog" of the two character strings, an alternative operation may be selected for the last character, that is, the lewistan distance between "cat" and "dog" is equal to the lewistan distance between "cat" and "dog" plus 1.

Thirdly, inserting operation;

assuming that the character included in the first text sequence is "cat-like", the character included in the second text sequence is "cat-like", for the last characters "cat" and "small" of the two strings, the insertion operation can be chosen to be performed after "small", that is, the lewinstein distance between "cat-like" and "cat-like" is equal to the lewinstein distance between "cat-like" and "cat-like" plus 1.

Fourthly, deleting operation;

assuming that the character included in the first text sequence is "kitten", and the character included in the second text sequence is "kitten", for the last characters "cat" and "cat" of the two character strings, a deletion operation may be selected after "cat", that is, the levenstan distance between "kitten" and "kitten" is equal to the levenstan distance between "kitten" and "kitten" plus 1.

Based on the above description, in the character matrix, the path with the largest gradient is found from the lower right corner to the upper left corner, i.e. the maximum path is obtained. Taking table 2 as an example, the positions where the maximum path passes include a lattice where "each" in the first text sequence intersects "each" in the second text sequence, a lattice where "same" in the first text sequence intersects "same" in the second text sequence, a lattice where "things" in the first text sequence intersects "science" in the second text sequence, a lattice where "big" in the first text sequence intersects "big" in the second text sequence, a lattice where "family" in the first text sequence intersects "family" in the second text sequence, a lattice where "good" in the first text sequence intersects "good" in the second text sequence, and a lattice where "good" in the first text sequence intersects "mu" in the second text sequence. The accumulated operand on the maximum path is a numerical value of 2 on an intersection grid of 'good' in the first text sequence and 'o' in the second text sequence.

Based on this, the first text sequence includes M characters, continuing with table 2 as an example, M is 7, which includes a space character "0" that can be used to align the first text sequence with the second text sequence. The server uses the ratio between the accumulated operand and M as the matching score, and as can be seen from the foregoing discussion, the accumulated operand obtained based on the character matrix shown in table 2 is 2, and M is 7, so that the matching score is 2/7-0.28. Assuming that the matching threshold is 0.3, the matching score is smaller than the matching threshold, and the matching score is considered to satisfy the matching condition.

In the embodiment of the application, a method for matching by adopting the Levenstan distance based on OCR recognition is provided, through the method, for videos with subtitle texts, the content of the subtitle texts can be preferentially recognized, the content is used as a related text to be subjected to subsequent matching processing, the search information is matched with the related text based on the Levenstan distance, the similarity between the texts is calculated by adopting the Levenstan distance, the method has the advantage of high accuracy, and if the Levenstan distance is smaller, the text similarity is high, so that the feasibility of the scheme is improved.

the server generates a first text sequence according to the search information, wherein the first text sequence comprises R words, and R is an integer greater than or equal to 1;

the server generates a second text sequence according to the target associated text of the target video, wherein the second text sequence comprises T words, and T is an integer greater than or equal to 1;

the server determines a word set according to the first text sequence and the second text sequence, wherein the word set is a union set of R words and T words;

the server determines a first word frequency vector according to the word set and the first text sequence;

the server determines a second word frequency vector according to the word set and the second text sequence;

and the server takes the cosine similarity between the first word frequency vector and the second word frequency vector as a matching score.

In this embodiment, a method for determining text similarity for matching is introduced. During subtitle retrieval, the search information can be converted into a first text sequence, the target associated text can be converted into a second text sequence, and after fuzzy matching is carried out under the condition that certain errors are allowed, the corresponding target associated text and time information thereof can be returned to the terminal equipment.

Specifically, the first text sequence includes R words and the second text sequence includes T words. Assuming that the search information is "the clothing number is large and the number is proper", and the target associated text is "the clothing number is not small and the number is proper", chinese word segmentation is performed respectively to obtain a first text sequence of "the piece/clothing/number/large/, the piece/number/proper", and a second text sequence of "the piece/clothing/number/not/small, the piece/more/proper". Based on this, all words are listed according to the first text sequence and the second text sequence to form a word set, wherein the word set is { the piece, the clothes, the number, is larger, the other is more suitable, the piece is not smaller }. And respectively calculating the word frequency of the first text sequence and the word frequency of the second text sequence according to the word set.

The word frequency of the first text sequence is:

this piece (1 time), clothes (1 time), number (2 times), big (1 time), that (1 time), more (0 times), suitable (1 time), not (0 times), little (0 times).

Thus, the first word frequency vector is (1,1,2,1,1,0,1,0, 0).

The word frequency of the second text sequence is:

this piece (1 time), clothes (1 time), number (1 time), big (0 time), that (1 time), more (1 time), suitable (1 time), not (1 time), little (1 time).

Thus, the second word frequency vector is obtained as (1,1,1,0,1,1,1,1, 1).

The server calculates the cosine similarity according to the first word frequency vector and the second word frequency vector by adopting the following method:

wherein, the cosine similarity cos (theta) is the matching score. Thus, the match score was 0.71. Assuming that the matching threshold is 0.7, the matching score is greater than the matching threshold, and the matching score is considered to satisfy the matching condition.

In the embodiment of the application, a method for determining text similarity for matching is provided, by which for a search text, the text similarity between the search text and an associated text can be calculated, and for a search voice, the search voice can be converted into a text form, and then the text similarity between the text and the associated text can be calculated. Therefore, a feasible mode is provided for implementation of the scheme, and feasibility and operability of the scheme are improved.

aiming at a target video, if the target video comprises voice information, the server performs automatic voice recognition (ASR) processing on the voice information in the target video to obtain an associated text;

the server acquires time information corresponding to the associated text;

In this embodiment, a method for recognizing subtitle information in a target video by using an ASR technique is described. For convenience of description, the target video is taken as an example for description, and whether the target video includes subtitle information or not, as long as voice information is available, the server may recognize the voice information by using ASR technology, so as to obtain the associated text. ASR is mainly composed of several parts.

Firstly, extracting acoustic features;

after the simulated voice signal is sampled to obtain waveform data, the waveform data is firstly input into a feature extraction module, and proper acoustic feature parameters are extracted for subsequent acoustic model training. The acoustic characteristics should take into account the following three factors. Firstly, the acoustic model should have excellent distinguishing characteristics so that different modeling units of the acoustic model can be conveniently and accurately modeled. Secondly, the feature extraction can also be regarded as a compression coding process of voice information, which not only needs to eliminate factors of channels and speakers and keep information related to contents, but also needs to use parameter dimensionality as low as possible under the condition of not losing too much useful information, thereby facilitating efficient and accurate model training. Finally, robustness, i.e. immunity against environmental noise, needs to be considered.

Secondly, an acoustic model;

at present, a Hidden Markov Model (HMM) is adopted as an acoustic Model in a mainstream speech recognition system, and a state jump Model of the HMM Model is very suitable for the short-time stationary characteristic of human speech, so that the continuous generated speech signal can be conveniently and statistically modeled. The HMM model has a wide range of applications, and can be modeled using HNM as long as different generation probability densities, discrete distributions, or continuous distributions are selected.

Thirdly, processing a language model and a language;

the language model includes a grammar network formed by recognizing voice commands or a language model formed by a statistical method, and the language processing can perform grammar and semantic analysis.

Specifically, assume that the target video includes 200 sentences of conversations, and each sentence of conversation is identified respectively, that is, 200 associated texts are obtained, where 200 associated texts include the target associated text, and the target associated text may be any one of the 200 associated texts, that is, one associated text that is subsequently matched with the search information. Next, the server needs to obtain time information corresponding to each associated text, so as to store the associated text and the time information corresponding to the associated text in the index library.

Secondly, in the embodiment of the application, a method for recognizing the speech in the target video by using the ASR technology is provided, and by the method, for the video with or without the caption text, the content of the speech can be recognized, and the content is converted into the associated text for subsequent matching processing, so that the matching accuracy is improved.

the server generates a first phoneme sequence according to the search information, wherein the first phoneme sequence comprises P phonemes, and P is an integer greater than or equal to 1;

the server generates a second phoneme sequence according to the target associated text of the target video, wherein the second phoneme sequence comprises Q phonemes, and Q is an integer greater than or equal to 1;

the server constructs a phoneme matrix according to the first phoneme sequence and the second phoneme sequence;

the server determines the accumulated operand corresponding to the maximum path from the phoneme matrix;

the server will accumulate the ratio between the operands and P as the match score.

In this embodiment, a manner of matching by using the levenstein distance based on ASR recognition is introduced. Considering that ASR may have a certain error rate, in the subtitle retrieval, instead of performing simple text matching, search information is converted into phonemes (i.e., initials and finals) in speech recognition and then aligned with phonemes corresponding to associated texts. The search information is converted into a first phoneme sequence, then the first phoneme sequence is aligned with a second phoneme sequence corresponding to the associated text, fuzzy matching is carried out under the condition that certain errors are allowed, and then the corresponding target associated text and time information thereof are returned to the terminal equipment.

The similarity between two phone strings can be quantified by using the levensian distance, that is, the minimum editing times required for converting one phone string into another phone string through inserting, deleting and replacing phones, and the application defines that the three operation costs of inserting, deleting and replacing are all set to be the numerical value 1. The smaller the edit distance, the more similar the two phone strings are. Where the levenstein distance allows for insertion, deletion, and replacement operations, representing the size of the difference between the two phone strings. For ease of understanding, please refer to the following formula for the calculation of the levenstein distance:

wherein a denotes a first phoneme sequence, b denotes a second phoneme sequence, and i denotes lev_a,b(i,j)Representing the distance between the first i phoneme of the first phoneme sequence and the first j phoneme of the second phoneme sequence, 1_(ai≠bj) Is shown when a_i≠b_jThe value is 1, otherwise it represents an indicative function with a value of 0.

Specifically, the first phoneme sequence comprises P phonemes, the second phoneme sequence comprises Q phonemes, and the algorithm core lies in aligning each phoneme in the first phoneme sequence with each phoneme in the second phoneme sequence. Assuming that the search information is "same as the pen is rotated", the target associated text is "same as the wall collision", and the first phoneme sequence is "j iu g en zh u an i y i y ang" and the second phoneme sequence is "j iu g en zh u an i y i y ang" after conversion. Based on this, a phoneme matrix shown in table 3 is obtained by performing calculation using the above formula (3).

TABLE 3

The purpose of this phoneme matrix is to find how to change from the second phoneme sequence to the first phoneme sequence with the least number of operations, the fewer the number of operations, the more similar the search information is to the target associated text, and the number in the phoneme matrix represents the number of operations (1 operation for each of insertion, deletion and replacement).

Note that, when a [ i ] ═ b [ j ] corresponds to the first phoneme sequence a and the second phoneme sequence b, a [ i +1] ═ b [ j +1] is considered. In the process of calculating the levens distance between the first phoneme sequence and the second phoneme sequence, there are four operation modes as follows:

firstly, not operating;

assuming that the phoneme included in the first phoneme sequence is "j iu" and the phoneme included in the second phoneme sequence is "en j iu", it may be selected not to operate for the last phoneme of the two phoneme strings "iu", that is, the levens distance between "j iu" and "en j iu" is equal to the levens distance between "j" and "en j".

Secondly, replacing operation;

assuming that the phoneme included in the first phoneme sequence is "j iu a" and the phoneme included in the second phoneme sequence is "j iu e", for the last phonemes "a" and "e" of the two phoneme strings, the replacement operation for the last phoneme may be selected, that is, the levensian distance between "j iu a" and "j iu e" is equal to the levensian distance between "j iu" and "j iu small" plus 1.

Thirdly, inserting operation;

assuming that the phoneme included in the first phoneme sequence is "j iu a" and the phoneme included in the second phoneme sequence is "j iu", for the last phonemes "a" and "iu" of the two phoneme strings, it may be chosen to perform the insertion operation after "iu", i.e. the lewistan distance between "j iu a" and "j iu" is equal to the lewistan distance between "j iu" and "j iu" plus 1.

Fourthly, deleting operation;

assuming that the phoneme included in the first phoneme sequence is "j iu sh" and the phoneme included in the second phoneme sequence is "j iu sh i", for the last phonemes "sh" and "i" of the two phoneme strings, it may be selected to perform a deletion operation after "i", that is, the levenstein distance between "j iu sh" and "j iu sh i" is equal to the levensan distance between "j iu sh" and "j iu sh" plus 1.

Based on the above description, in the phoneme matrix, the path with the largest gradient is found from the lower right corner to the upper left corner, i.e. the maximum path is obtained. Taking table 3 as an example, the positions where the maximum path passes include a lattice where "j" in the first phoneme sequence intersects "j" in the second phoneme sequence, a lattice where "iu" in the first phoneme sequence intersects "iu" in the second phoneme sequence, a lattice where "g" in the first phoneme sequence intersects "g" in the second phoneme sequence, a lattice where "en" in the first phoneme sequence intersects "en" in the second phoneme sequence, a lattice where "zh" in the first phoneme sequence intersects "zh" in the second phoneme sequence, a lattice where "u" in the first phoneme sequence intersects "u" in the second phoneme sequence, a lattice where "an" in the first phoneme sequence intersects "ang" in the second phoneme sequence, a lattice where "b" in the first phoneme sequence intersects "b" in the second phoneme sequence, a lattice where "i" in the first phoneme sequence intersects "i" in the second phoneme sequence, a lattice in which "y" in the first phoneme sequence intersects "y" in the second phoneme sequence, a lattice in which "i" in the first phoneme sequence intersects "i" in the second phoneme sequence, a lattice in which "y" in the first phoneme sequence intersects "y" in the second phoneme sequence, and a lattice in which "ang" in the first phoneme sequence intersects "ang" in the second phoneme sequence. The accumulated operand on the maximum path is the number "1" on the intersection lattice of "ang" in the first phoneme sequence and "ang" in the second phoneme sequence.

Based on this, the first phoneme sequence includes M phonemes, continuing with table 3 as an example, with M being 13. The server uses the ratio between the accumulated operand and M as the matching score, and as can be seen from the foregoing discussion, the accumulated operand obtained based on the phoneme matrix shown in table 3 is 1, and M is 13, so that the matching score is 1/13-0.08. Assuming that the matching threshold is 0.3, the matching score is smaller than the matching threshold, and the matching score is considered to satisfy the matching condition.

In the embodiment of the application, a method for matching by adopting the Levenstan distance based on ASR recognition is provided, through the method, whether the video has a subtitle text or not, the content of the voice can be recognized by the ASR, the content is used as a related text to perform subsequent matching processing, the search information is matched with the related text based on the Levenstan distance, the similarity between the texts calculated by adopting the Levenstan distance has the advantage of high accuracy, and if the Levenstan distance is smaller, the text similarity is higher, so that the feasibility of the scheme is improved.

aiming at a target video, the server carries out image recognition processing on a video frame in the target video to obtain an associated text;

the server acquires time information corresponding to the associated text;

In this embodiment, a method for identifying a video frame in a target video by using an image identification technology is described. For convenience of description, the target video is taken as an example for description, and no matter whether the target video includes subtitle information or not, as long as a picture is provided, the server may use ASR technology to recognize the picture content, so as to obtain the associated text. Image recognition is mainly composed of the following sections.

Firstly, acquiring an image;

two-dimensional images in video are captured.

Secondly, preprocessing an image;

the method mainly refers to processing of the image, and comprises image binarization, smoothing, transformation, enhancement, restoration, filtering and the like of the image.

Thirdly, extracting and selecting features;

extraction and selection of features is required, for example, 4096 data can be obtained from a 64 x 64 image, and the raw data in the measurement space is transformed to obtain features that best reflect the nature of the classification in the feature space.

Fourthly, designing a classifier;

the main function of the classifier design is to determine the decision rule through training, so that the error rate is lowest when classifying according to the decision rule.

Fifthly, classification decision making;

the identified objects are classified in a feature space.

Specifically, assuming that the target video includes 100000 video frames, each video frame is identified respectively, that is, at least one associated text is obtained, where the at least one associated text includes the target associated text, and the target associated text may be any one of the at least one associated text, that is, an associated text that is subsequently matched with the search information. Next, the server needs to obtain time information corresponding to each associated text, so as to store the associated text and the time information corresponding to the associated text in the index library.

For ease of understanding, please refer to table 4, where table 4 is an illustration between the associated texts in the index database and the time information thereof.

TABLE 4

Associated text	Time information
		Pen with writing-in function	11212
Pen and drum set	25568 and 33050
		Character, pen and drum set	44661
Television receiver	73216 and 88000
		Computer with a display	99532

As can be seen from table 4, each associated text corresponds to a time information, and the time information in table 4 indicates a time point at which the associated text appears in the target video, for example, the associated text is "pen and drum kit" appearing at 25568 th frame and 33050 th frame of the target video, which may last for at least one frame, it should be noted that the present application focuses on the first frame at which the associated text appears.

Secondly, in the embodiment of the application, a mode of identifying a video frame in a target video by adopting an image identification technology is provided, and through the mode, for videos with or without caption texts, the content of the video frame can be identified, and the content is converted into a related text for subsequent matching processing, so that the matching accuracy is improved.

based on the search information, the server obtains a first word vector through an input layer included in the semantic matching model;

based on a target associated text of a target video, the server acquires a second word vector through an input layer included by a semantic matching model;

based on the first word vector, the server obtains a first semantic vector through a presentation layer included in a semantic matching model;

based on the second word vector, the server obtains a second semantic vector through a presentation layer included in the semantic matching model;

based on the first semantic vector and the second semantic vector, the server obtains the cosine distance through a matching layer included in the semantic matching model, and takes the cosine distance as a matching score.

In this embodiment, a keyword matching method based on an image recognition technology is introduced. The server may output the match score using a Semantic matching model, which may be a Deep Semantic model (DSSM) or other type of model.

Specifically, for facilitating understanding, please refer to fig. 4, where fig. 4 is a schematic structural diagram of a semantic matching model in the embodiment of the present application, and as shown in the figure, search information (search text or search speech after ASR conversion) and a target associated text (any associated text) are respectively input into an input layer of the semantic matching model, where the input layer is used to map a sentence into a vector space and input the sentence into a Deep Neural Network (DNN), so as to obtain a first word vector corresponding to the search information and a second word vector corresponding to the target associated text. And respectively inputting the first word vector and the second word vector into a presentation layer, and processing the presentation layer in a word bag mode to respectively obtain a first semantic vector and a second semantic vector. And finally, inputting the first semantic vector and the second semantic vector into a matching layer, wherein the cosine distance output by the matching layer can be used as a matching score.

In the embodiment of the application, a method for performing keyword matching based on an image recognition technology is provided, and through the method, the relevance between the search information and the target associated text can be mined by adopting a neural network model, even if a user does not input the same content when inputting the search information, the content which the user may want to search can be found through the DSSM, and thus the diversity and flexibility of video search are improved.

With reference to fig. 5, another embodiment of a method for positioning a video frame in the present application includes:

201. the method comprises the steps that terminal equipment obtains search information, wherein the search information is a search text or a search voice;

in this embodiment, the terminal device obtains search information, where the search information may be a search text or a search voice. Specifically, in one example, when the terminal device plays the target video, the user may directly input the search information on the terminal device. In another example, the user may select a video in the list as the target video, and after selecting the target video, enter search information on the terminal device. In another example, the user may enter search information, and the server selects any one of the videos from the video database as a target video and matches the search information with the target video.

202. The method comprises the steps that terminal equipment sends search information to a server, so that the server matches the search information with a target associated text of a target video in an index base to obtain a matching score, wherein the index base comprises the associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

in this embodiment, the terminal device sends the search information to the server, and the server stores an index library, where the index library stores the associated text of each video of the K videos and the time information corresponding to the associated text of each video, where the associated text may be a subtitle text recognized by using an OCR technology, a subtitle text recognized by using an ASR technology, or a text recognized by using an image recognition technology. The server may match the search information with each associated text corresponding to the target video in the index library, and obtain a matching score between the search information and each associated text, for convenience of description, the present application takes a target associated text in a plurality of associated texts as an example, in an actual situation, the target associated text may be any one associated text, or may be one associated text with the highest matching degree with the search information, and this is not limited here.

203. If the matching score meets the matching condition, the terminal equipment receives time information corresponding to the target associated text and the target associated text sent by the server;

in this embodiment, the server matches the search information with the target associated text in the index library to obtain a matching score, where the matching score may be a score from 0 to 1, or may adopt other expression manners, which is not limited herein. If the matching score is cosine distance or Levensian, the smaller the matching score is, the higher the matching degree is, and at this time, if the matching score is less than or equal to the matching threshold, the matching score meets the matching condition. If the matching score is the cosine similarity, the larger the matching score is, the higher the matching degree is, and at this time, if the matching score is greater than or equal to the matching threshold, the matching score meets the matching condition.

Specifically, when the matching score meets the matching condition, the target associated text is determined to be successfully matched with the search information. Then, the server obtains the appearance time of the target associated text in the target video based on the index library, wherein the appearance time is the time information corresponding to the target associated text. And then the server sends the time information corresponding to the target associated text and the target associated text to the terminal equipment.

204. And the terminal equipment displays the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

Optionally, on the basis of the embodiment corresponding to fig. 5, in another optional embodiment provided in the embodiment of the present application, the acquiring, by the terminal device, the search information may specifically include:

the terminal equipment provides a text input area;

the terminal equipment receives a search text aiming at the target video through the text input area;

or the like, or, alternatively,

the terminal equipment starts the voice acquisition equipment;

the terminal equipment receives search voice aiming at the target video through the voice acquisition equipment.

In this embodiment, two ways of obtaining search information are introduced. The first way is that the user directly inputs the text content, which is the search text. The second way is that the user starts the voice collecting device and collects the voice input by the user through the voice collecting device, and the voice is the search voice.

Specifically, the manner in which the search information is acquired will be described below with reference to fig. 6 and 7. For easy understanding, please refer to fig. 6, fig. 6 is an interface diagram illustrating a text input area in an embodiment of the present application, and as shown in the drawing, a text input area is indicated by a reference character a1, a text input area may be added to an internal component or an external component of a player provided in a terminal device, and a user inputs a search text through the text input area. It should be noted that fig. 6 shows that the user may input a search text when the terminal device plays the target video, and optionally, the user may also input the search text when the terminal device does not play the video, at this time, the terminal device sends the search text to the server, the server selects one video from the video database as the target video, and performs matching processing on the search text based on the target video, where the matching manner is the content described in the foregoing embodiment, and details are not repeated here.

For easy understanding, please refer to fig. 7, fig. 7 is an interface schematic diagram illustrating a voice input prompt in the embodiment of the present application, as shown in the drawing, a voice trigger module is indicated by B1, a voice trigger module is added to an internal component or an external component of a player provided by a terminal device, and after the terminal device activates a voice collecting device (e.g., a microphone), a user inputs a voice through the voice collecting device, where the voice is a search voice. It should be noted that fig. 7 shows that the user can input the search voice when the terminal device plays the target video, and optionally, the user can also input the search voice when the terminal device does not play the video, at this time, the terminal device sends the search voice to the server, the server selects one video from the video database as the target video, and performs matching processing on the search voice based on the target video, where the matching manner is the content described in the foregoing embodiment, and details are not repeated here.

Secondly, in the embodiment of the application, two modes of obtaining the search information are provided, through the modes, a user can directly input text content as the search information and can also select a voice input mode to speak the search information, and the two modes can be realized, so that the flexibility of the scheme is improved.

Optionally, on the basis of the embodiment corresponding to fig. 5, in another optional embodiment provided in the embodiment of the present application, the displaying, by the terminal device, the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text specifically may include:

the terminal equipment provides a playing progress bar;

the terminal equipment displays a time point identifier on the playing progress bar according to the time information corresponding to the target associated text, wherein the time point identifier belongs to the picture positioning result;

and in the text display area corresponding to the time point identification, the terminal equipment highlights a target associated text, wherein the target associated text belongs to the picture positioning result.

In this embodiment, a manner of displaying a picture positioning result is described. The terminal equipment can provide a playing progress bar through a player interface, then displays a time point identifier on the playing progress bar based on the time information corresponding to the target associated text, and in addition, can also highlight the target associated text.

Specifically, for convenience of understanding, please refer to fig. 8, where fig. 8 is an interface diagram showing a screen positioning result in the embodiment of the present application, and as shown in the figure, assuming that search information is "pen-turning", three target associated texts successfully matched with the search information are obtained after matching. Based on this, the terminal device displays the time point identifier on the play progress bar, that is, as shown in fig. 8, three time point identifiers are displayed on the play progress bar, each time point identifier corresponds to one target associated text, where the three target associated texts are displayed in the text display area corresponding to the time point identifier, for example, the target associated text displayed in the first text display area from left to right is "i will not turn a pen", the target associated text displayed in the second text display area is "how good and hard the pen is turned, the target associated text displayed in the third text display area is" just as if the pen is turned ", and further, the two characters of" turn a pen "can be highlighted.

Further, please refer to fig. 9, where fig. 9 is an interface diagram illustrating that the video frame corresponding to the time point identifier is skipped to in the embodiment of the present application, as shown in the figure, the user may click the "up icon" or the "down icon" or directly click the time point identifier to quickly enter the corresponding video frame.

Secondly, in the embodiment of the present application, a manner of displaying a picture positioning result is provided, and by the manner, in a long video scene, a user can quickly retrieve related contents in a video and quickly position a target video picture. Second, in audio-only video without subtitles, a user can quickly locate a target video picture by means of text search. Thirdly, in the video platform, a series of video information related to the lines or subtitles can be retrieved through text retrieval, which is beneficial to better gathering related materials. Fourthly, for a video file, similar content information can be quickly integrated, and the frequency and the time point of reference of important information in the video can be quickly positioned.

Referring to fig. 10, fig. 10 is a schematic view of an embodiment of a video frame positioning apparatus in an embodiment of the present application, and the video frame positioning apparatus 30 includes:

the acquisition module 301 is configured to receive search information sent by a terminal device, where the search information is a search text or a search voice;

the matching module 302 is configured to match the search information with a target associated text of a target video in an index library to obtain a matching score, where the index library includes an associated text of each video of K videos and time information corresponding to the associated text of each video, the K videos include the target video, and K is an integer greater than or equal to 1;

the determining module 303 is configured to determine time information corresponding to the target associated text if the matching score meets the matching condition;

the sending module 304 is configured to send the time information corresponding to the target associated text and the target associated text to the terminal device, so that the terminal device displays the picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

According to the video picture positioning device, scenes where related contents appear can be found quickly by text search or voice search, interference of thumbnails of other similar video pictures can be saved by search based on a single video, and accuracy of video picture positioning is improved. In addition, the appearance time, the appearance position and the like of the target associated text can be searched, so that a user can conveniently and intuitively check all video pictures related to the searched content in the target video, and the searching efficiency is improved.

Optionally, on the basis of the embodiment corresponding to fig. 10, in another embodiment of the video picture positioning apparatus 30 provided in the embodiment of the present application, the video picture positioning apparatus 30 further includes an identification module 305 and a storage module 306;

the identifying module 305 is configured to, before the matching module 302 matches the search information with a target associated text of a target video in the index library to obtain a matching score, perform Optical Character Recognition (OCR) processing on subtitle information in the target video to obtain an associated text for the target video if the target video includes the subtitle information;

the obtaining module 301 is further configured to obtain time information corresponding to the associated text;

the storage module 306 is configured to store the associated text and the time information corresponding to the associated text in an index library.

In the embodiment of the application, a video picture positioning device is provided, and by adopting the device, for a video with a subtitle text, the content of the subtitle text can be preferentially identified, and the content is used as a related text to perform subsequent matching processing, so that the matching accuracy is improved.

Alternatively, on the basis of the embodiment corresponding to fig. 10, in another embodiment of the video picture positioning device 30 provided in the embodiment of the present application,

a matching module 302, configured to generate a first text sequence according to the search information, where the first text sequence includes M characters, and M is an integer greater than or equal to 1;

the ratio between the cumulative operand and M is taken as the match score.

In the embodiment of the application, the video picture positioning device is provided, and by adopting the device, for a video with a caption text, the content of the caption text can be preferentially identified, the content is used as an associated text to be subjected to subsequent matching processing, the search information is matched with the associated text based on the Levenstein distance, the similarity between the texts is calculated by adopting the Levenstein distance, so that the device has the advantage of high accuracy, and if the Levenstein distance is smaller, the text similarity is higher, so that the feasibility of the scheme is improved.

a matching module 302, configured to generate a first text sequence according to the search information, where the first text sequence includes R words, and R is an integer greater than or equal to 1;

In the embodiment of the application, the video picture positioning device is provided, and by adopting the device, for a search text, the text similarity between the search text and an associated text can be calculated, and for the search speech, the search speech can be firstly converted into a text form, and then the text similarity between the text and the associated text can be calculated. Therefore, a feasible mode is provided for implementation of the scheme, and feasibility and operability of the scheme are improved.

the recognition module 305 is configured to, before the matching module 302 matches the search information with a target associated text of a target video in the index library to obtain a matching score, perform, for the target video, if the target video includes voice information, automatic speech recognition ASR processing on the voice information in the target video to obtain an associated text;

In the embodiment of the application, a video picture positioning device is provided, and by adopting the device, for videos with or without caption texts, contents of voice can be recognized, and the contents are converted into associated texts to perform subsequent matching processing, so that matching accuracy is improved.

a matching module 302, configured to generate a first phoneme sequence according to the search information, where the first phoneme sequence includes P phonemes, and P is an integer greater than or equal to 1;

the ratio between the accumulated operand and P is taken as the match score.

In the embodiment of the application, the video picture positioning device is provided, and by adopting the device, whether a caption text exists in a video or not, the content of a voice can be recognized by adopting ASR (auto-regressive and fuzzy rule) and is used as an associated text to perform subsequent matching processing, the search information is matched with the associated text based on the Levenstein distance, the similarity between the texts is calculated by adopting the Levenstein distance, so that the device has the advantage of high accuracy, and if the Levenstein distance is smaller, the text similarity is higher, so that the feasibility of the scheme is improved.

the identifying module 305 is configured to, before the matching module 302 matches the search information with a target associated text of a target video in the index library to obtain a matching score, perform image identification processing on a video frame in the target video for the target video to obtain an associated text;

In the embodiment of the application, a video picture positioning device is provided, and by adopting the device, for videos with or without caption texts, the content of a video frame can be identified, and the content is converted into a related text to perform subsequent matching processing, so that the matching accuracy is improved.

a matching module 302, configured to obtain a first word vector through an input layer included in a semantic matching model based on the search information;

In the embodiment of the application, the video picture positioning device is provided, and by adopting the device, the relevance between the search information and the target associated text can be mined by adopting a neural network model, and even if a user does not input the same content when inputting the search information, the content which the user may want to search can be found through the DSSM, so that the diversity and flexibility of video search are improved.

Referring to fig. 11, fig. 11 is a schematic view of another embodiment of the video frame positioning apparatus in the present application, in which the video frame positioning apparatus 40 includes:

an obtaining module 401, configured to obtain search information, where the search information is a search text or a search voice;

a sending module 402, configured to send search information to a server, so that the server matches the search information with a target associated text of a target video in an index library to obtain a matching score, where the index library includes an associated text of each video of K videos and time information corresponding to the associated text of each video, the K videos include the target video, and K is an integer greater than or equal to 1;

the obtaining module 401 is further configured to receive time information and a target associated text corresponding to the target associated text sent by the server if the matching score meets the matching condition;

the display module 403 is configured to display a picture positioning result of the target video according to the time information corresponding to the target associated text and the target associated text.

Alternatively, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the video picture positioning device 30 provided in the embodiment of the present application,

an obtaining module 401, specifically configured to provide a text input area;

receiving a search text for a target video through a text input area;

or the like, or, alternatively,

starting a voice acquisition device;

In the embodiment of the application, the video picture positioning device is provided, and by adopting the device, a user can directly input text content as search information and can also select a voice input mode to speak the search information, and the two modes can be realized, so that the flexibility of the scheme is improved.

a display module 403, configured to provide a play progress bar;

In the embodiment of the application, a video picture positioning device is provided, and by adopting the device, firstly, in a long video scene, a user can quickly search related contents in a video and quickly position a target video picture. Second, in audio-only video without subtitles, a user can quickly locate a target video picture by means of text search. Thirdly, in the video platform, a series of video information related to the lines or subtitles can be retrieved through text retrieval, which is beneficial to better gathering related materials. Fourthly, for a video file, similar content information can be quickly integrated, and the frequency and the time point of reference of important information in the video can be quickly positioned.

The embodiment of the application also provides another video picture positioning device, and the video picture positioning device is deployed in the server. Fig. 12 is a schematic diagram of a server structure provided by an embodiment of the present application, where the server 500 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 522 (e.g., one or more processors) and a memory 532, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 542 or data 544. Memory 532 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the server 500.

The Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input-output interfaces 558, and/or one or more operating systems 541, such as a Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 12.

The embodiment of the application also provides another video picture positioning device, and the video picture positioning device is deployed in the terminal equipment. As shown in fig. 13, for convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the technology are not disclosed, please refer to the method part of the embodiments of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal device as the mobile phone as an example:

fig. 13 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 13, the handset includes: radio Frequency (RF) circuit 610, memory 620, input unit 630, display unit 640, sensor 650, audio circuit 660, wireless fidelity (WiFi) module 670, processor 680, and power supply 690. Those skilled in the art will appreciate that the handset configuration shown in fig. 6 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 13:

the RF circuit 610 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 680; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 610 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 610 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.

The memory 620 may be used to store software programs and modules, and the processor 680 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 630 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also referred to as a touch screen, may collect touch operations of a user (e.g., operations of the user on the touch panel 631 or near the touch panel 631 by using any suitable object or accessory such as a finger or a stylus) thereon or nearby, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 631 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 680, and can receive and execute commands sent by the processor 680. In addition, the touch panel 631 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 630 may include other input devices 632 in addition to the touch panel 631. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 640 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 640 may include a Display panel 641, and optionally, the Display panel 641 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 631 can cover the display panel 641, and when the touch panel 631 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 680 to determine the type of the touch event, and then the processor 680 provides a corresponding visual output on the display panel 641 according to the type of the touch event. Although in fig. 13, the touch panel 631 and the display panel 641 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 631 and the display panel 641 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 650, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 641 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 641 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuit 660, speaker 661, and microphone 662 can provide an audio interface between a user and a cell phone. The audio circuit 660 may transmit the electrical signal converted from the received audio data to the speaker 661, and convert the electrical signal into an audio signal through the speaker 661 for output; on the other hand, the microphone 662 converts the collected sound signals into electrical signals, which are received by the audio circuit 660 and converted into audio data, which are processed by the audio data output processor 680 and then transmitted via the RF circuit 610 to, for example, another cellular phone, or output to the memory 620 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 670, and provides wireless broadband Internet access for the user. Although fig. 13 shows the WiFi module 670, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 680 is a control center of the mobile phone, and connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 620 and calling data stored in the memory 620, thereby performing overall monitoring of the mobile phone. Optionally, processor 680 may include one or more processing units; optionally, the processor 680 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 680.

The handset also includes a power supply 690 (e.g., a battery) for powering the various components, optionally, the power supply may be logically connected to the processor 680 via a power management system, so that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

The steps performed by the terminal device in the above-described embodiment may be based on the terminal device structure shown in fig. 13.

Embodiments of the present application also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for locating a video frame, comprising:

2. The method of claim 1, further comprising:

aiming at the target video, if the target video comprises subtitle information, performing Optical Character Recognition (OCR) processing on the subtitle information in the target video to obtain the associated text;

acquiring time information corresponding to the associated text;

and storing the associated text and the time information corresponding to the associated text in the index database.

3. The method according to claim 1 or 2, wherein the matching the search information with the target associated text of the target video in the index library to obtain a matching score comprises:

generating a first text sequence according to the search information, wherein the first text sequence comprises M characters, and M is an integer greater than or equal to 1;

generating a second text sequence according to a target associated text of the target video, wherein the second text sequence comprises N characters, and N is an integer greater than or equal to 1;

taking a ratio between the accumulated operand and the M as the match score.

4. The method according to claim 1 or 2, wherein the matching the search information with the target associated text of the target video in the index library to obtain a matching score comprises:

generating a first text sequence according to the search information, wherein the first text sequence comprises R words, and R is an integer greater than or equal to 1;

generating a second text sequence according to a target associated text of the target video, wherein the second text sequence comprises T words, and T is an integer greater than or equal to 1;

determining a word set according to the first text sequence and the second text sequence, wherein the word set is a union of the R words and the T words;

and taking the cosine similarity between the first word frequency vector and the second word frequency vector as the matching score.

5. The method of claim 1, further comprising:

aiming at the target video, if the target video comprises voice information, performing automatic voice recognition (ASR) processing on the voice information in the target video to obtain the associated text;

acquiring time information corresponding to the associated text;

6. The method according to claim 1 or 5, wherein the matching the search information with the target associated text of the target video in the index library to obtain a matching score comprises:

generating a first phoneme sequence according to the search information, wherein the first phoneme sequence comprises P phonemes, and P is an integer greater than or equal to 1;

determining a cumulative operand corresponding to a maximum path from the phoneme matrix;

taking a ratio between the accumulated operand and the P as the match score.

7. The method of claim 1, further comprising:

aiming at the target video, carrying out image recognition processing on a video frame in the target video to obtain the associated text;

acquiring time information corresponding to the associated text;

8. The method according to claim 1 or 7, wherein the matching the search information with the target associated text of the target video in the index library to obtain a matching score comprises:

based on the search information, acquiring a first word vector through an input layer included in a semantic matching model;

based on the target associated text of the target video, acquiring a second word vector through an input layer included by the semantic matching model;

based on the first word vector, obtaining a first semantic vector through a presentation layer included in the semantic matching model;

based on the second word vector, a second semantic vector is obtained through a presentation layer included by the semantic matching model;

based on the first semantic vector and the second semantic vector, a cosine distance is obtained through a matching layer included in the semantic matching model, and the cosine distance is used as the matching score.

9. A method for locating a video frame, comprising:

sending the search information to a server to enable the server to match the search information with a target associated text of a target video in an index library to obtain a matching score, wherein the index library comprises the associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

if the matching score meets the matching condition, receiving time information corresponding to the target associated text and the target associated text which are sent by the server;

10. The method of claim 9, wherein the obtaining search information comprises:

providing a text input area;

receiving search text for the target video through a text input area;

or the like, or, alternatively,

starting a voice acquisition device;

receiving, by a voice capture device, a search voice for the target video.

11. The method according to claim 9, wherein the displaying the screen positioning result of the target video according to the time information corresponding to the target associated text and the target associated text comprises:

providing a playing progress bar;

displaying a time point identifier on the playing progress bar according to the time information corresponding to the target associated text, wherein the time point identifier belongs to the picture positioning result;

and highlighting the target associated text in a text display area corresponding to the time point identification, wherein the target associated text belongs to the picture positioning result.

12. A video picture positioning apparatus, comprising:

the terminal equipment comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for receiving search information sent by the terminal equipment, and the search information is a search text or a search voice;

the matching module is used for matching the search information with a target associated text of a target video in an index library to obtain a matching score, wherein the index library comprises an associated text of each video in K videos and time information corresponding to the associated text of each video, the K videos comprise the target video, and K is an integer greater than or equal to 1;

13. A video picture positioning apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring search information, and the search information is a search text or a search voice;

a sending module, configured to send the search information to a server, so that the server matches the search information with a target associated text of the target video in an index library to obtain a matching score, where the index library includes an associated text of each video of K videos and time information corresponding to the associated text of each video, the K videos include the target video, and K is an integer greater than or equal to 1;

the acquisition module is further used for receiving the time information corresponding to the target associated text and the target associated text sent by the server if the matching score meets the matching condition;

14. A computer device, comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute the program in the memory, the processor is configured to perform the method according to any one of claims 1 to 8 or the method according to any one of claims 9 to 11 according to instructions in the program code;

15. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any of claims 1 to 8, or perform the method of any of claims 9 to 11.