CN113068077B

CN113068077B - Subtitle file processing method and device

Info

Publication number: CN113068077B
Application number: CN202010000750.4A
Authority: CN
Inventors: 阳萍
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2023-08-25
Anticipated expiration: 2040-01-02
Also published as: CN113068077A

Abstract

The application relates to the technical field of artificial intelligence, and provides a subtitle file processing method and device, which are used for improving the keyword searching efficiency of a user in watching videos. The method comprises the following steps: displaying caption information of the video and keywords in the caption in the video playing process; wherein, keywords in the caption are matched from caption information according to user images of users, and each keyword is associated with target information; and responding to the selection operation of the user from target keywords in the displayed keywords, and displaying target information associated with the target keywords.

Description

Subtitle file processing method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a subtitle file processing method and device.

Background

At ordinary times, the user may see some caption information of interest to himself when watching the video through the terminal device. At this time, the user needs to pause the video, open the search engine, and search the subtitle information, so that the user can obtain the corresponding content. The interaction process between the user and the terminal equipment is complex, and the interaction efficiency is low.

Disclosure of Invention

The embodiment of the application provides a subtitle file processing method and device, which are used for improving the keyword searching efficiency of a user in the video watching process.

In a first aspect, a subtitle file processing method is provided, including:

displaying caption information of the video and keywords in the caption in the video playing process; wherein, keywords in the caption are matched from caption information according to user images of users, and each keyword is associated with target information;

and responding to the selection operation of the user from target keywords in the displayed keywords, and displaying target information associated with the target keywords.

In a possible embodiment, the video subtitle file further includes keyword indication information of video and audio contents associated with subtitles, the keyword indication information of the video and audio contents being used for indicating keywords in the video and audio contents, wherein the keywords in the video and audio contents are matched from the video and audio contents according to the user portrait, and the video and audio contents are identified from the video and audio file; and

the method further comprises the steps of: keywords of video and audio contents associated with the subtitles are displayed in a set format while the subtitles are displayed.

In one possible embodiment, the video-audio content comprises video content, and the keywords of the video-audio content comprise video content keywords indicating people and/or things in the video content; and/or

The video-audio content includes audio content, and the keywords of the video-audio content include audio content keywords indicating musical composition information in the audio content.

In a second aspect, a subtitle file processing method is provided, including:

determining keywords matched with user portraits of users in subtitle information; wherein, each keyword is associated with target information;

and generating a video subtitle file, wherein the video subtitle file comprises keyword indication information in subtitles and the subtitle information, and the keyword indication information is used for indicating keywords included in the subtitle information.

In one possible embodiment, the method further comprises:

identifying video and audio content from a video and audio file associated with the subtitle;

determining keywords of the video and audio content matched with the user portrait from the video and audio content; and

the generated video subtitle file further includes: the keyword indication information of the video and audio content is used for indicating keywords in the video and audio content.

In one possible embodiment, determining keywords in the subtitle information that match the user's user portraits includes:

extracting a plurality of keywords in the subtitle information;

determining a keyword set matched with the user portrait;

and determining the keywords with click frequency meeting preset conditions as keywords matched with the user portrait from the keyword set.

In a third aspect, there is provided a subtitle file processing apparatus including:

the display module is used for displaying caption information of the video and keywords in the caption in the video playing process; wherein, keywords in the caption are matched from caption information according to user images of users, and each keyword is associated with target information;

and the response module is used for responding to the selection operation of the user from the target keywords in the displayed keywords and displaying target information associated with the target keywords.

In a possible embodiment, the apparatus further comprises a transceiver module, wherein:

the response module is also used for responding to the video playing operation of the user and requesting the video subtitle file from the video server before displaying the subtitles of the video and the keywords in the subtitles in the video playing process;

the receiving and transmitting module is used for receiving the video subtitle file sent by the video server; the video subtitle file comprises subtitle information and keyword indication information in the subtitle, wherein the keyword indication information in the subtitle is used for indicating keywords included in the subtitle information.

the display module is also used for displaying keywords of the associated video and audio contents in the subtitles according to a set format while displaying the subtitles.

In a possible embodiment, the display module is specifically configured to:

while displaying the subtitles, keywords in the subtitles are displayed differently in accordance with a set format in the subtitles.

In one possible embodiment, the setting format includes: displaying the keywords as a super search link mode; the response module is specifically configured to:

obtaining a search link corresponding to the selected target keyword;

invoking a search engine, and obtaining target information associated with the selected target keywords from an engine server according to the search link;

and carrying out association display on the obtained target information and the target keywords.

In a possible embodiment, the video subtitle file further includes target information associated with each keyword; the response module is specifically configured to:

Obtaining target information associated with target keywords from a video file;

In a fourth aspect, there is provided a subtitle file processing apparatus including:

the determining module is used for determining keywords matched with the user portrait of the user in the subtitle information; wherein, each keyword is associated with target information;

the generation module is used for generating a video subtitle file, wherein the video subtitle file comprises keyword indication information in subtitles and the subtitle information, and the keyword indication information is used for indicating keywords included in the subtitle information.

In a possible embodiment, the determining module is further configured to: identifying video and audio content from a video and audio file associated with the subtitle; and determining keywords of the video and audio content matched with the user portrait from the video and audio content;

the generating module is further configured to: the keyword indication information of the video and audio content is used for indicating keywords in the video and audio content.

The receiving and transmitting module is used for responding to a video subtitle file request sent by a user through a video client and sending the video subtitle file to the client.

In one possible embodiment, the user representation is generated from user attributes, and operational behavior data of the user; the operation behavior data of the user comprises a video set operated by the user in a set time period before the current time and a clicked keyword set in all keywords displayed by the user in the video playing process in the set time period before the current time.

In a possible embodiment, the determining module is specifically configured to:

extracting a plurality of keywords in the subtitle information;

determining a keyword set matched with the user portrait;

In a possible embodiment, a ratio of the number of words of the keyword matched with the user portrait to the number of words included in the subtitle information is smaller than or equal to a preset ratio.

In a possible embodiment, the video subtitle file further includes target information associated with each keyword.

In a fifth aspect, there is provided an electronic device comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of the first or second aspects by executing the memory stored instructions.

In a sixth aspect, there is provided a computer readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform the method of any one of the first or second aspects.

Due to the adoption of the technical scheme, the embodiment of the application has at least the following technical effects:

In the embodiment of the application, the subtitle information and the keywords in the subtitle are displayed in the video playing process, and once a user needs to check a certain keyword, the keyword is directly selected, so that the corresponding target information can be displayed. Further, in the embodiment of the application, keywords in the caption information are matched based on the user portrait, so that the requirement of personalized searching captions of different users can be met, the number of keywords displayed in the captions can be reduced relatively, and the data transmission quantity between the video server and the terminal equipment can be reduced relatively.

Drawings

FIG. 1 is a schematic diagram of an interaction process between a user and a terminal device provided in the prior art;

fig. 2 is a schematic view of an application scenario of a subtitle file processing method according to an embodiment of the present application;

fig. 3 is an interaction schematic diagram of a subtitle file processing method according to an embodiment of the present application;

Fig. 4 is an interface schematic diagram of a video to be played of a video client according to an embodiment of the present application;

FIG. 5 is a schematic diagram of playing the same video frame for two different users according to an embodiment of the present application;

FIG. 6 is a schematic diagram of identifying video content in a video frame according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a target keyword and target information associated display provided in an embodiment of the present application;

fig. 8 is an interaction schematic diagram of a subtitle file processing method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a subtitle file processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic diagram of a subtitle file processing apparatus according to a second embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a video server according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions provided by the embodiments of the present application, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

In order to facilitate a better understanding of the present application by those skilled in the art, the following description is given of the generic terms related to the embodiments of the present application.

Artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Video and audio files: including audio files and video files. The audio file includes dubbing data for a video, musical composition data for a video, and the video file includes video frames for a video. In some silent video, audio may not be included, then the video-audio file is actually equivalent to the video file.

Keywords in subtitle information: the present application refers to a part or all of keywords that are matched from subtitle information based on a user portrait.

Keyword indication information of subtitles: for indicating keywords in the subtitle information. The keyword indication information may be the selected keywords directly, or the keywords correspond to the super search links, or the position information of the keywords in the caption information, where the position information may be a time period on a time axis in the caption, for example, 00:02 s-00:04 s, and the position information may also be a specific position in the caption information, for example, 3 rd word to 6 th word of the caption information, and the like. The keyword indication information may also be identification information for a corresponding keyword in the subtitle information.

Video and audio content: the content is identified from the video and audio file, and the video and audio content can be understood as text content obtained after certain processing of the video and audio. For example, based on machine learning, image descriptive information, such as characters and/or things in an image frame, is identified from an image frame in a video file, for example, based on NLP, text content, such as, in particular, a musical composition or dubbing in an audio file, etc., is identified from an audio file.

Keywords of video and audio content: refers to some or all of the keywords that are matched from the video and audio content based on the user portraits.

Keyword indication information of video and audio contents: for indicating keywords in the audio-visual content. The keyword indication information may be a time period of the keyword itself and the keyword on a time axis corresponding to the image frame or the audio, and the keyword indication information may be a time period of the super search link corresponding to the keyword and the keyword on a time axis corresponding to the image frame or the audio.

Video subtitle file: the application includes caption information and keyword indication information in the caption. The caption information specifically includes caption sentences, and may also include a time axis corresponding to the caption. The video subtitle file may further include keyword indication information of the video subtitle file and the video and audio contents.

Target keywords: the method refers to keywords selected by a user in the subtitle in the video playing process, and the target keywords may belong to keywords in the subtitle or keywords in the video and audio content.

Target information: the related information of the keywords can be understood, for example, the target keywords can be searched or the content in the database can be pre-stored to be matched, and partial or all results can be obtained. The target information includes one or a combination of several of text, pictures or videos, and the application is not limited to the specific form of the target information.

The following describes the design idea of the subtitle file processing method in the embodiment of the present application.

The prior art to which embodiments of the present application relate is exemplified below.

When the user views the video through the video client in the terminal device, the user finds that a strange place name exists in the subtitle, for example, the video is currently displayed as the video interface 110 shown in (1) in fig. 1, the user does not know about "B place" in the subtitle, and wants to dismiss. At this time, the user needs to pause the current screen and then open the browser in the terminal device desktop 120, for example, the user opens "b Liu Liulan" at which point the terminal device displays the browser homepage 130 corresponding to "b Liu Liulan" shown in (3) of fig. 1. The user inputs the keyword "B ground" in the browser homepage 130, and the terminal device displays the input keyword page 140 as shown in (4) of fig. 1. The user then clicks on the search, at which point the terminal device displays the search results page 150 shown in fig. 1 (5), and the user can view the final search results.

It can be seen that when the user and the terminal device need to interact at present, the video client needs to be closed or operated in the background, then the browser is clicked to perform input search, the operation process is quite complex, and the keyword searching efficiency of the user is relatively low.

In view of this, the present inventors devised a subtitle file processing method that, after a user performs a video playing operation, first determines keywords in subtitles that the user may search for based on a user portrait, and displays the keywords that the user may click on differently in the subtitles. Therefore, the user can select the keywords in the subtitles, the terminal equipment can display the target information related to the keywords, the user does not need to independently open the browser to search the corresponding keywords, the interaction process between the user and the terminal equipment is simplified, the keyword searching efficiency of the user in the video watching process is improved, and the interaction efficiency between the user and the terminal equipment is improved.

The inventor further considers that the more keywords are displayed differently, the more target information the user can view, but if the keywords are displayed differently in the video playing process, the watching experience of the user may be affected, the processing amount of a video playing client in the video playing process may be increased, and the smoothness of video playing is further reduced. Therefore, the inventor considers that the keywords can be further screened from the keyword set matched with the user based on the clicking frequency of the keywords and the ratio of the word number of the keywords to the word number in the subtitle information, so that some keywords with high frequency and the maximum possibility of clicking by the user are screened.

The inventor further considers that, in the process of watching video, a user may not only have a requirement of searching information in subtitles, but also may have a requirement of searching for a person or object in video pictures or a requirement of searching for a song in video and audio, and in the case that the user has no clear object at all, the user has difficulty in converting the images or the audio into characters, so that the user has difficulty in searching for effective information even though searching is performed through a browser. Accordingly, the present inventors considered to recognize video and audio contents from a video and audio file, then screen out keywords matching with a user portrait from the video and audio contents, and then add the keywords to subtitle information. Therefore, the user can select the keywords of the video and audio content and directly check the corresponding content, so that the interaction efficiency between the user and the terminal equipment is further improved.

The inventors further contemplate that the super search links of these keywords may be associated with corresponding keywords in the caption information. Therefore, the data interaction amount between the server and the video client can be relatively reduced, and excessive information displayed in the video playing picture can be avoided, so that the video watching experience of a user is influenced.

After the design concept of the embodiment of the present application is introduced, an application scenario related to the embodiment of the present application is described below as an example. Referring to fig. 2, the application scenario includes: terminal device 210, video client 220 installed in terminal device 210, video server 230 corresponding to the video client, database 240, engine server 250, and the like.

Wherein the terminal device 210 is, for example, a cell phone, a personal computer, etc. In fig. 2, two terminal devices 210 are taken as an example, and the number of terminal devices 210 is not limited in practice. The video client 220 may be understood as a video client installed in the terminal device 210, or a software module embedded in a third party application, or may be a web page client accessed through a web page, and the video client 220 in the embodiment of the present application generally refers to a client that a user can watch video, such as a short video playing client or a video playing client. In fig. 2, the first user and the second user are taken as examples.

The video server 230 and the engine server 250 may be physical servers or virtual servers. Video server 230 may be a single server or a cluster of servers. Database 240 may be implemented by one or more storage devices, such as disks, etc. Database 240 may be a stand alone database or may be part of video server 230. The database 240 may store a large number of subtitle files corresponding to videos, a large number of video files corresponding to videos, a large number of audio files corresponding to videos, and the like. When the video server 230 needs to obtain a relevant file of a certain video, the corresponding file may be obtained from the database 240.

First application scenario:

when a user performs a video playing operation, the video client 220 may request a video subtitle file from the video server 230, and the video server 230 may be capable of matching at least one keyword matched with a user portrait of the user based on subtitle information corresponding to the video, and obtaining keyword indication information corresponding to each keyword. The video server 230 then transmits a video subtitle file including subtitle information and keyword indication information for indicating at least one keyword to the video client 220.

The video client 220 may differentially display at least one keyword based on the video subtitle file. After the video client 220 obtains at least one keyword, when the keyword is displayed in the super search link mode, the user may click on a search link corresponding to the keyword, at this time, the video client 220 may generate a search request, request corresponding content from the engine server 250, and after receiving the found content sent by the engine server 250, perform corresponding display in the video playing process.

The second application scenario:

the video server 230 also obtains the keywords matching the user portraits, the video server 230 can communicate with the engine server 250, obtain the target information corresponding to the keywords from the engine server 250 in advance, and send the target information corresponding to the keywords to the video client 220 in advance. The video client 220 associates the target information with the keywords after receiving the video subtitle file. When the user performs an operation on a certain keyword among the keywords displayed in the subtitle, the video client 220 can directly display target information associated with the keyword, for example, the target information may be one or a combination of several of introduction information of the keyword, web page link, hot spot information associated with the keyword, investigation information associated with the keyword, and the like. The investigation information can be understood as carrying out investigation on the keyword, for example, voting for the keyword can be initiated to acquire more information of the current user, so that more services can be provided for the user later.

The keywords may be characters or objects, the content of the keywords is different, and the corresponding target information may be different. When the keyword is an object, the target information associated with the keyword may be one or a combination of several of introduction information of the object, purchase information of the object, history sources of the object, information of a place where the object is located, information between the place where the object is located and a place where the current user is located, hotspot information associated with the object, advertisement placement information associated with the object, and the like. For example, when the keyword is a song, the target information may be audio of the song, authored video of the song, or the like. When the keyword is a place, the target information may be information of the place where the current user is located and between the places, such as distance, travel mode of transportation, travel time, and the like.

When the keyword is a person, the target information associated with the keyword may be one or a combination of several of introduction information of the person, other person information associated with the person, work information of the person, feature video of the captured video of the person, wearing makeup of the person, article information corresponding to wearing makeup of the person, hotspot information of the person, advertisement placement information associated with the person, and the like. When the keyword is a person, the target information may be hot spot information of the person within a preset duration, so that the user can quickly learn about the latest hot spot, and the like.

No matter which application scenario is, the user only needs to select the keyword to obtain the information of a certain keyword in the caption which the user wants to know, so that the interaction process between the user and the terminal device 210 is simplified, and the interaction efficiency between the user and the terminal device 210 in the video playing process is improved.

Based on the first application scenario discussed in fig. 2, a subtitle file processing method according to an embodiment of the present application is described below. Referring to fig. 3, the method includes:

s301, responding to video playing operation of a user, and generating a video subtitle file request.

Specifically, when a user wants to watch a certain video, a video playing operation can be performed. The video playing operation, such as a user clicking operation, a user gesture operation, a user voice operation, etc., is not limited to the specific form of the video playing operation. After the user performs the video of the video play operation, the video client 220 may generate a video subtitle file request according to the operation. The video subtitle file request is used for requesting a subtitle file corresponding to the video.

Further, in order to facilitate the video server 230 to identify the video that the user performs the play operation, the video subtitle file request may include an identification of the video that the user performs the play operation.

As an embodiment, in order for the video server 230 to obtain the relevant information of the user later, the video subtitle file request may also carry the identity of the user, for example, the user's account ID for the video client, etc.

For example, referring to fig. 4, the video client 220 includes a plurality of videos, and the user clicks the play control 410 in the second video in fig. 4 (i.e. the video corresponding to ID2 in fig. 4), which is equivalent to performing the video play operation. The video client 220 generates a video subtitle file request according to the clicking operation of the user.

S302, the video client 220 transmits the video subtitle file request to the video server 230.

Specifically, after the video server 230 obtains the video subtitle file request, which is equivalent to the video server 230 obtaining the video subtitle file request, the video server 230 can thereby determine which video the user is about to play.

S303, the video server 230 determines a keyword matched with the user portrait from a plurality of keywords of the caption information, and obtains keyword indication information.

Before matching with a user portrait of a user, the video server 230 needs to obtain a plurality of keywords of the caption information, and the user portrait, and a manner of obtaining the plurality of keywords of the caption information is described below as an example.

A1：

The video server 230 may perform keyword extraction on the caption information to obtain a plurality of keywords in the caption information.

Specifically, after the video server 230 obtains the subtitle file request, related information of the video may be obtained from the database 240 based on the video ID, and the related information may include subtitle information. The caption information includes all caption sentences corresponding to the video, time axes corresponding to the captions, and the like.

As an example, the related information may further include video-audio content of the video, which may be understood as one or both of a video picture and an audio file.

After the video server 230 obtains the caption information of the video, the video server 230 may divide the caption sentence by using natural language processing, and remove nonsensical words in the caption to obtain a plurality of keywords in the caption information. Nonsensical words include mood aid words, adverbs, prepositions or conjunctions, etc.

For example, continuing with fig. 4, the user opens the second video segment in fig. 4, that is, the video corresponding to ID2, at which time the video server 230 obtains that the caption sentence in the caption information in the video includes "this is a photograph taken with the latest version of camera G; the shooting location is the real location B of the three movies a ", and the video server eliminates nonsensical words in the caption sentence, for example, eliminates caption sentences such as" this is a photo shot with "," the latest type "," the shooting location is "," the movie "," the real location ", and the like in the caption information, so as to obtain keywords in the caption sentence including: camera G, zhang three, and B.

A2：

The video server 230 may screen keywords matching the subtitle information from each video in advance from the subtitle information of the video, and store the keywords in association with the video. After determining that the user clicks on a certain video, the keywords under the caption information may be obtained directly from the video down call.

For example, the plurality of keywords corresponding to the ID2 video stored in the database 240 include "camera G, zhang san and B ground", and the video server 230 may directly obtain the plurality of keywords from the database 240 after determining that the user clicks on the video corresponding to the ID 2.

The manner in which video server 230 obtains a user representation is illustrated below.

The user portraits may be generated by video server 230 from previous operational behavior data of the user as well as user attributes.

Specifically, the operation behavior data of the user includes the video set operated by the user in a set time period before the current time and the clicked keyword set in all keywords displayed by the user in the video playing process in the set time period before the current time. User attributes such as user gender, age, occupation, region, and video preference categories.

B1: the embedded vector learning can be performed on the previous operation behavior data of the user and the user attributes to obtain the user portrait.

B2: the method can perform embedded vector learning on the previous operation behavior data and user attributes of the user, and input the learned vector into a trained user portrait prediction model so as to obtain the user portrait. The user portraits may be generated by video server 230 in real-time or periodically by video server 230.

It should be noted that the order in which the video server 230 obtains the plurality of keywords in the user portrait and the subtitle information may be arbitrary, and the present application is not particularly limited.

After the user representation is obtained by the video server 230, keywords that match the user representation to a high degree may be selected from a plurality of keywords based on the user representation, and the matched keywords represent keywords that may be of interest to the user, i.e., keywords that the user may want to learn further during viewing of the video. Since the user portraits of each user are different, keywords matched according to the user portraits may be different, so that subtitle information can be displayed for each user more specifically later.

For example, referring to fig. 5, both user a and user b watch the video corresponding to ID2, at this time, the video server 230 determines that user a is a digital product fan based on the user image of user a, and user b determines that user b is loving geography. The video server 230 determines, from among the plurality of keywords in the video of ID2, that the keyword "camera G" matches the user a, and that the keyword "B ground" matches the user B.

In one possible embodiment, there may be many keywords that match the user representation, at which time these keywords may be further filtered.

Specifically, the click frequency of the screened keywords in the video is obtained, and the keywords with the click frequency meeting the preset condition are selected.

For example, the click frequency of the camera G is 80 minutes, the click frequency of Zhang three is 70 minutes, and the click of the B place; when the frequency is 85 minutes, and the keywords are matched with the user portrait of a certain user, keywords with relatively high click frequency can be selected, for example, "cameras G and B ground" are determined.

Further, after the keywords with the click frequency meeting the preset condition are screened, the word number of the finally screened keywords and the word number ratio included in the subtitle information can be controlled to be smaller than or equal to the preset ratio, so that the finally screened keywords are obtained. For example, the number of words of the keyword does not exceed the number of words of the subtitle information by less than or equal to 10%.

After determining keywords that match the user representation, keyword indication information for the matched keywords is obtained. The specific form of the keyword indication information may refer to the content discussed above, and will not be described herein.

S304, obtaining the keywords matched with the user portrait from the video and audio content and the keyword indication information of the video and audio content.

Specifically, the video server 230 may obtain the video and audio content, and screen out keywords matching the user portrait from the video and audio content, so as to obtain keywords matching the user portrait. The manner in which video server 230 obtains video content and audio content is each illustrated below.

One way to obtain the audio-visual content is:

for video files:

the video file includes a plurality of video frames, and the video server 230 may perform object recognition on each video frame to obtain a recognition result. Identification may be by some identifier or detector, for example, and the application is not particularly limited. The video server 230 may also match the identified results with information stored in the network resource or database 240 to determine what the target object is specifically, such as which stars are specific, which animals are specific, which objects are specific, and so on.

For audio files:

the audio file may include musical composition data of the video, and may include dubbing data of the video. For the music data, the video server 230 may convert the music data into text, match corresponding music information, such as one or more of a music name, a singer, a word maker, a composer, and the like, according to the text, and the like. For dubbing data, the video server 230 may identify sound information corresponding to the dubbing data, identify a dubbing actor corresponding to the dubbing data, and the like.

One way to obtain the audio-visual content is:

the video server 230 may store the video content and the audio content in the video in advance, where the video content and the audio content may be obtained by the video server 230 in advance according to the above manner, or may be obtained by other devices and stored in the database 240, and the video server 230 may be directly called.

After invoking the video and audio contents, the video server 230 matches out related keywords based on the user's portraits, thereby obtaining keywords of the video and audio contents.

For example, continuing with the video frame shown in fig. 6, the video content obtained from the video frame by the video server 230 includes "water, actor a, and writing brush", and the keyword obtained from the video content is "actor a" if the video server 230 determines that the user prefers popularity based on the user's portraits.

After obtaining keywords matching the user portraits from the audio-visual content, corresponding keyword indication information can be generated. Keywords of the video-audio content may not actually correspond to subtitles, but in reality, image frames corresponding to the keywords or audio associated with the image frames correspond to each other on a time axis corresponding to the subtitles, so that the keyword indication information may be the keywords themselves, and time periods of the keywords on the time axis. Since the keyword indication information includes a time period on the time axis, it is possible to associate keywords of the video and audio contents with corresponding video frames or audio.

S305, obtaining the video subtitle file according to the keyword indication information and the subtitle information.

Specifically, the keyword indication information herein includes keyword indication information in subtitles and keyword indication information in video and audio contents, and the video server 230 may use the keyword indication information and the subtitle information as video subtitle files.

S306, the video server 230 transmits the video subtitle file to the video client 220.

Specifically, after obtaining the video subtitle file, the video server 230 sends the video subtitle file to the video client 220, and the video client 220 can determine which keywords are based on the keyword indication information after receiving the video subtitle file.

Further, the video client 220 may readjust the caption information according to the keyword indication information and the caption information, for example, associate the keyword indication information in the caption with the super search link corresponding to the keyword, and add the keyword matched with the video and audio content in the corresponding position of the caption information. And adjust these keywords to the set format, the set format can highlight these keywords, display these keywords in different colors from non-keywords, bold or underline these keywords, semi-transparent display these keywords, special effects on these keywords, bordering these keywords, etc., the application is not limited to the specific type of set format.

As an embodiment, since the keywords in the subtitle are actually the contents existing in the original subtitle information, and the keywords of the video and audio contents may not exist in the original subtitle information, in order to facilitate the user to distinguish the keywords in the subtitle from the keywords in the video and audio contents, the keywords in the subtitle and the keywords in the video and audio contents may be displayed in different set formats, so that the user can distinguish the original subtitle and the keywords in the newly added video and audio contents.

And adjusting the keywords into a set format so that keywords corresponding to the keyword indication information are distinguished from other words in the subtitle, wherein the keywords which are distinguished and displayed comprise the keywords in the subtitle and the keywords corresponding to the video and audio contents.

In one possible embodiment, in addition to the keyword in the subtitle and the keyword in the video and audio content being displayed differently, when the target information of the keyword is displayed, the target information may include various types of information, and different types of target information of a keyword may be displayed differently. For example, the type of information preferred by the user may be determined based on the user portrait, the target information of the type preferred by the user may be displayed in an enhanced manner, or the target information of the type not preferred by the user may be displayed in a reduced manner. Weakening the display refers to making the displayed content harder for the user to find.

For example, the introduction information of the keywords, the advertisement delivery information associated with the keywords, and the like, the video client 220 may enhance the introduction information of the keywords in the target information or attenuate the advertisement delivery information to avoid that the excessive advertisement delivery information affects the viewing experience of the user.

S307, the video client 220 obtains a search link in response to the user' S selection operation for the target keyword.

Specifically, the video client 220 performs a distinguishing display on the keywords corresponding to the keyword indication information discussed above, when a user has a search requirement, any keyword may be selected, and the keyword selected by the user is regarded as a target keyword, and the selected operation is, for example, clicking the position of the keyword in the subtitle information, or the time that the cursor stays on the keyword reaches a preset duration, or a voice assistant may perform a voice operation, etc. After the user performs the selection operation, the video client 220 can obtain a search link corresponding to the target keyword.

S308, the video client 220 transmits the search request to the engine server 250.

Specifically, the video client 220 invokes a search engine through the search link to generate a corresponding search request, and sends the search request to the engine server 250.

S309, the engine server 250 transmits the target information to the video client 220.

Specifically, the engine server 250 may perform a corresponding search according to the search request, obtain target information, and feed back the target information to the video client 220.

And S310, carrying out associated display on the target information and the target keywords.

Specifically, after obtaining the target information, the video client 220 may perform association display on the target information and the target keywords, so that the user may intuitively view the target information corresponding to the target keywords.

As an example, since the target information occupies more or less part of the display area, the video may be paused while the target information is displayed. After the user finishes viewing, clicking the screen again, and the video client 220 continues to play the video, or after the display target information satisfies the preset duration, the video client 220 continues to play the video.

As one embodiment, the video client 220 may split screen display the target information along with the original video frame so that the user may view the original video frame and the target information in contrast.

For example, continuing to take fig. 5 as an example, user a and user B both watch video corresponding to ID2, where user a matches the keyword "camera G" and user B matches the keyword "B. When the user a is watching the video, the "camera G" may be clicked, and the video client 220 obtains the target information corresponding to the "camera G" including the picture, price, core parameters, main performance, and the like, and displays the screen as shown in (1) in fig. 7. When the user B views the video, the user B may click on "B ground", and the video client 220 obtains the target information corresponding to "B ground" including basic information, some pictures, and the like, and displays the screen as shown in (2) of fig. 7.

The user a and the user b may also both click on "XX mountain", which is a keyword obtained from video content, and after clicking, target information related to the XX mountain is displayed.

As an embodiment, S304 belongs to an optional step. When S304 is not performed, the keyword indication information in S305 is the keyword indication information in the subtitle, and at this time, the target keyword is the keyword in the subtitle.

Based on the second application scenario discussed in fig. 2, a subtitle file processing method according to an embodiment of the present application is described below. Referring to fig. 8, the method includes:

s801, the video client 220 generates a video subtitle file request in response to a video play operation of a user.

The video playing operation, the video subtitle file request, the manner in which the video client 220 generates the video subtitle file request, etc. may refer to the foregoing discussion, and will not be repeated here.

S802, the video client 220 transmits a video subtitle file request to the video server 230.

S803, the video server 230 determines a keyword matching the user portrait from a plurality of keywords of the subtitle information, and obtains keyword indication information.

The contents of the caption information, the keywords which are matched with the user image are determined from a plurality of keywords in the caption information, and the contents of the user image and the like can refer to the contents in the foregoing discussion, and the details are not repeated here.

S804, the video server 230 transmits the first search request to the engine server 250.

Specifically, after obtaining keywords in the subtitle information that match the user's portraits, video server 230 may search for these keywords and send a corresponding search request to originating server 250.

S805, the engine server 250 transmits the target information to the video server 230.

Specifically, the engine server 250 feeds back the corresponding search result to the video server 230, and the video server 230 is equivalent to obtaining the target information corresponding to each keyword.

For one embodiment, engine server 250 may send a plurality of search results to video server 230, and video server 230 may match the search results most likely to be of interest to the user based on the user's image, with the search results as targeting information.

S806, the video server 230 obtains the keyword matched with the user portrait and the keyword indication information of the video content from the video content.

The audio and video content, the user portraits, and the keywords matching the user portraits may be referred to in the foregoing discussion, and will not be described in detail herein.

S807, the video server 230 transmits the first search request to the engine server 250.

S808, the video server 230 receives the target information of the keywords of the video-audio content transmitted by the engine server 250.

As an example, the video server 230 may send a search request to the engine server 250 after performing S803 and S806 to obtain corresponding target information.

In one possible embodiment, the database 240 may pre-store target information corresponding to each keyword, and after the video server 230 determines the keywords in the subtitle and the keywords of the video and audio content, the corresponding target information may be directly called from the data 240.

S809, the video server 230 obtains a video subtitle file according to the keyword indication information, the subtitle information, and the target information.

Specifically, the video server 230 may directly use the keyword indication information, the subtitle information and the target information as the video subtitle file, or integrate the keyword indication information, the subtitle information and the target information to obtain the video subtitle file.

S810, the video server 230 transmits the video subtitle file to the video client 220.

S811, the video client 220 displays target information associated with the target keyword in response to a user' S selection operation for the target keyword.

Specifically, after receiving the video subtitle file, the video client 220 may associate each target information with each keyword according to the keyword indication information, and display the keywords according to a set format. After the user performs a selection operation of the target keyword, the video client 220 displays target information associated with the target keyword.

As an example, S806-S808 are optional steps.

Based on the same inventive concept, an embodiment of the present application provides a subtitle file processing apparatus, which corresponds to the terminal device 210 discussed above, referring to fig. 9, the apparatus 900 includes:

The display module 901 is configured to display caption information of a video and keywords in the caption during a video playing process; wherein, keywords in the caption are matched from caption information according to user images of users, and each keyword is associated with target information;

and a response module 902 for displaying target information associated with the target keyword in response to a user selection operation from the target keywords in the displayed keywords.

In one possible embodiment, the apparatus further comprises a transceiver module 903, wherein:

the response module 902 is further configured to respond to a video playing operation of a user to request a video subtitle file from the video server before displaying the subtitle of the video and the keyword in the subtitle in the video playing process;

the transceiver module 903 is configured to receive a video subtitle file sent by the video server; the video subtitle file comprises subtitle information and keyword indication information in the subtitle, wherein the keyword indication information in the subtitle is used for indicating keywords included in the subtitle information.

In one possible embodiment, the video subtitle file further includes keyword indication information of video and audio contents associated with the subtitle, the keyword indication information of the video and audio contents being used to indicate keywords in the video and audio contents, wherein the keywords in the video and audio contents are matched from the video and audio contents according to the user portrait, and the video and audio contents are identified from the video and audio file; and

The display module 901 is further configured to display keywords of the video and audio content associated with the subtitle according to a set format while displaying the subtitle.

In one possible embodiment, the display module 901 is specifically configured to:

In one possible embodiment, the setting format includes: displaying the keywords as a super search link mode; the response module 902 is specifically configured to:

obtaining a search link corresponding to the selected target keyword;

invoking a search engine, and obtaining target information associated with the selected target keywords from an engine server according to the search links;

In one possible embodiment, the video subtitle file further includes target information associated with each keyword; the response module 902 is specifically configured to:

Obtaining target information associated with target keywords from a video file;

Based on the same inventive concept, an embodiment of the present application provides a subtitle file processing apparatus, which corresponds to the video server 230 discussed above, referring to fig. 10, the apparatus 1000 includes:

a determining module 1001, configured to determine keywords in the subtitle information, where the keywords match with a user portrait of a user; wherein, each keyword is associated with target information;

the generating module 1002 is configured to generate a video subtitle file, where the video subtitle file includes keyword indication information and subtitle information in a subtitle, and the keyword indication information is used to indicate keywords included in the subtitle information.

In one possible embodiment of the present application,

the determining module 1001 is further configured to: identifying video and audio content from a video and audio file associated with the subtitle; and determining keywords of the video and audio contents matched with the user portrait from the video and audio contents;

the generating module 1002 is further configured to: keyword indication information of the video-audio content, the keyword indication information of the video-audio content being used for indicating keywords in the video-audio content.

In one possible embodiment, the apparatus further comprises a transceiver module 1003, wherein:

and the receiving and transmitting module is used for responding to a video subtitle file request sent by a user through the video client and sending the video subtitle file to the client.

In one possible embodiment, the determining module 1001 is specifically configured to:

extracting a plurality of keywords in the subtitle information;

determining a keyword set matched with the user portrait;

And determining the keywords with the click frequency meeting the preset conditions from the keyword set as keywords matched with the user portrait.

In one possible embodiment, the word count of the keyword matched with the user portrait and the word count included in the subtitle information have a ratio of less than or equal to a preset ratio.

In one possible embodiment, the video subtitle file further includes target information associated with each keyword.

Based on the same inventive concept, the embodiment of the application further provides a terminal device 210, where the terminal device 210 may be an electronic device such as a smart phone, a tablet computer, a laptop or a PC.

Referring to fig. 11, the terminal device 210 includes a display unit 1140, a processor 1180 and a memory 1120, wherein the display unit 1140 includes a display panel 1141 for displaying information input by a user or provided to the user, various operation interfaces of the terminal device 210, and the like, and is mainly used for displaying interfaces, shortcut windows, and the like of the installed client 220 in the terminal device 210 in the embodiment of the present application. Alternatively, the display panel 1141 may be configured in the form of an LCD (Liquid Crystal Display) or an OLED (Organic Light-Emitting Diode) or the like.

The processor 1180 is used to read the computer program and then execute a method defined by the computer program, for example, the processor 1180 reads the game application long hair, thereby running the application on the terminal device 210, and displays the interface of the application on the display unit 1140. The processor 1180 may include one or more general-purpose processors and may also include one or more DSPs (Digital Signal Processor, digital signal processors) for performing related operations to implement the techniques provided by embodiments of the present application.

Memory 1120 typically includes memory and external memory, which may be Random Access Memory (RAM), read Only Memory (ROM), and CACHE memory (CACHE), among others. The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk, a tape drive, etc. The memory 1120 is used to store computer programs, including applications corresponding to the client 220, and the like, and other data, which may include data generated after the operating system or applications are run, including system data (e.g., configuration parameters of the operating system) and user data. The program instructions in the embodiments of the present application are stored in the memory 1120, and the processor 1180 executes the program instructions stored in the memory 1120 to implement the subtitle file processing method discussed above, or to implement the functions of the adaptation client 220 discussed above.

In addition, the terminal device 210 may further include a display unit 1140 for receiving input digital information, character information, or touch operation/noncontact gestures, and generating signal inputs related to user settings and function controls of the terminal device 210, and the like. Specifically, in an embodiment of the present application, the display unit 1140 may include a display panel 1141. The display panel 1141, such as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the player on the display panel 1141 or on the display panel 1141 using any suitable object or accessory such as a finger, stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the display panel 1141 may include two parts, a touch detection device and a touch controller. The touch detection device comprises a touch controller, a touch detection device and a touch control device, wherein the touch detection device is used for detecting a touch direction of a user, detecting a signal brought by touch operation and transmitting the signal to the touch controller; the touch controller receives touch information from the touch sensing device and converts it into touch point coordinates, which are then sent to the processor 1180, and can receive commands from the processor 1180 and execute them. In the embodiment of the present application, if the user performs the selection operation on the target keyword, the touch detection device in the display panel 1141 detects the touch operation, and then the touch controller sends a signal corresponding to the detected touch operation, the touch controller converts the signal into the touch point coordinates and sends the touch point coordinates to the processor 1180, and the processor 1180 determines the target keyword selected by the user according to the received touch point coordinates and displays the target information in association with the target keyword.

The display panel 1141 may be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the display unit 1140, the terminal device 210 may further include an input unit 1130, and the input unit 1130 may include, but is not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, etc. In fig. 11, the input unit 1130 includes an image input device 1131 and other input devices 1132 as an example.

In addition to the above, the terminal device 210 may also include a power supply 1190 for powering other modules, audio circuitry 1160, near field communication module 1170, and RF circuitry 1110. The terminal device 210 may also include one or more sensors 1150, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 1160 specifically includes a speaker 1161, a microphone 1162, etc., and for example, the user may use voice control, the terminal device 210 may collect the sound of the user through the microphone 1162, may control the sound of the user, and play a corresponding alert sound through the speaker 1161 when the user needs to be prompted.

Based on the same inventive concept, an embodiment of the present application provides a video server, that is, the video server 230 discussed above, please refer to fig. 12, which includes a processor 1201 and a memory 1202.

Components of the server may include, but are not limited to: at least one processor 1210, at least one memory 1220, a bus 1230 connecting the different system components (including the processor 1210 and the memory 1220).

Bus 1230 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.

Memory 1220 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 1221 and/or cache memory 1222, and may further include Read Only Memory (ROM) 1223.

Memory 1220 may also include a program/utility 1226 having a set (at least one) of program modules 1125, such program modules 1225 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. The processor 1210 is configured to execute program instructions and the like stored in the memory 1220 to implement the subtitle file processing method discussed above.

Video server 230 can also communicate with one or more external devices 1240 (e.g., keyboard, pointing device, etc.), one or more devices that enable terminal device 210 to interact with video server 230, and/or any device (e.g., router, modem, etc.) that enables video server 230 to communicate with one or more other devices. Such communication may occur through an input/output (I/O) interface 1250. Also, video server 230 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet via network adapter 1260. As shown, network adapter 1260 communicates with other modules for video server 230 via bus 1230. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with video server 230, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform the subtitle file processing method discussed above.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A subtitle file processing method, comprising:

displaying caption information of the video, keywords in the caption and keywords of the video and audio content in the video playing process; the keywords in the subtitles are matched from subtitle information according to user images of users, each keyword is associated with target information, the keywords of the video and audio contents are obtained by matching target objects identified from video contents in the video and audio contents according to the user images, the target objects comprise at least one of characters, animals or objects, the user images are generated according to operation behavior data before the users and user attributes, and the ratio of the number of words of the keywords in the subtitles to the number of words of the keywords of the video and audio contents to the number of words included in the subtitle information is smaller than or equal to a preset ratio;

2. The method of claim 1, wherein prior to displaying the subtitles of the video and keywords in the subtitles during video playback, comprising:

Responding to video playing operation of a user, and requesting a video subtitle file from a video server;

receiving a video subtitle file sent by the video server; the video subtitle file comprises subtitle information and keyword indication information in the subtitle, wherein the keyword indication information in the subtitle is used for indicating keywords included in the subtitle information.

3. The method of claim 2, wherein the video subtitle file further includes keyword indication information of video-audio content associated with a subtitle, the keyword indication information of the video-audio content being used to indicate keywords in the video-audio content, wherein the keywords in the video-audio content further include matching from audio content of the video-audio content according to the user portrait, the video-audio content being identified from the video-audio file; and displaying keywords of the video and audio contents, including:

and displaying the keywords of the video and audio contents associated with the subtitles according to a set format.

4. A method according to any one of claims 1-3, wherein displaying subtitle information and keywords in the subtitles of the video during the video playing process comprises:

5. The method of claim 4, wherein the setting the format comprises: displaying the keywords as a super search link mode; and

and responding to the selection operation of the user from target keywords in the displayed keywords, displaying target information associated with the target keywords, wherein the target information comprises the following components:

obtaining a search link corresponding to the selected target keyword;

6. The method of any of claims 1-3, wherein the video subtitle file further includes target information associated with each keyword; and

the displaying the target information associated with the target keywords specifically comprises:

obtaining target information associated with target keywords from a video file;

7. A subtitle file processing method, comprising:

determining keywords matched with user images of users and keywords matched with target objects of video contents in video and audio contents in subtitle information; wherein, each keyword is associated with target information, the user portrait is generated according to the previous operation behavior data of the user and the user attribute, and the ratio of the number of keywords in the subtitle to the number of words in the video and audio content to the number of words included in the subtitle information is smaller than or equal to a preset ratio;

Generating a video subtitle file, wherein the video subtitle file comprises keyword indication information in subtitles, the subtitle information and keyword indication information of the video and audio contents, the keyword indication information is used for indicating keywords included in the subtitle information, the keyword indication information of the video and audio contents is used for indicating keywords matched with a target object of the video contents in the video and audio contents, and the target object comprises at least one of a person, an animal or an object.

8. The method of claim 7, wherein determining keywords that match target objects of video content in video-audio content comprises:

identifying the video and audio content from a video and audio file associated with the subtitle;

determining keywords of the video and audio contents matched with the user portrait from the video contents and the audio contents in the video and audio contents; and

the keyword indication information of the video and audio content is also used for indicating keywords in the video and audio content.

9. The method as recited in claim 7, further comprising:

and responding to a video subtitle file request sent by a user through a video client, and sending the video subtitle file to the client.

10. The method of any one of claims 7 to 9, wherein the user representation is generated from user attributes and operational behavior data of the user; the operation behavior data of the user comprise a video set operated by the user in a set time period before the current time and a clicked keyword set in all keywords displayed by the user in the video playing process in the set time period before the current time.

11. The method of claim 10, wherein the word count of the keyword matched with the user portrait and the word count included in the subtitle information have a ratio of less than or equal to a preset ratio.

12. The method of any one of claims 7 to 9, wherein the video subtitle file further includes target information associated with each keyword.

13. A subtitle file processing apparatus, comprising:

the display module is used for displaying caption information of the video, keywords in the caption and keywords of the video and audio content in the video playing process; wherein, keywords in the subtitles are matched from subtitle information according to user images of users, each keyword is associated with target information, the keywords of the video and audio content are obtained by matching the target objects of the video content in the video and audio content according to the user images, the target objects comprise at least one of characters, animals or objects, the user images are generated according to operation behavior data before the users and user attributes, and the ratio of the number of words of the keywords in the subtitles to the number of words of the keywords of the video and audio content to the number of words included in the subtitle information is smaller than or equal to a preset ratio;

14. A subtitle file processing apparatus, comprising:

the determining module is used for determining keywords matched with user images of the user and keywords matched with target objects of video content in the video and audio content in the subtitle information; wherein, each keyword is associated with target information, the user portrait is generated according to the previous operation behavior data of the user and the user attribute, and the ratio of the number of keywords in the subtitle to the number of words in the video and audio content to the number of words included in the subtitle information is smaller than or equal to a preset ratio;

the generating module is used for generating a video subtitle file, wherein the video subtitle file comprises keyword indication information in subtitles, the subtitle information and keyword indication information of the video and audio contents, the keyword indication information is used for indicating keywords included in the subtitle information, the keyword indication information of the video and audio contents is used for indicating keywords matched with a target object of the video contents in the video and audio contents, and the target object comprises at least one of a person, an animal or an object.

15. A computer readable storage medium storing computer instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 6 or 7 to 12.