CN112752134B

CN112752134B - Video processing method and device, storage medium and electronic device

Info

Publication number: CN112752134B
Application number: CN202010693888.7A
Authority: CN
Inventors: 田元
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2023-09-22
Anticipated expiration: 2040-07-17
Also published as: CN112752134A

Abstract

The invention discloses a video processing method and device, a storage medium and an electronic device. Wherein the method comprises the following steps: displaying the target video in a display interface on the client; responding to the received conversion instruction, and converting voice content in the target video into text content; displaying text content in a target control contained in a display interface; and responding to a trigger instruction executed on the target control, and executing a target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client. The invention solves the technical problem of poor processing flexibility of video content in the related technology.

Description

Video processing method and device, storage medium and electronic device

Technical Field

The present invention relates to the field of computers, and in particular, to a video processing method and apparatus, a storage medium, and an electronic apparatus.

Background

In the prior art, after a user receives video content, the user can view the video content. However, the user cannot further process the content of interest in the video content, limited to viewing the video content. If the user wishes to further process the content of interest, he needs to remember the content of interest by watching the video and then use a specific application or function to process the content of interest.

That is, in the related art, there is a problem that processing efficiency of content of interest in video is low.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a video processing method and device, a storage medium and an electronic device, which at least solve the technical problem of low processing efficiency of interested contents in videos in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a video processing method including: displaying the target video in a display interface on the client; responding to the received conversion instruction, and converting voice content in the target video into text content; displaying text content in a target control contained in a display interface; and responding to a trigger instruction executed on the target control, and executing a target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client.

According to another aspect of the embodiment of the present invention, there is also provided a video processing apparatus including: a first display unit for displaying a target video in a display interface on a client; the conversion unit is used for responding to the received conversion instruction and converting the voice content in the target video into text content; the second display unit is used for displaying the text content in a target control contained in the display interface; and the execution unit is used for responding to the trigger instruction executed on the target control and executing the target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client.

As an alternative example, the second display unit includes: the third display module is used for displaying a plurality of target controls in the display interface; and the fourth display module is used for displaying one vocabulary of the text content in each target control.

As an alternative example, the apparatus further includes: and the third display unit is used for displaying a target result obtained after the target function is executed after the target function corresponding to the target control is executed in response to the trigger instruction executed on the target control, wherein the target result is a result obtained after the target function is executed on the text content in the target control.

As an alternative example, the apparatus further includes: the acquisition unit is used for acquiring the type of the client before responding to the trigger instruction executed on the target control and executing the target function corresponding to the target control; a first determining unit configured to determine a plurality of functions of the client that match the type; and a second determining unit configured to determine one function from the plurality of functions as the target function.

As an alternative example, the conversion unit includes: the input module is used for inputting the voice content into a target neural network model, wherein the target neural network model is a model obtained by training an original neural network model by using sample voice, and the target neural network model is used for outputting text content corresponding to the voice content after inputting the voice content; and the acquisition module is used for acquiring the text content output by the target neural network model.

As an alternative example, the execution unit includes: the processing module is used for searching the text content in the target control when the target function is a search function, sharing the text content in the target control when the target function is a sharing function, translating the text content in the target control when the target function is a translation function, and displaying the meaning of the text content in the target control when the target function is an interpretation function.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the video processing method described above.

According to still another aspect of the embodiments of the present invention, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the video processing method described above through the computer program.

In the embodiment of the invention, the target video is displayed in a display interface on the client; responding to the received conversion instruction, and converting voice content in the target video into text content; displaying text content in a target control contained in a display interface; in response to a trigger instruction executed on the target control, executing a target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic illustration of an application environment for an alternative video processing method according to an embodiment of the application;

FIG. 2 is a schematic diagram of an application environment of another alternative video processing method according to an embodiment of the present application;

FIG. 3 is a flow chart of an alternative video processing method according to an embodiment of the application;

FIG. 4 is an interface schematic diagram of an alternative video processing method according to an embodiment of the application;

FIG. 5 is an interface schematic of another alternative video processing method according to an embodiment of the application;

FIG. 6 is an interface schematic diagram of yet another alternative video processing method according to an embodiment of the application;

FIG. 7 is an interface schematic diagram of yet another alternative video processing method according to an embodiment of the application;

FIG. 8 is an interface schematic diagram of yet another alternative video processing method according to an embodiment of the application;

FIG. 9 is an interface schematic diagram of yet another alternative video processing method according to an embodiment of the application;

Fig. 10 is a schematic structural view of an alternative video processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural view of an alternative electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiment of the present invention, there is provided a video processing method, optionally, as an alternative implementation, the video processing method may be applied, but not limited to, in the environment shown in fig. 1.

Man-machine interaction between the user 102 and the user device 104 may be performed in fig. 1. The user device 104 includes a memory 106 for storing interaction data and a processor 108 for processing the interaction data. User device 104 may interact with data via network 110 and server 112. The server 112 includes a database 114 for storing interaction data and a processing engine 116 for processing the interaction data. The user equipment 104 may operate a client, display a target video in a display interface of the client, display a target control in the display interface when receiving the conversion instruction, and execute a target function corresponding to the target control when receiving a trigger instruction executed on the target control.

As an alternative embodiment, the video processing method described above may be applied, but not limited to, in the environment shown in fig. 2.

Human-machine interaction between user 202 and user device 204 may be performed in fig. 2. The user device 204 includes a memory 206 for storing interaction data and a processor 208 for processing the interaction data. The user equipment 204 may operate a client, display a target video in a display interface of the client, display a target control in the display interface when receiving the conversion instruction, and execute a target function corresponding to the target control when receiving a trigger instruction executed on the target control.

Alternatively, the user device 104 or the user device 204 may be, but not limited to, a mobile phone, a tablet computer, a notebook computer, a PC, etc., and the network 110 may be, but not limited to, a wireless network or a wired network. Wherein the wireless network comprises: WIFI and other networks that enable wireless communications. The wired network may include, but is not limited to: wide area network, metropolitan area network, local area network. The server 112 may include, but is not limited to, any hardware device that can perform calculations.

Optionally, as an optional embodiment, as shown in fig. 3, the video processing method includes:

s302, displaying a target video in a display interface on a client;

s304, responding to the received conversion instruction, and converting the voice content in the target video into text content;

s306, displaying the text content in a target control contained in the display interface;

and S308, responding to a trigger instruction executed on the target control, and executing a target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client.

Alternatively, the video processing method described above may be applied, but is not limited to, in any client. For example, the client is a video applet, or a live application, or a mailbox, and may also be applied to a client with chat functionality. The chat function may be a real-time chat function. The client may have other functions, such as a transfer function, a search function, a forwarding function, etc., while having a chat function. That is, the client in the present application is not limited to the real-time communication client, and may be other clients having a chat function. For example, communication chat between friends may also be performed in a payment device.

The client in the application is a client capable of displaying the target video, and can display the address of the target video in the client or display a video identifier, wherein the video identifier corresponds to the address of the target video. The target video may be played by clicking on the address or video identification. The application is not limited to the type of client, and all clients that can display the target video or display the video identifier of the target video are within the scope of the application.

Taking a live broadcast process as an example, in the live broadcast process, when a live video stream is displayed, live broadcast voice content can be converted into text content, and the text content is displayed in a control mode. If the user clicks on the control, the target function is performed. The target function is a function matching the type of live application.

Or, taking the client with the chat function as an example, displaying the target video in the chat window, and converting the voice content of the target video into text content and displaying the text content in a control mode. If the user clicks the control, a target function corresponding to the type of client is executed.

Or taking a common client, such as a news client, the news client can display the target video, convert the voice content of the target video into text content, and display the text content in a control mode. If the user clicks the control, a target function corresponding to the type of client is executed.

Or taking the mail as an example, after the mail is received, the target video is in the mail, so that the voice content of the target video can be converted into the text content, and the text content is displayed in a control mode. If the user clicks the control, a target function corresponding to the type of client is executed. Such as forwarding, etc.

Optionally, in the present application, the target video may be displayed through a chat window, where the target video may be a video sent by another user to the current user, and the chat window may be a chat window between two users or a chat window between multiple users. And displaying the target video in the chat window of the client, wherein the target video can be displayed through a video frame. For example, as shown in fig. 4, fig. 4 shows a display interface of a client of one user, in which a target video 402 is displayed, when two users chat using a chat window.

After the target video is displayed, the target video can be automatically played or a playing instruction of a user is received for playing. If the video is automatically played, before the video is automatically played, voice content in the target video can be acquired, then the voice content is converted into text content, and then the text content is displayed through the target control. The user may click on the control to perform the target function on the text content.

There are a number of ways to display the target control. The target control can be tied in the target video in the process of playing the target video, or can be displayed around the target video in the process of not playing the target video.

For example, as shown in fig. 5, in the video playing process, text information 502 may be displayed in the video, where text information 502 is text information obtained by converting the audio content of the video. The text information may be segmented and then the target control presented, as shown in fig. 6, with the target control 602 shown in fig. 6 and the text information shown in the target control 602. Fig. 5 and 6 show the case where a target control is displayed in a video or text information is displayed.

As shown in fig. 7 and 8, in fig. 7, text information is displayed below the video content, and in fig. 8, after the text information is segmented, a target control 802 is displayed below the video content. The video need not be played.

If the target control is displayed in the video, the original subtitle needs to be replaced by the target control. That is, if the video band has subtitles, when the target control is determined, the target control replaces the original subtitles when the text content is displayed by using the target control. The replacement may be to delete the original subtitle or to overlay the original subtitle. And displaying the target control in the time period of displaying the original subtitle.

If the target video has no subtitle, after the audio content in the target video is converted into the text content, the corresponding relation between the audio content and the text content is correspondingly recorded, for example, the corresponding relation between the target voice content and the target text content is recorded, the target voice content is a section of voice in the audio, and the target text content is the text content converted from the target voice content. And acquiring a starting time point and an ending time point of the target semantic content, and displaying the target text content in the starting time point and the ending time point. And when the target text content is displayed, displaying a target control, and displaying the target text content in the target control.

When the target control is displayed, a plurality of target controls can be displayed. That is, after converting the audio content in the video into the text content, the text content may be segmented into a plurality of words, and then one word is displayed using each of the plurality of target controls.

The target function in the application can be a function of the client. Such as any of search, interpretation, translation, forwarding, etc.

After the target control is displayed, after a trigger instruction is received and the corresponding function of the control is executed, a result obtained by executing the function of the control can be displayed. For example, search results are presented, or sharing results are presented, or translation results are presented, or meaning of text content is presented. For example, as shown in fig. 9, taking searching as an example, after clicking the target control, text content in the target control is searched, and a search result is displayed. The search may be a search within the client or a full web search with an interface that invokes a search engine.

Optionally, in the application, when the target control is generated, a function needs to be given to the target control, so that after the target control is triggered, a corresponding function can be executed. The functionality to which the target control is assigned may be determined based on the type of client. For example, the client is a search engine, and the target control may be given the function of searching. The client is translation software, and a translation function can be given to the target control. If a client has multiple functions, a function assignment target control can be selected from the multiple functions. Of course, it is also possible to select multiple functions, with one target control assigned to each function. The functionality of each target control needs to be displayed.

In the application, the voice content is converted into the text content by using a target neural network model. The target neural network model is a model obtained by training an original neural network model by using sample voice, and is used for outputting text content corresponding to the voice content after inputting the voice content.

According to the application, the sample voice can be obtained and then input into the original neural network model, and the original neural network model is trained. And determining whether to adjust weights and parameters in the original neural network model by calculating the loss of the original neural network model, and determining the original neural network model as a target neural network model when the identification accuracy of the original neural network model is greater than a first threshold value, such as greater than 99%.

In the above process, the process of automatically converting the voice into the text and performing word segmentation is adopted. The application also provides a method for word segmentation according to the user wish. Different from the above, in the application, after the target video is obtained and the voice content of the target video is displayed as the text content, the text content can be displayed first, then when the user clicks the text content, the content selected by the user can be generated as the target control, the text content selected by the user is displayed in the target control, or the user can segment the text content, and the word segmentation result is generated as the target control. The method can generate target controls in a targeted manner, and execute target functions on content generation interested by the user.

The application is explained below in connection with a specific example. For example, the application is applied to a client with chat function, and the user receives the friend information, wherein the information comprises the video information. The video message may be displayed as shown in fig. 4. However, at this point the video message has not yet been played. The user can select to start the function of converting the sound in the video into the subtitle, if the function is started, the voice content is converted into the text content in the playing process of the video, the text content is segmented, and after the text content is segmented, the word segmentation result can be displayed in the form of a target control, and the corresponding function is not bound by the control. When the user clicks on the target control, the corresponding function may be performed. Or, the user can select to start the function of converting the sound in the video into the subtitle, if the function is started, the voice content is converted into the text content in the playing process of the video, the text content is displayed, the user performs word segmentation or clicks the word of interest, and the system converts the word clicked by the user into a target control and displays the target control. When the user clicks on the target control, the corresponding function may be performed. When the target control is displayed, the target control can replace the original text content. The result may be as shown in fig. 5, where the target control is displayed during video playback.

In the above process, the target control is displayed in the video playing process. The application can also display the target control under the condition that the video is not played. When the user displays the target video on the display interface of the client, the user selects the long-press target video and then selects the voice-to-text function, so that text contents can be displayed below the target video. When the text content is displayed, the target control generated after word segmentation can be directly displayed, the text content can also be directly displayed, the user performs word segmentation on the text content or selects the word of interest, and then the system generates and displays the target control. The displayed target control replaces the original subtitle or text content. When the user clicks the target control, the function of the target control is executed. Such as searching for words within the target control, or forwarding or translating, etc. And displays the result.

The client in the application can be a receiving end, and the target video is sent by a sending end, and is sent to the receiving end through a server. The receiving end obtains the unique identification code VID of the target video, then the VID is sent to the server, the server retrieves video data according to the VID, the video is subjected to voice-to-text processing, and the text is sent to the receiving end. After the receiving end receives the text data, refreshing the display front end to display the text data. The user can long press the text data of the receiving end, select word segmentation in the pop-up menu, and the word segmentation module carries out word segmentation. Of course, the word segmentation module may be disposed in a server, and the server performs word segmentation. And after the word segmentation is successful, generating a control for each word segmentation, and displaying each control by a receiving end. If the control is clicked, the vocabulary is used as input to call up the corresponding information association function in the application. Such as retrieval, or translation, or forwarding or paraphrasing, etc.

Or, the receiving end server side in the application sends the video and sound subtitle conversion request and brings the unique identification code VID of the video, the server side retrieves the video data stored in the server side according to the VID, carries out voice and text conversion processing on the video and adds a time axis to compress the video into subtitle files (text information). The server transmits the video corresponding subtitle file to the client of the receiving end. The client of the receiving end loads the subtitle file when the video is played, and the subtitle is displayed in the video; after a receiving end user clicks a subtitle, the video pauses to play, the subtitle becomes a subtitle word segmentation control, after the user clicks the subtitle word segmentation control, the section of subtitle sentences are transmitted to a word segmentation module to carry out word segmentation processing, the word segmentation module carries out word segmentation processing on text information and returns data to a client, the client refreshes the effect of displaying the words to the original position of the subtitle, each word which is segmented generates a clickable control, and the receiving end user clicks the control corresponding to the word which is segmented, so that the word can be used as an input to call up the corresponding information association function in the application.

As an optional implementation manner, the displaying the text content in the target control included in the display interface includes:

Under the condition that the target video is not played, displaying the target control in an area outside the target video in the display interface, and displaying the text content in the target control;

and displaying the target control in the target video under the condition that the target video is being played, and displaying the text content in the target control.

Optionally, the target video may be played, and when playing, the target control is displayed at the original subtitle position. Or when the target video is not played, the target control is displayed below the target video, so that the flexibility of displaying the target control is improved.

As an optional implementation, the displaying the target control in the target video and displaying the text content in the target control when the target video is playing includes:

under the condition that the target video contains subtitle content, replacing the subtitle content in the target video with the target control;

and displaying the target control in the time period of displaying the caption content.

By the method, repeated display of subtitles can be avoided, and the accuracy of displaying the target control is improved.

acquiring a starting time point and an ending time point of target voice content in the target video under the condition that the target video does not comprise subtitle content, wherein the target voice content is a section of content in the voice content;

starting to display the target control at the starting time point, and displaying the text content corresponding to the target voice content in the target control;

and ending displaying the target control at the ending time point, and canceling displaying the text content corresponding to the target voice content.

That is, in the case where the target video in the present application does not include subtitles, the text content into which the voice content is converted can be displayed with the target video. The starting time point and the ending time point of the target voice content can be determined, so that the text content corresponding to the target voice content is displayed in the starting time point and the ending time point, and the aim that the text content is played along with the voice content is fulfilled.

displaying a plurality of target controls in the display interface;

and displaying a vocabulary of the text content in each target control.

That is, the application can segment word information, and then each target control of the plurality of target controls displays a word segmentation result, thereby realizing the effect of improving the efficiency of displaying the target control.

As an optional embodiment, after executing the target function corresponding to the target control in response to the trigger instruction executed on the target control, the method further includes:

and displaying a target result obtained after the target function is executed, wherein the target result is obtained after the target function is executed on the text content in the target control.

Optionally, the record corresponding to the display target control may be to jump to other pages to display the result, or directly display the result on the current page. By the embodiment, the flexibility of processing the video is improved.

As an alternative embodiment, before executing the target function corresponding to the target control in response to the trigger instruction executed on the target control, the method further includes:

Acquiring the type of the client;

determining a plurality of functions of the client that match the type;

and determining one function from the plurality of functions as the target function.

That is, in the present application, one function can be selected from a plurality of functions of the client to process the converted text information of the video, thereby improving the flexibility of processing the video.

As an alternative embodiment, said converting the voice content in the target video into text content in response to the received conversion instruction includes:

and inputting the voice content into a target neural network model, wherein the target neural network model is a model obtained by training an original neural network model by using sample voice, and the target neural network model is used for outputting text content corresponding to the voice content after inputting the voice content.

According to the application, the semantic content is identified through the target neural network model, and the text content is obtained through conversion, so that the effect of automatically, accurately and efficiently converting the voice content into the text content is realized.

As an optional implementation manner, the executing the target function corresponding to the target control in response to the trigger instruction executed on the target control includes:

Under the condition that the target function is a search function, searching text content in the target control;

under the condition that the target function is a sharing function, the text content in the target control is shared;

under the condition that the target function is a translation function, translating the text content in the target control;

and displaying the meaning of the text content in the target control when the target function is an interpretation function.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

According to another aspect of the embodiment of the present invention, there is also provided a video processing apparatus for implementing the video processing method described above. As shown in fig. 10, the apparatus includes:

a first display unit 1002 for displaying a target video in a display interface on a client;

A conversion unit 10024, configured to convert the voice content in the target video into text content in response to the received conversion instruction;

a second display unit 1006, configured to display the text content in a target control included in the display interface;

and the execution unit 1008 is configured to execute a target function corresponding to the target control in response to a trigger instruction executed on the target control, where the target function of the target control is determined according to the type of the client.

Alternatively, the video processing apparatus described above may be applied, but not limited to, to any client that can receive and display messages. For example, the client is a video applet, or a live application, or a mailbox, and may also be applied to a client with chat functionality. The chat function may be a real-time chat function. The client may have other functions, such as a transfer function, a search function, a forwarding function, etc., while having a chat function. That is, the client in the present application is not limited to the real-time communication client, and may be other clients having a chat function. For example, communication chat between friends may also be performed in a payment device.

As an alternative embodiment, the second display unit includes:

the first display module is used for displaying the target control in the area outside the target video in the display interface and displaying the text content in the target control under the condition that the target video is not played;

and the second display module is used for displaying the target control in the target video and displaying the text content in the target control under the condition that the target video is being played.

As an alternative embodiment, the second display module includes:

a replacing sub-module, configured to replace, when the target video contains subtitle content, the subtitle content in the target video with the target control;

and the first display sub-module is used for displaying the target control in the time period for displaying the caption content.

As an alternative embodiment, the second display module includes:

the acquisition sub-module is used for acquiring a starting time point and an ending time point of target voice content in the target video under the condition that the target video does not comprise subtitle content, wherein the target voice content is a section of content in the voice content;

and the second display sub-module is used for starting to display the target control at the starting time point, displaying the text content corresponding to the target voice content in the target control, ending to display the target control at the ending time point, and canceling to display the text content corresponding to the target voice content.

As an alternative embodiment, the second display unit includes:

the third display module is used for displaying a plurality of target controls in the display interface;

and the fourth display module is used for displaying one vocabulary of the text content in each target control.

As an alternative embodiment, the device further comprises:

and the third display unit is used for displaying a target result obtained after the target function is executed after the target function corresponding to the target control is executed in response to the trigger instruction executed on the target control, wherein the target result is a result obtained after the target function is executed on the text content in the target control.

As an alternative embodiment, the device further comprises:

the acquisition unit is used for acquiring the type of the client before responding to the trigger instruction executed on the target control and executing the target function corresponding to the target control;

a first determining unit configured to determine a plurality of functions of the client that match the type;

and a second determining unit configured to determine one function from the plurality of functions as the target function.

As an alternative embodiment, the conversion unit comprises:

the input module is used for inputting the voice content into a target neural network model, wherein the target neural network model is a model obtained by training an original neural network model by using sample voice, and the target neural network model is used for outputting text content corresponding to the voice content after inputting the voice content;

And the acquisition module is used for acquiring the text content output by the target neural network model.

As an alternative embodiment, the execution unit includes:

the processing module is used for searching the text content in the target control when the target function is a search function, sharing the text content in the target control when the target function is a sharing function, translating the text content in the target control when the target function is a translation function, and displaying the meaning of the text content in the target control when the target function is an interpretation function.

According to a further aspect of embodiments of the present application there is also provided an electronic device for implementing the above-described video processing method, as shown in fig. 11, the electronic device comprising a memory 1102 and a processor 1104, the memory 1102 having stored therein a computer program, the processor 1104 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

displaying the target video in a display interface on the client;

responding to the received conversion instruction, and converting voice content in the target video into text content;

displaying the text content in a target control contained in the display interface;

and responding to a trigger instruction executed on the target control, and executing a target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 11 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 11 is not limited to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

The memory 1102 may be used to store software programs and modules, such as program instructions/modules corresponding to the video processing methods and apparatuses in the embodiments of the present invention, and the processor 1104 executes the software programs and modules stored in the memory 1102 to perform various functional applications and data processing, i.e., implement the video processing methods described above. Memory 1102 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1102 may further include memory located remotely from processor 1104, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1102 may be used to store, but is not limited to, information about a target video, a target control, and the like. As an example, as shown in fig. 11, the memory 1102 may include, but is not limited to, a first display unit 1002, a conversion unit 1004, a second display unit 1006, and an execution unit 1008 in the video processing apparatus. In addition, other module units in the video processing apparatus may be included, but are not limited to, and are not described in detail in this example.

Optionally, the transmission device 1106 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 1106 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1106 is a Radio Frequency (RF) module for communicating wirelessly with the internet.

In addition, the electronic device further includes: a display 1108 for displaying a target video with a target control; and a connection bus 1110 for connecting the respective module parts in the above-described electronic apparatus.

According to a further aspect of embodiments of the present invention, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

Displaying the target video in a display interface on the client;

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A video processing method, comprising:

displaying the target video in a display interface on the client;

responding to a trigger instruction executed on the target control, executing a target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client;

The method further comprises the steps of: under the condition that the client side has multiple functions, assigning one target control for each function, and displaying the function of each target control;

the displaying the text content in the target control included in the display interface comprises the following steps:

after converting the voice content in the target video into the text content, displaying the text content; when the user clicks the text content, the text content selected by the user is generated into the target control, and the text content selected by the user is displayed in the target control.

2. The method of claim 1, wherein displaying the text content in a target control contained in the display interface comprises:

3. The method of claim 2, wherein the displaying the target control in the target video and the text content in the target control while the target video is being played comprises:

4. The method of claim 2, wherein the displaying the target control in the target video and the text content in the target control while the target video is being played comprises:

5. The method of claim 1, wherein displaying the text content in a target control contained in the display interface comprises:

Displaying a plurality of target controls in the display interface;

and displaying a vocabulary of the text content in each target control.

6. The method of claim 1, wherein after executing the target function corresponding to the target control in response to the trigger instruction executed on the target control, the method further comprises:

7. The method of claim 1, wherein prior to executing the target function corresponding to the target control in response to the trigger instruction executed on the target control, the method further comprises:

acquiring the type of the client;

determining a plurality of functions of the client that match the type;

8. The method of claim 1, wherein converting the voice content in the target video to text content in response to the received conversion instruction comprises:

9. The method according to any one of claims 1 to 8, wherein the executing the target function corresponding to the target control in response to the trigger instruction executed on the target control includes:

searching the text content in the target control under the condition that the target function is a search function;

translating the text content in the target control under the condition that the target function is a translation function;

and displaying the meaning of the text content in the target control when the target function is an explanation function.

10. A video processing apparatus, comprising:

a first display unit for displaying a target video in a display interface on a client;

The conversion unit is used for responding to the received conversion instruction and converting the voice content in the target video into text content;

the second display unit is used for displaying the text content in a target control contained in the display interface;

the execution unit is used for responding to a trigger instruction executed on the target control and executing a target function corresponding to the target control, wherein the target function of the target control is determined according to the type of the client;

the device is also for: under the condition that the client side has multiple functions, assigning one target control for each function, and displaying the function of each target control;

the device is used for displaying the text content in a target control contained in the display interface in the following way: after converting the voice content in the target video into the text content, displaying the text content; when the user clicks the text content, the text content selected by the user is generated into the target control, and the text content selected by the user is displayed in the target control.

11. The apparatus of claim 10, wherein the second display unit comprises:

12. The apparatus of claim 11, wherein the second display module comprises:

13. The apparatus of claim 11, wherein the second display module comprises:

14. The apparatus of claim 10, wherein the second display unit comprises:

the third display module is used for displaying a plurality of target controls in the display interface; and the fourth display module is used for displaying one vocabulary of the text content in each target control.

15. The apparatus of claim 10, wherein the apparatus further comprises:

16. The apparatus of claim 10, wherein the apparatus further comprises:

The acquisition unit is used for acquiring the type of the client before responding to the trigger instruction executed on the target control and executing the target function corresponding to the target control; a first determining unit configured to determine a plurality of functions of the client that match the type; and a second determining unit configured to determine one function from the plurality of functions as the target function.

17. The apparatus of claim 10, wherein the conversion unit comprises:

the input module is used for inputting the voice content into a target neural network model, wherein the target neural network model is a model obtained by training an original neural network model by using sample voice, and the target neural network model is used for outputting text content corresponding to the voice content after inputting the voice content; and the acquisition module is used for acquiring the text content output by the target neural network model.

18. The apparatus according to any one of claims 10 to 17, wherein the execution unit comprises:

19. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 9.

20. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program which, when executed by the processor, implements the method of any of claims 1 to 9.