CN116010654A

CN116010654A - Method, device, equipment and storage medium for adding audio annotation to video

Info

Publication number: CN116010654A
Application number: CN202310020575.9A
Authority: CN
Inventors: 王聪; 陈天峰; 林义圣; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd; Lemon Inc Cayman Island
Current assignee: Beijing Youzhuju Network Technology Co Ltd; Lemon Inc Cayman Island
Priority date: 2023-01-06
Filing date: 2023-01-06
Publication date: 2023-04-25

Abstract

According to embodiments of the present disclosure, methods, apparatuses, devices, and storage medium for adding audio annotations to video are provided. The method comprises the following steps: presenting an indication of one or more time periods of the video, the one or more time periods determined as candidate time periods that can be used to add audio; in response to receiving a selection of a first time period of the one or more time periods, presenting visual information of a video segment of the video associated with the first time period; and receiving input for the video clip, the input to be used to generate an audio annotation for the video clip. Therefore, according to the embodiment of the disclosure, the audio annotation can be added to the video rapidly, and the audio annotation efficiency is improved.

Description

Method, device, equipment and storage medium for adding audio annotation to video

Technical Field

Example embodiments of the present disclosure relate generally to the field of computers, and more particularly, relate to a method, apparatus, device, and computer-readable storage medium for adding audio annotations to video.

Background

A wide variety of videos have become an important component in people's daily production and life. In some cases, in order to make a user who views a video, especially a user whose vision is impaired, more clearly and conveniently understand a non-speech (e.g., non-dialect, non-bystander) scenario in the video, it is necessary to add an audio annotation to the video. The audio annotation describes a non-speech episode in the video in a phonetic manner. For example, in producing an unobstructed visual work for visually impaired people, it is desirable to add such audio annotations to the visual work. With audio annotation, the user can better understand the content of the video.

Disclosure of Invention

In a first aspect of the present disclosure, a method of adding an audio annotation to video is provided. The method comprises the following steps: presenting an indication of one or more time periods of the video, the one or more time periods being determined as candidate time periods that can be used to add audio; in response to receiving a selection of a first time period of the one or more time periods, presenting visual information of a video segment of the video associated with the first time period; and receiving input for the video clip, the input to be used to generate an audio annotation for the video clip.

In a second aspect of the present disclosure, an apparatus for adding audio annotations to video is provided. The device comprises: a time period determination module configured to present an indication of one or more time periods of the video, the one or more time periods being determined as candidate time periods that can be used to add audio; a visual information presentation module configured to present visual information of a video segment in the video associated with a first time period in response to receiving a selection of the first time period of the one or more time periods; and a receiving module configured to receive an input for the video clip, the input to be used to generate an audio annotation for the video clip.

In a third aspect of the present disclosure, an electronic device is provided. The apparatus comprises at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. The instructions, when executed by at least one processing unit, cause the apparatus to perform the method of the first aspect.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program executable by a processor to implement the method of the first aspect.

It should be understood that what is described in this section of the disclosure is not intended to limit key features or essential features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a flow chart of a method of adding audio annotations to a video in accordance with some embodiments of the present disclosure;

FIG. 3 illustrates a schematic diagram of a user interface for presenting an indication of one or more time periods, according to some embodiments of the present disclosure;

FIG. 4 illustrates a schematic diagram for detecting one or more non-speech segments in a video, according to some embodiments of the present disclosure;

FIG. 5 illustrates a schematic diagram of a user interface for presenting video clips, according to some embodiments of the present disclosure;

FIG. 6 illustrates a schematic diagram of a user interface for inputting information related to audio annotations, in accordance with some embodiments of the present disclosure;

FIG. 7 illustrates a schematic diagram of a user interface for selecting auditory effects according to some embodiments of the present disclosure;

FIG. 8 illustrates a block diagram of an apparatus for adding audio annotations to video in accordance with some embodiments of the present disclosure; and

fig. 9 illustrates a block diagram of an apparatus capable of implementing various embodiments of the present disclosure.

Detailed Description

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather, these embodiments are provided so that this disclosure will be more thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that any section/subsection headings provided herein are not limiting. Various embodiments are described throughout this document, and any type of embodiment may be included under any section/subsection. Furthermore, the embodiments described in any section/subsection may be combined in any manner with any other embodiment described in the same section/subsection and/or in a different section/subsection.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions are also possible below. The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As used herein, the term "model" may learn the association between the respective inputs and outputs from training data so that, for a given input, a corresponding output may be generated after training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs through the use of multiple layers of processing units. The "model" may also be referred to herein as a "machine learning model," "machine learning network," or "network," and these terms are used interchangeably herein. A model may in turn comprise different types of processing units or networks.

As mentioned briefly above, in some cases, it is desirable to add audio annotations to a video (e.g., a movie, a television show, etc.). For this purpose, a bystander script (abbreviated as script) for describing the non-dialect, non-bystander scenario in the video needs to be created first. Conventionally, there are two schemes for making a manuscript. In one scheme, a annotator searches for non-dialect and non-bystander episodes in the video, and then inputs the contents of the manuscript into a document and a table according to the playing time periods (also called as the time periods for inserting the manuscript) of the episodes, so that the annotation of the video is realized. In another approach, the annotator converts the manuscript into an annotation for the video by using a specialized turnpiece. For example, the contents of the manuscript are edited by the subtitle function of the video editing software, and the editing software is used in the later stage to dub the manuscript.

However, these conventional schemes have some problems. On the one hand, the time period in which the manuscript can be inserted needs to be searched manually, so that the marking efficiency is low. On the other hand, the whole video can only be marked by a single person according to the time sequence of the episodes, and the collaborative marking of multiple persons cannot be realized. In the scheme using professional software, the operation needs to be performed by personnel trained in audio and video professions, and the operation is limited by personnel and cannot be performed on a large scale.

Embodiments of the present disclosure propose a solution to adding audio annotations to video to address one or more of the above problems, as well as other potential problems. According to various embodiments of the present disclosure, one or more time periods that can be used to add audio annotations are determined and an indication of those time periods is presented. If a selection of a time period of the one or more time periods is received, visualization information for the video segment corresponding to the selected time period is presented (e.g., one or more frames of the video segment, the video segment itself). Input (e.g., text information, voice information) is then received for the video clip, which is to be used to generate an audio annotation for the video clip.

In embodiments of the present disclosure, the annotators are not required to manually find the time period in which the annotation can be added, and the relevant segments can be viewed after selecting the time period in which the annotation is to be added. Therefore, by utilizing the embodiment of the disclosure, the audio annotation can be added to the video rapidly, and the audio annotation efficiency is improved.

The scheme for adding audio annotation to the video can effectively assist users to understand the episodes and contents in the video, especially for people with vision persistence or temporary impairment or disorder (also called vision impaired users or vision impaired people). It should be appreciated that the schemes provided by embodiments of the present disclosure may be provided to facilitate a particular population, but this does not imply any discrimination of the particular population.

Some example embodiments of the present disclosure are described below with reference to the accompanying drawings.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure may be implemented. In environment 100, service device 110 may be used to provide a video annotation platform. Service device 110 may be any type of device having computing capabilities. For example, the service device 110 may include a computing system/server, such as a mainframe, edge computing node, computing device in a cloud environment, and so forth.

The environment 100 also includes a plurality of user devices 120-1, 120-2, 120-3, also referred to individually or collectively as user devices 120, associated with a plurality of users 130-1, 130-2, 130-3, also referred to individually or collectively as users 130, respectively. The service device 110 is communicatively connected to the user device 120. In environment 100, user device 120 may be any type of terminal device having computing capabilities. The terminal device may be any type of mobile terminal, fixed terminal, or portable terminal, including a mobile handset, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, media computer, multimedia tablet, personal Communication System (PCS) device, personal navigation device, personal Digital Assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, game device, or any combination of the preceding, including accessories and peripherals for these devices, or any combination thereof. The user 130 may be an annotator that adds annotations to the video, either a professional or a volunteer.

In embodiments of the present disclosure, the scheme of adding audio annotations to video is implemented by the service device 110 and/or the user device 120. In some embodiments, the service device 110 and the user device 120 may cooperatively implement a scheme for adding audio annotations to video. For example, the service device 110 may present a user interface of a browser or application to the corresponding user 130 via the user device 120. The user interface may be displayed with information needed to add the annotation. User 130 may provide input for audio annotations through interaction with a user interface. It should be understood that the number of users 130 and user devices 120 in fig. 1 is merely exemplary and is not intended to limit the scope of the present disclosure. The number of users 130 and user devices 120 may also be any suitable other number in embodiments of the present disclosure, as the present disclosure is not particularly limited in this respect.

It should be understood that the structure and function of environment 100 are described for illustrative purposes only and are not meant to suggest any limitation as to the scope of the disclosure. For example, in some embodiments, the scheme of adding audio annotations to video of embodiments of the present disclosure may be implemented by a single device (e.g., service device 110 or user device 120).

Fig. 2 illustrates a flow chart of a method 200 of adding audio annotations to a video in accordance with some embodiments of the present disclosure. In some embodiments, the method 200 may be implemented at the service device 110 and/or the user device 120 shown in fig. 1, for example. The method 200 is described below with reference to fig. 1 for illustrative purposes only.

At block 210, an indication of one or more time periods of the video is presented. The one or more time periods are determined as candidate time periods that can be used to add audio annotations, and are therefore also referred to as candidate time periods hereinafter. These time periods may be time periods of: the addition of additional audio does not affect the viewer's listening to the original audio in the video.

The added audio annotation may include any suitable sound form and visual effect. The audio annotation may include speech, for example, a speech description for a scene in the video, a speech description for a person action, wear, expression, mental activity, etc. in the video, and so forth. Alternatively or additionally, the audio annotation may include atmosphere group music, sounds, etc. that fit the scene in the video.

Service device 110 may present an indication of one or more time periods of the video via user device 120. For example, the service device 110 may send information regarding one or more time periods to the user device 120. Upon receiving such information, the user device 120 may present an indication of one or more time periods to the corresponding user 130. Such an indication may be presented to the corresponding user 130 through, for example, a browser of the user device 120 or a user interface of an application.

In the example of fig. 1, the service device 110 may send information regarding candidate time periods for the video to any one or more of the user devices 120-1, 120-2, 120-3. Accordingly, the user device 120 may present an indication of one or more time periods based on the received information. For example, the user device 120-1 may present an indication of the one or more time periods to the user 130-1; the user device 120-2 may present an indication of the one or more time periods to the user 130-2; the user device 120-3 presents an indication of the one or more time periods to the user 130-3. The indication is presented to the plurality of users 130 to facilitate the plurality of users 130 to synchronize the addition of audio annotations to the video, thereby further improving the efficiency of the addition of video annotations.

The indication of the one or more time periods may be presented in any suitable visual pattern. Fig. 3 illustrates a schematic diagram of a user interface 300 for presenting an indication of one or more time periods, according to some embodiments of the present disclosure.

As shown in FIG. 3, the user interface 300 includes a region 309-1, a region 309-2 (which is also referred to individually or collectively as region 309) for adding audio annotations to video. It should be appreciated that the number of regions 309 in FIG. 3 is by way of example only, and that user interface 300 may also include any number of regions that accommodate the size of user interface 300. The present disclosure is not particularly limited in this respect. Furthermore, if indications of multiple time periods cannot be presented in the same page, the indications may be presented paged.

Each area 309 corresponds to a time period and has an indication of the time period, also referred to as a time period indication, displayed therein. As an example, the regions 309-1 and 309-2 are displayed with a time period indication 310-1 and a time period indication 310-2, respectively (also referred to individually or collectively as time period indications 310). The area 309-1 is used to add video annotations for the candidate time period 00:12-01:23, and the time period indication 310-1 shows the start time 00:12 and end time 01:23 of the candidate time period. The area 309-2 is used to add video annotations for the candidate time periods 03:05-03:67, and the time period indication 310-2 shows the start time 03:05 and the end time 03:67 of the candidate time period.

In some embodiments, the area 309 may also display an input box, an indication of a plurality of candidate auditory effects. As an example, the area 309-1 also displays an input box 330, an indication of a sunk male sound 350-1, an indication of a neutral female sound 350-3, an indication of a clear female sound 350-5 (these indications are collectively referred to as an indication of candidate auditory effects 350), which will be described in detail below.

It should be appreciated that the presentation style of the plurality of regions 309 may be the same or different. The presentation style of region 309 shown in fig. 3 is by way of example only and the disclosure is not particularly limited in this regard.

In embodiments of the present disclosure, the candidate time periods may be determined in any suitable manner. In some embodiments, in the case of metadata with a video (e.g., production data for a movie work), candidate time periods may be determined based on the metadata.

In some embodiments, the candidate time period may be determined by detecting non-speech segments in the video. For example, the service device 110 may detect one or more non-speech segments in the video that contain speech in an amount below a predetermined threshold. Service device 110 may determine whether the amount of speech contained in the video clip is below a predetermined threshold. The amount of speech may be represented by any suitable index. For example, the amount of speech may include the volume of the speech and/or the duration of the speech, and so on. If the amount of speech is below a predetermined threshold, service device 110 may determine the video segment as a non-speech segment. If the amount of speech is above a predetermined threshold, service device 110 determines the video segment as a speech segment.

As an example, a non-speech segment in a video may not have any sound, or may include background sounds, such as sounds emitted by a person's action (e.g., moving a table, nailing a nail, etc.), sounds of objects (such as automobile engine sounds), environmental sounds (such as rainy sounds), and so forth. And the speech segments may include segments having conversations, bystandings, singing, etc. human voices.

Further, the service device 110 may determine the candidate time period based on one or more non-speech segments. For example, a time period corresponding to a non-speech segment or a portion thereof may be determined as a candidate time period.

Fig. 4 illustrates a schematic diagram for detecting one or more non-speech segments in a video, according to some embodiments of the present disclosure. As shown in fig. 4, the service device 110 may detect the audio 430 of the video using the audio event detection system 450 to obtain one or more non-speech segments 470.

As an example, the audio event detection system 450 may be a AED (Audio event detection) detection system. The presence or absence of a target sound event in a continuous audio stream can be detected by the AED system. In some embodiments, the target sound time may be a speech event, such as a speaking voice, singing voice, or the like. By detecting a speech event, a non-speech segment 470 can be determined based on segments without speech events. Alternatively, in some embodiments, the target sound event may refer to a non-speech event, the corresponding video clip of which is a non-speech clip. It should be appreciated that audio event detection system 450 may also be other models capable of detecting non-speech segments 470 in video, as the disclosure is not particularly limited in this respect.

The foregoing describes the service device 110 automatically detecting one or more non-speech segments and their corresponding one or more time periods from the video, presenting an indication of the one or more time periods of the video to the one or more users 130 via the one or more user devices 120. On the one hand, the non-voice fragments and the corresponding time periods are not required to be searched manually, so that the searching speed of the non-voice fragments is improved. On the other hand, the distributed multiple user devices 120 may synchronously receive the selection of one or more time periods of the video from the multiple users 130 to add information for audio annotation corresponding to the selected time periods, thereby improving the efficiency of adding audio.

Next, at block 220, a determination is made as to whether a selection of a first time period of the one or more time periods is received. The user 130, having knowledge of the candidate time periods in the video, may select a time period therein to add information about the audio annotation.

If a selection of the first time period is received, the method 200 proceeds to block 230. At block 230, visual information of a video segment in the video associated with the first time period is presented. In some embodiments, the user device 120 receives a selection of a first time period and presents visual information of a video segment of the video associated with the first time period. For example, the user device 120, after receiving the selection of the first time period, transmits the selection of the first time period into the service device 110. The service device 110 determines a video clip associated with the first time period from the video according to the selection of the first time period and presents visual information of the video clip via the user device 120.

The video clip associated with the first time period may have any suitable length. In some embodiments, the video clip associated with the first time period may begin at the beginning of the first time period and end at the end of the first time period.

In some embodiments, the video clip associated with the first time period may begin at the beginning of the first time period and end at the beginning of a second time period immediately after the first time period in one or more time periods. For a comprehensive explanation of the content of the video clip, the video clip may contain, in addition to the video content corresponding to the first period, video content of a period of time after the first period, that is, video content of a period of time corresponding to the video content between the end of the first period and the start of the second period. In this way, the annotators can fully understand the episodes and content so that the added annotations can fully interpret the video clips.

The visual information of the video clip may include one or more frames of the video clip. For example, presentation may be performed every predetermined number of frames. Alternatively or additionally, the visual information may include the video clip itself. Fig. 5 illustrates a schematic diagram of a user interface 500 for presenting video clips, according to some embodiments of the present disclosure. In fig. 5, user device 120 receives a selection of time periods 00:12-01:23 corresponding to time period indication 310-1. Accordingly, the region 309-1 is presented with a different visual effect than the region 309-2 of the unselected time period. In response to the selection of time period 00:12-01:23, user device 120 presents a video clip associated with time period 00:12-01:23 in region 510. It should be understood that the video shown in fig. 5 is by way of example only, and the disclosure is not particularly limited herein.

At block 240, input is received for a video clip that is to be used to generate an audio annotation for the video clip. User device 120 may receive input for a video clip from user 130 and send the input to service device 110. That is, the service device 110 receives input for a video clip from the user device 120. In some embodiments, such input may include text information, such as a manuscript, describing the video clip. Alternatively or additionally, in some embodiments, such input may include voice information describing the video clip. For example, the user 130 may speak a description of an event or content occurring with a video clip.

In embodiments that receive text information, user device 120 may present the received text information, e.g., may display the text information in area 309. Fig. 6 illustrates a schematic diagram of a user interface 600 for presenting text information, according to some embodiments of the present disclosure. In fig. 6, after viewing the visual information of the video clip presented by region 510, user 130 enters text information for the video clip in input box 330, e.g., the text information is "a push, see B say". In some embodiments, text information may be further added to the video for display as subtitles. For example, in the example of fig. 6, text information entered by user 130 is displayed in area 510. It should be understood that the text information shown in fig. 6 is merely an example, and the present disclosure is not particularly limited in this respect.

In some embodiments, service device 110 may convert text information in the input to speech information. For example, the text information "a in fig. 6 is advanced, looking at the B-talk" to be converted into voice information. It should be appreciated that service device 110 may convert text information in input to speech information using any suitable text-to-speech (TTS) technology, and the disclosure is not particularly limited in this respect. Further, the service device 110 can generate an audio annotation for the video clip based at least in part on the speech information. For example, the service device 110 may generate audio annotations for the video clips based on the voice information of the one or more video clips and their corresponding one or more time periods.

In some embodiments, audiovisual functionality may be provided for the annotators. For example, an indication of a plurality of candidate auditory effects may be presented, the plurality of candidate auditory effects being used to generate an audio annotation for a video clip. The candidate auditory effects may include multiple types of auditory effects, or a combination of multiple types of auditory effects. For example, the auditory effect may be any one or more of tone, pitch, volume. Alternatively, the auditory effects may also include other specific sound effects. The present disclosure is not particularly limited in this respect.

As an example, an indication of a plurality of candidate auditory effects may be displayed in the region 390. Fig. 7 illustrates a schematic diagram of a user interface 700 for selecting auditory effects according to some embodiments of the present disclosure. As shown in fig. 3, 5, 6, 7, indications 350-1, 350-2, 350-3 of a plurality of candidate auditory effects, also referred to individually or collectively as auditory effect indications 350, may be displayed in a region 390. In this example, the candidate auditory effects may include different timbres. For example, the timbre may include a sunk male voice, a neutral female voice, a clear female voice, and so on. The timbre may also include other types of timbres, such as hyperactive men's voice. The indication of these tone-corresponding candidate auditory effects are respectively an indication 350-1 of a sunk male, an indication 350-3 of a neutral female, and an indication 350-5 of a clear female. It should be understood that the auditory effect indication 350 shown in fig. 3, 5, 6, 7 is by way of example only, and the present disclosure is not particularly limited in this respect.

Further, the user device 120 may receive a selection of one of a plurality of candidate auditory effects. The user device 120 and/or the service device 110 generates an audio clip having the selected auditory effect based on the input. The user device 120 in turn presents the generated audio clip. Such audio clips may be candidates for audio annotation for video clips. The user 130 may confirm whether the entered text information is sufficient to describe the video clip by listening to the audio clip. Further, the user 130 may modify the information entered by him or her and may confirm his or her input.

In the example of fig. 7, the user 130 has selected the effect of a low-lying male voice. Accordingly, the user device 120 presents audio that is spoken by a heavy man voice "a push door in, looking at B say".

In the above, the selection and application of the hearing effect is described taking text information as an example. It should be understood that if the input includes voice information, the received voice may be updated to have the selected auditory effect. In such an embodiment, the annotator is provided with the capability to view immediately after writing, thereby improving the hearing ability of the thumbnail and facilitating the generation of more accurate audio annotations.

In some embodiments, the user device 120 may present an indication of one or more time periods of the video to a first user and a second user of the plurality of users 130. The first user and the second user may be any two users of the plurality of users 130. For example, the first user may be user 130-1 and the second user may be user 130-2. It should be understood that such first and second users are by way of example only and that the present disclosure is not particularly limited in this respect.

The first user and the second user may select a time period from the one or more time periods for which audio is desired to be added based on an indication of the one or more time periods presented by the respective user devices. In some embodiments, if it is determined that a selection of one of the one or more time periods is received from the first user, the service device 110 may prohibit the second user from selecting the time period. Since the plurality of user devices 120 respectively present the plurality of users 130 with an indication of one or more time periods of the same video, to avoid the plurality of users 130 repeatedly selecting the same time period to add audio annotations, the service device 110, upon receiving the selection of one time period by the first user, prohibits the second user from selecting the time period. For example, the indication of the time period may be presented to the second user in an unselected fashion, or the second user may be prompted that the user is not available when the second user selects the time period. For example, in the example of FIG. 3, service device 110 inhibits the second user from selecting candidate time period 00:12-01:23 in response to receiving a selection of candidate time period 00:12-01:23 from the first user.

In some embodiments, multiple user devices 120 may synchronously present input for a video clip to enable each user 130 to observe input of other users 130 in a user interface provided by the corresponding user device 120. Continuing with the example of FIG. 6, if the first user provided text information "A-push, looking at B-talk" in area 390-1, the second user-specific user device 120-2 may display the first user-entered text information "A-push, looking at B-talk" in area 390-1 of the provided user interface. In this way, the second user may be aware that time period 00:12-01:23 has been selected, and may select time period 03:05-03:67 displayed in region 390-2. In this way, repeated labeling for the same video segment can be effectively avoided in the case where multiple users 130 are co-labeling the same video.

Alternatively, in some embodiments, different users 130 may be allowed to provide input for generating audio annotations for the same candidate time period. The service device 110 may then select from these inputs the inputs that are ultimately used to generate the audio annotation, or may integrate these inputs together. In this way, the source of the annotation information can be enriched.

Example embodiments of the present disclosure are described above with reference to fig. 2 through 7. From the above description, it can be seen that with embodiments of the present disclosure, there is no technical skill requirement for the annotators, so that ordinary annotators can also provide input for video annotation. This reduces the difficulty of video annotation tasks, expanding the population that can be annotated, thereby facilitating faster production of more annotated video to assist more demanding people (such as visually impaired people). Further, it should be understood that the operations described with reference to the serving device 110 and the user device 120, respectively, may be implemented at a single device.

Fig. 8 illustrates a schematic block diagram of an apparatus 800 for adding audio annotations to video in accordance with some embodiments of the present disclosure. The apparatus 800 may be implemented as or included in the service device 110 and/or the user device 120. The various modules/components in apparatus 800 may be implemented in hardware, software, firmware, or any combination thereof.

As shown, the apparatus 800 includes a time period determination module 810 configured to present an indication of one or more time periods of the video, the one or more time periods being determined as candidate time periods that can be used to add audio annotations. The apparatus 800 further includes a visual information presentation module 820 configured to present visual information of a video segment of the video associated with a first time period of the one or more time periods in response to receiving a selection of the first time period. And the apparatus 800 further comprises a receiving module 830 configured to receive an input for a video clip, the input to be used for generating an audio annotation for the video clip.

In some embodiments, the apparatus 800 further comprises: a conversion module configured to convert text information in input into voice information; and a video annotation generation module configured to generate an audio annotation for the video clip based at least in part on the speech information.

In some embodiments, the apparatus 800 further comprises: an auditory effect presentation module configured to present an indication of a plurality of candidate auditory effects for generating an audio annotation for a video clip; an auditory effect selection module configured to receive a selection of one of a plurality of candidate auditory effects; and an auditory effects application module configured to present an audio clip generated based on the input with the selected auditory effect, the audio clip being a candidate for an audio annotation for the video clip.

In some embodiments, the apparatus 800 further comprises: a detection module configured to detect one or more non-speech segments in the video, the amount of speech contained in the non-speech segments being below a predetermined threshold; a time period determination module configured to determine one or more time periods based on the one or more non-speech segments.

In some embodiments, the video clip begins at the beginning of the first time period and ends at the beginning of the second time period immediately after the first time period in one or more time periods.

In some embodiments, presenting the indication of the one or more time periods of the video includes presenting the indication to the first user and the second user, the apparatus 800 further comprising: the inhibit selection module is configured to inhibit the second user from selecting the first time period in response to receiving a selection of the first time period of the one or more time periods from the first user.

In some embodiments, the rendering module 820 is further configured to render one or more frames of the video clip; and/or render the video clip itself.

Fig. 9 illustrates a block diagram that shows an electronic device 900 in which one or more embodiments of the disclosure may be implemented. It should be understood that the electronic device 900 illustrated in fig. 9 is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein. The electronic device 900 illustrated in fig. 9 may be used to implement the service device 110 and/or the user device 120 of fig. 1.

As shown in fig. 9, the electronic device 900 is in the form of a general-purpose electronic device. Components of electronic device 900 may include, but are not limited to, one or more processors or processing units 910, memory 920, storage 930, one or more communication units 940, one or more input devices 950, and one or more output devices 960. The processing unit 910 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 920. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capabilities of electronic device 900.

Electronic device 900 typically includes multiple computer storage media. Such a medium may be any available media that is accessible by electronic device 900, including, but not limited to, volatile and non-volatile media, removable and non-removable media. The memory 920 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 930 may be a removable or non-removable medium and may include machine-readable media such as flash drives, magnetic disks, or any other medium that may be capable of storing information and/or data (e.g., training data for training) and may be accessed within electronic device 900.

The electronic device 900 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 9, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 920 may include a computer program product 925 having one or more program modules configured to perform the various methods or acts of the various embodiments of the disclosure.

The communication unit 940 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 900 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communications connection. Thus, the electronic device 900 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.

The input device 950 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 960 may be one or more output devices such as a display, speakers, printer, etc. The electronic device 900 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., with one or more devices that enable a user to interact with the electronic device 900, or with any device (e.g., network card, modem, etc.) that enables the electronic device 900 to communicate with one or more other electronic devices, as desired, via the communication unit 940. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium having stored thereon computer-executable instructions, wherein the computer-executable instructions are executed by a processor to implement the method described above is provided. According to an exemplary implementation of the present disclosure, there is also provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions that are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices, and computer program products implemented according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand each implementation disclosed herein.

Claims

1. A method of adding audio annotations to video, comprising:

presenting an indication of one or more time periods of the video, the one or more time periods determined as candidate time periods that can be used to add the audio annotation;

in response to receiving a selection of a first time period of the one or more time periods, presenting visual information of a video segment of the video associated with the first time period; and

an input is received for the video segment, the input to be used to generate an audio annotation for the video segment.

2. The method of claim 1, further comprising:

Converting the text information in the input into voice information; and

the audio annotation for the video segment is generated based at least in part on the speech information.

3. The method of claim 1, further comprising:

presenting an indication of a plurality of candidate auditory effects for generating the audio annotation for the video segment;

receiving a selection of one of the plurality of candidate auditory effects; and

an audio clip generated based on the input with the selected auditory effect is presented as a candidate for the audio annotation for the video clip.

4. The method of claim 1, further comprising:

detecting one or more non-speech segments in the video, the amount of speech contained in the non-speech segments being below a predetermined threshold; and

the one or more time periods are determined based on the one or more non-speech segments.

5. The method of claim 1, wherein the video clip begins at a beginning of the first time period and ends at a beginning of a second time period immediately after the first time period in the one or more time periods.

6. The method of claim 1, wherein presenting an indication of one or more time periods of video comprises presenting the indication to a first user and a second user, the method further comprising:

in response to receiving the selection of the first time period of the one or more time periods from the first user, the second user is prohibited from selecting the first time period.

7. The method of claim 1, wherein presenting visual information of a video segment of the video associated with the first time period comprises:

presenting one or more frames of the video clip; or (b)

The video clip itself is presented.

8. An apparatus for adding audio annotations to video, comprising:

a time period determination module configured to present an indication of one or more time periods of video, the one or more time periods determined as candidate time periods that can be used to add the audio;

a visual information presentation module configured to present visual information of a video segment of the video associated with a first time period of the one or more time periods in response to receiving a selection of the first time period; and

A receiving module configured to receive input for the video segment, the input to be used to generate an audio annotation for the video segment.

9. An electronic device, comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit cause the electronic device to perform the method of any one of claims 1-7.

10. A computer readable storage medium having stored thereon a computer program executable by a processor to implement the method of any of claims 1 to 7.