WO2024093442A1

WO2024093442A1 - Method and apparatus for checking audiovisual content, and device and storage medium

Info

Publication number: WO2024093442A1
Application number: PCT/CN2023/113406
Authority: WO
Inventors: 郑康; 张鼎; 李继超; 刘敬晖; 和君; 李想
Original assignee: 北京字跳网络技术有限公司
Priority date: 2022-10-31
Filing date: 2023-08-16
Publication date: 2024-05-10
Also published as: CN117956233A

Abstract

According to the embodiments of the present disclosure, provided are a method and apparatus for checking audiovisual content, and a device and a storage medium. The method comprises: providing a check interface for audiovisual content, wherein the check interface comprises a playing control for playing the audiovisual content; and presenting at least one speaking time axis in the playing control, wherein the at least one speaking time axis is used for indicating the temporal distribution of spoken content of at least one speaker associated with the audiovisual content. In this way, by means of the embodiments of the present disclosure, a user is aided in acquiring speaking time distribution information of each speaker in audiovisual content, thereby helping the user to acquire required information more conveniently.

Description

Method, apparatus, device and storage medium for viewing audiovisual content

This application claims priority to the Chinese invention patent application entitled “Methods, devices, equipment and storage media for viewing audiovisual content” filed on October 31, 2022 and application number 202211352393.3, the entire contents of which are incorporated by reference into this application.

Technical Field

Example embodiments of the present disclosure relate generally to the field of computers, and more particularly to methods, devices, apparatuses, and computer-readable storage media for viewing audiovisual content.

Background technique

With the development of computer technology, the Internet has become the main platform for people to obtain and share content. For example, people can use the Internet to publish a variety of content, or receive content shared by other users.

In the content sharing based on the Internet, the sharing of audiovisual content (e.g., audio content or video content) has become one of the most important forms. For example, people can use a player to play a speech or a video or audio recording of a meeting shared by other users. However, during such a playback process, it is difficult for people to quickly locate the corresponding part of a specific speaker in such a video or audio recording.

Summary of the invention

In a first aspect of the present disclosure, a method for viewing audio-visual content is provided. The method includes: receiving a selection of a plurality of text segments, the plurality of text segments corresponding to a plurality of parts in target audio-visual content, the plurality of parts at least including a first part and a second part that are not continuous in the target audio-visual content; causing segment audio-visual content to be created based on at least the plurality of parts of the target audio-visual content, wherein the first part and the second part are continuous in the segment audio-visual content; And present a sharing entrance for sharing the audio-visual content clips.

In a second aspect of the present disclosure, a device for viewing audio-visual content is provided. The device includes a receiving module configured to receive selections for multiple text segments, the multiple text segments corresponding to multiple parts in the target audio-visual content, the multiple parts at least including a first part and a second part that are discontinuous in the target audio-visual content; a control module configured to create segmented audio-visual content based on at least the multiple parts of the target audio-visual content, wherein the first part and the second part are continuous in the segmented audio-visual content; and a presentation module configured to present a sharing entry for sharing the segmented audio-visual content.

In a third aspect of the present disclosure, an electronic device is provided. The device includes at least one processing unit; and at least one memory, the at least one memory is coupled to the at least one processing unit and stores instructions for execution by the at least one processing unit. When the instructions are executed by the at least one processing unit, the device executes the method of the first aspect.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided, wherein a computer program is stored on the medium, and when the program is executed by a processor, the method of the first aspect is implemented.

In a fifth aspect of the present disclosure, a playback system is provided. The playback system includes: a main timeline, which at least indicates the current playback position of the audio-visual content; and at least one speech timeline, which is used to indicate the temporal distribution of the speech content of at least one speaker associated with the audio-visual content.

It should be understood that the contents described in the summary of the present invention are not intended to limit the key features or important features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become easily understood through the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other features, advantages and aspects of the embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. In the accompanying drawings, the same or similar reference numerals represent the same or similar elements, wherein:

FIG1 shows a schematic diagram of a conventional audio-visual content player;

2A to 2C show schematic diagrams of example playback systems according to some embodiments of the present disclosure;

3A and 3B illustrate example viewing interfaces for audiovisual content according to some embodiments of the present disclosure;

4A and 4B are schematic diagrams showing sharing of audio-visual content segments according to some embodiments of the present disclosure;

FIG5 illustrates a flow chart of an example process for viewing audiovisual content according to some embodiments of the present disclosure;

FIG6 shows a block diagram of an apparatus for viewing audiovisual content according to some embodiments of the present disclosure; and

FIG. 7 shows a block diagram of a device capable of implementing various embodiments of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes and are not intended to limit the scope of protection of the present disclosure.

In the description of the embodiments of the present disclosure, the term "including" and similar terms should be understood as open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may also be included below.

As discussed above, people can use a player to obtain audio-visual content. Figure 1 shows a schematic diagram of a traditional audio-visual content player 100. As shown in Figure 1, in the player 100, people usually need to drag the time axis control to locate the desired playback moment.

However, such playback control is inefficient. For example, in the example of FIG. 1 , the audiovisual content has a length of more than 1 hour, which makes it difficult for the user to quickly locate the desired playback position through the time axis.

This is especially true when reviewing audio-visual content such as conferences, lectures, or online classes. Such a scene usually has multiple speakers, and people may want to quickly locate the speaking part of a specific speaker.

The embodiments of the present disclosure provide a system for playing audio-visual content (audio content or video content). The system may include a main timeline to at least indicate the current playback position of the audio-visual content. In addition, the system may also include at least one speech timeline, which is used to indicate the distribution of speech content of at least one speaker associated with the audio-visual content in time.

In addition, an embodiment of the present disclosure also provides a solution for viewing audio-visual content. According to the solution, a viewing interface for audio-visual content can be provided, wherein the viewing interface includes a playback control for playing the audio-visual content. Further, at least one speech timeline can be presented in the playback control, and the at least one speech timeline is used to indicate the distribution of speech content of at least one speaker associated with the audio-visual content in time.

Based on this approach, the embodiments of the present disclosure can provide a speech timeline in the playback system or playback control to provide a time distribution corresponding to the speech content of the speaker associated with the audio-visual content. Thus, the implementation of the present disclosure can facilitate users to view the part corresponding to a specific speaker, thereby improving the efficiency of users in obtaining desired content.

Hereinafter, exemplary solutions according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Sample playback system

In some embodiments, embodiments of the present disclosure can utilize a timeline to provide richer information about audio-visual content.

FIG2A shows a schematic diagram 200A of an example playback system 205 according to some embodiments of the present disclosure. As shown in FIG2A , the playback system 205 (also referred to as a player 205 or a playback control 205) can be used to play corresponding audio-visual content. The playback system 205 can be provided by, for example, an appropriate electronic device, examples of which can include, but are not limited to, a desktop computer, a laptop computer, a smart phone, a tablet computer, a personal digital assistant, or a smart wearable device.

In some embodiments, the audiovisual content may include audio and video files locally stored in the audiovisual system 205, audio and video files stored in the cloud, or audio and video streams. For example, it may include a playback stream of recorded audio-visual content (eg, a conference recording), or a live stream of live audio-visual content.

Main Timeline

As shown in Fig. 2A, the playback system 205 may include a main timeline 210. In some embodiments, the main timeline 210 may indicate the current playback position of the audio-visual content, that is, the playback progress. For example, the main timeline 210 may include a playback position indicator 215 to indicate the time point at which the audio-visual content is currently being played.

For example, if the audiovisual content to be played is a recorded content, the length information of the audiovisual content is fixed, and the total length of the time axis can correspond to the total duration of the time content. In addition, the play position indicator 215 can be set accordingly according to the corresponding relationship between the position and the play time.

In some other examples, if the audiovisual content to be played is live content whose duration is still increasing, the play position indicator 215 may, for example, always be set to the far right of the main timeline 210. In addition, if the user wants to watch back a specific content that has already been broadcast live, the user may, for example, jump back to the corresponding time point by moving the play position indicator 215.

In some embodiments, as shown in FIG1 , the main timeline 210 may also present graphic information corresponding to the audio waveform of the audio-visual content. In this way, the user can more conveniently understand which parts of the audio-visual content are worth paying attention to, and which parts, such as those with less audio waveforms, can be temporarily ignored. Thus, such a playback system 205 can improve the efficiency of content acquisition for users.

In some embodiments, the audiovisual content may be, for example, recorded content about an online conference. Accordingly, as shown in FIG1 , the main timeline 210 may also present, for example, an interaction identifier 220 corresponding to an interaction behavior in the online conference.

Such an interaction mark 220 may be set at a corresponding position of the main time axis 210 to indicate that a corresponding interaction behavior has occurred at a corresponding moment. In some embodiments, different graphics of the interaction mark 220 may correspond to different interaction behaviors.

In some embodiments, the main timeline 210 may include, for example, an interactive identifier 220 for indicating file sharing in an online conference. There is a graphic corresponding to the format of the shared file, for example, it can be a thumbnail of the shared file.

In some embodiments, when the user selects the interactive mark 220, the playback system 205 can, for example, guide the user to obtain descriptive information about the file sharing. For example, when the user hovers over the interactive mark 220 with a mouse, the playback system 205 can, for example, indicate the information of the shared file in a floating window, such as the file name, format, size, sharer, etc. In another example, if the user clicks the interactive mark 220, the playback system 205 can, for example, guide the user to obtain the content of the shared file, for example, can guide the user to jump to the online viewing interface of the file.

In some embodiments, the main timeline 210 may include, for example, an interactive identifier 220 for indicating an online chat in an online conference. The online chat herein refers to any appropriate chat based on text, emoticons, images, and/or audio, for example, using an instant messaging tool of an online conference. Accordingly, the graphic identifier of the interactive identifier 220 may be determined, for example, based on the content of the online chat. Alternatively, the graphic identifier of the interactive identifier 220 may be determined, for example, by a graphic identifier (e.g., an avatar) of a user participating in the chat.

In some embodiments, when the user selects the interactive mark 220, the playback system 205 can guide the user to obtain descriptive information about the online chat. For example, when the user hovers over the interactive mark 220 with a mouse, the playback system 205 can indicate the information of the online chat, such as the participants of the online chat, the chat content, etc., in a floating window. In another example, if the user clicks the interactive mark 220, the playback system 205 can guide the user to obtain the complete content of the previous chat, for example, guide the user to jump to the viewing interface of the chat content in the meeting.

In some embodiments, the main timeline 210 may include, for example, an interactive identifier 220 for indicating comments in an online conference. The comments here may include, for example, any appropriate comments based on text, expressions, images, and/or audio. For example, a user's like may also be understood as a comment on the corresponding content. Accordingly, the graphic identifier of the interactive identifier 220 may be determined, for example, based on the content and/or type of the comment. For example, if it is an expression-based comment, the graphic representation of the interactive identifier 220 may be generated based on the expression.

In some embodiments, when the user selects the interactive indicator 220, the playback system 205 For example, the user may be guided to obtain descriptive information about the comment. For example, when the user hovers over the interactive mark 220 with a mouse, the playback system 205 may indicate the information of the comment, such as the commenter, comment time, and comment reply, etc., in a floating window. In another example, if the user clicks the interactive mark 220, the playback system 205 may guide the user to jump to the interface for viewing comments to obtain more abundant information about the comment.

In some embodiments, the main timeline 210 may also present speaker information indicating the temporal distribution of speech content of at least one speaker associated with the audiovisual content. In this case, the main timeline 210 may also be identified as a type of speech timeline.

For example, the main timeline 210 may assign a corresponding color mark to each speaker. Accordingly, the color distribution on the main timeline 210 may be used to indicate which one or more speakers the corresponding time period corresponds to. It should be understood that other appropriate styles may also be used to use the main timeline 210 to indicate the distribution of the speech content of the speaker in time.

Speech Timeline

2A, the playback system 205 may further include a viewing portal 230 for viewing the speech timeline. In some embodiments, the viewing portal 230 may indicate a graphic identifier (eg, avatar) of one or more speakers associated with the audiovisual content.

After receiving the user's selection of the viewing portal 230 , as shown in FIG. 2B , the playback system 205 may present, for example, a speech timeline 240 - 1 and a speech timeline 240 - 2 (individually or collectively referred to as speech timelines 240 ).

In some embodiments, the speech timeline 240 may be used to indicate the distribution of speech content of at least one speaker associated with the audio-visual content over time. For example, if the speaker made a speech at the corresponding moment, the speech timeline 240 may be filled with a first graphic; on the contrary, if the speaker did not make a speech at the corresponding moment, the speech timeline 240 may be filled with a second graphic. Thus, the user can intuitively understand at what moment each speaker made a speech.

In some embodiments, as shown in FIG. 2B , the speech timeline 240 may also be similarly Graphic information corresponding to the audio waveform of the audio-visual content corresponding to the speaker is presented. Based on this method, the user can also intuitively understand at which moments the speaker did not speak and at which moments the speaker spoke frequently. Such information is more helpful for users to quickly obtain the desired content.

In some embodiments, the number of speech timelines 240 may be determined based on the number of speakers participating in the audiovisual content. In some embodiments, the number of such speakers may be determined by the number of terminals participating in the online conference. For example, multiple conference participants may access the online conference through the same terminal (or use the same account), and such multiple participants may be identified as the same speaker, although they may include multiple different speakers.

In some embodiments, the number of such speakers may be determined based on the number of speakers in the audiovisual content. It should be understood that any appropriate speaker recognition technology may be used to determine the corresponding speakers in the audiovisual content, and the present disclosure is not intended to be limited thereto.

In some embodiments, after receiving a selection for the viewing entry 230, the playback system 205 may present a speech timeline 240 corresponding to all speakers of the audio-visual content. Taking FIG. 2B as an example, the audio-visual content may include two speakers (“speaker 1” and “speaker 2”). Accordingly, the presentation order of the corresponding speech timeline 240-1 and speech timeline 240-2 in the playback system 205 may be determined based on the information of the speakers.

In some examples, the presentation order of the speech timeline may be determined based on the text identifier of the speaker, such as a user name or nickname of the speaker, and the presentation order of the speech timeline may be based on the order of the text identifier of the speaker.

In some other examples, the presentation order of the speech timeline can be determined based on the proportion of the speech content of the speaker. For example, if the speech content proportion of "Speaker 1" reaches "70%", which is greater than the speech content proportion of "Speaker 2" "30%", then the speech timeline 240-1 can be presented in priority over the speech timeline 240-2.

In some other examples, the presentation order of the speech timeline can be determined based on the start time of the speech content of the speaker. The time is, for example, the first minute after the meeting starts, which is earlier than the start time of the speech content of "Speaker 2", for example, the third minute after the meeting starts. Accordingly, the speech timeline 240-1 can be presented in priority to the speech timeline 240-2.

It should be understood that other appropriate sorting strategies may be used to sort the multiple speech timelines 240 to facilitate users to obtain desired content more efficiently.

In some embodiments, the playback system 205 may also present the description information of the corresponding speaker in association with the speech timeline 240. For example, the speech timeline 240-1 may have a text identifier (e.g., a user name or nickname) of the corresponding speaker. Alternatively, the speech timeline 240-1 may also have a graphic identifier (e.g., an avatar) of the corresponding speaker.

In some embodiments, the playback system 205 may also present the percentage information of the speech content of the corresponding speaker in association with at least one speech timeline. The speech timeline 240-1 may include the percentage "XX%" of the speech content of "speaker 1".

In some embodiments, similar to the main timeline 210 , the speech timeline 240 may also present an interaction identifier (not shown in FIG. 2B ) for indicating an interaction behavior associated with a corresponding speaker in the online conference.

In some embodiments, such interactive behaviors refer to corresponding interactive behaviors in which the corresponding speaker participates, such as the file sharing, online chatting or commenting behaviors discussed above. The interactive logic of the interactive identifiers presented on the speech timeline 240 may be similar to the interactive identifiers 220 discussed above, which will not be described in detail here.

In some embodiments, the speech timeline 240 - 1 may also be automatically collapsed or expanded in response to the user's selection of the viewing portal 230 . For example, the playback system may always provide the speech timeline about all speakers by default, regardless of the selection of the viewing portal 230 .

In some embodiments, the playback system 205 may also provide a search portal 250 for the speech timeline. Using the search portal 250, a user may initiate a viewing request associated with a specific speaker.

In some embodiments, upon receiving a selection of the viewing portal 250, the playback system 205 may present visual elements associated with all speakers associated with the audio-visual content. Such visual elements may include, for example, a text identifier of the speaker (e.g., a user name or nickname). or a graphic identifier (e.g., an avatar).

Furthermore, the playback system 205 may receive a user's selection of a specific visual element from among the multiple visual elements to determine that the user desires to view the speech timeline of the speaker corresponding to the selected visual element. For example, the user may click on the avatar of "Speaker 1" so that the playback system 205 only presents the speech timeline 240-1 corresponding to "Speaker 1" but not the speech timeline 240-2.

As another example, the user may also provide input indicating the target speaker by viewing the entry 250. For example, the user may enter at least part of the nickname or user name of "Speaker 1" to automatically match to "Speaker 1" and cause the playback system 205 to correspondingly present the speech timeline 240-1 corresponding to "Speaker 1" instead of presenting the speech timeline 240-2.

In some embodiments, the search entry 250 may be provided independently of the viewing entry 230. For example, when the viewing entry 230 is not selected, the playback system 205 may also provide a search entry 250 for viewing a specific speaker.

Alternatively, the search portal 250 may be provided, for example, dependent on the viewing portal 230. That is, only when the viewing portal 230 is activated and the speech timelines of all speakers are presented, the search portal 250 is provided accordingly for quickly filtering or locating a specific speech timeline.

In some embodiments, the speech timeline 240 may also support various types of user interactions. For example, as shown in FIG2C , the user may click on a position 260 in the speech timeline 240 - 1 to indicate that the audiovisual content is expected to be played starting from that position.

Accordingly, the playback system 205 can play the audiovisual content from the time point 270 corresponding to the position 260. In some embodiments, the playback system 205 can play the audiovisual content continuously from the time point 270. For example, if the time point 270 is the moment "5 minutes and 30 seconds", the audiovisual content will be played continuously from "5 minutes and 30 seconds" until the end.

Alternatively, the playback system 205 may also play part of the audiovisual content corresponding to "Speaker 1" from time point 270. That is, the playback system 205 may only play part of the audiovisual content of "Speaker 1" corresponding to the speech timeline 240-1, and play it from time point 270, thereby achieving the effect of only listening to a specific speaker.

In some embodiments, if the user performs a preset operation on the speech timeline 240-1 (for example, double-clicking the speech timeline 240-1), the playback system 205 can make the partial audio-visual content corresponding to "Speaker 1" play from the beginning, that is, only play the partial audio-visual content corresponding to "Speaker 1" in the audio-visual content.

It should be understood that, for the purpose of description, although various examples of playback systems are discussed above in conjunction with FIGS. 2A to 2C , the various features discussed above (e.g., provision of audio waveforms, provision of interactive identifiers, provision of speech timelines, style of speech timelines, interaction of speech timelines, etc.) can be provided independently or in a combination different from that shown in FIGS. 2A to 2C . For example, in the case where the playback system provides the feature of interactive identifiers, the timeline of the playback system can be a graphical style similar to the timeline of a conventional playback system, and does not necessarily have to be used to indicate an audio waveform.

In addition, although the examples shown in FIG. 2A to FIG. 2C are for reviewing the recorded content, the playback system 205 can also be used to play real-time audio-visual content (e.g., live audio and video streams). Accordingly, the speech timeline discussed above can be used to indicate the temporal distribution of the historical speech content of at least one speaker associated with the historical portion of the real-time audio-visual content. For example, the speech timeline can be presented in a graphical manner: the temporal distribution of the historical speech content of each speaker from the start of the live broadcast to the current moment.

Example viewing interface

In some embodiments, the embodiments of the present disclosure may also provide a viewing interface for audio-visual content. Such a viewing interface may be, for example, a playback interface for recorded content, or a live interface for real-time content. It should be understood that, for the purpose of convenience of explanation, the following uses the "meeting minutes" scenario as an example of viewing audio-visual content, but such a scenario is only exemplary, and the embodiments of the present disclosure may also be applied to other appropriate scenarios.

FIG. 3A shows an example viewing interface 300 according to some embodiments of the present disclosure. As shown in FIG. 3A , the viewing interface 300 may include a playback control 310. The playback control 310 may be implemented, for example, using the playback system 205 discussed above. The control panel 310 may include, for example, a main timeline 312 and speech timelines 314 - 1 and 314 - 2 (individually or collectively referred to as speech timelines 314 ).

In some embodiments, the viewing interface 300 further includes a text control 320 for presenting text content corresponding to the audio-visual content. In some embodiments, the text content may be generated based on the audio of the audio-visual content. Taking the audio-visual content as a meeting record as an example, the text content may be generated based on voice recognition of the audio of each speaker in the meeting. Taking the audio-visual content as a real-time live broadcast content as an example, the text content may be generated based on real-time voice recognition of each speaker.

In some embodiments, as shown in FIG. 3B , the user may select the speech timeline 314 - 1 , for example, and accordingly, the text content 322 corresponding to “Speaker 1 ” may be adjusted to be highlighted in the text control 320 relative to other text content 324 of other speakers.

In some embodiments, making the text content 322 highlighted relative to other text content 324 may include, for example, increasing the prominence of the text content 322 displayed in the text control 320. For example, the display style (e.g., text color, background color, boldness, font size, underline) of the text content 322 may be adjusted to be more prominent. For example, the text content 322 may be bolded or highlighted.

Alternatively, making the text content 322 highlighted relative to the text content 324 may also include, for example, reducing the prominence of the other text content 324 displayed in the text control. For example, the display style (e.g., text color, background color, boldness, font size, underline) of the text content 324 may be adjusted to be non-prominent. For example, as shown in FIG. 3B , the text color of the other text content 324 may be gray, thereby forming a contrast with the black text content 322.

As discussed with reference to FIG2C , the user may also select a specific position in the speech timeline to trigger the audiovisual content to be played starting from the corresponding moment. Alternatively or additionally, when a specific position in the speech timeline is selected, the text content corresponding to the specific position may also be adjusted to be highlighted in the text control 320.

For example, the text content presented in the text control 320 always corresponds to the moment at which the audiovisual content is currently playing. The text content of the segment corresponding to the moment (e.g., the text corresponding to a certain paragraph spoken by the speaker) can be adjusted to the top of the text disclosure 320 to be highlighted. Alternatively or additionally, the display style of the text content of the segment can also be adjusted to highlight the text content of the segment. For example, one or more words corresponding to the time point can be highlighted to be highlighted.

Sharing of audio-visual content

In some embodiments, embodiments of the present disclosure may also support sharing of audiovisual content segments based on speech timelines. As shown in FIG4A , a user may select one or more speech timelines (e.g., speech timeline 430 - 1 ) of the multiple speech timelines in the playback control 410 (or the playback system 410 ) for sharing.

After receiving the selection, the audio-visual content corresponding to the speech timeline 430-1 can be generated for sharing. Taking FIG. 4A as an example, after the user selects the speech timeline 430-1 and clicks the sharing entry 420 (i.e., sends a sharing request), the entire speech content of "Speaker 1" can be used to generate independent audio-visual content, for example, for sharing with other users or organizations.

As another example, as shown in Fig. 4B, the user can also select one or more time segments in the speech timelines 430-1 and 430-2, for example, time segment 440-1, time segment 440-2, and time segment 440-3. Accordingly, after the user clicks on the sharing entry 420 (i.e., sends a sharing request), multiple discrete audio-visual content segments corresponding to the time segments 440-1, 440-2, and 440-3 can be combined to generate independent segment audio-visual content, for example, for sharing with other users or organizations.

Based on this approach, the embodiments of the present disclosure can support users to more efficiently share audio-visual content by selecting a speech timeline or a time segment, thereby improving the efficiency of audio-visual content sharing and improving the efficiency of information acquisition by the shared party. In addition, the embodiments of the present disclosure also support users to select non-continuous segments to create, which further improves the flexibility of sharing audio-visual content segments.

Example Process

FIG5 shows a flow chart of an example process 500 for viewing audiovisual content according to some embodiments of the present disclosure. Process 500 may be implemented at a suitable electronic device. Examples of such electronic devices may include, but are not limited to, desktop computers, laptop computers, smart phones, tablet computers, personal digital assistants, or smart wearable devices.

As shown in FIG. 5 , in block 510 , the electronic device provides a viewing interface for audio-visual content, where the viewing interface includes a play control for playing the audio-visual content.

In box 520, the electronic device presents at least one speech timeline in the playback control, where the at least one speech timeline is used to indicate the temporal distribution of speech content of at least one speaker associated with the audio-visual content.

In some embodiments, the viewing interface further includes a text control for presenting text content corresponding to the audiovisual content, the text content being generated based on the audio of the audiovisual content.

In some embodiments, the method also includes: in response to selection of a first speech timeline in at least one speech timeline, causing first text content corresponding to a first speaker in the text content to be highlighted in the text control relative to second text content of other speakers, wherein the first speech timeline corresponds to the first speaker.

In some embodiments, first text content corresponding to a first speaker in the text content is highlighted in a text control relative to second text content of other speakers: the prominence of the first text content displayed in the text control is increased; and/or the prominence of the second text content displayed in the text control is reduced.

In some embodiments, the method further includes: receiving a selection of a first position in a first speech timeline of at least one speech timeline; and causing text content in the text content corresponding to the first position to be highlighted in the text control.

In some embodiments, presenting at least one speech timeline in the playback control includes: presenting a viewing entry for viewing the speech timeline in the playback control; and presenting at least one speech timeline in the playback control in response to a selection of the viewing entry.

In some embodiments, at least one speech timeline includes multiple speech timelines, and the presentation order of the multiple speech timelines in the playback control is determined based on at least one of the following: text identifiers of multiple speakers corresponding to the multiple speech timelines, The proportion of speeches by multiple speakers, or the start time of speeches by multiple speakers.

In some embodiments, presenting at least one speech timeline in a playback control includes: receiving a viewing request associated with a target speaker; and presenting a target speech timeline corresponding to the target speaker, the target speech timeline being used to indicate the temporal distribution of the target speech content of the target speaker.

In some embodiments, receiving a viewing request associated with a target speaker includes: presenting multiple visual elements associated with multiple speakers associated with audio-visual content; and receiving a viewing request associated with the target speaker based on a preset operation of a target visual element corresponding to the target speaker among the multiple visual elements.

In some embodiments, receiving a view request associated with the target speaker includes receiving a view request associated with the target speaker based on an input indicating the target speaker.

In some embodiments, the method further includes: receiving a selection of a second position in a first speech timeline of at least one speech timeline; and causing a corresponding portion of the audiovisual content to be played from a time point corresponding to the second position.

In some embodiments, the first speech timeline corresponds to the first speaker, and causing at least part of the audio-visual content to be played from a time point corresponding to the second position includes: causing the audio-visual content to be played continuously from the time point; or causing part of the audio-visual content corresponding to the first speaker in the audio-visual content to be played from the time point.

In some embodiments, the method further comprises: presenting description information of the corresponding speaker in association with at least one speech timeline, the description information being generated based on a text identifier and/or a graphic identifier of the speaker.

In some embodiments, the method further includes: presenting the proportion information of the speech content of the corresponding speaker in association with at least one speech timeline.

In some embodiments, the playback controls further include a main timeline for presenting graphical information corresponding to an audio waveform of the audiovisual content.

In some embodiments, the audiovisual content is an audiovisual recording of an online meeting, and the playback control further includes a main timeline, which is used to present a first interaction identifier corresponding to a first interaction behavior in the online meeting.

In some embodiments, at least one speech timeline also presents a timeline for indicating the online meeting A second interaction identifier of a second interaction behavior associated with the corresponding speaker.

In some embodiments, the first interactive behavior and/or the second interactive behavior includes at least one of the following: file sharing, online chatting, and commenting.

In some embodiments, the method further includes: in response to a first selection of the first interaction identifier, presenting first description information for the first interaction behavior; and/or in response to a second selection of the second interaction identifier, presenting second description information for the second interaction behavior.

In some embodiments, the method further includes: receiving a selection of at least one time segment in at least one speech timeline; and based on a first sharing request associated with the at least one time segment, generating a first segment of audiovisual content corresponding to the at least one time segment for sharing.

In some embodiments, the method also includes: receiving a selection of a group of speech timelines in at least one speech timeline, the group of timelines including one or more speech timelines; and based on a second sharing request associated with the group of timelines, causing a second segment of audio-visual content corresponding to the group of speech timelines to be generated for sharing.

In some embodiments, the audiovisual content includes real-time audiovisual content, and at least one speech timeline is used to indicate the temporal distribution of historical speech content of at least one speaker associated with the historical portion of the real-time audiovisual content.

Example devices and equipment

The embodiments of the present disclosure also provide corresponding devices for implementing the above methods or processes. Fig. 6 shows a schematic structural block diagram of a device 600 for viewing audio-visual content according to some embodiments of the present disclosure.

As shown in FIG. 6 , the apparatus 600 includes a providing module 610 configured to provide a viewing interface for audio-visual content, where the viewing interface includes a playback control for playing the audio-visual content.

In addition, the apparatus 600 further includes a presentation module 620 configured to present at least one speech timeline in the playback control, wherein the at least one speech timeline is used to indicate the temporal distribution of speech content of at least one speaker associated with the audio-visual content.

In some embodiments, the viewing interface further includes a text control for presenting text content corresponding to the audiovisual content, the text content being generated based on the audio of the audiovisual content. become.

In some embodiments, the presentation module 620 is further configured to: in response to selection of a first speech timeline in at least one speech timeline, cause first text content corresponding to a first speaker in the text content to be highlighted in the text control relative to second text content of other speakers, wherein the first speech timeline corresponds to the first speaker.

In some embodiments, the presentation module 620 is further configured to: receive a selection of a first position in a first speech timeline of at least one speech timeline; and highlight the text content corresponding to the first position in the text content in the text control.

In some embodiments, the presentation module 620 is further configured to: present a viewing entry for viewing the speech timeline in the playback control; and present at least one speech timeline in the playback control in response to a selection of the viewing entry.

In some embodiments, at least one speech timeline includes multiple speech timelines, and the presentation order of the multiple speech timelines in the playback control is determined based on at least one of the following: text identifiers of the multiple speakers corresponding to the multiple speech timelines, the proportion of the speech content of the multiple speakers, or the start time of the speech content of the multiple speakers.

In some embodiments, the presentation module 620 is further configured to: receive a viewing request associated with a target speaker; and present a target speech timeline corresponding to the target speaker, the target speech timeline being used to indicate the temporal distribution of the target speech content of the target speaker.

In some embodiments, the presentation module 620 is further configured to: present multiple visual elements associated with multiple speakers associated with the audio-visual content; and receive a viewing request associated with a target speaker based on a preset operation of a target visual element corresponding to the target speaker among the multiple visual elements.

In some embodiments, presentation module 620 is further configured to: based on the input indicating the target speaker, receive a viewing request associated with the target speaker.

In some embodiments, the presentation module 620 is further configured to: receive a selection of a second position in a first speech timeline of at least one speech timeline; and cause the corresponding portion of the audiovisual content to be played from a time point corresponding to the second position.

In some embodiments, the first speech timeline corresponds to the first speaker, and the presentation module 620 is further configured to: enable the audiovisual content to be played continuously from a point in time; or enable the portion of the audiovisual content corresponding to the first speaker to be played from a point in time.

In some embodiments, the presentation module 620 is further configured to present description information of the corresponding speaker in association with at least one speech timeline, where the description information is generated based on a text identifier and/or a graphic identifier of the speaker.

In some embodiments, the presentation module 620 is further configured to present the proportion information of the speech content of the corresponding speaker in association with at least one speech timeline.

In some embodiments, at least one speech timeline further presents a second interaction identifier for indicating a second interaction behavior associated with the corresponding speaker in the online conference.

In some embodiments, the presentation module 620 is further configured to: present first description information for a first interaction behavior in response to a first selection of a first interaction identifier; and/or present second description information for a second interaction behavior in response to a second selection of a second interaction identifier.

In some embodiments, the presentation module 620 is further configured to: receive a selection of at least one time segment in at least one speech timeline; and based on a first sharing request associated with the at least one time segment, generate a first segment of audio-visual content corresponding to the at least one time segment for sharing.

In some embodiments, the presentation module 620 is further configured to: receive a request for at least one A group of speech timelines is selected from the speech timelines, wherein the group of timelines includes one or more speech timelines; and based on a second sharing request associated with the group of timelines, a second segment of audio-visual content corresponding to the group of speech timelines is generated for sharing.

The units included in the device 600 can be implemented in various ways, including software, hardware, firmware, or any combination thereof. In some embodiments, one or more units can be implemented using software and/or firmware, such as machine executable instructions stored on a storage medium. In addition to or as an alternative to machine executable instructions, some or all of the units in the device 600 can be implemented at least in part by one or more hardware logic components. As an example and not limitation, exemplary types of hardware logic components that can be used include field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), and the like.

Figure 7 shows a block diagram of a computing device/server 700 in which one or more embodiments of the present disclosure may be implemented. It should be understood that the computing device/server 700 shown in Figure 7 is merely exemplary and should not constitute any limitation on the functionality and scope of the embodiments described herein.

As shown in Figure 7, computing device/server 700 is in the form of a general computing device. The components of computing device/server 700 may include, but are not limited to, one or more processors or processing units 710, memory 720, storage device 730, one or more communication units 740, one or more input devices 760, and one or more output devices 760. Processing unit 710 may be an actual or virtual processor and is capable of performing various processes according to a program stored in memory 720. In a multi-processor system, multiple processing units execute computer executable instructions in parallel to improve the parallel processing capabilities of computing device/server 700.

The computing device/server 700 typically includes a plurality of computer storage media. Such media can be any available media accessible to the computing device/server 700, including but not limited to volatile and nonvolatile media, removable and non-removable media. The memory 720 can be a volatile memory (e.g., registers, cache, random access memory (RAM)), Non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 730 may be removable or non-removable media and may include machine-readable media such as a flash drive, a disk, or any other media that may be capable of storing information and/or data (e.g., training data for training) and may be accessed within computing device/server 700.

The computing device/server 700 may further include additional removable/non-removable, volatile/non-volatile storage media. Although not shown in FIG. 7 , a disk drive for reading or writing from a removable, non-volatile disk (e.g., a “floppy disk”) and an optical drive for reading or writing from a removable, non-volatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data media interfaces. The memory 720 may include a computer program product 725 having one or more program modules configured to perform various methods or actions of various embodiments of the present disclosure.

The communication unit 740 enables communication with other computing devices via a communication medium. Additionally, the functions of the components of the computing device/server 700 can be implemented in a single computing cluster or multiple computing machines that can communicate via a communication connection. Thus, the computing device/server 700 can operate in a networked environment using a logical connection with one or more other servers, a network personal computer (PC), or another network node.

Input device 750 may be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 760 may be one or more output devices, such as a display, speaker, printer, etc. Computing device/server 700 may also communicate with one or more external devices (not shown) as needed, such as storage devices, display devices, etc., with one or more devices that enable users to interact with computing device/server 700, or with any device that enables computing device/server 700 to communicate with one or more other computing devices (e.g., network card, modem, etc.) through communication unit 740. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which one or more computer instructions are stored, wherein the one or more computer instructions are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to the flowcharts and/or block diagrams of the methods, devices (systems) and computer program products implemented according to the present disclosure. It should be understood that each box in the flowchart and/or block diagram and the combination of each box in the flowchart and/or block diagram can be implemented by computer-readable program instructions.

These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processing unit of the computer or other programmable data processing device, a device that implements the functions/actions specified in one or more boxes in the flowchart and/or block diagram is generated. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause the computer, programmable data processing device, and/or other equipment to work in a specific manner, so that the computer-readable medium storing the instructions includes a manufactured product, which includes instructions for implementing various aspects of the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to implement the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

The flow chart and block diagram in the accompanying drawings show the possible architecture, function and operation of the system, method and computer program product according to multiple implementations of the present disclosure. In this regard, each square box in the flow chart or block diagram can represent a part of a module, program segment or instruction, and a part of a module, program segment or instruction includes one or more executable instructions for realizing the logical function of the specification. In some implementations as replacements, the function marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two continuous square boxes can actually be executed substantially in parallel, and they can sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be realized by a special hardware-based system that performs the function or action of the specification, or can be realized by a combination of special hardware and computer instructions.

The above descriptions of various implementations of the present disclosure are exemplary, non-exhaustive, and not limited to the disclosed implementations. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described implementations. The selection of terms used herein is intended to best explain the principles of the implementations, practical applications, or improvements to the technology in the marketplace, or to enable other persons of ordinary skill in the art to understand the implementations disclosed herein.

Claims

A method for viewing audiovisual content, comprising:

providing a viewing interface for the audio-visual content, the viewing interface including a playback control for playing the audio-visual content; and

At least one speech timeline is presented in the playback control, and the at least one speech timeline is used to indicate the temporal distribution of speech content of at least one speaker associated with the audio-visual content.
The method according to claim 1, wherein the viewing interface further comprises a text control, the text control being used to present text content corresponding to the audio-visual content, the text content being generated based on audio of the audio-visual content.
The method according to claim 2, further comprising:

In response to selection of a first speech timeline in the at least one speech timeline, first text content in the text content corresponding to a first speaker is highlighted in the text control relative to second text content of other speakers, wherein the first speech timeline corresponds to the first speaker.
The method according to claim 3, wherein the first text content corresponding to the first speaker in the text content is highlighted in the text control relative to the second text content of other speakers:

increasing the prominence of the first text content displayed in the text control; and/or

The prominence with which the second text content is displayed in the text control is reduced.
The method according to claim 2, further comprising:

receiving a selection of a first position in a first speech timeline of the at least one speech timeline; and

The text content corresponding to the first position in the text content is highlighted in the text control.
The method according to claim 1, wherein presenting at least one speech timeline in the playback control comprises:

Presenting a viewing entry for viewing the speech timeline in the playback control; and

In response to selection of the viewing entry, the at least one speech timeline is presented in the playback control.
The method according to claim 1, wherein the at least one speech timeline includes multiple speech timelines, and the presentation order of the multiple speech timelines in the playback control is determined based on at least one of the following:

text identifiers of multiple speakers corresponding to the multiple speech timelines,

the proportion of the speech contents of the multiple speakers, or

The starting time of the speech contents of the multiple speakers.
The method according to claim 1, wherein presenting at least one speech timeline in the playback control comprises:

receiving a viewing request associated with a target speaker; and

A target speech timeline corresponding to the target speaker is presented, wherein the target speech timeline is used to indicate the temporal distribution of the target speech content of the target speaker.
The method of claim 8, wherein receiving a viewing request associated with a target speaker comprises:

presenting a plurality of visual elements associated with a plurality of speakers associated with the audiovisual content; and

Based on a preset operation of a target visual element corresponding to the target speaker among the plurality of visual elements, a viewing request associated with the target speaker is received.
The method of claim 8, wherein receiving a viewing request associated with a target speaker comprises:

Based on the input indicating the target speaker, a view request associated with the target speaker is received.
The method according to claim 1, further comprising:

receiving a selection of a second position in a first speech timeline of the at least one speech timeline; and

The corresponding portion of the audio-visual content is played from a time point corresponding to the second position.
The method according to claim 11, wherein the first speech timeline corresponds to a first speaker, and causing at least a portion of the audiovisual content to be played from a time point corresponding to the second position comprises:

causing the audiovisual content to be played continuously from the time point; or

The part of the audio-visual content corresponding to the first speaker is played from the time point.
The method according to claim 1, further comprising:

The description information of the corresponding speaker is presented in association with the at least one speech timeline, wherein the description information is generated based on the text identifier and/or the graphic identifier of the speaker.
The method according to claim 1, further comprising:

The proportion information of the speech content of the corresponding speaker is presented in association with the at least one speech timeline.
The method according to claim 1, wherein the playback control further includes a main timeline, the main timeline being used to present graphical information corresponding to an audio waveform of the audiovisual content.
The method according to claim 1, wherein the audio-visual content is an audio-visual recording of an online meeting, and the playback control further includes a main timeline, wherein the main timeline is used to present a first interaction identifier corresponding to a first interaction behavior in the online meeting.
The method according to claim 16, wherein the at least one speech timeline also presents a second interaction identifier for indicating a second interaction behavior associated with the corresponding speaker in the online conference.
The method according to claim 16 or 17, wherein the first interactive behavior and/or the second interactive behavior comprises at least one of the following: file sharing, online chatting, and commenting.
The method according to claim 16 or 17, further comprising:

In response to a first selection of the first interaction identifier, presenting first description information for the first interaction behavior; and/or

In response to a second selection of the second interaction identifier, a Second description of the behavior.
The method according to claim 1, further comprising:

receiving a selection of at least one time segment in the at least one speech timeline; and

Based on a first sharing request associated with the at least one time segment, a first segment of audio-visual content corresponding to the at least one time segment is generated for sharing.
The method according to claim 1, further comprising:

receiving a selection of a group of speech timelines among the at least one speech timeline, the group of timelines comprising one or more speech timelines; and

Based on a second sharing request associated with the set of timelines, a second segment of audio-visual content corresponding to the set of speech timelines is generated for sharing.
The method according to claim 1, wherein the audio-visual content comprises real-time audio-visual content, and the at least one speech timeline is used to indicate the temporal distribution of historical speech content of at least one speaker associated with the historical portion of the real-time audio-visual content.
A device for viewing audio-visual content, comprising:

A providing module configured to provide a viewing interface for audio-visual content, wherein the viewing interface includes a play control for playing the audio-visual content; and

The presentation module is configured to present at least one speech timeline in the playback control, wherein the at least one speech timeline is used to indicate the temporal distribution of speech content of at least one speaker associated with the audio-visual content.
A playback system, comprising:

a main timeline, the main timeline indicating at least a current playback position of the audiovisual content; and

At least one speech timeline, wherein the at least one speech timeline is used to indicate the temporal distribution of speech content of at least one speaker associated with the audio-visual content.
The playback system of claim 24, wherein the main timeline also presents graphical information corresponding to an audio waveform of the audiovisual content.
The playback system according to claim 24, wherein the audio-visual content is for The audio-visual record of the online conference, the main timeline also presents a first interaction identifier corresponding to a first interaction behavior in the online conference.
The playback system according to claim 26, wherein the at least one speech timeline also presents a second interaction identifier for indicating a second interaction behavior associated with the corresponding speaker in the online conference.
The playback system according to claim 26 or 27, wherein the first interactive behavior and/or the second interactive behavior comprises at least one of the following: file sharing, online chatting, and commenting.
The playback system according to claim 26 or 27, wherein:

A first selection of the first interaction identifier is used to trigger first description information for the first interaction behavior; and/or

The second selection of the second interaction identifier is used to trigger presentation of second description information for the second interaction behavior.
The playback system according to claim 24, wherein the at least one speech timeline includes a plurality of speech timelines, and the presentation order of the plurality of speech timelines in the playback system is determined based on at least one of the following:

text identifiers of multiple speakers corresponding to the multiple speech timelines,

the proportion of the speech contents of the multiple speakers, or

The starting time of the speech contents of the multiple speakers.
An electronic device, comprising:

at least one processing unit; and

At least one memory, the at least one memory being coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the device to perform a method according to any one of claims 1 to 22 when executed by the at least one processing unit.
A computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method according to any one of claims 1 to 22.