CN111711849A

CN111711849A - Method, device and storage medium for displaying multimedia data

Info

Publication number: CN111711849A
Application number: CN202010623694.XA
Authority: CN
Inventors: 马志强
Original assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Current assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-09-25

Abstract

The embodiment of the application discloses a method, equipment and a storage medium for displaying multimedia data, wherein the method comprises the following steps: obtaining at least one start point location and at least one end point location for current multimedia data; wherein any one of the at least one start point location is capable of matching one of the at least one end point location; at least one segment of the audio and video data is corresponding to the position from any one starting point to the matched end point; intercepting the at least one segment from the current multimedia data; obtaining presentation data for the at least one segment; and performing correlated display on the at least one segment and the display data. The intelligent and cheap multimedia data display can be improved at least, and the use experience of the user can be greatly improved.

Description

Method, device and storage medium for displaying multimedia data

Technical Field

The present application relates to audio and video data processing technologies, and in particular, to a method, device, and storage medium for displaying multimedia data.

Background

In the related art, when a user listens to audio or watches video, the user often encounters some of the segments that the user wants to listen to or watch repeatedly, such as favorite segments or lesson segments that are not listened to. Currently, repeated listening or watching of these segments requires a user to manually select, for example, a schedule to be pulled to the segment that the user wants to listen or watch repeatedly, and each repeated listening or watching needs to be manually selected once. For a user who wants to listen or watch repeatedly, the repeated and mechanical operations result in greatly reduced user experience and insufficient intelligence and cheapness.

Disclosure of Invention

In order to solve the existing technical problem, embodiments of the present application provide a method, device, and storage medium for displaying multimedia data.

In a first aspect, an embodiment of the present application provides a method for presenting multimedia data, where the method includes:

obtaining at least one start point location and at least one end point location for current multimedia data; wherein any one of the at least one start point location is capable of matching one of the at least one end point location; at least one segment of the audio and video data is corresponding to the position from any one starting point to the matched end point;

intercepting the at least one segment from the current multimedia data;

obtaining presentation data for the at least one segment;

and performing correlated display on the at least one segment and the display data.

In the above scheme, in the case that the current multimedia data includes at least audio data,

the obtaining presentation data for the at least one segment comprises:

identifying audio data in the at least one segment;

converting the audio data into first text data, wherein the first text data are characterized as text data corresponding to the audio data one by one;

correspondingly, the performing the associated display of the at least one segment and the display data includes:

and correspondingly displaying the first text data and the at least one segment.

In the above scheme, the obtaining presentation data for the at least one segment includes:

receiving comment information for the at least one section input in a comment area;

loading the at least one segment into a comment field;

and correspondingly displaying the at least one segment and the comment information aiming at the at least one segment in a comment area.

In the above scheme, the method further comprises:

identified key information in the first text data;

correspondingly, the displaying the first text data corresponding to the at least one segment includes:

and correspondingly displaying the identified key information and the at least one segment.

In the above scheme, under the condition that there are at least two intercepting intervals, one of the intercepting intervals is the position on the current multimedia data from a starting point position to an end point position matched with the starting point position; the method further comprises the following steps:

determining the interception sequence of each interception interval;

correspondingly, the displaying the text data corresponding to the at least one segment includes:

and correspondingly displaying at least one fragment corresponding to each intercepting interval and the text data of at least one fragment corresponding to each identified intercepting interval according to the intercepting sequence of each intercepting interval.

In the above scheme, the method further comprises:

detecting a trigger event aiming at the current multimedia data, wherein the trigger event is used for intercepting the current multimedia data;

correspondingly, the obtaining at least one start point position and at least one end point position for the current multimedia data includes:

and when the trigger event is detected, detecting a starting point position selection operation and an ending point position selection operation aiming at each interception interval, which are generated on the playing progress bar of the current multimedia data.

In the scheme, each interception position is displayed on the playing progress bar;

or, under the condition that each intercepting interval is displayed on the playing progress bar, displaying the corresponding playing time information of each intercepting interval on the playing progress bar.

In a second aspect, an embodiment of the present application provides an apparatus for displaying audio and video data, where the apparatus includes:

a first obtaining unit, configured to obtain at least one start point location and at least one end point location for current multimedia data; wherein any one of the at least one start point location is capable of matching one of the at least one end point location; at least one segment of the audio and video data is corresponding to the position from any one starting point to the matched end point;

an intercepting unit for intercepting the at least one segment from the current multimedia data;

a second obtaining unit, configured to obtain presentation data for the at least one segment;

and the display unit is used for carrying out associated display on the at least one segment and the display data.

In a third aspect, an embodiment of the present application provides an apparatus for displaying audio and video data, including:

one or more processors;

a memory communicatively coupled to the one or more processors;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method described above.

The embodiment of the application provides a method, equipment and a storage medium for displaying multimedia data, wherein at least one starting point position and at least one ending point position for current multimedia data are obtained, a fragment corresponding to the position from any starting point position to the position from the end point matched with the starting point position is intercepted from the current multimedia data aiming at any starting point position and the end point position matched with the starting point position, display data of the intercepted fragment are obtained, and the intercepted fragment and the display data of the intercepted fragment are displayed in an associated mode.

According to the embodiment of the application, the automatic interception of the fragment of the multimedia data can be realized based on two pieces of position information (the position of the starting point and the position of the ending point), and the interception of the expected intercepted fragment is quickly realized. Therefore, the user can conveniently and automatically intercept the fragments needing to be listened or watched repeatedly, and the intercepted fragments and the display data thereof can be automatically displayed and visually displayed. The problem of insufficient use experience caused by repeated listening or watching through multiple manual operations in the related technology can be solved. The intelligent and cheap multimedia data display is embodied, and the use experience of a user can be greatly improved.

Drawings

FIG. 1 is a first flowchart illustrating a method for displaying multimedia data according to an embodiment of the present disclosure;

FIG. 2 is a second flowchart illustrating a method for displaying multimedia data according to an embodiment of the present disclosure;

FIG. 3 is a third flowchart illustrating a method for displaying multimedia data according to an embodiment of the present application;

FIG. 4 is a fourth flowchart illustrating a method for displaying multimedia data according to an embodiment of the present application;

FIG. 5 is a first diagram illustrating an application of an embodiment of the present application;

FIG. 6 is a second exemplary illustration of an application of the present invention;

FIGS. 7(a) and (b) are schematic diagrams illustrating a third application of the embodiment of the present application;

FIG. 8 is a fourth illustration of an application of an embodiment of the present application;

fig. 9 is a first schematic structural diagram of a component of a device for displaying audio and video data according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a second composition of the device for displaying audio and video data according to the embodiment of the present application;

fig. 11 is a schematic diagram of a hardware structure of the device for displaying audio and video data according to the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In some of the flows described in the specification and claims of the present application and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that the flows may include more or less operations, and that the operations may be performed sequentially or in parallel.

Before the technical solutions of the embodiments of the present application are introduced, technical terms that may be used in the embodiments of the present application are described:

1) multimedia data refers to at least one of audio, image, and video data.

2) Intercepting the interval refers to the interval of the multimedia data from the intercepted playing start time of the at least part of the playing duration to the time to be played.

The embodiment of the application provides a method for displaying multimedia data, which can be applied to equipment for displaying the multimedia data. The device may be any device capable of outputting at least one of audio, image and video data. Such as mobile phones, computers, notebooks, intelligent wearable devices such as smart watches, smart glasses, etc. For convenience of the subsequent description, the device for presenting multimedia data is simply referred to as a device.

Fig. 1 is a first embodiment of a method for presenting multimedia data according to an embodiment of the present application, as shown in fig. 1, the method includes:

s101: obtaining at least one start point location and at least one end point location for current multimedia data; wherein any one of the at least one start point location is capable of matching one of the at least one end point location; at least one segment of the audio and video data is corresponding to the position from any one starting point to the matched end point;

s102: intercepting the at least one segment from the current multimedia data;

s103: obtaining presentation data for the at least one segment;

s104: and performing correlated display on the at least one segment and the display data.

The main body performing S101 to S104 is a device that presents multimedia data.

In S101 to S104, for convenience of description, at least one start point position and at least one end point position for the current multimedia data are obtained, and for any one obtained start point position and an end point position matched therewith, (at least one) segment corresponding to the any one start point position to the end point position matched therewith is intercepted from the current multimedia data (the segment is called an intercepted segment or an intercepted segment), and display data of the (intercepted) segment is obtained, and the display data of the (intercepted) segment and the (intercepted) segment is displayed in an associated manner. According to the embodiment of the application, the automatic interception of the fragment of the multimedia data can be realized based on two pieces of position information (the position of the starting point and the position of the ending point), the interception of the expected intercepted fragment is quickly realized, and the intercepted fragment and the display data of the intercepted fragment are subjected to associated display. Therefore, the user can conveniently and automatically intercept the fragments needing to be listened or watched repeatedly, and the intercepted fragments and the display data thereof can be automatically displayed and visually displayed. The problem of insufficient use experience caused by repeated listening or watching through multiple manual operations in the related technology can be solved. The intelligent and cheap multimedia data display is improved, and the use experience of the user can be greatly improved.

It is understood that S101 may occur during the playing of the current multimedia data, or during the pausing of the playing. The start point position and the end point position can be manually selected by a user in the process of playing or pausing the playing of the multimedia data. The user can select the two position information (the start point position and the end point position) in a manual operation or voice operation mode.

Taking voice operation selection as an example, as shown in fig. 5, the currently played multimedia data is a "handover rate teaching" video with a duration of 5 minutes, and in the process of playing the video, an audio acquisition device in the device, such as a microphone, monitors in real time whether there is voice input of a viewer (user). If the viewer (user) encounters a part which is not understood or hears a part emphasized in the examination and needs to be intercepted for subsequent repeated viewing, the user inputs data such as 'intercept 3:20-4: 15', the microphone collects the data input by the user and identifies the data, the 3:20 of the currently played video is used as a starting point position, and the 4:15 of the currently played video is used as an end point position.

Taking the manual operation as an example, as shown in fig. 5, if the user needs to repeatedly watch the video segment between 3:20 and 4:15, the user may perform the selection operation on the 3 rd minute, 20 th second and the 4 th minute, 15 th second of the video respectively, for example, click the position of the 3 rd minute, 20 th second and then click the position of the 4 th minute, 15 th second on the playing progress bar. The device for displaying the multimedia data detects the two selection operations of the user, and takes the detected first click position on the playing progress bar as the starting point position and the second click position as the ending point position. Alternatively, the user may slide or pull the progress bar from the 3 rd minute 20 second position to the 4 th minute 15 second position, and the device detects this sliding or pulling operation, and regards the starting point of the sliding or pulling operation as the starting point position of the video clip on the whole video to be captured, and regards the ending point of the sliding or pulling operation as the ending point position of the video clip on the whole video to be captured.

Of course, the two position information of the starting point position and the end point position can also be obtained by simultaneously collecting the voice input of the user and detecting the clicking, sliding or pulling operation. That is, S101 is a device that obtains two position information of a start point position and an end point position by detecting the aforementioned voice input, and/or selection operation (click, slide, or pull).

For convenience of description, the aforementioned click, slide or pull operations of voice input and manual input are all regarded as intercepting operations. It will be appreciated that in actual practice, the user may want to repeatedly view two or more partial segments, such as 1:00-2:00 and 3:20-4:15, for the video shown in fig. 5. In practical application, the user selects the start point position and the end point position of one segment, and then selects the start point position and the end point position of the other segment. For example, the user generates the interception operation on the parts 1:00-2:00 first, and then generates the interception operation on the parts 3:20-4:15, which is equivalent to generating the selection operation on four positions. For a plurality of position selection operations detected by the device, according to the conventional operation behavior of the user, the device regards the odd-numbered position selection operations generated by the user as operations for selecting the starting point positions of the interception of different partial segments, and regards the even-numbered position selection operations generated by the user as operations for selecting the end point positions of the interception of different partial segments. The device takes the position selected by each odd-numbered position selection operation generated by the user as the starting point position of the corresponding partial segment, and takes the position selected by each even-numbered position selection operation generated by the user as the selection ending point position of the corresponding partial segment. And regarding the position selection operation of even number closest to each odd number of position selection operations as the matching operation, namely, the position selected by each odd number of position selection operations is the end point position matched with the start point position selected by the odd number of position selection operations, wherein the end point position selected by the position selection operation of even number closest to the position selection operation of odd number is the end point position selected by the position selection operation of even number closest to the position selection operation of odd number. As described above for the intercepting operation of the 1:00-2:00 part and the intercepting operation of the 3:20-4:15 part, the device detects the clicking operation of the user on the 1 st minute position of the video progress bar, and then detects the clicking operation of the user on the 2 nd minute position of the video progress bar. Next, the click operation on the 3 rd/20 th/15 th/th second position, according to the foregoing description, the device regards the position selected by the click operation on the 1 st minute position of the video progress bar by the user as the start point position of the first segment capture, and regards the position selected by the click operation on the 2 nd minute position of the video progress bar as the end point position of the first segment capture. The position selected by the clicking operation on the 2 nd minute position of the video progress bar is the end point position matched with the start point position selected by the clicking operation on the 1 st minute position.

In the foregoing S102, the device intercepts a segment that the user desires to intercept based on an intercept operation generated by the user. In a technical aspect, a segment that a user desires to intercept in the currently played multimedia data may be read, for example, the video content of 1:00-2:00 in the 5-minute video shown in fig. 5 is read as the intercepted segment. The truncated fragments may be saved for subsequent use.

According to the scheme, the starting point position and the ending point position of the fragment which needs to be listened or watched by the user in the whole multimedia data can be obtained through the acquisition or detection of the interception operation of the user by the equipment in the embodiment of the application, which is equivalent to automatically intercepting the fragment expected by the user according to the actual use operation of the user, and the interception of the expected intercepted fragment is rapidly realized, so that the use experience of the user can be greatly improved, and the actual use requirement of the user is met.

It should be understood that the aforementioned display data may be any data related to the intercepted segment, and may be the intercepted segment itself, such as the text, image, etc. of the intercepted segment; but also the content derived from the intercepted fragment itself, such as the key meaning expressed by the intercepted fragment, the core meaning expressed by the image of the intercepted fragment, etc. The device in the embodiment of the application can obtain the information by recognizing the text or the image of the content such as the characters and the images in the intercepted segment. And the intercepted segment and the display data of the intercepted segment are displayed in a correlation mode, so that the segment which is expected to be listened or watched repeatedly by a user is automatically displayed, and the use experience is greatly improved.

The display data of the intercepted segment and the intercepted segment may be displayed in association with each other, and the intercepted segment and the display data of the intercepted segment are displayed in association with each other when being displayed, for example, the intercepted segment and the display data of the intercepted segment are displayed correspondingly, so that when being displayed, a user is enabled to know which display data is currently displayed and a certain intercepted segment currently displayed has a relationship with.

Specifically, the presentation data in the embodiment of the present application may be text data obtained by text-converting audio data included in the intercepted segment, or may be comment information generated by the user on the current intercepted segment.

For the implementation of the text data obtained by text conversion of the audio data included in the intercepted fragment, please refer to the description of the second embodiment. For implementation of the comment information, please refer to the description of the third embodiment.

Fig. 2 is a second embodiment of a method for presenting multimedia data according to an embodiment of the present application, as shown in fig. 2, the method includes:

s201: obtaining at least one start point location and at least one end point location for current multimedia data; wherein any one of the at least one start point location is capable of matching one of the at least one end point location; at least one segment of the audio and video data is corresponding to the position from any one starting point to the matched end point;

s202: intercepting the at least one segment from the current multimedia data;

s203: identifying audio data in the at least one segment in the case that the current multimedia data includes at least audio data;

s204: converting the audio data into first text data, wherein the first text data are characterized as text data corresponding to the audio data one by one;

s205: and correspondingly displaying the first text data and the at least one segment.

The main body performing S201 to S205 is a device that presents multimedia data. For the description of S201 and S202, please refer to the related parts, and repeated parts are not described.

In the foregoing solution, when the current multimedia data is audio data or video data, for a cut segment (cut segment), the audio data in the cut segment is identified, the identified data is converted into corresponding text data (first text data), and the converted (first) text data and the cut segment are displayed correspondingly. From the displayed content, the user can clearly see the character meaning expressed by the displayed intercepting segment. Therefore, the use experience of the user can be improved.

In the foregoing scheme, it can be seen from the schematic diagrams shown in fig. 7(a) and (b), that two parts of segment, segment 1 and segment 2, are intercepted from the "handover rate teaching" video shown in fig. 5 with a duration of 5 minutes. And identifying the audio data in the section 1, identifying the audio data in the section 2, converting the respective identified audio data into text data which is in one-to-one correspondence with respective audio contents, and correspondingly displaying the section 1 and the text data which is obtained by converting the audio data identified by the section. And correspondingly displaying the intercepted segment 2 and the text data after the audio data identified by the intercepted segment is converted. For example, the data obtained by performing one-to-one text conversion on the audio content of the intercepted segment 1 in fig. 7(a) is a relative numerical value, and the standard at different periods is different, and the average handoff rate of the average city bear day is about 2%, and the data obtained by performing one-to-one text conversion on the audio content of the intercepted segment 2 in fig. 7(b) "more than 2% is regarded as a stock with a high handoff rate, and less than 2 is regarded as a stock with a low handoff rate, and the stock amplitude or the anti-falling capability with a high handoff rate is greater than a stock with a low handoff rate. For each stock, the hand-changing rate is proportional to the price, and the stock price increases as the hand-changing rate increases and decreases as the hand-changing rate decreases ". Therefore, the meaning word expressed in the intercepted segment can be displayed to the user without leakage, and the user can know all the meanings expressed by the intercepted segment.

That is, the first text data is obtained by performing word-insensitive text conversion on the audio content in each of the cut-out segments. Furthermore, key information in the identified first text data may also be identified; and correspondingly displaying the identified key information and the at least one segment. It can be understood that, with respect to the scheme of performing word-insensitive text conversion on the audio content in each intercepted segment, in the embodiment of the present application, only the key content in the audio content in each intercepted segment may be subjected to text conversion, and the key information that may represent the core idea of the audio content may be extracted after performing word-insensitive text conversion on the vocabularies, prepositions, auxiliary words, and the like that are not the key content in the audio content, or performing word-insensitive text conversion on the audio content in each intercepted segment. As shown in fig. 7(b), the key information is extracted from the data obtained by performing one-to-one text conversion on the audio content of the clip 1 shown in fig. 7(a), and the obtained key information is that "the handoff rate is a relative value, and the criteria at different times are different". After text conversion is performed on the key content in the audio content of the clip 2 shown in fig. 7(b), the key information "more than 2% of the key information is regarded as the stock with high hand-off rate, less than that, the stock with low hand-off rate, the stock with high hand-off rate has a higher rise or fall resistance than the stock with low hand-off rate, and the hand-off rate is proportional to the price thereof" is obtained. Therefore, the main or key meaning expressed in the intercepted fragment can be simply and clearly presented to the user, and the user can know the main meaning expressed by the intercepted fragment.

In practical application, the one-to-one correspondence between the text-converted data and the truncated segments may be as shown in fig. 7 (a). In fig. 7(a), the clip and the text data are displayed separately as two types of data, such as the text data is displayed on top of the clip. In this case, which text data corresponds to which cut-out section is distinguished by paragraph division of the text data. If the text data after the conversion of the audio data in the two cut segments 1 and 2 is segmented according to paragraphs, "the high and low hand-off rates are relative values, the standards at different times are different, the average hand-off rate of the average bear in the market is about 2%" the text data is a paragraph, "more than 2% can be regarded as the stock with high hand-off rate, less than the stock with low hand-off rate, the stock with high hand-off rate has higher rise or fall resistance than the stock with low hand-off rate, and the text data" the high and low hand-off rates are in direct proportion to the price "is another paragraph to distinguish the text data after the text conversion of the audio data in the two cut segments. Specifically, the data obtained by text conversion of the audio data in which two cut segments are located can be seen from the two video data below the text data, that is, cut segments 1 and 2. As shown in fig. 7(b), the clipped segment and the text data may be displayed separately as two types of data, but the text data is displayed on the side of the clipped segment, and for example, the data obtained by text-converting the audio data in the clipped segment 1 is displayed on the right side of the clipped segment 1. Data such as a text conversion of the audio data in the cut segment 2 is displayed on the right side of the cut segment 2.

It is understood that fig. 7(a) and (b) are only one specific example, and do not represent all the ways. It should be clear to a person skilled in the art that all reasonable ways are covered by the embodiments of the present application. For example, as in fig. 7(a), the clipped segment is displayed on top of the text data, and so on. The data obtained by text-converting the audio data in the cut 1 as in fig. 7(b) shows the left side of the cut 2. No matter what kind of corresponding display mode is adopted for displaying, the user can know which text meaning is expressed by which intercepting fragment, so that convenience is provided for the user on the corresponding display mode, and the user experience is improved.

In an optional scheme, under the condition that at least two intercepting intervals are provided, one of the intercepting intervals is from a starting point position to an end point position matched with the starting point position; the method further comprises the following steps: determining the interception sequence of each interception interval; correspondingly, the displaying the text data corresponding to the at least one segment includes: and correspondingly displaying at least one fragment corresponding to each intercepting interval and the text data of at least one fragment corresponding to each identified intercepting interval according to the intercepting sequence of each intercepting interval. In this alternative, in consideration of the fact that, in practical applications, a user may have two or more partial clips for the same multimedia data, in this case, the device regards the position, such as the time position, of one clip in the entire multimedia data play stream as one clip interval in order to identify the start point position and the end point position of each clip in the entire multimedia data play stream, and further needs to record the clip sequence of each clip, that is, the clip sequence of each clip interval. And correspondingly displaying the intercepted fragments and the text data according to the interception sequence. As shown in fig. 7(a) and (b), in the case where the device recognizes the operation of the start point and the end point, which is input by the user and generated by intercepting the intercepted fragment 1, the intercepted fragment 1 is regarded as a fragment that needs to be intercepted first. Then, under the condition that the device identifies the position operation of the starting point and the end point which are generated by intercepting the intercepted fragment 2 and are input by the user, the intercepted fragment 2 is regarded as the fragment which needs to be intercepted after the operation is needed. Thus, the determined interception order is to intercept fragment 1 first and then fragment 2. And when the text data and the intercepted fragments are correspondingly displayed, displaying according to the intercepting sequence. As shown in fig. 7(a) and (b), the clip 1 is displayed in front of the clip 2, and the text data obtained by text-converting the audio data in the clip 1 is displayed in front of the text data obtained by text-converting the audio data in the clip 2. Of course, the display can also be performed according to the reverse order of the interception sequence, for example, the intercepted segments 1 intercepted first are all displayed behind the intercepted segments 2 intercepted later; meanwhile, in order to ensure the corresponding display, the text data obtained by text conversion of the audio data in the first intercepted segment 1 is also displayed behind the text data obtained by text conversion of the audio data in the second intercepted segment 2. The display mode is convenient for users to know which segment is intercepted firstly and which segment is intercepted later. If the intercepted segment is intercepted by the user according to the sequential playing time, the meaning of the whole multimedia data to be expressed can be guessed from each intercepted segment, and the user experience is improved.

In the interfaces shown in fig. 7(a), (b), a cancel/delete function key, a confirm/save function key, and a delete function key are further provided for the convenience of the user

Wherein the cancel function key is operable to cancel or delete at least one of the intercepted snippets and its text data presented by the current interface. Preferably, all the intercepted snippets and their text data presented by the current interface are cancelled or deleted. The user may make an operation such as a click operation on the cancel/delete function key without the need for the apparatus to display the cut-out and its text data, manually confirm that the cut-out is incorrect, or convert the text data incorrectly, and the apparatus recognizes this operation to perform the cancellation or deletion of the cut-out and its text data. The confirm/save function key can be used for saving the intercepted clip and the text data thereof, and if the clip is saved in a notebook list and a subsequent user wants to listen or view the clip, the clip and the text data thereof can be viewed by entering the notebook list. The user may operate the confirm/save function key, such as clicking the confirm/save function key, in the case that the device is required to save the display of the intercepted fragment and its text data, and the device recognizes this operation and performs the operation of saving the intercepted fragment and its text data. The delete function key is provided for text data. Self-input method for text data which is not converted correctly or not converted when user thinks the text data is converted correctlyIn the case of the text data of the cut-out section, the user makes an operation to delete the function key, and the device recognizes this operation and performs an operation to delete the text data. It is to be understood that, in the case where a deletion function key is provided for all the text data in common as shown in fig. 7(a), and the user performs an operation on the deletion function key, the user can manually select data to be deleted from the entire text data, and the apparatus recognizes the manually selected text data and deletes the data. In the case where a corresponding deletion function key is provided for each text data as shown in fig. 7(b), and the user performs an operation on one of the deletion function keys, the apparatus recognizes the operation and deletes the text data served by the deletion function key operated by the user. After the text data is deleted, the user can input own insights or comments on the intercepted segment corresponding to the text data in the area where the deleted text data is located. In other words, for the intercepted segment, the user may also perform text comment on the intercepted segment. The use experience of the user can be greatly improved.

Applicable scenes of the embodiment of the application:

the application scene one: for a person who learns by watching a network course, the person generates an intercepting operation on video clips which can not be understood or video clips with examination key points, the mobile phone identifies the intercepting operation, intercepts the video clips which are expected to be intercepted, performs text conversion on audio content in the video clips, and correspondingly displays the video clips and the audio content on the mobile phone. Therefore, the user can be helped to repeatedly learn the video clips which are not understood or the video clips which are important in the examination. In addition, the video clips which are not understood by the user or the video clips with the test emphasis and the text data thereof are stored in a list such as a notebook list, and if the video clips are used subsequently, the video clips which are not understood by the user or the video clips with the test emphasis and the text data thereof can be opened by the user directly from the notebook list to continue learning. The whole network course does not need to be checked for learning, video clips which are not understood or video clips which are important in examination do not need to be searched from the whole network course for learning, and great convenience is brought to users. In addition, the user can also operate the delete function key, and add the text content required by the user, such as the understanding of the content expressed by the intercepted segment, into the notebook list.

Application scenario two: during entertainment, for example, a person listening to a song or watching a video can generate an intercepting operation on a part of segments in the whole song or a part of segments in the whole video, the mobile phone recognizes the intercepting operation, intercepts the song segments or the video segments which are expected to be intercepted, performs text conversion on audio content in the song segments or the video segments, and correspondingly displays the audio content together with the song segments or the video segments on the mobile phone. Therefore, the interested part can be listened or watched repeatedly without listening to the whole song or the whole video, thereby greatly facilitating the entertainment of the user.

Fig. 3 is a third embodiment of a method for presenting multimedia data according to an embodiment of the present application, as shown in fig. 3, the method includes:

s301: obtaining at least one start point location and at least one end point location for current multimedia data; wherein any one of the at least one start point location is capable of matching one of the at least one end point location; at least one segment of the audio and video data is corresponding to the position from any one starting point to the matched end point;

s302: intercepting the at least one segment from the current multimedia data;

s303: receiving comment information for the at least one section input in a comment area;

s304: loading the at least one segment into a comment field;

s305: and correspondingly displaying the at least one segment and the comment information aiming at the at least one segment in a comment area.

The main body performing S301 to S305 is a device that presents multimedia data. For the description of S301 and S302, please refer to the related parts, and repeated parts are not described.

In the foregoing solution, taking the presentation data as the comment information as an example, in the case that the device receives the comment information for the at least one clip input by the user in the comment area, the intercepted clip is loaded to the comment area; and correspondingly displaying the at least one segment and the comment information aiming at the at least one segment in a comment area. In other words, the comment area may display the cut segment together with the data of the user's opinion, like, or dislike that the user wants to express with respect to the cut segment. The defect that the multimedia data such as video can not be loaded only by inputting characters in the comment area in the related technology is avoided. In the embodiment of the application, the intercepted segment and the text comment aiming at the segment can be displayed in the comment area together, so that the multimedia segment aiming at the comment content can be clearly indicated. The comment emotion of the user can be further reflected by intercepting the segments. The use experience of the user is improved. Other users in the comment area can also further feel the emotion or meaning to be expressed by the user by intercepting the segments.

In the interfaces shown in fig. 7(a), (b) described above, the user can operate the confirm/save function key to save a desired clip into the notebook list. Further, desired segments may also be saved to the review list. When the display interface of the device enters the comment interface shown in fig. 8, the device may load the intercepted segment stored in the comment list into a region available for the user, such as the user a, in the comment region, and may express the emotion of the user to the intercepted segment in combination with the text data input by the user in the region. As shown in fig. 8, user a intercepts two songs and loads them into the comment area, and issues his own comment information, "always listen to two songs in my dad's CD before, recently retrieve them again, and then continuously wash the brain". Other users, such as user B and user C, can click on the intercepted clip that has been loaded into the review area for listening or viewing of the song, upon seeing the review of user a. The comment information of the user and the intercepted fragments are displayed together in the comment area, the multimedia fragments aiming at the comment content can be clearly indicated, interaction among the users is increased, and interaction interestingness is increased.

It can be understood that if the number of the intercepted clips saved in the comment list is large, the user can select the intercepted clip to be loaded into the comment area, and the device identifies the intercepted clip selected by the user and loads the intercepted clip into the comment area.

The above scheme can be regarded as that before the user enters the comment interface to start to comment, the segment which is expected to be intercepted is intercepted and stored in the comment list, and when the user is used, the user can load the segment from the comment list. In addition, the user can enter the comment interface first, then, aiming at the content which the user wants to comment, the user intercepts the segment which is loaded to the comment area from the corresponding multimedia data, and under the condition, if the device detects the operation of the user on the confirm/save function key, the device responds to the operation and loads the newly intercepted segment to the comment area. In any way, the intercepted segment and the text comment of the user for the emotion of the segment are displayed in the comment area together, so that the comment emotion of the user can be further reflected, and the use experience is improved. The intercepted segment and the text comment of the user for the emotion of the segment are displayed in the comment area together for other users to watch, so that the effect of recommending good videos or songs to other users is achieved, and the propagation of good things such as songs or videos is enhanced.

Fig. 4 is a fourth embodiment of a method for presenting multimedia data according to an embodiment of the present application, as shown in fig. 4, the method includes:

s401: detecting a trigger event aiming at the current multimedia data, wherein the trigger event is used for intercepting the current multimedia data;

s402: when the trigger event is detected, detecting a starting point position selection operation and an ending point position selection operation aiming at each interception interval, which are generated on a playing progress bar of the current multimedia data;

s403: intercepting the at least one segment from the current multimedia data;

s404: obtaining presentation data for the at least one segment;

s405: and performing correlated display on the at least one segment and the display data.

The main body performing S401 to S405 is a device that presents multimedia data. For the descriptions of S403-S405, please refer to the related parts, and the repeated parts are not repeated.

In the first to third embodiments, the device directly detects the position selection operation of the start point and the end point generated by the user to perform the fragment interception. Compared with the first to third embodiments, in this embodiment (embodiment four), in order to avoid the inconvenience caused by the user touching by mistake and the inconvenience caused by the fact that the device does not recognize the mistaken touching and directly performs the fragment interception, before the position selection operation is detected, whether a trigger event for intercepting the current multimedia data exists is detected, and if the trigger event exists, the position selection operation of the starting point and the ending point generated by the user is detected, so that the inconvenience caused by the mistaken touching can be greatly avoided, and the accurate interception is realized.

For accurate interception, a trigger function key is provided in fig. 5

When the user is detected to operate the trigger function key, such as clicking, a trigger signal is generated, the user is considered to have the requirement of intercepting the fragments, and then the selection operation of the starting point and the ending point generated by the user can be detected. In addition, the operation of the user on the current multimedia data picture, such as pressing operation, can also be detected, the duration of the pressing operation is calculated, if the duration reaches a preset duration, such as 5s, a trigger signal is generated, the user is considered to have the requirement of intercepting the segment, and then the selection operation of the starting point and the ending point generated by the user can be detected. Correspondingly, if the user has performed two-position selection operations on all the intercepted sections, the operation for triggering the function key may be generated again, and at this time, the device recognizes the operation, considers that the operation is an operation for ending the intercepted section, and no response is performed on the position selection operation generated by the user. Or, the user presses the multimedia data frame again, and in the case that the pressing duration is longer than the predetermined duration, such as 6s, the user considers that the operation of cutting the segment is ended, and the bit generated by the user is not processed any moreThe selection operation is set in response. The operation of ending the interception operation and the operation of preparing to start the interception may be the same operation, such as pressing operation with a pressing time length longer than a predetermined time length. It is also possible to start preparation for interception by a pressing operation having a pressing time period longer than a predetermined time period and end interception by an operation on a trigger function key. This is not particularly limited. The scheme of starting to prepare the interception segment and ending the interception segment operation by triggering the function key may be implemented by the same trigger function key (as shown in fig. 5), or by different trigger function keys, which is not specifically limited.

It should be understood by those skilled in the art that a trigger event can be selected from only one intercepted segment, and a trigger event needs to be generated every time a segment is intercepted. The primary trigger event can also select all the intercepted segments, namely, the whole segment which is expected to be intercepted can be intercepted under the primary trigger event.

In an alternative, after identifying each of the intercepting intervals obtained based on the position selection operation generated by the user, each of the intercepting intervals may be displayed on the play progress bar; or, under the condition that each intercepting interval is displayed on the playing progress bar, displaying the corresponding playing time information of each intercepting interval on the playing progress bar. Therefore, the user can see whether the intercepted segment is correct or not, and the actual intercepting requirement of the user can be met. The interception position in the embodiment of the application is a starting point position and an ending point position of each interception fragment generated on the playing progress bar by the user, and each starting point position and the ending point position matched with each starting point position are embodied as an interception interval of the multimedia data on the playing progress bar. As shown in fig. 6, each rectangle displayed on the play progress bar represents an interception interval of each intercepted segment selected by the user on the entire multimedia data. And under the condition that the equipment identifies each starting point position and the end point position matched with the starting point position, displaying an intercepting interval of each starting point position and the end point position matched with the starting point position on the playing progress bar. And displaying the corresponding playing time information of each intercepting interval on the playing progress bar under the condition that each intercepting interval is displayed on the playing progress bar. If the playing time information corresponding to the interception interval 1 is 01:20-01:45, the interception segment 1 is data of the current multimedia data from 1 minute 20 seconds to 1 minute 45 seconds; the playing time information corresponding to the interception interval 2 is 03:20-04:10, which indicates that the interception segment 1 is data of the 3 rd minute 20 seconds to the 4 th minute 10 seconds of the current multimedia data. In fig. 6, the user first slides from 1 st minute 20 seconds to 1 st minute 45 seconds on the progress bar, selects the data of the segment 1, then slides from 3 rd minute 20 seconds to 4 th minute 10 seconds on the progress bar, and then selects the data of the segment 2. Taking the selection of the segment 2 as an example, in the process that the user slides from the 3 rd minute 20 seconds to the 4 th minute 10 seconds on the progress bar, the playing time information of the segment 2 in the whole multimedia data is displayed, and the displayed playing times are displayed one by one from the starting sliding time to the ending time along with the sliding of the user on the progress bar and the pause generated in the sliding process. With the interval of each pause of the user during the sliding process being 20s, the playing time displayed is changed from 03:20 to 03:20 → 03:40 → 04:00 → 04: 10. The scheme is that the sliding time is dynamically changed from the beginning to the end time, so that the user can further confirm where the user should intercept the clip or whether the clip intercepted by the user is the clip expected to intercept the clip.

When the playing time information corresponding to each intercepting interval on the playing progress bar is displayed, the intercepting sequence of the intercepting intervals needs to be displayed. If the user selects the intercepting interval 1 first and then selects the intercepting interval 2, the intercepting sequence is 1 → 2. When displaying the playing time information, the playing time information corresponding to the capturing section 1 on the playing progress bar needs to be displayed in front of the playing time information corresponding to the capturing section 2 on the playing progress bar. Of course, the display order may be in reverse order of the interception order.

For the function key for presenting the clipped segment and its text data shown in fig. 6

Intercepted by the user for each wishAfter the selection of the segments is completed, the equipment intercepts each segment which is expected to be intercepted and performs text recognition to obtain each intercepted segment and text data thereof. Identifying that the user has a pair function key on the device

In the case of the operation of (b), the user can jump from the display interface shown in fig. 6 to the display interface shown in fig. 7(a) or (b) to display the intercepted segment and the text data thereof, so that the humanization of the display interface is highlighted.

It should be understood by those skilled in the art that fig. 5-8 of the present application are merely specific examples and do not represent all implementations. Any reasonable implementation is covered by the scope of the embodiments of the present application. Among them, the function keys of FIGS. 5-7 (a), (b)

Indicated as a key that can hide or open the play progress bar.

In general, the schemes shown in the first to third embodiments are equivalent to a scheme for quickly capturing and recording the segment content of the multimedia data. The scheme shown in the fourth embodiment is equivalent to a scheme for quickly capturing the segment content of the multimedia data and commenting.

In summary, at least the embodiments of the present application have the following advantages:

1) the device acquires or detects the interception operation of the user to obtain the starting point position and the ending point position of the segment in the whole multimedia data, which the user needs to listen to or watch repeatedly, which is equivalent to automatically intercepting the segment expected by the user according to the actual use operation of the user, and the interception of the expected intercepted segment is rapidly realized, so that the use experience of the user can be greatly improved. And the intercepted fragments and the text data thereof are visually displayed. Meanwhile, the text conversion of the audio data in the multimedia data is realized, the text content corresponding to the audio data does not need to be input by a user, the automatic recording of the text data is equivalently realized, the efficiency of recording the text data is improved, and the actual use requirements of the user are met.

2) The method can correspondingly display all text data or part of text data converted from the audio content in the intercepted segment, can provide clear display content for users, and is convenient for the users.

3) The comment information and the intercepted fragments of the user can be displayed together in the comment area, the multimedia fragments aiming at the comment contents can be clearly indicated, the comment content is visual, the efficiency of reading the comment information is improved for the user, and interaction among the users is facilitated.

An embodiment of the present application provides an apparatus for displaying audio and video data, as shown in fig. 9, the apparatus includes: a first obtaining unit 11, a cutting unit 12, a second obtaining unit 13 and a display unit 14; wherein,

a first obtaining unit 11, configured to obtain at least one start point position and at least one end point position for current multimedia data; wherein any one of the at least one start point location is capable of matching one of the at least one end point location; at least one segment of the audio and video data is corresponding to the position from any one starting point to the matched end point;

an intercepting unit 12 for intercepting the at least one segment from the current multimedia data;

a second obtaining unit 13, configured to obtain presentation data for the at least one segment;

and the display unit 14 is used for performing associated display on the at least one segment and the display data.

In an alternative, the second obtaining unit 13 is configured to identify audio data in the at least one segment; converting the audio data into first text data, wherein the first text data are characterized as text data corresponding to the audio data one by one; correspondingly, the display unit 14 is configured to correspondingly display the first text data and the at least one segment.

In an optional scheme, the second obtaining unit 13 is configured to receive comment information for the at least one section input in the comment area; loading the at least one segment into a comment field; correspondingly, the display unit 14 is configured to correspondingly display the at least one segment and the comment information for the at least one segment in the comment area.

In an optional scheme, the second obtaining unit 13 is configured to identify key information in the first text data; correspondingly, the display unit 14 is configured to correspondingly display the identified key information and the at least one segment.

In an optional scheme, in the case that there are at least two truncation intervals, one of the truncation intervals is a position on the current multimedia data from a start point position to an end point position matched with the start point position; a first obtaining unit 11, configured to determine an interception order of each interception interval; the display unit 14 is configured to correspondingly display at least one segment corresponding to each intercepting interval and text data of at least one segment corresponding to each identified intercepting interval according to the intercepting order of each intercepting interval.

In an optional scheme, as shown in fig. 10, the apparatus further includes a detecting unit 15, configured to detect a trigger event for the current multimedia data, where the trigger event is used to intercept the current multimedia data; a first obtaining unit 11, configured to detect, when the detecting unit 15 detects the trigger event, a start point position selecting operation and an end point position selecting operation for each intercepting interval that are generated on the play progress bar of the current multimedia data.

In an optional scheme, the display unit 14 is configured to display each intercepting interval on the play progress bar; or, under the condition that each intercepting interval is displayed on the playing progress bar, displaying the corresponding playing time information of each intercepting interval on the playing progress bar.

In practical application, the first obtaining Unit 11, the intercepting Unit 12, the second obtaining Unit 13, and the detecting Unit 15 in the device for displaying audio and video data in the embodiment of the present application may be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU) or a Programmable Gate Array (FPGA). The display unit 14 may be implemented by a display screen.

It should be noted that, in the device for displaying audio and video data according to the embodiment of the present application, because the principle of the device for displaying audio and video data to solve the problem is similar to the method for displaying multimedia data, the implementation process and the implementation principle of the device for displaying audio and video data can be described by referring to the implementation process and the implementation principle of the method for displaying audio and video data, and repeated details are not repeated.

Here, it should be noted that: the descriptions of the embodiments of the apparatus are similar to the descriptions of the methods, and have the same advantages as the embodiments of the methods, and therefore are not repeated herein. For technical details that are not disclosed in the embodiments of the apparatus of the present invention, those skilled in the art should refer to the description of the embodiments of the method of the present invention to understand, and for brevity, will not be described again here.

The embodiment of the present application further provides a device for displaying audio and video data, including: one or more processors; a memory communicatively coupled to the one or more processors; one or more application programs; wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method described above.

In a specific example, the device for presenting audio and video data according to the embodiment of the present application may specifically have a structure as shown in fig. 11, where the device for presenting audio and video data at least includes a processor 51, a storage medium 52, and at least one external communication interface 53; the processor 51, the storage medium 52 and the external communication interface 53 are all connected by a bus 54. The processor 51 may be a microprocessor, a central processing unit, a digital signal processor, a programmable logic array, or other electronic components with processing functions. The storage medium has stored therein computer executable code capable of performing the method of any of the above embodiments. In practical applications, the first obtaining unit 11, the intercepting unit 12, the second obtaining unit 13, and the detecting unit 15 may be implemented by the processor 51.

Here, it should be noted that: the above description of the device embodiment showing audio and video data is similar to the above description of the method, and has the same beneficial effects as the method embodiment, and therefore, the details are not repeated. For technical details that are not disclosed in the embodiment of the device for displaying audio/video data of the present invention, those skilled in the art should refer to the description of the embodiment of the method of the present invention to understand that, for brevity, detailed description is not repeated here.

Embodiments of the present application also provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method described above.

A computer-readable storage medium can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that all or part of the steps carried by the method for implementing the above embodiments can be implemented by hardware related to instructions of a program, which can be stored in a computer readable storage medium, and the program includes one or a combination of the steps of the method embodiments when the program is executed.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The embodiments described above are only a part of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims

1. A method of presenting multimedia data, the method comprising:

intercepting the at least one segment from the current multimedia data;

obtaining presentation data for the at least one segment;

2. The method of claim 1, wherein, in the case that the current multimedia data includes at least audio data,

the obtaining presentation data for the at least one segment comprises:

identifying audio data in the at least one segment;

3. The method of claim 1, wherein the obtaining presentation data for the at least one segment comprises:

loading the at least one segment into a comment field;

4. The method of claim 2, further comprising:

identified key information in the first text data;

5. The method according to claim 2 or 3, wherein, in the case of at least two truncation intervals, one of the truncation intervals is a position on the current multimedia data from a start point position to an end point position matched with the start point position; the method further comprises the following steps:

determining the interception sequence of each interception interval;

6. The method of claim 5, further comprising:

7. The method according to claim 6, wherein each interception location is displayed on the play progress bar;

8. A device for presenting audio-visual data, the device comprising:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

10. A device for presenting audio-visual data, comprising:

one or more processors;

a memory communicatively coupled to the one or more processors;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the steps of the method of any of claims 1 to 7.